-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Tested on Debian testing, with supervisor 3.3.1. If supervisord is in the process of shutting down after receiving a SIGTERM, and it receives a SIGHUP while it is waiting for some process to die, the shutdown process is interrupted and supervisord restarts instead.
Steps to reproduce:
- start supervisord with the attached configuration file (test-supervisor.zip), which runs the test.sh script (set the path to test.sh in the configuration file). test.sh sleeps for a few seconds on SIGTERM, giving enough time for the race condition to happen.
$ killall -SIGTERM supervisord && sleep 0.5 && killall -SIGHUP supervisord
- supervisord restarts instead of terminating. The program's output reads (emphasis mine):
...
2018-11-24 15:57:04,987 INFO spawned: 'test' with pid 9226
2018-11-24 15:57:05,988 INFO success: test entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2018-11-24 15:58:06,750 WARN received SIGTERM indicating exit request
2018-11-24 15:58:06,750 INFO waiting for test to die
2018-11-24 15:58:07,252 WARN received SIGHUP indicating restart request
2018-11-24 15:58:10,255 WARN killing 'test' (9226) with SIGKILL
2018-11-24 15:58:10,255 INFO waiting for test to die
2018-11-24 15:58:10,256 INFO stopped: test (terminated by SIGKILL)
2018-11-24 15:58:10,264 INFO RPC interface 'supervisor' initialized
2018-11-24 15:58:10,264 CRIT Server 'unix_http_server' running without any HTTP authentication checking
...
Why it matters: when a systemd user session is closed, all the running programs receive SIGTERM and SIGHUP, on the basis that at least one of them should instruct the program to shutdown. In the case of supervisord, this often triggers the race condition and the program is restarted rather than stopped. Then systemd waits for 90 more seconds before killing it with SIGKILL. When this happens during system shutdown or restart, the computer hangs for 90 seconds before completing the operation, which can totally be avoided.
Possible fixes:
- ignore/queue incoming signals while a signal is already being handled (by reading older issues, it seems to me that this is already done for supervisorctl commands, but clearly not for signals).
- introduce some kind of priority between signals, e.g. SIGTERM can interrupt a SIGHUP but not the other way around.
I could try to make a PR once a clear strategy is chosen.