Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supervisorctl takes too long to execute command #131

Closed
gilles opened this issue Jun 19, 2012 · 44 comments
Closed

supervisorctl takes too long to execute command #131

gilles opened this issue Jun 19, 2012 · 44 comments

Comments

@gilles
Copy link

gilles commented Jun 19, 2012

[root@ip-10-245-174-225] ~ $ supervisorctl stop XXX
^CTraceback (most recent call last):
File "/usr/local/bin/supervisorctl", line 9, in
load_entry_point('supervisor==3.0a12', 'console_scripts', 'supervisorctl')()
File "/usr/local/lib/python2.6/dist-packages/supervisor/supervisorctl.py", line 1114, in main
c.onecmd(" ".join(options.args))
File "/usr/local/lib/python2.6/dist-packages/supervisor/supervisorctl.py", line 144, in onecmd
return do_func(arg)
File "/usr/local/lib/python2.6/dist-packages/supervisor/supervisorctl.py", line 732, in do_stop
result = supervisor.stopProcess(name)
File "/usr/lib/python2.6/xmlrpclib.py", line 1199, in call
return self.__send(self.__name, args)
File "/usr/lib/python2.6/xmlrpclib.py", line 1489, in __request
verbose=self.__verbose
File "/usr/local/lib/python2.6/dist-packages/supervisor/xmlrpc.py", line 463, in request
r = self.connection.getresponse()
File "/usr/lib/python2.6/httplib.py", line 990, in getresponse
response.begin()
File "/usr/lib/python2.6/httplib.py", line 391, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.6/httplib.py", line 349, in _read_status
line = self.fp.readline()
File "/usr/lib/python2.6/socket.py", line 427, in readline
data = recv(1)
KeyboardInterrupt

supervisorctl status show 'STOPPED' and the process does not exists.
The log file shows:
2012-06-19 14:31:08,722 INFO stopped: XXX (terminated by SIGTERM)

I can turn on more debugging if you need

@gilles
Copy link
Author

gilles commented Jun 19, 2012

It's happening on 3.0a12 and 3.0a8

@gilles
Copy link
Author

gilles commented Jun 28, 2012

Here are some strace. This is on a debian squeeze box. Note that we don't see the problem on a CentOS5

supervisord strace: strace -p 1467 2>&1 | grep SIG
kill(1494, SIGTERM) = 0
--- SIGCHLD (Child exited) @ 0 (0) ---
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], WNOHANG, NULL) = 1494

=> Appears immediately after the stop command is issued, no other mention of process 1494

Chilld process strace: strace -p 1494
accept(296, {sa_family=AF_INET6, sin6_port=htons(49644), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 15
futex(0x21a01f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x18d9680, FUTEX_WAKE_PRIVATE, 1) = 1
accept(296, {sa_family=AF_INET6, sin6_port=htons(49676), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 83
futex(0x1610370, FUTEX_WAKE_PRIVATE, 1) = 1
accept(296, {sa_family=AF_INET6, sin6_port=htons(49715), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 15
futex(0x1ea9a10, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x18d9680, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x21ad150, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x18d9680, FUTEX_WAKE_PRIVATE, 1) = 1
accept(296, 0x7fff2be93050, [28]) = ? ERESTARTSYS (To be restarted)
--- SIGTERM (Terminated) @ 0 (0) ---
Process 1494 detached

=> Appears immediately after the stop command is issued

=> The process is marked as 'STOPPED' and the process does not exists anymore almost immediately after the command is issued

The stop command supervisorctl stop service:service_X appears to hang but just takes a long time to complete:
date; supervisorctl stop service:service_X; date
Thu Jun 28 13:07:36 EDT 2012
service:service_X: stopped
Thu Jun 28 13:18:10 EDT 2012

Note: sometimes the start command takes a long time to complete too:
date; supervisorctl start service:service_X; date
Thu Jun 28 13:23:02 EDT 2012
service:service_X: started
Thu Jun 28 13:23:29 EDT 2012

Note: supervisord.log shows the process stopped almost immediately after the command is executed:
2012-06-28 13:07:36,390 INFO stopped: service_X (terminated by SIGTERM)

@beniwohli
Copy link

I'm seeing the same problem on one server (but not on others):

# date; supervisorctl stop project_main_gunicorn; date
Wed Jul 11 17:50:07 CEST 2012
project_main_gunicorn: stopped
Wed Jul 11 18:52:14 CEST 2012

Yes, more than one hour.

supervisor.log:

2012-07-11 17:50:08,468 INFO stopped: project_main_gunicorn (exit status 0)

This is a huge problem for us, since it basically makes an automated deployment impossible. If you need more information, just let me know.

@thieman
Copy link

thieman commented Sep 13, 2012

I'm also experiencing this issue using 3.0a12 and Python 2.7.2 on Ubuntu 11.10.

@dcrosta
Copy link

dcrosta commented Oct 3, 2012

Does anyone have a solution or work-around for this? We're seeing it with supervisor 3.0a12 on a variety of platforms (Centos 6.2, 6.3, and Amazon Linux 2012.03).

@beniwohli
Copy link

@dcrosta in my case I reported above, it turned out that one task was restarting all the time, and blocking supervisord. I'm not sure why supervisor didn't back off restarting the task, my guess is that it took slightly longer to start up and die than the "it's alive!"-threshold. But I didn't have the time to investigate it further.

@dcrosta
Copy link

dcrosta commented Oct 4, 2012

@piquadrat Hmm, I don't think we have that issue on my end. In some cases, there's just one long-lived process managed by supervisord (though admittedly that process takes sometimes up to 10s to quit on its own after receiving SIGTERM).

If someone can point me in the right direction, I'd be happy to help gather more data about what's going on.

@caioariede
Copy link
Contributor

+1 for this issue on CentOS 6

@RealJTG
Copy link

RealJTG commented Jan 9, 2013

Same problem on CentOS 6 + gunicorn. Supervisorctl succesfully start/stop task and hangs

supervisorctl stop mysite
Zzzzzzz

^CTraceback (most recent call last):
  File "/usr/bin/supervisorctl", line 6, in <module>
    main()
  File "/usr/lib/python2.6/site-packages/supervisor/supervisorctl.py", line 598, in main
    c.onecmd(" ".join(options.args))
  File "/usr/lib/python2.6/site-packages/supervisor/supervisorctl.py", line 86, in onecmd
    return func(arg)
  File "/usr/lib/python2.6/site-packages/supervisor/supervisorctl.py", line 433, in do_stop
    result = supervisor.stopProcess(processname)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request
    verbose=self.__verbose
  File "/usr/lib/python2.6/site-packages/supervisor/options.py", line 1308, in request
    errcode, errmsg, headers = h.getreply()
  File "/usr/lib64/python2.6/httplib.py", line 1064, in getreply
    response = self._conn.getresponse()
  File "/usr/lib64/python2.6/httplib.py", line 990, in getresponse
    response.begin()
  File "/usr/lib64/python2.6/httplib.py", line 391, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python2.6/httplib.py", line 349, in _read_status
    line = self.fp.readline()
  File "/usr/lib64/python2.6/socket.py", line 433, in readline
    data = recv(1)
KeyboardInterrupt

logs

2013-01-09 18:04:23,044 DEBG XML-RPC method called: supervisor.getVersion()
2013-01-09 18:04:23,044 DEBG XML-RPC method supervisor.getVersion() returned successfully
2013-01-09 18:04:23,044 INFO localhost:0 - - [09/Jan/2013:16:04:23 +0200] "POST /RPC2 HTTP/1.0" 200 254
2013-01-09 18:04:23,045 DEBG XML-RPC method called: supervisor.stopProcess()
2013-01-09 18:04:23,045 DEBG XML-RPC method supervisor.stopProcess() returned successfully
2013-01-09 18:04:23,045 DEBG killing mysite (pid 6320) with signal SIGTERM
2013-01-09 18:04:23,376 INFO stopped: mysite (exit status 0)
2013-01-09 18:04:23,377 INFO received SIGCLD indicating a child quit

@tsharju
Copy link

tsharju commented Mar 4, 2013

Has anyone found a solution to this problem? I'm facing similar issues with supervisor version 3.0a8 running on Ubuntu 12.04. Installed supervisor using apt. Problem is exactly same as described above. I run supervisorctl stop myapp and the application stops, but supervisorctl just hangs. Any help is greatly appreciated here.

@tobsch
Copy link

tobsch commented Apr 11, 2013

me to. any way out of this?

@tobsch
Copy link

tobsch commented Apr 11, 2013

okay, in my case it seems that one of the processes i am watching was constantly printing out messages.
this seems to keep supervisor too busy to return the rpc call!?

@philipcristiano
Copy link

Suffering from this as well on Ubuntu 12.04 with 3.0a8 and 3.0b1, sometimes up to 5 minutes to restart a process.

@amcmanus
Copy link

Any progress on this bug? We're seeing it on CentOS 5 and 6, supervisor 3.0a12.

@daenney
Copy link

daenney commented Jun 5, 2013

Same here. We though we originally fixed it by fixing some bogus things with a child process but that just turned out to be a side-effect.

The quick fix to this issue is restart supervisord, after we do that it starts responding normally again and can take upto a month or two before it starts to exhibit this behaviour. I'm still convinced this is somehow triggered by a child process doing something weird but we haven't been able to pin point it yet.

We do have a strace to accompany the issue:

recvfrom(3, "<?xml version='1.0'?>\n<methodRes"..., 129, 0, NULL, NULL) = 129
stat("/usr/lib/pymodules/python2.6/supervisor/rpcinterface", 0x7fffec5d61a0) = -1 ENOENT (No such file or directory)
open("/usr/lib/pymodules/python2.6/supervisor/rpcinterface.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/pymodules/python2.6/supervisor/rpcinterfacemodule.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/pymodules/python2.6/supervisor/rpcinterface.py", O_RDONLY) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=30019, ...}) = 0
open("/usr/lib/pymodules/python2.6/supervisor/rpcinterface.pyc", O_RDONLY) = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=27233, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7f05911000
read(5, "\321\362\r\n(\2135Pc\0\0\0\0\0\0\0\0\31\0\0\0@\0\0\0s\221\1\0\0d\0"..., 4096) = 4096
fstat(5, {st_mode=S_IFREG|0644, st_size=27233, ...}) = 0
read(5, "dules/python2.6/supervisor/rpcin"..., 20480) = 20480
read(5, "RB\0\0\0(\3\0\0\0R\24\0\0\0t\4\0\0\0typeR&\0\0\0(\0\0"..., 4096) = 2657
read(5, "", 4096)                       = 0
close(5)                                = 0
munmap(0x7f7f05911000, 4096)            = 0
stat("/usr/lib/pymodules/python2.6/supervisor/events", 0x7fffec5d2c00) = -1 ENOENT (No such file or directory)
open("/usr/lib/pymodules/python2.6/supervisor/events.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/pymodules/python2.6/supervisor/eventsmodule.so", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/pymodules/python2.6/supervisor/events.py", O_RDONLY) = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=6224, ...}) = 0
open("/usr/lib/pymodules/python2.6/supervisor/events.pyc", O_RDONLY) = 6
fstat(6, {st_mode=S_IFREG|0644, st_size=11881, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f7f05911000
read(6, "\321\362\r\n(\2135Pc\0\0\0\0\0\0\0\0\4\0\0\0@\0\0\0s\237\2\0\0d\0"..., 4096) = 4096
fstat(6, {st_mode=S_IFREG|0644, st_size=11881, ...}) = 0
read(6, "or/events.pyR\23\0\0\0H\0\0\0s\4\0\0\0\0\1\t\1c\1"..., 4096) = 4096
read(6, "on2.6/supervisor/events.pyR-\0\0\0\217"..., 4096) = 3689
read(6, "", 4096)                       = 0
close(6)                                = 0
munmap(0x7f7f05911000, 4096)            = 0
close(5)                                = 0
close(4)                                = 0
close(3)                                = 0
socket(PF_FILE, SOCK_STREAM, 0)         = 3
connect(3, {sa_family=AF_FILE, path="/var/run/supervisor.sock"}, 26) = 0
sendto(3, "POST /RPC2 HTTP/1.1\r\nHost: local"..., 186, 0, NULL, 0) = 186
sendto(3, "<?xml version='1.0'?>\n<methodCal"..., 177, 0, NULL, 0) = 177
recvfrom(3, 

As you can see, which is also clear from @gilles backtrace we're just waiting, to infinity and beyond. This timeouts after approximately 15 minutes.

This all happens for us on superivsor 3.0b1 and odly enough is only happening on the machines that supervise nodejs processes, the ruby-only clusters are completely unaffected in our case (or the bug just hasn't shown up yet).

@palmkevin
Copy link

+1 (Same problem occurs sometimes on Solaris)

@kalyanceg
Copy link

Any improvement on this issue?
How do folks facing this issue, run supervisord in production environment?

@daenney
Copy link

daenney commented Oct 15, 2013

No improvement though we don't seem to hit the issue anymore at all, oddly enough.

Fortunately for us everything running under supervisord is redundant so we can fairly carelessly restart a supervisord if it grid locks again, but that's really not a very good solution.

I'm currently looking at Mozilla's circus as a potential way out. They also seem to have Python 3 compatibility on their list.

@tsharju
Copy link

tsharju commented Oct 15, 2013

I had some pretty bad CPU usage issues with circus. It was a while ago and I didn't have time to investigate the issue. I started using Upstart and was really happy with that decision.

@mnaberez
Copy link
Member

I would like to fix this, but I haven't experienced it myself yet. It would help a lot if someone could find a way to reproduce it.

@philipcristiano
Copy link

Anytime I've seen this issue it's been with multiple process groups writing a lot to stdout/err. I haven't tested this though.

@kalyanceg
Copy link

I think if the subprocesses write a lot to stdout, supervisord hangs in reading those output and leaves the callback in the queue unattended. But since I need a mechanism where after a code upload from jenkins, jenkins should trigger a supervisorctl restart group, where supervisorctl cant hang on for ever. So I resorted to blocking synchronous way for stopAllProcesses, stopProcessGroup, startProcessGroups and stopAllProcesses module in rpcinterface. The modules execute almost instantly. I know its an unclean way, but I am thinking of using this as a tradeoff. Suggestions are welcome on this. Please have a look at the diff
kalyanceg@935760a#diff-1

@lxyu
Copy link

lxyu commented Mar 20, 2014

Hi all, you may want to try http://lxyu.github.io/supervisor-quick/ plugin, it may ease your pain as an temporary workaround.

I wrote it because of the same problem, and it works for me very well.

@alicee
Copy link

alicee commented May 7, 2014

@lxyu Thanks a bunch for the plugin - it makes using supervisor much less painful.

@wesdu
Copy link

wesdu commented Jun 20, 2014

@lxyu Thanks. But still have problem with other commands.

@alexeiz
Copy link

alexeiz commented Jun 30, 2014

I have the same issue with the start command. When I start a group of processes and any (or all) of them enters the FATAL state, supervisorctl tries to restart the whole group over and over again (as apparent by running the maintail command), and the start command never finishes. Starting the processes individually finishes very quickly. I believe it's the same issue, because it's caused by the callback stack never getting exhausted.

@hgdeoro
Copy link

hgdeoro commented Jul 29, 2014

Same problem on Amazon AMI 2014.03 (supervisor v2.1)

@hgdeoro
Copy link

hgdeoro commented Jul 29, 2014

Workaround (tested on Amazon 2014.03): use timeout.

$ timeout 10s supervisorctl start nginx

In my case, supervisorctl works fine, the only problem is that the process never returns, and sending the process to the background leaves those supervisorctl process running, and I didn't like that. Running supervisorctl with timeout is a ugly hack, but works...

@lxyu
Copy link

lxyu commented Jul 29, 2014

@hgdeoro try supervisor-quick, it'll ease your problem.

@daenney
Copy link

daenney commented Jul 30, 2014

@mcdonc The issue is just as present in 3 and that's the version this thread was started with so upgrading won't really solve the issue, unfortunately :(.

@mcdonc
Copy link
Member

mcdonc commented Jul 30, 2014

There's not just one issue described in this thread, however (by my count there are at least three, which are related but not identical), so upgrading is probably a wise idea anyway.

@mcdonc
Copy link
Member

mcdonc commented Jul 30, 2014

I should also note that it would be great if someone could help us reproduce "this" (whichever subproblem described in this thread that is happening to you) on our own systems.

@hgdeoro
Copy link

hgdeoro commented Jul 30, 2014

@mcdonc I've just uploaded a Dockerfile to https://github.com/hgdeoro/test-supervisord/, with some screenshots, and instructions to reproduce. The same is happenging in CentOS 6.5 and Amazon AMI servers. The Dockerfile creates a CentOS 6 container and uses supervisor 2.1 from EPEL

@mcdonc
Copy link
Member

mcdonc commented Jul 30, 2014

Awesome!

mcdonc added a commit that referenced this issue Aug 9, 2014
@mcdonc
Copy link
Member

mcdonc commented Aug 9, 2014

Work to make start and stop of multiple processes much faster has been merged into master, and will make its debut in a 4.0 release.

@mcdonc mcdonc closed this as completed Aug 9, 2014
@wooparadog
Copy link

@mcdonc Hello, is there a 4.0 release date? We'd very much like to use fast restart.

@marcinn
Copy link

marcinn commented Aug 29, 2014

... or any chance to hotfix 3.x branch? Maybe patch for 3.x will be easier and faster to release?

@mcdonc
Copy link
Member

mcdonc commented Aug 29, 2014

No, sorry. The master is currently about 3-4X slower (it uses more CPU when logging) than the 3.X branch and will need to be fixed before we make an official release; not sure how long it will take. In the meantime, you can of course use a checkout.

@mnaberez
Copy link
Member

The changes for this issue (50d1857, d948fc5, 11ffa51, 5366309) were released in Supervisor 3.2.0.

magical added a commit to osu-mist/ansible-hubot that referenced this issue Jul 7, 2016
CentOS 6.7 ships with supervisord 2.1, which is now 9 years
old and unsupported. This wouldn't necessarily be a problem,
except that supervisorctl frequently hangs when trying to
start or stop a service[1].

Install the latest version with pip instead.

[1]: Supervisor/supervisor#131
cuongdo pushed a commit to cuongdo/cockroach that referenced this issue Aug 22, 2016
The previously used version, Ubuntu 15.10, has an incredibly old version
of supervisord that had an issue where subprocesses generating lots of
output (hello `--vmodule=raft=5`) created substantial delays in starting
and stopping other processes:

Supervisor/supervisor#131

Among other things, 16.04 has a newer version of supervisord.
jaraco pushed a commit to yougov/vr.server that referenced this issue Oct 4, 2016
This prevents tasks from hanging indefinitely if supervisorctl hangs, too.
See e.g. Supervisor/supervisor#131
@hnrindani
Copy link

Seems like this is a never ending bug. I came across this recently and seems like nothing is working for me. Does anybody have the experience of working with Upstart?

@verseal
Copy link

verseal commented May 15, 2019

Stop all your jobs, and the start one by one. You'll find which job causes problem

@saaiful
Copy link

saaiful commented Nov 16, 2022

Supervisor Now Takes 3-4 Minute to start when system boot (tested in ubuntu 20-22), this makes rebooting server nerve waking!

@wtfz
Copy link

wtfz commented Oct 23, 2023

Situation: The server suddenly shutdown and reboot, after that supervisor service not working properly due to bad config (some config point to deleted directory previously). After restarting the service, any supervisorctl command takes too long to execute and sometimes hang.

My workaround that probably fixed it:

  • Reinstall supervisor service
  • Remove all supervisor.sock files in /var/run directory
  • Move all config files in /etc/supervisor/conf.d to another directory (added it later one by one to check for bad config)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests