Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Supervisord crashes when over 1023 files are open (even with ulimit set) #26
Comments
chuckadams
commented
Jul 7, 2011
|
I threw together a quick hack that emulates select() with poll(). I don't recommend trusting your production boxes with my skin-deep knowledge of poll(), but it does withstand the cat-bomb as well as pass all tests. https://github.com/sproingie/supervisor/commit/2d3e753b8bca6c39eca3840290b5425f583fb0db |
chuckadams
commented
Jul 7, 2011
|
The above hack improperly handled POLLPRI, and it's fixed in in the head of my fork |
shimon
commented
Jul 9, 2011
|
Thanks for the quick attention, sproingie! |
chuckadams
commented
Jul 9, 2011
|
Since renaming my repo (owing to some incompatible changes I'm making), I can't seem to nail down that changeset anymore, but if you're trying it, you'll want to grab the code out of the head revision. I recall making at least one extra change, namely multiplying the timeout by 1000. Turns out that select.select specifies the timeout in seconds, whereas for select.poll it's in milliseconds. Oops. The tiny timeout caused a lot of spinning and probably even some livelock. It's still a hack job, since the proper design would be to keep the poll object around persistently and not try to make it emulate the statelessness of select(). |
|
FWIW, I think there is a way to compile a Python (involving some FD_SETSIZE hackery IIRC) that allows for more file descriptors to be accessible by select(). Googling doesn't lead to any obvious URLs however. |
mikekhristo
commented
Sep 7, 2011
|
Running into the error in the initial post. Any advice on how to deal with it? |
|
Currently the only workaround is to compile a Python that supports > 1024 file descriptors and run supervisor under that. |
mikekhristo
commented
Sep 7, 2011
|
Any idea how to do that? Google hasn't been helpful. I have the python 2.7.2 source extracted and ready to go. |
|
Nope. As I said in a previous entry, I could not find a suitable Google entry. Likely have to either ask on python newsgroup or stackoverflow. |
mikekhristo
commented
Sep 7, 2011
timbaileyjones
commented
Sep 8, 2011
|
Shouldn't supervisor just switch from select.select to select.poll ? By my math (5 fds per child), this restricts supervisor to about 204 processes, actually fewer if you substract stdin/stdout/stderr, listeners for rpc/http, and whatever shlibs python has upon. So maybe 200 or 201. For the time being, we are probably going to cope with this by running two instances of supervisord and splitting our workload among them. |
timbaileyjones
commented
Oct 28, 2011
|
I am taking a whack at fixing this myself, since Chuck Adams can't seem to find his change set: https://github.com/linuxtampa/supervisor |
kevin1024
commented
Nov 11, 2011
|
We just ran into this issue in production today. Not sure what our interim solution will be. Running two supervisors would be awkward. |
|
I looked into maybe trying to implement the mainloop in terms of select.poll, but it doesn't appear to work on Mac OS X, or at least the out-of-the-box Python builds on Mac OS X don't support it: http://bugs.python.org/issue5154 Bleh. |
timbaileyjones
commented
Jan 14, 2012
|
I tried making that change too... it seemed to work for the first day, but https://github.com/linuxtampa/supervisor tlj On Sat, Jan 14, 2012 at 4:38 AM, Chris McDonough <
|
|
Hi, I've started working to replace select() for poll(), my fork is on: https://github.com/igorsobreira/supervisor/commits/master
I would love some feedback, and please let me know if i'm on the wrong track. |
|
I've sent an email with updates: http://lists.supervisord.org/pipermail/supervisor-users/2012-March/001036.html |
This was referenced Jun 18, 2012
|
See also #145 which sounds like it is also caused by this issue. |
weissi
commented
Jan 10, 2013
|
What is the status with this issue? @igorsobreira what about your pull request? |
|
@weissi my pull request has a working solution, I mean there are no more features I had in mind that were needed. But two issues were reported on the pull request, maybe it's the same (see the comments), I didn't have time to dive into those yet. I plan to investigate this hand on linux this weekend. Anyway, needs more testing, and maybe an update to supervisor master. |
weissi
commented
Jan 10, 2013
|
Cool, thank you! |
spleeyah
commented
Apr 5, 2013
|
Any possible hope of this being addressed? :( |
jeff-minard-ck
commented
Apr 25, 2013
|
Our organization is hitting this same bug too. It's a pretty big deal. |
sandra1n
commented
Jun 27, 2013
|
Same bug. Thanks to @igorsobreira, his version work at me fine. |
frankmayer
commented
Aug 13, 2013
|
Just installed 3.0 and hitting this issue. Is there a plan to resolve this? |
akimicyu
commented
Jan 9, 2014
|
I met same bug. Thanks to @igorsobreira. your work is very cool. |
shimon commentedJul 7, 2011
Supervisord uses select.select to monitor filehandles related to the processes it supervises. This is problematic because select.select raises a ValueError for filehandles numbered >1023. (Observed with supervisor 3.0a8 on an Ubuntu Gnu/Linux 11.04 amd64 machine.)
We ran into this problem when running approximately 254 supervised processes. Initially, we assumed it was a ulimit configuration problem, but found that the crash occurred even when running supervisord in non-daemon mode. I've been able to reproduce the stacktrace by supervising a large number of /bin/cat processes, and have included it below. Here's a conf file
to run 1100 cats:
https://gist.github.com/1068713
To reproduce this bug, just install that config file and run something like:
sudo bash -c "ulimit -n 10000; supervisord -n"
You'll see a ValueError (out of range) from select.select(), called from supervisord's runforever():
https://github.com/Supervisor/supervisor/blob/master/supervisor/supervisord.py#L218
It appears this is a limitation of Python's select() function, which raises a ValueError on file descriptors > 1023. I've seen some suggestions that beyond this limit, one should use poll() instead of select(), but I'm not an expert.
FULL TRACEBACK:
Traceback (most recent call last):
File "/usr/bin/supervisord", line 9, in
load_entry_point('supervisor==3.0a8', 'console_scripts', 'supervisord')()
File "/usr/lib/pymodules/python2.7/supervisor/supervisord.py", line 371, in main
go(options)
File "/usr/lib/pymodules/python2.7/supervisor/supervisord.py", line 381, in go
d.main()
File "/usr/lib/pymodules/python2.7/supervisor/supervisord.py", line 94, in main
self.run()
File "/usr/lib/pymodules/python2.7/supervisor/supervisord.py", line 111, in run
self.runforever()
File "/usr/lib/pymodules/python2.7/supervisor/supervisord.py", line 229, in runforever
r, w, x = self.options.select(r, w, x, timeout)
File "/usr/lib/pymodules/python2.7/supervisor/options.py", line 1097, in select
return select.select(r, w, x, timeout)
ValueError: filedescriptor out of range in select()