Supervisord crashes when over 1023 files are open (even with ulimit set) #26

Closed
shimon opened this Issue Jul 7, 2011 · 27 comments

Comments

Projects
None yet

shimon commented Jul 7, 2011

Supervisord uses select.select to monitor filehandles related to the processes it supervises. This is problematic because select.select raises a ValueError for filehandles numbered >1023. (Observed with supervisor 3.0a8 on an Ubuntu Gnu/Linux 11.04 amd64 machine.)

We ran into this problem when running approximately 254 supervised processes. Initially, we assumed it was a ulimit configuration problem, but found that the crash occurred even when running supervisord in non-daemon mode. I've been able to reproduce the stacktrace by supervising a large number of /bin/cat processes, and have included it below. Here's a conf file
to run 1100 cats:
https://gist.github.com/1068713

To reproduce this bug, just install that config file and run something like:
sudo bash -c "ulimit -n 10000; supervisord -n"

You'll see a ValueError (out of range) from select.select(), called from supervisord's runforever():
https://github.com/Supervisor/supervisor/blob/master/supervisor/supervisord.py#L218

It appears this is a limitation of Python's select() function, which raises a ValueError on file descriptors > 1023. I've seen some suggestions that beyond this limit, one should use poll() instead of select(), but I'm not an expert.

FULL TRACEBACK:

Traceback (most recent call last):
File "/usr/bin/supervisord", line 9, in
load_entry_point('supervisor==3.0a8', 'console_scripts', 'supervisord')()
File "/usr/lib/pymodules/python2.7/supervisor/supervisord.py", line 371, in main
go(options)
File "/usr/lib/pymodules/python2.7/supervisor/supervisord.py", line 381, in go
d.main()
File "/usr/lib/pymodules/python2.7/supervisor/supervisord.py", line 94, in main
self.run()
File "/usr/lib/pymodules/python2.7/supervisor/supervisord.py", line 111, in run
self.runforever()
File "/usr/lib/pymodules/python2.7/supervisor/supervisord.py", line 229, in runforever
r, w, x = self.options.select(r, w, x, timeout)
File "/usr/lib/pymodules/python2.7/supervisor/options.py", line 1097, in select
return select.select(r, w, x, timeout)
ValueError: filedescriptor out of range in select()

I threw together a quick hack that emulates select() with poll(). I don't recommend trusting your production boxes with my skin-deep knowledge of poll(), but it does withstand the cat-bomb as well as pass all tests.

https://github.com/sproingie/supervisor/commit/2d3e753b8bca6c39eca3840290b5425f583fb0db

The above hack improperly handled POLLPRI, and it's fixed in in the head of my fork

https://github.com/sproingie/supervisor/compare/d5aa987d26786c46ee5397b76c6a39afd84c9d0b...sproingie:master

shimon commented Jul 9, 2011

Thanks for the quick attention, sproingie!

Since renaming my repo (owing to some incompatible changes I'm making), I can't seem to nail down that changeset anymore, but if you're trying it, you'll want to grab the code out of the head revision. I recall making at least one extra change, namely multiplying the timeout by 1000. Turns out that select.select specifies the timeout in seconds, whereas for select.poll it's in milliseconds. Oops. The tiny timeout caused a lot of spinning and probably even some livelock.

It's still a hack job, since the proper design would be to keep the poll object around persistently and not try to make it emulate the statelessness of select().

Owner

mcdonc commented Sep 2, 2011

FWIW, I think there is a way to compile a Python (involving some FD_SETSIZE hackery IIRC) that allows for more file descriptors to be accessible by select(). Googling doesn't lead to any obvious URLs however.

Running into the error in the initial post. Any advice on how to deal with it?

Owner

mcdonc commented Sep 7, 2011

Currently the only workaround is to compile a Python that supports > 1024 file descriptors and run supervisor under that.

Any idea how to do that? Google hasn't been helpful. I have the python 2.7.2 source extracted and ready to go.

Owner

mcdonc commented Sep 7, 2011

Nope. As I said in a previous entry, I could not find a suitable Google entry. Likely have to either ask on python newsgroup or stackoverflow.

Shouldn't supervisor just switch from select.select to select.poll ? By my math (5 fds per child), this restricts supervisor to about 204 processes, actually fewer if you substract stdin/stdout/stderr, listeners for rpc/http, and whatever shlibs python has upon. So maybe 200 or 201.

For the time being, we are probably going to cope with this by running two instances of supervisord and splitting our workload among them.

I am taking a whack at fixing this myself, since Chuck Adams can't seem to find his change set: https://github.com/linuxtampa/supervisor

We just ran into this issue in production today. Not sure what our interim solution will be. Running two supervisors would be awkward.

Owner

mcdonc commented Jan 14, 2012

I looked into maybe trying to implement the mainloop in terms of select.poll, but it doesn't appear to work on Mac OS X, or at least the out-of-the-box Python builds on Mac OS X don't support it:

http://bugs.python.org/issue5154

Bleh.

I tried making that change too... it seemed to work for the first day, but
then it got all sluggish and eventually gets stuck. I don't really know
what I'm doing wrong, but here's my fork.

https://github.com/linuxtampa/supervisor

tlj

On Sat, Jan 14, 2012 at 4:38 AM, Chris McDonough <
reply@reply.github.com

wrote:

I looked into maybe trying to implement the mainloop in terms of
select.poll, but it doesn't appear to work on Mac OS X, or at least the
out-of-the-box Python builds on Mac OS X don't support it:

http://bugs.python.org/issue5154

Bleh.


Reply to this email directly or view it on GitHub:
#26 (comment)

Contributor

igorsobreira commented Feb 14, 2012

Hi, I've started working to replace select() for poll(), my fork is on: https://github.com/igorsobreira/supervisor/commits/master
It kinda works now, but there is a lot to be done yet, this first commit is just trying to understand the solution...

  • I need to verify what's the correct bitmask to use when registering a file descriptior on poll()
  • Most of the tests passes (i've executed just python setup.py test, not using tox yet), there are just 3 failures
  • As @mcdonc pointed out, osx's python doesn't have select.poll(), so I plan to use select.kqueue on this case. For this I will detect the OS and move the poll() call to self.options (as it works with select() now), that will use the correct one based on the OS
  • There is an error being raised when the process starts Cannot allocate memory. This is probably because it's trying to read the fd but supervisor dispatcher says the process is STARTING. It should not be a big problem though

I would love some feedback, and please let me know if i'm on the wrong track.

Owner

mnaberez commented Sep 11, 2012

See also #145 which sounds like it is also caused by this issue.

weissi commented Jan 10, 2013

What is the status with this issue? @igorsobreira what about your pull request?

Contributor

igorsobreira commented Jan 10, 2013

@weissi my pull request has a working solution, I mean there are no more features I had in mind that were needed. But two issues were reported on the pull request, maybe it's the same (see the comments), I didn't have time to dive into those yet. I plan to investigate this hand on linux this weekend.

Anyway, needs more testing, and maybe an update to supervisor master.

weissi commented Jan 10, 2013

Cool, thank you!

spleeyah commented Apr 5, 2013

Any possible hope of this being addressed? :(

Our organization is hitting this same bug too. It's a pretty big deal.

Same bug. Thanks to @igorsobreira, his version work at me fine.

Just installed 3.0 and hitting this issue. Is there a plan to resolve this?

akimicyu commented Jan 9, 2014

I met same bug. Thanks to @igorsobreira. your work is very cool.

Owner

mnaberez commented Aug 10, 2014

Fixed in 9e6aa44 (PR #129).

@mnaberez mnaberez closed this Aug 10, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment