Skip to content

Fix 100% CPU usage#875

Merged
mnaberez merged 3 commits into
Supervisor:masterfrom
jaredsuttles:fix_high_cpu_usage_bug
Apr 20, 2017
Merged

Fix 100% CPU usage#875
mnaberez merged 3 commits into
Supervisor:masterfrom
jaredsuttles:fix_high_cpu_usage_bug

Conversation

@jaredsuttles

Copy link
Copy Markdown

I ran into this bug using the web UI on macOS. It seems that #589 was on the right track but has unnecessarily strict rules to unregister a file descriptor. Only need to check if the writers are not writeable anymore - readers not readable.

@mnaberez

Copy link
Copy Markdown
Member

@jaredsuttles Thanks for this patch! It looks like the CI failure is unrelated (#873). There's another poller bug reported in #858. Does this patch happen to fix that as well, and if not, could you possibly look at it as well?

@mnaberez

Copy link
Copy Markdown
Member

CI failure is unrelated and should be fixed since fa7e3f5.

@jaredsuttles

Copy link
Copy Markdown
Author

@mnaberez Looked into the other issue. I think I found a solution - please verify #885

@mnaberez

Copy link
Copy Markdown
Member

@jaredsuttles Thanks much for the #885. I will test and intend to apply it if there aren't issues. Can you please check out #903? It is another patch that aims to fix the same issue as this one, but it's completely different. I could use help figuring out the best thing to do here.

@mhariri

mhariri commented Mar 11, 2017

Copy link
Copy Markdown

Reading @jaredsuttles code, I think he has a better fix than mine. So I think this one should go in.

@mhariri mhariri left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Can we get this in? The bug is really irritating...

@jaredsuttles

Copy link
Copy Markdown
Author

@mnaberez @mhariri Just a note regarding #903 - it does indeed seem to fix the issue for kqueue poller but the other pollers are also broken if you force their use.

@mnaberez

Copy link
Copy Markdown
Member

but the other pollers are also broken if you force their use.

Could you please give me some more specific details on how they are broken so that we can create tickets for them?

Do you think it would be better to just revert the poller patch entirely? It was added only recently and prior to it, supervisord was a relatively stable program. The main purpose of the poller patch is to allow supervisord to monitor a large number of file descriptors, which came up in tickets from time to time, but wasn't a problem faced by most users. I didn't merge the poller patch so I'm less familiar with all its details. I did reach out to the author of the patch (@igorsobreira) in other tickets and he hasn't responded. I'm surprised to hear that select is broken even, considered that supervisord prior to the poller patch used select and didn't have these problems.

@jaredsuttles

Copy link
Copy Markdown
Author

@mnaberez Sorry, I simply meant the other pollers have the same 100% CPU usage issue so they would also need to be dealt with were you to use #903. You can test this by just setting for instance Poller = SelectPoller at the end of poller.py

I haven't really looked at the code prior to that merge so I can't really say one way or the other.

@mhariri

This comment has been minimized.

@mnaberez

This comment has been minimized.

@mnaberez mnaberez closed this Mar 21, 2017
@mnaberez mnaberez reopened this Mar 21, 2017
@mnaberez

Copy link
Copy Markdown
Member

@jaredsuttles Thanks for the clarification above on #903. I hope to keep the kqueue poller, otherwise we have to reopen the other issues it was implemented to address. I will test this patch and #885 locally soon and if there are no issues, I'll try to make a bug fix release with them.

@mnaberez mnaberez left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear on the changes to PollPoller in unregister_readable() and unregister_writable().

Comment thread supervisor/poller.py
self.readables.discard(fd)
self._poller.unregister(fd)
if fd in self.writables:
self._poller.register(fd, self.WRITE)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am reading this correctly, it seems like unregistering a readable causes all writables to be (re)registered(?). Could you please explain this? Same thing in unregister_writable below.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In kqueue you can unregister a file descriptor for specifically read or write but, as far as I can tell, you cannot do that with poll. The workaround is to just reregister for write if we unregistered read and vice versa.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for clarifying.

@mnaberez mnaberez left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'll test this locally and will merge it if I don't find any issues.

@FAKERINHEART

Copy link
Copy Markdown

I tried this branch.
I found that when I used commands 'stop program_name and start program_name', it would raise 'httplib.BadStatusLine' Error.

image

My python version is 2.7.12

@jaredsuttles

Copy link
Copy Markdown
Author

@FAKERINHEART I'm not able to reproduce. This doesn't happen on master?

@mnaberez

Copy link
Copy Markdown
Member

@FAKERINHEART I'm testing this change locally and I'm also not able to see the error you posted. Can you please help us reproduce it or let us know if you're no longer seeing it?

@FAKERINHEART

Copy link
Copy Markdown

It was my own mistake for the 'BadStatusLine' error. I am sorry to forget to reply this in time.

@mnaberez

Copy link
Copy Markdown
Member

No problem, thanks for letting us know!

@mhariri

mhariri commented Apr 16, 2017

Copy link
Copy Markdown

When can we get this merged @mnaberez ?

@mnaberez

Copy link
Copy Markdown
Member

I tested this patch locally and it worked for me. I plan to merge and release it next week.

@mnaberez

Copy link
Copy Markdown
Member

@plockaby tried this patch and said in #807 (comment):

That patch did not solve my problem. I almost immediately had one of the updated systems (not the same one as before) go into a poll craziness:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 593446}, NULL) = 0
gettimeofday({1492367896, 593515}, NULL) = 0
gettimeofday({1492367896, 593584}, NULL) = 0
gettimeofday({1492367896, 593637}, NULL) = 0
gettimeofday({1492367896, 593719}, NULL) = 0
gettimeofday({1492367896, 593788}, NULL) = 0
gettimeofday({1492367896, 593826}, NULL) = 0
gettimeofday({1492367896, 593863}, NULL) = 0
gettimeofday({1492367896, 593886}, NULL) = 0
gettimeofday({1492367896, 593906}, NULL) = 0
gettimeofday({1492367896, 593948}, NULL) = 0
gettimeofday({1492367896, 593987}, NULL) = 0
gettimeofday({1492367896, 594031}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 594103}, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 594331}, NULL) = 0
gettimeofday({1492367896, 594415}, NULL) = 0
gettimeofday({1492367896, 594491}, NULL) = 0
gettimeofday({1492367896, 594561}, NULL) = 0
gettimeofday({1492367896, 594642}, NULL) = 0
gettimeofday({1492367896, 594663}, NULL) = 0
gettimeofday({1492367896, 594699}, NULL) = 0
gettimeofday({1492367896, 594739}, NULL) = 0
gettimeofday({1492367896, 594769}, NULL) = 0
gettimeofday({1492367896, 594808}, NULL) = 0
gettimeofday({1492367896, 594836}, NULL) = 0
gettimeofday({1492367896, 594887}, NULL) = 0
gettimeofday({1492367896, 594934}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 595005}, NULL) = 0

lr-x------ 1 root root 64 Apr 16 10:37 0 -> /dev/null
lrwx------ 1 root root 64 Apr 16 10:37 1 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 10 -> pipe:[36606043]
lr-x------ 1 root root 64 Apr 16 10:37 11 -> pipe:[36606045]
l-wx------ 1 root root 64 Apr 16 10:37 12 -> /data/logs/supervisor/supermon.log
l-wx------ 1 root root 64 Apr 16 10:37 13 -> /data/logs/supervisor/supermon.err
l-wx------ 1 root root 64 Apr 16 10:37 14 -> pipe:[36606047]
lr-x------ 1 root root 64 Apr 16 10:37 15 -> pipe:[36606046]
lr-x------ 1 root root 64 Apr 16 10:37 16 -> pipe:[36606048]
l-wx------ 1 root root 64 Apr 16 10:37 17 -> /data/logs/supervisor/supercron.log
l-wx------ 1 root root 64 Apr 16 10:37 18 -> /data/logs/supervisor/supercron.err
l-wx------ 1 root root 64 Apr 16 10:37 19 -> pipe:[36606050]
lrwx------ 1 root root 64 Apr 16 10:37 2 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 20 -> pipe:[36606049]
lr-x------ 1 root root 64 Apr 16 10:37 21 -> pipe:[36606051]
l-wx------ 1 root root 64 Apr 16 10:37 22 -> /data/logs/supervisor/dart-agent.log
l-wx------ 1 root root 64 Apr 16 10:37 24 -> pipe:[36606053]
lr-x------ 1 root root 64 Apr 16 10:37 25 -> pipe:[36606052]
lr-x------ 1 root root 64 Apr 16 10:37 26 -> pipe:[36606054]
l-wx------ 1 root root 64 Apr 16 10:37 27 -> /data/logs/supervisor/pgcheck.log
l-wx------ 1 root root 64 Apr 16 10:37 28 -> /data/logs/supervisor/pgcheck.err
l-wx------ 1 root root 64 Apr 16 10:37 3 -> /data/logs/supervisord.log
lr-x------ 1 root root 64 Apr 16 10:37 30 -> pipe:[36606055]
l-wx------ 1 root root 64 Apr 16 10:37 32 -> /data/logs/supervisor/pgwatch.log
l-wx------ 1 root root 64 Apr 16 10:37 33 -> /data/logs/supervisor/pgwatch.err
lrwx------ 1 root root 64 Apr 16 10:37 4 -> socket:[36606038]
l-wx------ 1 root root 64 Apr 16 10:37 5 -> /data/logs/supervisor/dart-agent.err
l-wx------ 1 root root 64 Apr 16 10:37 6 -> pipe:[36606041]
lr-x------ 1 root root 64 Apr 16 10:37 7 -> /dev/urandom
lr-x------ 1 root root 64 Apr 16 10:37 8 -> pipe:[36606042]
l-wx------ 1 root root 64 Apr 16 10:37 9 -> pipe:[36606044]

@plockaby

Copy link
Copy Markdown
Contributor

I feel it may be prudent to add to this thread that this 100% cpu thing only started happening when I added minfds and childlogdir to my configurations. Of course, I haven't tried removing those to see if the problem goes away but I will try to do that today.

@jaredsuttles

Copy link
Copy Markdown
Author

@mnaberez @plockaby It sounds like this is a separate (but similar) issue. This PR still fixes the polling on macOS.

@plockaby

Copy link
Copy Markdown
Contributor

I agree that it is separate. Don't let my problem hold up merging this pull request.

@mnaberez

Copy link
Copy Markdown
Member

This is what I tried to reproduce the high CPU usage:

Minimal supervisord.conf:

[supervisord]
logfile = /tmp/supervisord.log
loglevel = blather

[inet_http_server]
port = 127.0.0.1:9001

[supervisorctl]
serverurl = http://127.0.0.1:9001

[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

Start supervisord in the foreground with that config:

$ supervisord -n -c /path/to/supervisord.conf

Open a web browser to http://127.0.0.1:9001. The supervisord log in the foreground will start spewing lines like BLAT read event and won't stop. I believe this is the high CPU usage.

I was able to see this on macOS Sierra (10.12.4) with Python 2.7.13 using the current master (3d128e4) and applying this patch fixed it.

I also tried this on Ubuntu Desktop 16.04.1 LTS with Python 2.7.12. On that system, I wasn't able to see the problem before or after the patch.

@mnaberez mnaberez merged commit 610e832 into Supervisor:master Apr 20, 2017
@dkrieger

Copy link
Copy Markdown

@plockaby I'm also seeing poll craziness (very similar looking strace) and 100% CPU usage, on a linux rackspace box. did you open a separate issue for the issue you found? I'm running 3.3.3 fwiw, python 2.7.6

@plockaby

Copy link
Copy Markdown
Contributor

@dkrieger I worked around my problem by making my program's logger write to syslog and then not using supervisor to handle the logging from the event handler. I think I just hijacked other 100% CPU issues rather than making my own. Look at #807

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants