Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU usage (maybe caused by new poll implementation?) #807

Open
timonwong opened this Issue Aug 6, 2016 · 43 comments

Comments

@timonwong
Copy link

timonwong commented Aug 6, 2016

We were using supervisor 3.3.0 on Ubuntu 14.04 LTS.

Recently on some of our nodes on AWS, we spot very high cpu usage about supervisord, to get around that, we have to reload it, but it may happen again in one day.

Reading from strace, we spot there are excessive calls to both 'gettimeofday' and 'poll', so after that, we have to choose to downgrade supervisor to 3.2.3.

I see there was #581, but I think it's irrelevant here, our wild guess is it just caused by the new poll implementation introduced in 3.3.0 (and maybe caused by simultaneous log outputs?)...

Thanks in advance!

@mnaberez

This comment has been minimized.

Copy link
Member

mnaberez commented Aug 6, 2016

Recently on some of our nodes on AWS, we spot very high cpu usage about supervisord, to get around that, we have to reload it, but it may happen again in one day.

@hathawsh also saw high CPU usage from the new poller implementation and submitted #589. That was merged before 3.3.0 was released.

Reading from strace, we spot there are excessive calls to both 'gettimeofday' and 'poll', so after that, we have to choose to downgrade supervisor to 3.2.3.

It's possible this is the main loop spinning since poller.poll() does a poll and below it tick() does a gettimeofday.

Can you run 3.3.0 with loglevel=blat? That will produce as much debugging information as possible. Perhaps it will have some clues to why this is happening.

and maybe caused by simultaneous log outputs?

There weren't any changes to logging between 3.2.3 and 3.3.0. Since you said 3.2.3 works for you, I wouldn't suspect logging.

@timonwong

This comment has been minimized.

Copy link
Author

timonwong commented Aug 8, 2016

Hi!
Today I spot the problem again, here gz'd strace log (duration ~1s):
strace.log.tar.gz
And here is fds:

0 -> /dev/null
1 -> /dev/null
10 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
100 -> pipe:[32356299]
101 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
102 -> pipe:[32356298]
103 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
104 -> pipe:[32356300]
105 -> /data/supervisor_log/bg_task_00-stderr.log
106 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
107 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
108 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
109 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
11 -> pipe:[3024]
110 -> /data/supervisor_log/bg_task_00-stdout.log
111 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
112 -> /data/supervisor_log/bg_task_00-stderr.log
113 -> pipe:[32356302]
114 -> pipe:[32356301]
115 -> pipe:[32356303]
116 -> /data/supervisor_log/bg_task_00-stdout.log.1
117 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
118 -> /data/supervisor_log/bg_task_00-stderr.log
119 -> /data/supervisor_log/bg_task_00-stdout.log.8
12 -> pipe:[17721748]
121 -> pipe:[32356304]
124 -> /data/supervisor_log/bg_task_00-stderr.log
13 -> pipe:[17721749]
14 -> pipe:[11775]
15 -> pipe:[3025]
16 -> pipe:[11776]
17 -> /data/supervisor_log/bg_task_02-stdout.log
18 -> /data/supervisor_log/bg_task_02-stderr.log
19 -> pipe:[17721751]
2 -> /dev/null
20 -> pipe:[11777]
21 -> pipe:[17721750]
22 -> /data/supervisor_log/bg_task_03-stdout.log
23 -> /data/supervisor_log/bg_task_03-stderr.log
24 -> pipe:[11827]
25 -> pipe:[17721752]
26 -> pipe:[11828]
27 -> /data/supervisor_log/bg_task_01-stdout.log.1
28 -> /data/supervisor_log/bg_task_01-stderr.log.2
29 -> pipe:[17721754]
3 -> /data/supervisor_log/supervisord.log
30 -> pipe:[11829]
31 -> pipe:[17721753]
32 -> /data/supervisor_log/bg_task_04-stdout.log
33 -> /data/supervisor_log/bg_task_04-stderr.log
34 -> pipe:[17721755]
35 -> /data/supervisor_log/bg_task_01-stdout.log
36 -> /data/supervisor_log/bg_task_01-stderr.log.1
37 -> pipe:[17721757]
38 -> pipe:[17721756]
39 -> pipe:[17721758]
4 -> socket:[13073]
40 -> /data/supervisor_log/bg_task_01-stdout.log.3
41 -> /data/supervisor_log/bg_task_01-stderr.log
42 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
43 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
44 -> pipe:[17721759]
45 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
46 -> pipe:[17719642]
47 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
48 -> pipe:[17719643]
49 -> /data/supervisor_log/bg_task_01-stdout.log.2
5 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
50 -> pipe:[17719644]
51 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
52 -> /data/supervisor_log/bg_task_01-stderr.log.3
53 -> /data/supervisor_log/bg_task_05-stdout.log
54 -> /data/supervisor_log/bg_task_05-stderr.log
55 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
56 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
57 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
58 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
59 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
6 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
60 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
61 -> pipe:[30456289]
62 -> pipe:[30456290]
63 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
64 -> /data/supervisor_log/node_exporter.log
65 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
66 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
67 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
68 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
69 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
7 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
70 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
71 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
72 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
73 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
74 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
75 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
76 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
77 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
78 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
79 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
8 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
80 -> /data/supervisor_log/bg_task_00-stderr.log
81 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
82 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
83 -> /data/supervisor_log/bg_task_00-stderr.log
84 -> /data/supervisor_log/bg_task_00-stdout.log.2
85 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
86 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
87 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
88 -> /data/supervisor_log/bg_task_00-stderr.log
89 -> /data/supervisor_log/bg_task_00-stderr.log
9 -> pipe:[3023]
90 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
91 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
92 -> /data/supervisor_log/bg_task_00-stdout.log.8
93 -> pipe:[32356293]
94 -> pipe:[32356294]
95 -> pipe:[32356296]
96 -> pipe:[32356295]
97 -> pipe:[32356297]
98 -> /data/supervisor_log/bg_task_00-stdout.log.3
99 -> /data/supervisor_log/bg_task_00-stderr.log

I haven't open the blather level yet, I'll try it now, and will see what's going on.

@timonwong

This comment has been minimized.

Copy link
Author

timonwong commented Aug 10, 2016

Hi, me again, here is the supervisord log with blather level enabled:
supervisord.log.tar.gz

@blusewang

This comment has been minimized.

Copy link

blusewang commented Dec 21, 2016

Me too.every times when i login with http and restart a program.my server's CPU immediately go to 100%.
I wasted so much time on this bug!
I used it to manage my laravel queue. I hate it, but I have no choose!

@gonesurfing

This comment has been minimized.

Copy link

gonesurfing commented Dec 26, 2016

This is 100% reproducible on 3.3.1 (anaconda package on OSX).
-Start supervisord with http server on localhost:9001
-open new ipython console and import xmlrpclib and type
server = xmlrpclib.Server('http://localhost:9001')
-now run any server.getMethod and instantly cpu goes to 100% and stays there
-del(server) and cpu goes back to normal.

Bummer...This is a show stopper for what I need. let me know if any other info is needed. I'm going to have to downgrade and see if an older version doesn't show this behavior.

Update: I downgraded to 3.1.3 and this behavior is not present. Something definitely broke between those two versions, but I don't have any of the 3.2.x branch available to test easily.

@cenkalti

This comment has been minimized.

Copy link

cenkalti commented Dec 26, 2016

I can confirm this. Very frustrating 😞

@mnaberez

This comment has been minimized.

Copy link
Member

mnaberez commented Dec 26, 2016

@igorsobreira A number of users have reported high CPU usage from supervisord and believe it was introduced by your patch in #129. Another case of high CPU usage was found in #589 and confirmed to be caused by the patch. Could you please look into this? There's reproduce instructions above in #807 (comment).

@plockaby

This comment has been minimized.

Copy link
Contributor

plockaby commented Apr 15, 2017

I want to add that of the 76 servers that I have running supervisord, I had to stop/start supervisord on seven of them in the past four days for this bug.

@plockaby

This comment has been minimized.

Copy link
Contributor

plockaby commented Apr 16, 2017

Now an eighth. A bit more detail now. I am running supervisor 3.3.0. I installed it last May as soon as it came out because it had my patch in it that I really wanted. I restarted supervisor on all my hosts to get the new version and it has been going just fine. Last week, I made a global configuration change. I changed minfds to 100000 (because a specific process wanted it) and I set childlogdir which had not been set before. I installed that new configuration on all my hosts and went around restarting supervisord. Now hosts are going into this poll spin loop at random. This is the strace log. These are spinning like crazy, eating up a whole CPU.

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=22, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}, {fd=36, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=41, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 1 ([{fd=20, revents=POLLIN}])
gettimeofday({1492359703, 190995}, NULL) = 0
gettimeofday({1492359703, 191062}, NULL) = 0
gettimeofday({1492359703, 194873}, NULL) = 0
gettimeofday({1492359703, 194973}, NULL) = 0
gettimeofday({1492359703, 195072}, NULL) = 0
gettimeofday({1492359703, 195108}, NULL) = 0
gettimeofday({1492359703, 195153}, NULL) = 0
gettimeofday({1492359703, 195224}, NULL) = 0
gettimeofday({1492359703, 195254}, NULL) = 0
gettimeofday({1492359703, 195299}, NULL) = 0
gettimeofday({1492359703, 195327}, NULL) = 0
gettimeofday({1492359703, 195378}, NULL) = 0
gettimeofday({1492359703, 195446}, NULL) = 0
wait4(-1, 0x7ffea7758d04, WNOHANG, NULL) = 0
gettimeofday({1492359703, 195526}, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=22, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}, {fd=36, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=41, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 1 ([{fd=20, revents=POLLIN}])
gettimeofday({1492359703, 195874}, NULL) = 0
gettimeofday({1492359703, 195936}, NULL) = 0
gettimeofday({1492359703, 196000}, NULL) = 0
gettimeofday({1492359703, 196092}, NULL) = 0
gettimeofday({1492359703, 196166}, NULL) = 0
gettimeofday({1492359703, 196256}, NULL) = 0
gettimeofday({1492359703, 196336}, NULL) = 0
gettimeofday({1492359703, 196380}, NULL) = 0
gettimeofday({1492359703, 196520}, NULL) = 0
gettimeofday({1492359703, 196557}, NULL) = 0
gettimeofday({1492359703, 196599}, NULL) = 0
gettimeofday({1492359703, 196643}, NULL) = 0
gettimeofday({1492359703, 196689}, NULL) = 0
wait4(-1, 0x7ffea7758d04, WNOHANG, NULL) = 0
gettimeofday({1492359703, 196787}, NULL) = 0

On the host in particular that is crashing right now, these are the fds:

total 0
lr-x------ 1 root root 64 Apr 11 15:42 0 -> /dev/null
lrwx------ 1 root root 64 Apr 11 15:42 1 -> socket:[74731265]
lr-x------ 1 root root 64 Apr 11 15:42 10 -> pipe:[74731311]
lr-x------ 1 root root 64 Apr 11 15:42 11 -> pipe:[74731313]
l-wx------ 1 root root 64 Apr 11 15:42 12 -> /data/logs/supervisor/supermon.log
l-wx------ 1 root root 64 Apr 11 15:42 14 -> pipe:[77877459]
lr-x------ 1 root root 64 Apr 11 15:42 15 -> pipe:[74731314]
lr-x------ 1 root root 64 Apr 11 15:42 16 -> pipe:[77877460]
l-wx------ 1 root root 64 Apr 11 15:42 17 -> /data/logs/supervisor/supercron.log
l-wx------ 1 root root 64 Apr 11 15:42 18 -> /data/logs/supervisor/supercron.err
l-wx------ 1 root root 64 Apr 11 15:42 19 -> pipe:[74731318]
lrwx------ 1 root root 64 Apr 11 15:42 2 -> socket:[74731265]
l-wx------ 1 root root 64 Apr 11 15:42 20 -> /data/logs/supervisor/dart-agent.err
lr-x------ 1 root root 64 Apr 11 15:42 21 -> pipe:[74731319]
lr-x------ 1 root root 64 Apr 11 15:42 22 -> pipe:[77877461]
l-wx------ 1 root root 64 Apr 11 15:42 24 -> pipe:[74731321]
lr-x------ 1 root root 64 Apr 11 15:42 25 -> pipe:[74731320]
lr-x------ 1 root root 64 Apr 11 15:42 26 -> pipe:[74731322]
l-wx------ 1 root root 64 Apr 11 15:42 27 -> /data/logs/supervisor/statsd.log
l-wx------ 1 root root 64 Apr 11 15:42 28 -> /data/logs/supervisor/statsd.err
l-wx------ 1 root root 64 Apr 16 09:27 29 -> pipe:[74731324]
l-wx------ 1 root root 64 Apr 11 15:42 3 -> /data/logs/supervisord.log
lr-x------ 1 root root 64 Apr 16 09:27 30 -> pipe:[74731323]
lr-x------ 1 root root 64 Apr 16 09:27 31 -> pipe:[74731325]
l-wx------ 1 root root 64 Apr 16 09:27 32 -> /data/logs/supervisor/redis.log
l-wx------ 1 root root 64 Apr 16 09:27 33 -> /data/logs/supervisor/redis.err
l-wx------ 1 root root 64 Apr 16 09:27 34 -> pipe:[74731327]
lr-x------ 1 root root 64 Apr 16 09:27 35 -> pipe:[74731326]
lr-x------ 1 root root 64 Apr 16 09:27 36 -> pipe:[74731328]
l-wx------ 1 root root 64 Apr 11 15:42 37 -> /data/logs/supervisor/dmca-admin-web.log
l-wx------ 1 root root 64 Apr 11 15:42 38 -> /data/logs/supervisor/dmca-admin-web.err
l-wx------ 1 root root 64 Apr 16 09:27 39 -> pipe:[74731330]
lrwx------ 1 root root 64 Apr 11 15:42 4 -> socket:[74731287]
lr-x------ 1 root root 64 Apr 16 09:27 40 -> pipe:[74731329]
lr-x------ 1 root root 64 Apr 11 15:42 41 -> pipe:[74731331]
l-wx------ 1 root root 64 Apr 11 15:42 42 -> /data/logs/supervisor/pgcheck.log
l-wx------ 1 root root 64 Apr 11 15:42 43 -> /data/logs/supervisor/pgcheck.err
l-wx------ 1 root root 64 Apr 11 15:42 44 -> /data/logs/supervisor/dart-agent.log
lr-x------ 1 root root 64 Apr 11 15:42 45 -> pipe:[74731332]
l-wx------ 1 root root 64 Apr 11 15:42 47 -> /data/logs/supervisor/pgwatch.log
l-wx------ 1 root root 64 Apr 11 15:42 48 -> /data/logs/supervisor/pgwatch.err
l-wx------ 1 root root 64 Apr 11 15:42 5 -> /data/logs/supervisor/supermon.err
l-wx------ 1 root root 64 Apr 11 15:42 6 -> pipe:[74731309]
lr-x------ 1 root root 64 Apr 11 15:42 7 -> /dev/urandom
lr-x------ 1 root root 64 Apr 11 15:42 8 -> pipe:[74731310]
l-wx------ 1 root root 64 Apr 11 15:42 9 -> pipe:[74731312]
@mnaberez

This comment has been minimized.

Copy link
Member

mnaberez commented Apr 16, 2017

@plockaby We're testing a fix for this over in #875. Are you able to try that patch?

@plockaby

This comment has been minimized.

Copy link
Contributor

plockaby commented Apr 16, 2017

On about 2/3rds of my servers I have now upgraded to 3.3.1 and installed the patch. If it takes over a CPU again I'll let you know but it might be a few days. Sorry I didn't notice the conversation in the pull request.

@plockaby

This comment has been minimized.

Copy link
Contributor

plockaby commented Apr 16, 2017

That patch did not solve my problem. I almost immediately had one of the updated systems (not the same one as before) go into a poll craziness:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 593446}, NULL) = 0
gettimeofday({1492367896, 593515}, NULL) = 0
gettimeofday({1492367896, 593584}, NULL) = 0
gettimeofday({1492367896, 593637}, NULL) = 0
gettimeofday({1492367896, 593719}, NULL) = 0
gettimeofday({1492367896, 593788}, NULL) = 0
gettimeofday({1492367896, 593826}, NULL) = 0
gettimeofday({1492367896, 593863}, NULL) = 0
gettimeofday({1492367896, 593886}, NULL) = 0
gettimeofday({1492367896, 593906}, NULL) = 0
gettimeofday({1492367896, 593948}, NULL) = 0
gettimeofday({1492367896, 593987}, NULL) = 0
gettimeofday({1492367896, 594031}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 594103}, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 594331}, NULL) = 0
gettimeofday({1492367896, 594415}, NULL) = 0
gettimeofday({1492367896, 594491}, NULL) = 0
gettimeofday({1492367896, 594561}, NULL) = 0
gettimeofday({1492367896, 594642}, NULL) = 0
gettimeofday({1492367896, 594663}, NULL) = 0
gettimeofday({1492367896, 594699}, NULL) = 0
gettimeofday({1492367896, 594739}, NULL) = 0
gettimeofday({1492367896, 594769}, NULL) = 0
gettimeofday({1492367896, 594808}, NULL) = 0
gettimeofday({1492367896, 594836}, NULL) = 0
gettimeofday({1492367896, 594887}, NULL) = 0
gettimeofday({1492367896, 594934}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 595005}, NULL) = 0
lr-x------ 1 root root 64 Apr 16 10:37 0 -> /dev/null
lrwx------ 1 root root 64 Apr 16 10:37 1 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 10 -> pipe:[36606043]
lr-x------ 1 root root 64 Apr 16 10:37 11 -> pipe:[36606045]
l-wx------ 1 root root 64 Apr 16 10:37 12 -> /data/logs/supervisor/supermon.log
l-wx------ 1 root root 64 Apr 16 10:37 13 -> /data/logs/supervisor/supermon.err
l-wx------ 1 root root 64 Apr 16 10:37 14 -> pipe:[36606047]
lr-x------ 1 root root 64 Apr 16 10:37 15 -> pipe:[36606046]
lr-x------ 1 root root 64 Apr 16 10:37 16 -> pipe:[36606048]
l-wx------ 1 root root 64 Apr 16 10:37 17 -> /data/logs/supervisor/supercron.log
l-wx------ 1 root root 64 Apr 16 10:37 18 -> /data/logs/supervisor/supercron.err
l-wx------ 1 root root 64 Apr 16 10:37 19 -> pipe:[36606050]
lrwx------ 1 root root 64 Apr 16 10:37 2 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 20 -> pipe:[36606049]
lr-x------ 1 root root 64 Apr 16 10:37 21 -> pipe:[36606051]
l-wx------ 1 root root 64 Apr 16 10:37 22 -> /data/logs/supervisor/dart-agent.log
l-wx------ 1 root root 64 Apr 16 10:37 24 -> pipe:[36606053]
lr-x------ 1 root root 64 Apr 16 10:37 25 -> pipe:[36606052]
lr-x------ 1 root root 64 Apr 16 10:37 26 -> pipe:[36606054]
l-wx------ 1 root root 64 Apr 16 10:37 27 -> /data/logs/supervisor/pgcheck.log
l-wx------ 1 root root 64 Apr 16 10:37 28 -> /data/logs/supervisor/pgcheck.err
l-wx------ 1 root root 64 Apr 16 10:37 3 -> /data/logs/supervisord.log
lr-x------ 1 root root 64 Apr 16 10:37 30 -> pipe:[36606055]
l-wx------ 1 root root 64 Apr 16 10:37 32 -> /data/logs/supervisor/pgwatch.log
l-wx------ 1 root root 64 Apr 16 10:37 33 -> /data/logs/supervisor/pgwatch.err
lrwx------ 1 root root 64 Apr 16 10:37 4 -> socket:[36606038]
l-wx------ 1 root root 64 Apr 16 10:37 5 -> /data/logs/supervisor/dart-agent.err
l-wx------ 1 root root 64 Apr 16 10:37 6 -> pipe:[36606041]
lr-x------ 1 root root 64 Apr 16 10:37 7 -> /dev/urandom
lr-x------ 1 root root 64 Apr 16 10:37 8 -> pipe:[36606042]
l-wx------ 1 root root 64 Apr 16 10:37 9 -> pipe:[36606044]
@mnaberez

This comment has been minimized.

Copy link
Member

mnaberez commented Apr 16, 2017

@plockaby I posted this over in #875. Can you please work with us there to reproduce?

@plockaby

This comment has been minimized.

Copy link
Contributor

plockaby commented Apr 16, 2017

Can do. I'll move to commenting over there instead.

@plockaby

This comment has been minimized.

Copy link
Contributor

plockaby commented May 12, 2017

I hate to harp on this but #875 didn't solve my problem and I'm still restarting supervisord every few days on random hosts. I just wanted to bring up that this is still happening.

@blusewang

This comment has been minimized.

Copy link

blusewang commented May 13, 2017

@plockaby I suggest you try to use systemd instead supervisor!

@mnaberez

This comment has been minimized.

Copy link
Member

mnaberez commented May 14, 2017

I hate to harp on this but #875 didn't solve my problem and I'm still restarting supervisord every few days on random hosts.

@plockaby I was able to reproduce one cause of high CPU and observe that #875 did fix it, which is why I merged #875. I haven't observed another cause of high CPU usage yet. If you are still experiencing problems with #875 applied, we are going to need reproduce instructions to show us how to get supervisord into that state so we can work on it.

@mnaberez

This comment has been minimized.

Copy link
Member

mnaberez commented Jun 3, 2017

Supervisor 3.3.2 was released to PyPI and it includes the poller fixes (#875, #885).

The only high CPU problem that I was able to reproduce is fixed by this release.

If you are experiencing a high CPU problem on Supervisor 3.3.2, please open a new issue and include instructions to reproduce it.

@mnaberez mnaberez closed this Jun 3, 2017

@plockaby

This comment has been minimized.

Copy link
Contributor

plockaby commented Jun 16, 2017

I've been on vacation so I'm just getting to this now.

I tried 3.3.2 and it didn't solve the CPU spikes. I actually rolled all of my hosts back to 3.2.3 under the assumption that it was a problem with the new polling code. I'm still experiencing supervisord going into polling loops that use 100% CPU.

I understand that you can't fix a problem that you can't reproduce. If I had any way to reproduce this consistently, I would show you. All I can give you, really, is situations where it is happening. Right now I can tell you that this is not caused by changes in 3.3.x. I can also tell you that it is (probably) not caused by multiple processes writing to their logs simultaneously as this doesn't happen with any more frequency on our heavily loaded hosts as compared to our lightly loaded hosts. In fact it has happened on hosts that have just one process controlled by supervisord just as often as it has happened on hosts that have forty processes.

If I ever do figure out how to reproduce it then I will create a new ticket. Would you mind if I continue to post information about when I experience the issue?

@mnaberez

This comment has been minimized.

Copy link
Member

mnaberez commented Jun 16, 2017

Would you mind if I continue to post information about when I experience the issue?

Of course not. Please do that.

@plockaby

This comment has been minimized.

Copy link
Contributor

plockaby commented Jun 28, 2017

Interestingly I've discovered that it often recovers from the polling problem. I identified one of my hosts that had started spinning CPU on polling but it wasn't convenient to restart the supervisord at the time so I just left top running in a window on the side. Six days of 100% CPU and it just stopped as suddenly as it started. Another one of the hosts that I identified ran at 100% CPU for nearly five days before it stopped but then it started again six hours later. (I've identified all of these by looking at collectd CPU graphs.)

I have an strace running on one of the hosts where the problem has cropped up several times and where I don't mind if the disk fills. When it starts on that host then I should have a bit more information about what is starting and/or stopping the problem.

I have also noticed that it is not happening on Centos7 hosts. It so far has only happened on Centos6 hosts. Unfortunately most of my hosts are still on Centos6. :)

@plockaby

This comment has been minimized.

Copy link
Contributor

plockaby commented Jun 30, 2017

I've narrowed the problem down even further to be specific to the interaction between supervisord and an event listener on the communications channel between the two (i.e. my event listener's stdout). The file descriptor that is generating all of the poll events is my event listener's stdout file.

This explains why the problem would disappear: my event listener is configured to restart itself at random intervals and indeed when I restart it the 100% CPU spike goes away. I'm still examining the strace to figure out why the problem starts in the first place. The log itself isn't blowing up in size with insane numbers of writes.

@mnaberez mnaberez reopened this Jun 30, 2017

@plockaby

This comment has been minimized.

Copy link
Contributor

plockaby commented Jun 30, 2017

I was going to write a big long description with quotes from my strace but I think I discovered the problem. The problem happens when my event listener's log is rotated. Because it is an event listener and I really don't care about stdout for event listeners, my configuration is like this:

stdout_logfile = /data/logs/supervisor/dart-agent.log
stdout_logfile_maxbytes = 1MB
stdout_logfile_backups = 0

When the strace says that my program started going crazy, this appears in the strace:

open("/data/logs/supervisor/dart-agent.log", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0

That is the log file for my event listener and "5" is the file descriptor for its log file that I am currently spinning on polling:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=12, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}], 14, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1498848417, 352015}, NULL) = 0
gettimeofday({1498848417, 352079}, NULL) = 0
gettimeofday({1498848417, 352141}, NULL) = 0
gettimeofday({1498848417, 352204}, NULL) = 0
gettimeofday({1498848417, 352268}, NULL) = 0
gettimeofday({1498848417, 352331}, NULL) = 0
gettimeofday({1498848417, 352483}, NULL) = 0
gettimeofday({1498848417, 352590}, NULL) = 0
gettimeofday({1498848417, 352769}, NULL) = 0
gettimeofday({1498848417, 352858}, NULL) = 0
wait4(-1, 0x7fff3600b334, WNOHANG, NULL) = 0
gettimeofday({1498848417, 352992}, NULL) = 0

I hope that is enough to go on?

@plockaby

This comment has been minimized.

Copy link
Contributor

plockaby commented Aug 23, 2017

I have just discovered that this is not a problem specific to event listeners. I had a non-event listener program just trigger the same problem. However that's the first time that has happened that I can tell.

@lambdaq

This comment has been minimized.

Copy link

lambdaq commented Sep 30, 2017

I once deleted few log files and CPU stuck to 100%. Is it relevant ?

@mumutu66

This comment has been minimized.

Copy link

mumutu66 commented Oct 26, 2017

got the same problem too 。。on Centos7 with 3.3.1
is it confirmed 3.3.2 can solve this?

@plockaby

This comment has been minimized.

Copy link
Contributor

plockaby commented Dec 31, 2017

FWIW I wanted to provide an update. This is still a problem. It is specific to event listeners and it happens on 3.2 and 3.3. (I have not tested on 3.1 or on master.) The specific problem is related to when the stdout log file rotates. I worked around it by just completely disabling the stdout log file by writing stdout_logfile = NONE in my event listener's configuration. Since stdout on event listeners is mostly useless this was not a big loss. Plus, it completely solved my problem so there you go.

@NKYB

This comment has been minimized.

Copy link

NKYB commented Jan 2, 2018

FWIW I can confirm that I've experienced almost exactly the same problem as described by @plockaby

  • a large number of cloud servers running Ubuntu 16.04 and Supervisor->Python
  • randomly different machines go to 100% CPU, a reboot fixes it
  • setting stdout_logfile = NONE makes the problem go away
  • I also believe this is related to stdout_logfile rotating
@plockaby

This comment has been minimized.

Copy link
Contributor

plockaby commented Jan 2, 2018

My solution that has me setting stdout_logfile = NONE does not completely solve the problem. Now it spins at 100% to POLLIN the stderr log file on rotation. I don't need to reboot the host, though. I just restart the event listener and the problem clears until the next time the log rotates. I suppose I could just not keep any logs from my event listeners?

@engin

This comment has been minimized.

Copy link

engin commented Jun 29, 2018

i think mine was caused by redirect stderr
in one box ~ 120 job watched by supervisord. and cpu usage of supervisord was around 99%

# supervisorctl version
3.3.4
# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 14.04.2 LTS
Release:	14.04
Codename:	trusty
find /etc/supervisor/conf.d -name "*.conf" -exec sh -c 'echo "redirect_stderr=true" >> "$0" ' {} \;

now dropped to below 2%

@devspyrosv

This comment has been minimized.

Copy link

devspyrosv commented Jul 4, 2018

Hello! I had a similar experience with supervisord 3.3.4

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.4 LTS
Release: 16.04
Codename: xenial

strace -p

Shows tens of messages per second like the following:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=12, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}, {fd=19, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=28, events=POLLIN|POLLPRI|POLLHUP}, {fd=32, events=POLLIN|POLLPRI|POLLHUP}, {fd=38, events=POLLIN|POLLPRI|POLLHUP}, {fd=39, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=47, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=55, events=POLLIN|POLLPRI|POLLHUP}, {fd=61, events=POLLIN|POLLPRI|POLLHUP}, {fd=65, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 0 (Timeout)

Supervisord has been running for ~1.5 months and is managing 10 processes.

supervisorctl status all

Shows the same process ids (no process is constantly restarting etc).

Having read the comments above, I decided to do a little experiment. So I started stopping one by one the services supervisord is managing, through the web interface.

So among the custom processes I am running, I have crashmailbatch and fatalmailbatch from the package superlance.

So I was seeing the same frequency of polls in strace up untill I stopped fatalmailbatch.

Fatalmailbatch has the following config:

[eventlistener:fatalmailbatch]
command=fatalmailbatch --toEmail= --fromEmail= --smtpHost= --userName=... --password=... --subject="Alert - MYDOMAIN has crashed processes!" --interval=5.0
events=PROCESS_STATE,TICK_60
autostart=true
autorestart=true

The process was running for as long as supervisord was running without stopping.

I hope it helps somehow.

@plockaby

This comment has been minimized.

Copy link
Contributor

plockaby commented Aug 8, 2018

I know that earlier I stated that this was being caused by an event listener and that by modifying the configuration for the event listener to not log that the problem went away. The fact that it was an event listener was apparently a red herring because I'm experiencing this problem yet again where supervisord is spinning like crazy on POLLIN on a log file. The only commonality that I'm seeing between all of the systems where this is happening is that the file handle is consistently numbered "5" on every server and that it happens after a log rotation and that when I restart the process with the log file where it is spinning that the spinning stops. It's possible that this is also related to log events which I also have enabled. I'd love to get some traction on this bug as it has been plaguing me for almost two years. If the maintainer could give me some pointers on how I can assist in getting this addressed I'd be grateful.

@mnaberez

This comment has been minimized.

Copy link
Member

mnaberez commented Aug 8, 2018

If the maintainer could give me some pointers on how I can assist in getting this addressed I'd be grateful.

I am one of the people with commit access to this repository, but I was not involved with any of the code in question. I am not very familiar with this particular code. The problem seems to have been introduced by #129, which was written by @igorsobreira and merged by @mcdonc. I attempted to contact @igorsobreira after reports of problems but he did not reply. Another contributor, @jaredsuttles, put some effort into the high CPU issue and submitted #875, which I merged. I appreciate all your follow-up, @plockaby, and would like see this resolved also. Help is needed.

@p01ymath

This comment has been minimized.

Copy link

p01ymath commented Aug 20, 2018

In my case, It started to happen a few days back. Now I have to restart supervisor once every day otherwise it would create unusual spikes. I am using Supervisor 3.2.0 and the issue is still there. Any way we can solve this issue permanently? I haven't installed it from a git repository.

@hathawsh

This comment has been minimized.

Copy link
Contributor

hathawsh commented Aug 21, 2018

@p01ymath: This bug has been tricky because we don't know how to reproduce it, but it sounds like you're running into it consistently. Your case could be key to solving it. Can you describe what programs Supervisor is running?

Some types of child processes do unusual things like closing the stdout file descriptor. Unusual behaviors can influence the operation of Supervisor and reveal issues.

@p01ymath

This comment has been minimized.

Copy link

p01ymath commented Aug 22, 2018

I ran into it 2 times till now and it happens at random times. I am running Laravel queue workers using supervisor. The first time I faced this issue, I turned off logging completely and restarted supervisor. It worked so I thought it was the issue of logging.

After 2-3 days, It happened again. And this time Logs were already turned off. So I didn't restart the supervisor directly. From htop, I found out the worker that was using the majority of my CPU. So, I just turned off that individual process. Still, My CPU usage was 100% so I checked htop again and found out that the other worker/process in the same program started using 100% of CPU. Then I turned off the whole process group and found out that the other worker in other process group started using all the CPU.

Then, I restarted supervisor and it worked. This happened about 3 days ago but I haven't faced any issue since then. I am using supervisor 3.2.0 and I have 5 programs with 12 queue workers.

@hathawsh

This comment has been minimized.

Copy link
Contributor

hathawsh commented Aug 24, 2018

@p01ymath, here's a guess: is it possible that some of your child processes are sending a large amount of log data through Supervisor? I'm not sure how much log data Supervisor can reasonably handle, but I suspect it would choke on a lot of logging data, especially if the child process happens to send only a character at a time.

@hathawsh

This comment has been minimized.

Copy link
Contributor

hathawsh commented Sep 7, 2018

There's a new tool available called py-spy and I think it will greatly help anyone struggling with high CPU usage in Supervisor or any other Python process. If you're using a virtualenv, install it with 'bin/pip install py-spy' and run it using 'sudo bin/py-spy --pid PID' where PID is the PID of the supervisord process. It will show a 'top'-like breakdown of what the process is doing. Use ctrl-S to pause its output (for copy/paste) and ctrl-Q to resume.

@AlirezaSadeghi

This comment has been minimized.

Copy link

AlirezaSadeghi commented Oct 28, 2018

I'm also facing a 100% CPU usage load when supervised processes are taking 100% cpu themselves too. Here's the (top) output of Py-Spy:

 50.50%  50.50%    3.14s     3.14s   select (dist-packages/supervisor/options.py:1156)
  4.50%   9.50%   0.185s    0.355s   get_process_map (dist-packages/supervisor/supervisord.py:138)
  4.00%   5.00%   0.250s    0.305s   runforever (dist-packages/supervisor/supervisord.py:212)
  3.50%  11.00%   0.335s    0.655s   runforever (dist-packages/supervisor/supervisord.py:192)
  3.00%   3.00%   0.150s    0.150s   get_dispatchers (dist-packages/supervisor/process.py:691)
  3.00%   4.00%   0.225s    0.255s   runforever (dist-packages/supervisor/supervisord.py:214)
  3.00%   3.00%   0.140s    0.140s   get_dispatchers (dist-packages/supervisor/process.py:692)
  2.50%   3.00%   0.080s    0.105s   runforever (dist-packages/supervisor/supervisord.py:211)
  1.50%   1.50%   0.035s    0.035s   transition (dist-packages/supervisor/process.py:522)

Supervisor is taking a constant 80~100% CPU on my hosts, it's managing about 30 NodeJS processes, on a 32 Core machine.

(I'm using V3.3.4)

Any idea what that comes from or whether it's expected?

@dexterbt1

This comment has been minimized.

Copy link
Contributor

dexterbt1 commented Oct 29, 2018

I recently deployed v3.3.4 across 6 servers running Ubuntu 18.04. Each server supervises as little as 5 and up to 70+ processes.

After about a few days, almost all of the servers exhibited high cpu usage issues. Restarting/Rebooting does not help, as the problem reoccurs immediately. A server having 5 supervised child processes still experiences 100% cpu; same with the 70 process server.

I downgraded all these servers to v3.1.0 and the problem disappeared!
Right now, it's been 2 weeks with no issues at all.

I suspect the problem is related to stdout/stderr handling of supervised processes; as the CPU usage goes down if I disable/minimize the stdout/stderr output.

I'm reading a lot of comments re: the select/poll implementation and I say this is right on track.

@devspyrosv

This comment has been minimized.

Copy link

devspyrosv commented Oct 29, 2018

Caught one of my supervisord instances doing the POLL thing with 100% CPU usage.
The logs of the processes were empty - no process was logging at the time.
In past encounters with this bug, I had tried without success the following:

  • disable logging
  • run less processes
  • run less CPU demanding processes
  • restart one or more processes
  • restart superlance plugins

This time when I found out that there was a CPU spike, I started strace and just opened the Web Interface.

The continuous polling stopped and the CPU usage became normal again.

This is no proof of course but it got me thinking.

@hathawsh

This comment has been minimized.

Copy link
Contributor

hathawsh commented Oct 29, 2018

Great clues, guys. Keep them coming if you can.

@devspyrosv

This comment has been minimized.

Copy link

devspyrosv commented Mar 1, 2019

I have 3 servers with identical specs running different counts of the same list of processes.

So let's say the servers are s1, s2 and s3,
the processes A, B, C, D, E and the counts of the running processes are shown in the following table

A B C D
s1 5 2 1 4
s2 2 2 4 4
s3 3 3 3 3

All the processes have the supervisord logging disabled (they log in their own separate log files).
All the servers have exactly the same packages and updates (ubuntu 16.04).
All the processes use the same python version.

Server s1 has not had any problems with high CPU usage from supervisord.
Servers s2 and s3 have the problem with the high CPU usage due to continuous polling (shown by strace -p pid_of_supervisord).
Usually I restart the processes one by one while checking strace up to the point where strace shows that the continuous polling stops. Some times the first process to be restarted fixes the problem. Sometimes its after 4-5 processes restarted.
It's not always the same process that stops the polling by being restarted.
It's not always the same time that the process was running up to the point that its restart stops the polling.
Last but not least, the processes have nothing in common as to what they do, which non builtin libraries they use etc. The only common thing they have is that they are all written in python3 and run under python 3.5.x.

The restarts I mention are made through the web UI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.