New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% CPU usage (maybe caused by new poll implementation?) #807

Open
timonwong opened this Issue Aug 6, 2016 · 38 comments

Comments

@timonwong

timonwong commented Aug 6, 2016

We were using supervisor 3.3.0 on Ubuntu 14.04 LTS.

Recently on some of our nodes on AWS, we spot very high cpu usage about supervisord, to get around that, we have to reload it, but it may happen again in one day.

Reading from strace, we spot there are excessive calls to both 'gettimeofday' and 'poll', so after that, we have to choose to downgrade supervisor to 3.2.3.

I see there was #581, but I think it's irrelevant here, our wild guess is it just caused by the new poll implementation introduced in 3.3.0 (and maybe caused by simultaneous log outputs?)...

Thanks in advance!

@mnaberez

This comment has been minimized.

Show comment
Hide comment
@mnaberez

mnaberez Aug 6, 2016

Member

Recently on some of our nodes on AWS, we spot very high cpu usage about supervisord, to get around that, we have to reload it, but it may happen again in one day.

@hathawsh also saw high CPU usage from the new poller implementation and submitted #589. That was merged before 3.3.0 was released.

Reading from strace, we spot there are excessive calls to both 'gettimeofday' and 'poll', so after that, we have to choose to downgrade supervisor to 3.2.3.

It's possible this is the main loop spinning since poller.poll() does a poll and below it tick() does a gettimeofday.

Can you run 3.3.0 with loglevel=blat? That will produce as much debugging information as possible. Perhaps it will have some clues to why this is happening.

and maybe caused by simultaneous log outputs?

There weren't any changes to logging between 3.2.3 and 3.3.0. Since you said 3.2.3 works for you, I wouldn't suspect logging.

Member

mnaberez commented Aug 6, 2016

Recently on some of our nodes on AWS, we spot very high cpu usage about supervisord, to get around that, we have to reload it, but it may happen again in one day.

@hathawsh also saw high CPU usage from the new poller implementation and submitted #589. That was merged before 3.3.0 was released.

Reading from strace, we spot there are excessive calls to both 'gettimeofday' and 'poll', so after that, we have to choose to downgrade supervisor to 3.2.3.

It's possible this is the main loop spinning since poller.poll() does a poll and below it tick() does a gettimeofday.

Can you run 3.3.0 with loglevel=blat? That will produce as much debugging information as possible. Perhaps it will have some clues to why this is happening.

and maybe caused by simultaneous log outputs?

There weren't any changes to logging between 3.2.3 and 3.3.0. Since you said 3.2.3 works for you, I wouldn't suspect logging.

@timonwong

This comment has been minimized.

Show comment
Hide comment
@timonwong

timonwong Aug 8, 2016

Hi!
Today I spot the problem again, here gz'd strace log (duration ~1s):
strace.log.tar.gz
And here is fds:

0 -> /dev/null
1 -> /dev/null
10 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
100 -> pipe:[32356299]
101 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
102 -> pipe:[32356298]
103 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
104 -> pipe:[32356300]
105 -> /data/supervisor_log/bg_task_00-stderr.log
106 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
107 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
108 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
109 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
11 -> pipe:[3024]
110 -> /data/supervisor_log/bg_task_00-stdout.log
111 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
112 -> /data/supervisor_log/bg_task_00-stderr.log
113 -> pipe:[32356302]
114 -> pipe:[32356301]
115 -> pipe:[32356303]
116 -> /data/supervisor_log/bg_task_00-stdout.log.1
117 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
118 -> /data/supervisor_log/bg_task_00-stderr.log
119 -> /data/supervisor_log/bg_task_00-stdout.log.8
12 -> pipe:[17721748]
121 -> pipe:[32356304]
124 -> /data/supervisor_log/bg_task_00-stderr.log
13 -> pipe:[17721749]
14 -> pipe:[11775]
15 -> pipe:[3025]
16 -> pipe:[11776]
17 -> /data/supervisor_log/bg_task_02-stdout.log
18 -> /data/supervisor_log/bg_task_02-stderr.log
19 -> pipe:[17721751]
2 -> /dev/null
20 -> pipe:[11777]
21 -> pipe:[17721750]
22 -> /data/supervisor_log/bg_task_03-stdout.log
23 -> /data/supervisor_log/bg_task_03-stderr.log
24 -> pipe:[11827]
25 -> pipe:[17721752]
26 -> pipe:[11828]
27 -> /data/supervisor_log/bg_task_01-stdout.log.1
28 -> /data/supervisor_log/bg_task_01-stderr.log.2
29 -> pipe:[17721754]
3 -> /data/supervisor_log/supervisord.log
30 -> pipe:[11829]
31 -> pipe:[17721753]
32 -> /data/supervisor_log/bg_task_04-stdout.log
33 -> /data/supervisor_log/bg_task_04-stderr.log
34 -> pipe:[17721755]
35 -> /data/supervisor_log/bg_task_01-stdout.log
36 -> /data/supervisor_log/bg_task_01-stderr.log.1
37 -> pipe:[17721757]
38 -> pipe:[17721756]
39 -> pipe:[17721758]
4 -> socket:[13073]
40 -> /data/supervisor_log/bg_task_01-stdout.log.3
41 -> /data/supervisor_log/bg_task_01-stderr.log
42 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
43 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
44 -> pipe:[17721759]
45 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
46 -> pipe:[17719642]
47 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
48 -> pipe:[17719643]
49 -> /data/supervisor_log/bg_task_01-stdout.log.2
5 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
50 -> pipe:[17719644]
51 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
52 -> /data/supervisor_log/bg_task_01-stderr.log.3
53 -> /data/supervisor_log/bg_task_05-stdout.log
54 -> /data/supervisor_log/bg_task_05-stderr.log
55 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
56 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
57 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
58 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
59 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
6 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
60 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
61 -> pipe:[30456289]
62 -> pipe:[30456290]
63 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
64 -> /data/supervisor_log/node_exporter.log
65 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
66 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
67 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
68 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
69 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
7 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
70 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
71 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
72 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
73 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
74 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
75 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
76 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
77 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
78 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
79 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
8 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
80 -> /data/supervisor_log/bg_task_00-stderr.log
81 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
82 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
83 -> /data/supervisor_log/bg_task_00-stderr.log
84 -> /data/supervisor_log/bg_task_00-stdout.log.2
85 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
86 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
87 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
88 -> /data/supervisor_log/bg_task_00-stderr.log
89 -> /data/supervisor_log/bg_task_00-stderr.log
9 -> pipe:[3023]
90 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
91 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
92 -> /data/supervisor_log/bg_task_00-stdout.log.8
93 -> pipe:[32356293]
94 -> pipe:[32356294]
95 -> pipe:[32356296]
96 -> pipe:[32356295]
97 -> pipe:[32356297]
98 -> /data/supervisor_log/bg_task_00-stdout.log.3
99 -> /data/supervisor_log/bg_task_00-stderr.log

I haven't open the blather level yet, I'll try it now, and will see what's going on.

timonwong commented Aug 8, 2016

Hi!
Today I spot the problem again, here gz'd strace log (duration ~1s):
strace.log.tar.gz
And here is fds:

0 -> /dev/null
1 -> /dev/null
10 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
100 -> pipe:[32356299]
101 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
102 -> pipe:[32356298]
103 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
104 -> pipe:[32356300]
105 -> /data/supervisor_log/bg_task_00-stderr.log
106 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
107 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
108 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
109 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
11 -> pipe:[3024]
110 -> /data/supervisor_log/bg_task_00-stdout.log
111 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
112 -> /data/supervisor_log/bg_task_00-stderr.log
113 -> pipe:[32356302]
114 -> pipe:[32356301]
115 -> pipe:[32356303]
116 -> /data/supervisor_log/bg_task_00-stdout.log.1
117 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
118 -> /data/supervisor_log/bg_task_00-stderr.log
119 -> /data/supervisor_log/bg_task_00-stdout.log.8
12 -> pipe:[17721748]
121 -> pipe:[32356304]
124 -> /data/supervisor_log/bg_task_00-stderr.log
13 -> pipe:[17721749]
14 -> pipe:[11775]
15 -> pipe:[3025]
16 -> pipe:[11776]
17 -> /data/supervisor_log/bg_task_02-stdout.log
18 -> /data/supervisor_log/bg_task_02-stderr.log
19 -> pipe:[17721751]
2 -> /dev/null
20 -> pipe:[11777]
21 -> pipe:[17721750]
22 -> /data/supervisor_log/bg_task_03-stdout.log
23 -> /data/supervisor_log/bg_task_03-stderr.log
24 -> pipe:[11827]
25 -> pipe:[17721752]
26 -> pipe:[11828]
27 -> /data/supervisor_log/bg_task_01-stdout.log.1
28 -> /data/supervisor_log/bg_task_01-stderr.log.2
29 -> pipe:[17721754]
3 -> /data/supervisor_log/supervisord.log
30 -> pipe:[11829]
31 -> pipe:[17721753]
32 -> /data/supervisor_log/bg_task_04-stdout.log
33 -> /data/supervisor_log/bg_task_04-stderr.log
34 -> pipe:[17721755]
35 -> /data/supervisor_log/bg_task_01-stdout.log
36 -> /data/supervisor_log/bg_task_01-stderr.log.1
37 -> pipe:[17721757]
38 -> pipe:[17721756]
39 -> pipe:[17721758]
4 -> socket:[13073]
40 -> /data/supervisor_log/bg_task_01-stdout.log.3
41 -> /data/supervisor_log/bg_task_01-stderr.log
42 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
43 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
44 -> pipe:[17721759]
45 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
46 -> pipe:[17719642]
47 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
48 -> pipe:[17719643]
49 -> /data/supervisor_log/bg_task_01-stdout.log.2
5 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
50 -> pipe:[17719644]
51 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
52 -> /data/supervisor_log/bg_task_01-stderr.log.3
53 -> /data/supervisor_log/bg_task_05-stdout.log
54 -> /data/supervisor_log/bg_task_05-stderr.log
55 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
56 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
57 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
58 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
59 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
6 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
60 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
61 -> pipe:[30456289]
62 -> pipe:[30456290]
63 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
64 -> /data/supervisor_log/node_exporter.log
65 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
66 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
67 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
68 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
69 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
7 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
70 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
71 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
72 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
73 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
74 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
75 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
76 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
77 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
78 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
79 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
8 -> /data/supervisor_log/bg_task_01-stdout.log.10 (deleted)
80 -> /data/supervisor_log/bg_task_00-stderr.log
81 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
82 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
83 -> /data/supervisor_log/bg_task_00-stderr.log
84 -> /data/supervisor_log/bg_task_00-stdout.log.2
85 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
86 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
87 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
88 -> /data/supervisor_log/bg_task_00-stderr.log
89 -> /data/supervisor_log/bg_task_00-stderr.log
9 -> pipe:[3023]
90 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
91 -> /data/supervisor_log/bg_task_00-stdout.log.10 (deleted)
92 -> /data/supervisor_log/bg_task_00-stdout.log.8
93 -> pipe:[32356293]
94 -> pipe:[32356294]
95 -> pipe:[32356296]
96 -> pipe:[32356295]
97 -> pipe:[32356297]
98 -> /data/supervisor_log/bg_task_00-stdout.log.3
99 -> /data/supervisor_log/bg_task_00-stderr.log

I haven't open the blather level yet, I'll try it now, and will see what's going on.

@timonwong

This comment has been minimized.

Show comment
Hide comment
@timonwong

timonwong Aug 10, 2016

Hi, me again, here is the supervisord log with blather level enabled:
supervisord.log.tar.gz

timonwong commented Aug 10, 2016

Hi, me again, here is the supervisord log with blather level enabled:
supervisord.log.tar.gz

@blusewang

This comment has been minimized.

Show comment
Hide comment
@blusewang

blusewang Dec 21, 2016

Me too.every times when i login with http and restart a program.my server's CPU immediately go to 100%.
I wasted so much time on this bug!
I used it to manage my laravel queue. I hate it, but I have no choose!

blusewang commented Dec 21, 2016

Me too.every times when i login with http and restart a program.my server's CPU immediately go to 100%.
I wasted so much time on this bug!
I used it to manage my laravel queue. I hate it, but I have no choose!

@gonesurfing

This comment has been minimized.

Show comment
Hide comment
@gonesurfing

gonesurfing Dec 26, 2016

This is 100% reproducible on 3.3.1 (anaconda package on OSX).
-Start supervisord with http server on localhost:9001
-open new ipython console and import xmlrpclib and type
server = xmlrpclib.Server('http://localhost:9001')
-now run any server.getMethod and instantly cpu goes to 100% and stays there
-del(server) and cpu goes back to normal.

Bummer...This is a show stopper for what I need. let me know if any other info is needed. I'm going to have to downgrade and see if an older version doesn't show this behavior.

Update: I downgraded to 3.1.3 and this behavior is not present. Something definitely broke between those two versions, but I don't have any of the 3.2.x branch available to test easily.

gonesurfing commented Dec 26, 2016

This is 100% reproducible on 3.3.1 (anaconda package on OSX).
-Start supervisord with http server on localhost:9001
-open new ipython console and import xmlrpclib and type
server = xmlrpclib.Server('http://localhost:9001')
-now run any server.getMethod and instantly cpu goes to 100% and stays there
-del(server) and cpu goes back to normal.

Bummer...This is a show stopper for what I need. let me know if any other info is needed. I'm going to have to downgrade and see if an older version doesn't show this behavior.

Update: I downgraded to 3.1.3 and this behavior is not present. Something definitely broke between those two versions, but I don't have any of the 3.2.x branch available to test easily.

@cenkalti

This comment has been minimized.

Show comment
Hide comment
@cenkalti

cenkalti Dec 26, 2016

I can confirm this. Very frustrating 😞

cenkalti commented Dec 26, 2016

I can confirm this. Very frustrating 😞

@mnaberez

This comment has been minimized.

Show comment
Hide comment
@mnaberez

mnaberez Dec 26, 2016

Member

@igorsobreira A number of users have reported high CPU usage from supervisord and believe it was introduced by your patch in #129. Another case of high CPU usage was found in #589 and confirmed to be caused by the patch. Could you please look into this? There's reproduce instructions above in #807 (comment).

Member

mnaberez commented Dec 26, 2016

@igorsobreira A number of users have reported high CPU usage from supervisord and believe it was introduced by your patch in #129. Another case of high CPU usage was found in #589 and confirmed to be caused by the patch. Could you please look into this? There's reproduce instructions above in #807 (comment).

@plockaby

This comment has been minimized.

Show comment
Hide comment
@plockaby

plockaby Apr 15, 2017

Contributor

I want to add that of the 76 servers that I have running supervisord, I had to stop/start supervisord on seven of them in the past four days for this bug.

Contributor

plockaby commented Apr 15, 2017

I want to add that of the 76 servers that I have running supervisord, I had to stop/start supervisord on seven of them in the past four days for this bug.

@plockaby

This comment has been minimized.

Show comment
Hide comment
@plockaby

plockaby Apr 16, 2017

Contributor

Now an eighth. A bit more detail now. I am running supervisor 3.3.0. I installed it last May as soon as it came out because it had my patch in it that I really wanted. I restarted supervisor on all my hosts to get the new version and it has been going just fine. Last week, I made a global configuration change. I changed minfds to 100000 (because a specific process wanted it) and I set childlogdir which had not been set before. I installed that new configuration on all my hosts and went around restarting supervisord. Now hosts are going into this poll spin loop at random. This is the strace log. These are spinning like crazy, eating up a whole CPU.

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=22, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}, {fd=36, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=41, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 1 ([{fd=20, revents=POLLIN}])
gettimeofday({1492359703, 190995}, NULL) = 0
gettimeofday({1492359703, 191062}, NULL) = 0
gettimeofday({1492359703, 194873}, NULL) = 0
gettimeofday({1492359703, 194973}, NULL) = 0
gettimeofday({1492359703, 195072}, NULL) = 0
gettimeofday({1492359703, 195108}, NULL) = 0
gettimeofday({1492359703, 195153}, NULL) = 0
gettimeofday({1492359703, 195224}, NULL) = 0
gettimeofday({1492359703, 195254}, NULL) = 0
gettimeofday({1492359703, 195299}, NULL) = 0
gettimeofday({1492359703, 195327}, NULL) = 0
gettimeofday({1492359703, 195378}, NULL) = 0
gettimeofday({1492359703, 195446}, NULL) = 0
wait4(-1, 0x7ffea7758d04, WNOHANG, NULL) = 0
gettimeofday({1492359703, 195526}, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=22, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}, {fd=36, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=41, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 1 ([{fd=20, revents=POLLIN}])
gettimeofday({1492359703, 195874}, NULL) = 0
gettimeofday({1492359703, 195936}, NULL) = 0
gettimeofday({1492359703, 196000}, NULL) = 0
gettimeofday({1492359703, 196092}, NULL) = 0
gettimeofday({1492359703, 196166}, NULL) = 0
gettimeofday({1492359703, 196256}, NULL) = 0
gettimeofday({1492359703, 196336}, NULL) = 0
gettimeofday({1492359703, 196380}, NULL) = 0
gettimeofday({1492359703, 196520}, NULL) = 0
gettimeofday({1492359703, 196557}, NULL) = 0
gettimeofday({1492359703, 196599}, NULL) = 0
gettimeofday({1492359703, 196643}, NULL) = 0
gettimeofday({1492359703, 196689}, NULL) = 0
wait4(-1, 0x7ffea7758d04, WNOHANG, NULL) = 0
gettimeofday({1492359703, 196787}, NULL) = 0

On the host in particular that is crashing right now, these are the fds:

total 0
lr-x------ 1 root root 64 Apr 11 15:42 0 -> /dev/null
lrwx------ 1 root root 64 Apr 11 15:42 1 -> socket:[74731265]
lr-x------ 1 root root 64 Apr 11 15:42 10 -> pipe:[74731311]
lr-x------ 1 root root 64 Apr 11 15:42 11 -> pipe:[74731313]
l-wx------ 1 root root 64 Apr 11 15:42 12 -> /data/logs/supervisor/supermon.log
l-wx------ 1 root root 64 Apr 11 15:42 14 -> pipe:[77877459]
lr-x------ 1 root root 64 Apr 11 15:42 15 -> pipe:[74731314]
lr-x------ 1 root root 64 Apr 11 15:42 16 -> pipe:[77877460]
l-wx------ 1 root root 64 Apr 11 15:42 17 -> /data/logs/supervisor/supercron.log
l-wx------ 1 root root 64 Apr 11 15:42 18 -> /data/logs/supervisor/supercron.err
l-wx------ 1 root root 64 Apr 11 15:42 19 -> pipe:[74731318]
lrwx------ 1 root root 64 Apr 11 15:42 2 -> socket:[74731265]
l-wx------ 1 root root 64 Apr 11 15:42 20 -> /data/logs/supervisor/dart-agent.err
lr-x------ 1 root root 64 Apr 11 15:42 21 -> pipe:[74731319]
lr-x------ 1 root root 64 Apr 11 15:42 22 -> pipe:[77877461]
l-wx------ 1 root root 64 Apr 11 15:42 24 -> pipe:[74731321]
lr-x------ 1 root root 64 Apr 11 15:42 25 -> pipe:[74731320]
lr-x------ 1 root root 64 Apr 11 15:42 26 -> pipe:[74731322]
l-wx------ 1 root root 64 Apr 11 15:42 27 -> /data/logs/supervisor/statsd.log
l-wx------ 1 root root 64 Apr 11 15:42 28 -> /data/logs/supervisor/statsd.err
l-wx------ 1 root root 64 Apr 16 09:27 29 -> pipe:[74731324]
l-wx------ 1 root root 64 Apr 11 15:42 3 -> /data/logs/supervisord.log
lr-x------ 1 root root 64 Apr 16 09:27 30 -> pipe:[74731323]
lr-x------ 1 root root 64 Apr 16 09:27 31 -> pipe:[74731325]
l-wx------ 1 root root 64 Apr 16 09:27 32 -> /data/logs/supervisor/redis.log
l-wx------ 1 root root 64 Apr 16 09:27 33 -> /data/logs/supervisor/redis.err
l-wx------ 1 root root 64 Apr 16 09:27 34 -> pipe:[74731327]
lr-x------ 1 root root 64 Apr 16 09:27 35 -> pipe:[74731326]
lr-x------ 1 root root 64 Apr 16 09:27 36 -> pipe:[74731328]
l-wx------ 1 root root 64 Apr 11 15:42 37 -> /data/logs/supervisor/dmca-admin-web.log
l-wx------ 1 root root 64 Apr 11 15:42 38 -> /data/logs/supervisor/dmca-admin-web.err
l-wx------ 1 root root 64 Apr 16 09:27 39 -> pipe:[74731330]
lrwx------ 1 root root 64 Apr 11 15:42 4 -> socket:[74731287]
lr-x------ 1 root root 64 Apr 16 09:27 40 -> pipe:[74731329]
lr-x------ 1 root root 64 Apr 11 15:42 41 -> pipe:[74731331]
l-wx------ 1 root root 64 Apr 11 15:42 42 -> /data/logs/supervisor/pgcheck.log
l-wx------ 1 root root 64 Apr 11 15:42 43 -> /data/logs/supervisor/pgcheck.err
l-wx------ 1 root root 64 Apr 11 15:42 44 -> /data/logs/supervisor/dart-agent.log
lr-x------ 1 root root 64 Apr 11 15:42 45 -> pipe:[74731332]
l-wx------ 1 root root 64 Apr 11 15:42 47 -> /data/logs/supervisor/pgwatch.log
l-wx------ 1 root root 64 Apr 11 15:42 48 -> /data/logs/supervisor/pgwatch.err
l-wx------ 1 root root 64 Apr 11 15:42 5 -> /data/logs/supervisor/supermon.err
l-wx------ 1 root root 64 Apr 11 15:42 6 -> pipe:[74731309]
lr-x------ 1 root root 64 Apr 11 15:42 7 -> /dev/urandom
lr-x------ 1 root root 64 Apr 11 15:42 8 -> pipe:[74731310]
l-wx------ 1 root root 64 Apr 11 15:42 9 -> pipe:[74731312]
Contributor

plockaby commented Apr 16, 2017

Now an eighth. A bit more detail now. I am running supervisor 3.3.0. I installed it last May as soon as it came out because it had my patch in it that I really wanted. I restarted supervisor on all my hosts to get the new version and it has been going just fine. Last week, I made a global configuration change. I changed minfds to 100000 (because a specific process wanted it) and I set childlogdir which had not been set before. I installed that new configuration on all my hosts and went around restarting supervisord. Now hosts are going into this poll spin loop at random. This is the strace log. These are spinning like crazy, eating up a whole CPU.

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=22, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}, {fd=36, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=41, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 1 ([{fd=20, revents=POLLIN}])
gettimeofday({1492359703, 190995}, NULL) = 0
gettimeofday({1492359703, 191062}, NULL) = 0
gettimeofday({1492359703, 194873}, NULL) = 0
gettimeofday({1492359703, 194973}, NULL) = 0
gettimeofday({1492359703, 195072}, NULL) = 0
gettimeofday({1492359703, 195108}, NULL) = 0
gettimeofday({1492359703, 195153}, NULL) = 0
gettimeofday({1492359703, 195224}, NULL) = 0
gettimeofday({1492359703, 195254}, NULL) = 0
gettimeofday({1492359703, 195299}, NULL) = 0
gettimeofday({1492359703, 195327}, NULL) = 0
gettimeofday({1492359703, 195378}, NULL) = 0
gettimeofday({1492359703, 195446}, NULL) = 0
wait4(-1, 0x7ffea7758d04, WNOHANG, NULL) = 0
gettimeofday({1492359703, 195526}, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=22, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}, {fd=36, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=41, events=POLLIN|POLLPRI|POLLHUP}, {fd=45, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 1 ([{fd=20, revents=POLLIN}])
gettimeofday({1492359703, 195874}, NULL) = 0
gettimeofday({1492359703, 195936}, NULL) = 0
gettimeofday({1492359703, 196000}, NULL) = 0
gettimeofday({1492359703, 196092}, NULL) = 0
gettimeofday({1492359703, 196166}, NULL) = 0
gettimeofday({1492359703, 196256}, NULL) = 0
gettimeofday({1492359703, 196336}, NULL) = 0
gettimeofday({1492359703, 196380}, NULL) = 0
gettimeofday({1492359703, 196520}, NULL) = 0
gettimeofday({1492359703, 196557}, NULL) = 0
gettimeofday({1492359703, 196599}, NULL) = 0
gettimeofday({1492359703, 196643}, NULL) = 0
gettimeofday({1492359703, 196689}, NULL) = 0
wait4(-1, 0x7ffea7758d04, WNOHANG, NULL) = 0
gettimeofday({1492359703, 196787}, NULL) = 0

On the host in particular that is crashing right now, these are the fds:

total 0
lr-x------ 1 root root 64 Apr 11 15:42 0 -> /dev/null
lrwx------ 1 root root 64 Apr 11 15:42 1 -> socket:[74731265]
lr-x------ 1 root root 64 Apr 11 15:42 10 -> pipe:[74731311]
lr-x------ 1 root root 64 Apr 11 15:42 11 -> pipe:[74731313]
l-wx------ 1 root root 64 Apr 11 15:42 12 -> /data/logs/supervisor/supermon.log
l-wx------ 1 root root 64 Apr 11 15:42 14 -> pipe:[77877459]
lr-x------ 1 root root 64 Apr 11 15:42 15 -> pipe:[74731314]
lr-x------ 1 root root 64 Apr 11 15:42 16 -> pipe:[77877460]
l-wx------ 1 root root 64 Apr 11 15:42 17 -> /data/logs/supervisor/supercron.log
l-wx------ 1 root root 64 Apr 11 15:42 18 -> /data/logs/supervisor/supercron.err
l-wx------ 1 root root 64 Apr 11 15:42 19 -> pipe:[74731318]
lrwx------ 1 root root 64 Apr 11 15:42 2 -> socket:[74731265]
l-wx------ 1 root root 64 Apr 11 15:42 20 -> /data/logs/supervisor/dart-agent.err
lr-x------ 1 root root 64 Apr 11 15:42 21 -> pipe:[74731319]
lr-x------ 1 root root 64 Apr 11 15:42 22 -> pipe:[77877461]
l-wx------ 1 root root 64 Apr 11 15:42 24 -> pipe:[74731321]
lr-x------ 1 root root 64 Apr 11 15:42 25 -> pipe:[74731320]
lr-x------ 1 root root 64 Apr 11 15:42 26 -> pipe:[74731322]
l-wx------ 1 root root 64 Apr 11 15:42 27 -> /data/logs/supervisor/statsd.log
l-wx------ 1 root root 64 Apr 11 15:42 28 -> /data/logs/supervisor/statsd.err
l-wx------ 1 root root 64 Apr 16 09:27 29 -> pipe:[74731324]
l-wx------ 1 root root 64 Apr 11 15:42 3 -> /data/logs/supervisord.log
lr-x------ 1 root root 64 Apr 16 09:27 30 -> pipe:[74731323]
lr-x------ 1 root root 64 Apr 16 09:27 31 -> pipe:[74731325]
l-wx------ 1 root root 64 Apr 16 09:27 32 -> /data/logs/supervisor/redis.log
l-wx------ 1 root root 64 Apr 16 09:27 33 -> /data/logs/supervisor/redis.err
l-wx------ 1 root root 64 Apr 16 09:27 34 -> pipe:[74731327]
lr-x------ 1 root root 64 Apr 16 09:27 35 -> pipe:[74731326]
lr-x------ 1 root root 64 Apr 16 09:27 36 -> pipe:[74731328]
l-wx------ 1 root root 64 Apr 11 15:42 37 -> /data/logs/supervisor/dmca-admin-web.log
l-wx------ 1 root root 64 Apr 11 15:42 38 -> /data/logs/supervisor/dmca-admin-web.err
l-wx------ 1 root root 64 Apr 16 09:27 39 -> pipe:[74731330]
lrwx------ 1 root root 64 Apr 11 15:42 4 -> socket:[74731287]
lr-x------ 1 root root 64 Apr 16 09:27 40 -> pipe:[74731329]
lr-x------ 1 root root 64 Apr 11 15:42 41 -> pipe:[74731331]
l-wx------ 1 root root 64 Apr 11 15:42 42 -> /data/logs/supervisor/pgcheck.log
l-wx------ 1 root root 64 Apr 11 15:42 43 -> /data/logs/supervisor/pgcheck.err
l-wx------ 1 root root 64 Apr 11 15:42 44 -> /data/logs/supervisor/dart-agent.log
lr-x------ 1 root root 64 Apr 11 15:42 45 -> pipe:[74731332]
l-wx------ 1 root root 64 Apr 11 15:42 47 -> /data/logs/supervisor/pgwatch.log
l-wx------ 1 root root 64 Apr 11 15:42 48 -> /data/logs/supervisor/pgwatch.err
l-wx------ 1 root root 64 Apr 11 15:42 5 -> /data/logs/supervisor/supermon.err
l-wx------ 1 root root 64 Apr 11 15:42 6 -> pipe:[74731309]
lr-x------ 1 root root 64 Apr 11 15:42 7 -> /dev/urandom
lr-x------ 1 root root 64 Apr 11 15:42 8 -> pipe:[74731310]
l-wx------ 1 root root 64 Apr 11 15:42 9 -> pipe:[74731312]
@mnaberez

This comment has been minimized.

Show comment
Hide comment
@mnaberez

mnaberez Apr 16, 2017

Member

@plockaby We're testing a fix for this over in #875. Are you able to try that patch?

Member

mnaberez commented Apr 16, 2017

@plockaby We're testing a fix for this over in #875. Are you able to try that patch?

@plockaby

This comment has been minimized.

Show comment
Hide comment
@plockaby

plockaby Apr 16, 2017

Contributor

On about 2/3rds of my servers I have now upgraded to 3.3.1 and installed the patch. If it takes over a CPU again I'll let you know but it might be a few days. Sorry I didn't notice the conversation in the pull request.

Contributor

plockaby commented Apr 16, 2017

On about 2/3rds of my servers I have now upgraded to 3.3.1 and installed the patch. If it takes over a CPU again I'll let you know but it might be a few days. Sorry I didn't notice the conversation in the pull request.

@plockaby

This comment has been minimized.

Show comment
Hide comment
@plockaby

plockaby Apr 16, 2017

Contributor

That patch did not solve my problem. I almost immediately had one of the updated systems (not the same one as before) go into a poll craziness:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 593446}, NULL) = 0
gettimeofday({1492367896, 593515}, NULL) = 0
gettimeofday({1492367896, 593584}, NULL) = 0
gettimeofday({1492367896, 593637}, NULL) = 0
gettimeofday({1492367896, 593719}, NULL) = 0
gettimeofday({1492367896, 593788}, NULL) = 0
gettimeofday({1492367896, 593826}, NULL) = 0
gettimeofday({1492367896, 593863}, NULL) = 0
gettimeofday({1492367896, 593886}, NULL) = 0
gettimeofday({1492367896, 593906}, NULL) = 0
gettimeofday({1492367896, 593948}, NULL) = 0
gettimeofday({1492367896, 593987}, NULL) = 0
gettimeofday({1492367896, 594031}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 594103}, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 594331}, NULL) = 0
gettimeofday({1492367896, 594415}, NULL) = 0
gettimeofday({1492367896, 594491}, NULL) = 0
gettimeofday({1492367896, 594561}, NULL) = 0
gettimeofday({1492367896, 594642}, NULL) = 0
gettimeofday({1492367896, 594663}, NULL) = 0
gettimeofday({1492367896, 594699}, NULL) = 0
gettimeofday({1492367896, 594739}, NULL) = 0
gettimeofday({1492367896, 594769}, NULL) = 0
gettimeofday({1492367896, 594808}, NULL) = 0
gettimeofday({1492367896, 594836}, NULL) = 0
gettimeofday({1492367896, 594887}, NULL) = 0
gettimeofday({1492367896, 594934}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 595005}, NULL) = 0
lr-x------ 1 root root 64 Apr 16 10:37 0 -> /dev/null
lrwx------ 1 root root 64 Apr 16 10:37 1 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 10 -> pipe:[36606043]
lr-x------ 1 root root 64 Apr 16 10:37 11 -> pipe:[36606045]
l-wx------ 1 root root 64 Apr 16 10:37 12 -> /data/logs/supervisor/supermon.log
l-wx------ 1 root root 64 Apr 16 10:37 13 -> /data/logs/supervisor/supermon.err
l-wx------ 1 root root 64 Apr 16 10:37 14 -> pipe:[36606047]
lr-x------ 1 root root 64 Apr 16 10:37 15 -> pipe:[36606046]
lr-x------ 1 root root 64 Apr 16 10:37 16 -> pipe:[36606048]
l-wx------ 1 root root 64 Apr 16 10:37 17 -> /data/logs/supervisor/supercron.log
l-wx------ 1 root root 64 Apr 16 10:37 18 -> /data/logs/supervisor/supercron.err
l-wx------ 1 root root 64 Apr 16 10:37 19 -> pipe:[36606050]
lrwx------ 1 root root 64 Apr 16 10:37 2 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 20 -> pipe:[36606049]
lr-x------ 1 root root 64 Apr 16 10:37 21 -> pipe:[36606051]
l-wx------ 1 root root 64 Apr 16 10:37 22 -> /data/logs/supervisor/dart-agent.log
l-wx------ 1 root root 64 Apr 16 10:37 24 -> pipe:[36606053]
lr-x------ 1 root root 64 Apr 16 10:37 25 -> pipe:[36606052]
lr-x------ 1 root root 64 Apr 16 10:37 26 -> pipe:[36606054]
l-wx------ 1 root root 64 Apr 16 10:37 27 -> /data/logs/supervisor/pgcheck.log
l-wx------ 1 root root 64 Apr 16 10:37 28 -> /data/logs/supervisor/pgcheck.err
l-wx------ 1 root root 64 Apr 16 10:37 3 -> /data/logs/supervisord.log
lr-x------ 1 root root 64 Apr 16 10:37 30 -> pipe:[36606055]
l-wx------ 1 root root 64 Apr 16 10:37 32 -> /data/logs/supervisor/pgwatch.log
l-wx------ 1 root root 64 Apr 16 10:37 33 -> /data/logs/supervisor/pgwatch.err
lrwx------ 1 root root 64 Apr 16 10:37 4 -> socket:[36606038]
l-wx------ 1 root root 64 Apr 16 10:37 5 -> /data/logs/supervisor/dart-agent.err
l-wx------ 1 root root 64 Apr 16 10:37 6 -> pipe:[36606041]
lr-x------ 1 root root 64 Apr 16 10:37 7 -> /dev/urandom
lr-x------ 1 root root 64 Apr 16 10:37 8 -> pipe:[36606042]
l-wx------ 1 root root 64 Apr 16 10:37 9 -> pipe:[36606044]
Contributor

plockaby commented Apr 16, 2017

That patch did not solve my problem. I almost immediately had one of the updated systems (not the same one as before) go into a poll craziness:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 593446}, NULL) = 0
gettimeofday({1492367896, 593515}, NULL) = 0
gettimeofday({1492367896, 593584}, NULL) = 0
gettimeofday({1492367896, 593637}, NULL) = 0
gettimeofday({1492367896, 593719}, NULL) = 0
gettimeofday({1492367896, 593788}, NULL) = 0
gettimeofday({1492367896, 593826}, NULL) = 0
gettimeofday({1492367896, 593863}, NULL) = 0
gettimeofday({1492367896, 593886}, NULL) = 0
gettimeofday({1492367896, 593906}, NULL) = 0
gettimeofday({1492367896, 593948}, NULL) = 0
gettimeofday({1492367896, 593987}, NULL) = 0
gettimeofday({1492367896, 594031}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 594103}, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 594331}, NULL) = 0
gettimeofday({1492367896, 594415}, NULL) = 0
gettimeofday({1492367896, 594491}, NULL) = 0
gettimeofday({1492367896, 594561}, NULL) = 0
gettimeofday({1492367896, 594642}, NULL) = 0
gettimeofday({1492367896, 594663}, NULL) = 0
gettimeofday({1492367896, 594699}, NULL) = 0
gettimeofday({1492367896, 594739}, NULL) = 0
gettimeofday({1492367896, 594769}, NULL) = 0
gettimeofday({1492367896, 594808}, NULL) = 0
gettimeofday({1492367896, 594836}, NULL) = 0
gettimeofday({1492367896, 594887}, NULL) = 0
gettimeofday({1492367896, 594934}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 595005}, NULL) = 0
lr-x------ 1 root root 64 Apr 16 10:37 0 -> /dev/null
lrwx------ 1 root root 64 Apr 16 10:37 1 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 10 -> pipe:[36606043]
lr-x------ 1 root root 64 Apr 16 10:37 11 -> pipe:[36606045]
l-wx------ 1 root root 64 Apr 16 10:37 12 -> /data/logs/supervisor/supermon.log
l-wx------ 1 root root 64 Apr 16 10:37 13 -> /data/logs/supervisor/supermon.err
l-wx------ 1 root root 64 Apr 16 10:37 14 -> pipe:[36606047]
lr-x------ 1 root root 64 Apr 16 10:37 15 -> pipe:[36606046]
lr-x------ 1 root root 64 Apr 16 10:37 16 -> pipe:[36606048]
l-wx------ 1 root root 64 Apr 16 10:37 17 -> /data/logs/supervisor/supercron.log
l-wx------ 1 root root 64 Apr 16 10:37 18 -> /data/logs/supervisor/supercron.err
l-wx------ 1 root root 64 Apr 16 10:37 19 -> pipe:[36606050]
lrwx------ 1 root root 64 Apr 16 10:37 2 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 20 -> pipe:[36606049]
lr-x------ 1 root root 64 Apr 16 10:37 21 -> pipe:[36606051]
l-wx------ 1 root root 64 Apr 16 10:37 22 -> /data/logs/supervisor/dart-agent.log
l-wx------ 1 root root 64 Apr 16 10:37 24 -> pipe:[36606053]
lr-x------ 1 root root 64 Apr 16 10:37 25 -> pipe:[36606052]
lr-x------ 1 root root 64 Apr 16 10:37 26 -> pipe:[36606054]
l-wx------ 1 root root 64 Apr 16 10:37 27 -> /data/logs/supervisor/pgcheck.log
l-wx------ 1 root root 64 Apr 16 10:37 28 -> /data/logs/supervisor/pgcheck.err
l-wx------ 1 root root 64 Apr 16 10:37 3 -> /data/logs/supervisord.log
lr-x------ 1 root root 64 Apr 16 10:37 30 -> pipe:[36606055]
l-wx------ 1 root root 64 Apr 16 10:37 32 -> /data/logs/supervisor/pgwatch.log
l-wx------ 1 root root 64 Apr 16 10:37 33 -> /data/logs/supervisor/pgwatch.err
lrwx------ 1 root root 64 Apr 16 10:37 4 -> socket:[36606038]
l-wx------ 1 root root 64 Apr 16 10:37 5 -> /data/logs/supervisor/dart-agent.err
l-wx------ 1 root root 64 Apr 16 10:37 6 -> pipe:[36606041]
lr-x------ 1 root root 64 Apr 16 10:37 7 -> /dev/urandom
lr-x------ 1 root root 64 Apr 16 10:37 8 -> pipe:[36606042]
l-wx------ 1 root root 64 Apr 16 10:37 9 -> pipe:[36606044]
@mnaberez

This comment has been minimized.

Show comment
Hide comment
@mnaberez

mnaberez Apr 16, 2017

Member

@plockaby I posted this over in #875. Can you please work with us there to reproduce?

Member

mnaberez commented Apr 16, 2017

@plockaby I posted this over in #875. Can you please work with us there to reproduce?

@plockaby

This comment has been minimized.

Show comment
Hide comment
@plockaby

plockaby Apr 16, 2017

Contributor

Can do. I'll move to commenting over there instead.

Contributor

plockaby commented Apr 16, 2017

Can do. I'll move to commenting over there instead.

@plockaby

This comment has been minimized.

Show comment
Hide comment
@plockaby

plockaby May 12, 2017

Contributor

I hate to harp on this but #875 didn't solve my problem and I'm still restarting supervisord every few days on random hosts. I just wanted to bring up that this is still happening.

Contributor

plockaby commented May 12, 2017

I hate to harp on this but #875 didn't solve my problem and I'm still restarting supervisord every few days on random hosts. I just wanted to bring up that this is still happening.

@blusewang

This comment has been minimized.

Show comment
Hide comment
@blusewang

blusewang May 13, 2017

@plockaby I suggest you try to use systemd instead supervisor!

blusewang commented May 13, 2017

@plockaby I suggest you try to use systemd instead supervisor!

@mnaberez

This comment has been minimized.

Show comment
Hide comment
@mnaberez

mnaberez May 14, 2017

Member

I hate to harp on this but #875 didn't solve my problem and I'm still restarting supervisord every few days on random hosts.

@plockaby I was able to reproduce one cause of high CPU and observe that #875 did fix it, which is why I merged #875. I haven't observed another cause of high CPU usage yet. If you are still experiencing problems with #875 applied, we are going to need reproduce instructions to show us how to get supervisord into that state so we can work on it.

Member

mnaberez commented May 14, 2017

I hate to harp on this but #875 didn't solve my problem and I'm still restarting supervisord every few days on random hosts.

@plockaby I was able to reproduce one cause of high CPU and observe that #875 did fix it, which is why I merged #875. I haven't observed another cause of high CPU usage yet. If you are still experiencing problems with #875 applied, we are going to need reproduce instructions to show us how to get supervisord into that state so we can work on it.

@mnaberez

This comment has been minimized.

Show comment
Hide comment
@mnaberez

mnaberez Jun 3, 2017

Member

Supervisor 3.3.2 was released to PyPI and it includes the poller fixes (#875, #885).

The only high CPU problem that I was able to reproduce is fixed by this release.

If you are experiencing a high CPU problem on Supervisor 3.3.2, please open a new issue and include instructions to reproduce it.

Member

mnaberez commented Jun 3, 2017

Supervisor 3.3.2 was released to PyPI and it includes the poller fixes (#875, #885).

The only high CPU problem that I was able to reproduce is fixed by this release.

If you are experiencing a high CPU problem on Supervisor 3.3.2, please open a new issue and include instructions to reproduce it.

@mnaberez mnaberez closed this Jun 3, 2017

@plockaby

This comment has been minimized.

Show comment
Hide comment
@plockaby

plockaby Jun 16, 2017

Contributor

I've been on vacation so I'm just getting to this now.

I tried 3.3.2 and it didn't solve the CPU spikes. I actually rolled all of my hosts back to 3.2.3 under the assumption that it was a problem with the new polling code. I'm still experiencing supervisord going into polling loops that use 100% CPU.

I understand that you can't fix a problem that you can't reproduce. If I had any way to reproduce this consistently, I would show you. All I can give you, really, is situations where it is happening. Right now I can tell you that this is not caused by changes in 3.3.x. I can also tell you that it is (probably) not caused by multiple processes writing to their logs simultaneously as this doesn't happen with any more frequency on our heavily loaded hosts as compared to our lightly loaded hosts. In fact it has happened on hosts that have just one process controlled by supervisord just as often as it has happened on hosts that have forty processes.

If I ever do figure out how to reproduce it then I will create a new ticket. Would you mind if I continue to post information about when I experience the issue?

Contributor

plockaby commented Jun 16, 2017

I've been on vacation so I'm just getting to this now.

I tried 3.3.2 and it didn't solve the CPU spikes. I actually rolled all of my hosts back to 3.2.3 under the assumption that it was a problem with the new polling code. I'm still experiencing supervisord going into polling loops that use 100% CPU.

I understand that you can't fix a problem that you can't reproduce. If I had any way to reproduce this consistently, I would show you. All I can give you, really, is situations where it is happening. Right now I can tell you that this is not caused by changes in 3.3.x. I can also tell you that it is (probably) not caused by multiple processes writing to their logs simultaneously as this doesn't happen with any more frequency on our heavily loaded hosts as compared to our lightly loaded hosts. In fact it has happened on hosts that have just one process controlled by supervisord just as often as it has happened on hosts that have forty processes.

If I ever do figure out how to reproduce it then I will create a new ticket. Would you mind if I continue to post information about when I experience the issue?

@mnaberez

This comment has been minimized.

Show comment
Hide comment
@mnaberez

mnaberez Jun 16, 2017

Member

Would you mind if I continue to post information about when I experience the issue?

Of course not. Please do that.

Member

mnaberez commented Jun 16, 2017

Would you mind if I continue to post information about when I experience the issue?

Of course not. Please do that.

@plockaby

This comment has been minimized.

Show comment
Hide comment
@plockaby

plockaby Jun 28, 2017

Contributor

Interestingly I've discovered that it often recovers from the polling problem. I identified one of my hosts that had started spinning CPU on polling but it wasn't convenient to restart the supervisord at the time so I just left top running in a window on the side. Six days of 100% CPU and it just stopped as suddenly as it started. Another one of the hosts that I identified ran at 100% CPU for nearly five days before it stopped but then it started again six hours later. (I've identified all of these by looking at collectd CPU graphs.)

I have an strace running on one of the hosts where the problem has cropped up several times and where I don't mind if the disk fills. When it starts on that host then I should have a bit more information about what is starting and/or stopping the problem.

I have also noticed that it is not happening on Centos7 hosts. It so far has only happened on Centos6 hosts. Unfortunately most of my hosts are still on Centos6. :)

Contributor

plockaby commented Jun 28, 2017

Interestingly I've discovered that it often recovers from the polling problem. I identified one of my hosts that had started spinning CPU on polling but it wasn't convenient to restart the supervisord at the time so I just left top running in a window on the side. Six days of 100% CPU and it just stopped as suddenly as it started. Another one of the hosts that I identified ran at 100% CPU for nearly five days before it stopped but then it started again six hours later. (I've identified all of these by looking at collectd CPU graphs.)

I have an strace running on one of the hosts where the problem has cropped up several times and where I don't mind if the disk fills. When it starts on that host then I should have a bit more information about what is starting and/or stopping the problem.

I have also noticed that it is not happening on Centos7 hosts. It so far has only happened on Centos6 hosts. Unfortunately most of my hosts are still on Centos6. :)

@plockaby

This comment has been minimized.

Show comment
Hide comment
@plockaby

plockaby Jun 30, 2017

Contributor

I've narrowed the problem down even further to be specific to the interaction between supervisord and an event listener on the communications channel between the two (i.e. my event listener's stdout). The file descriptor that is generating all of the poll events is my event listener's stdout file.

This explains why the problem would disappear: my event listener is configured to restart itself at random intervals and indeed when I restart it the 100% CPU spike goes away. I'm still examining the strace to figure out why the problem starts in the first place. The log itself isn't blowing up in size with insane numbers of writes.

Contributor

plockaby commented Jun 30, 2017

I've narrowed the problem down even further to be specific to the interaction between supervisord and an event listener on the communications channel between the two (i.e. my event listener's stdout). The file descriptor that is generating all of the poll events is my event listener's stdout file.

This explains why the problem would disappear: my event listener is configured to restart itself at random intervals and indeed when I restart it the 100% CPU spike goes away. I'm still examining the strace to figure out why the problem starts in the first place. The log itself isn't blowing up in size with insane numbers of writes.

@mnaberez mnaberez reopened this Jun 30, 2017

@plockaby

This comment has been minimized.

Show comment
Hide comment
@plockaby

plockaby Jun 30, 2017

Contributor

I was going to write a big long description with quotes from my strace but I think I discovered the problem. The problem happens when my event listener's log is rotated. Because it is an event listener and I really don't care about stdout for event listeners, my configuration is like this:

stdout_logfile = /data/logs/supervisor/dart-agent.log
stdout_logfile_maxbytes = 1MB
stdout_logfile_backups = 0

When the strace says that my program started going crazy, this appears in the strace:

open("/data/logs/supervisor/dart-agent.log", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0

That is the log file for my event listener and "5" is the file descriptor for its log file that I am currently spinning on polling:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=12, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}], 14, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1498848417, 352015}, NULL) = 0
gettimeofday({1498848417, 352079}, NULL) = 0
gettimeofday({1498848417, 352141}, NULL) = 0
gettimeofday({1498848417, 352204}, NULL) = 0
gettimeofday({1498848417, 352268}, NULL) = 0
gettimeofday({1498848417, 352331}, NULL) = 0
gettimeofday({1498848417, 352483}, NULL) = 0
gettimeofday({1498848417, 352590}, NULL) = 0
gettimeofday({1498848417, 352769}, NULL) = 0
gettimeofday({1498848417, 352858}, NULL) = 0
wait4(-1, 0x7fff3600b334, WNOHANG, NULL) = 0
gettimeofday({1498848417, 352992}, NULL) = 0

I hope that is enough to go on?

Contributor

plockaby commented Jun 30, 2017

I was going to write a big long description with quotes from my strace but I think I discovered the problem. The problem happens when my event listener's log is rotated. Because it is an event listener and I really don't care about stdout for event listeners, my configuration is like this:

stdout_logfile = /data/logs/supervisor/dart-agent.log
stdout_logfile_maxbytes = 1MB
stdout_logfile_backups = 0

When the strace says that my program started going crazy, this appears in the strace:

open("/data/logs/supervisor/dart-agent.log", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0

That is the log file for my event listener and "5" is the file descriptor for its log file that I am currently spinning on polling:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=12, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}, {fd=31, events=POLLIN|POLLPRI|POLLHUP}, {fd=35, events=POLLIN|POLLPRI|POLLHUP}], 14, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1498848417, 352015}, NULL) = 0
gettimeofday({1498848417, 352079}, NULL) = 0
gettimeofday({1498848417, 352141}, NULL) = 0
gettimeofday({1498848417, 352204}, NULL) = 0
gettimeofday({1498848417, 352268}, NULL) = 0
gettimeofday({1498848417, 352331}, NULL) = 0
gettimeofday({1498848417, 352483}, NULL) = 0
gettimeofday({1498848417, 352590}, NULL) = 0
gettimeofday({1498848417, 352769}, NULL) = 0
gettimeofday({1498848417, 352858}, NULL) = 0
wait4(-1, 0x7fff3600b334, WNOHANG, NULL) = 0
gettimeofday({1498848417, 352992}, NULL) = 0

I hope that is enough to go on?

@plockaby

This comment has been minimized.

Show comment
Hide comment
@plockaby

plockaby Aug 23, 2017

Contributor

I have just discovered that this is not a problem specific to event listeners. I had a non-event listener program just trigger the same problem. However that's the first time that has happened that I can tell.

Contributor

plockaby commented Aug 23, 2017

I have just discovered that this is not a problem specific to event listeners. I had a non-event listener program just trigger the same problem. However that's the first time that has happened that I can tell.

@lambdaq

This comment has been minimized.

Show comment
Hide comment
@lambdaq

lambdaq Sep 30, 2017

I once deleted few log files and CPU stuck to 100%. Is it relevant ?

lambdaq commented Sep 30, 2017

I once deleted few log files and CPU stuck to 100%. Is it relevant ?

@mumutu66

This comment has been minimized.

Show comment
Hide comment
@mumutu66

mumutu66 Oct 26, 2017

got the same problem too 。。on Centos7 with 3.3.1
is it confirmed 3.3.2 can solve this?

mumutu66 commented Oct 26, 2017

got the same problem too 。。on Centos7 with 3.3.1
is it confirmed 3.3.2 can solve this?

@plockaby

This comment has been minimized.

Show comment
Hide comment
@plockaby

plockaby Dec 31, 2017

Contributor

FWIW I wanted to provide an update. This is still a problem. It is specific to event listeners and it happens on 3.2 and 3.3. (I have not tested on 3.1 or on master.) The specific problem is related to when the stdout log file rotates. I worked around it by just completely disabling the stdout log file by writing stdout_logfile = NONE in my event listener's configuration. Since stdout on event listeners is mostly useless this was not a big loss. Plus, it completely solved my problem so there you go.

Contributor

plockaby commented Dec 31, 2017

FWIW I wanted to provide an update. This is still a problem. It is specific to event listeners and it happens on 3.2 and 3.3. (I have not tested on 3.1 or on master.) The specific problem is related to when the stdout log file rotates. I worked around it by just completely disabling the stdout log file by writing stdout_logfile = NONE in my event listener's configuration. Since stdout on event listeners is mostly useless this was not a big loss. Plus, it completely solved my problem so there you go.

@NKYB

This comment has been minimized.

Show comment
Hide comment
@NKYB

NKYB Jan 2, 2018

FWIW I can confirm that I've experienced almost exactly the same problem as described by @plockaby

  • a large number of cloud servers running Ubuntu 16.04 and Supervisor->Python
  • randomly different machines go to 100% CPU, a reboot fixes it
  • setting stdout_logfile = NONE makes the problem go away
  • I also believe this is related to stdout_logfile rotating

NKYB commented Jan 2, 2018

FWIW I can confirm that I've experienced almost exactly the same problem as described by @plockaby

  • a large number of cloud servers running Ubuntu 16.04 and Supervisor->Python
  • randomly different machines go to 100% CPU, a reboot fixes it
  • setting stdout_logfile = NONE makes the problem go away
  • I also believe this is related to stdout_logfile rotating
@plockaby

This comment has been minimized.

Show comment
Hide comment
@plockaby

plockaby Jan 2, 2018

Contributor

My solution that has me setting stdout_logfile = NONE does not completely solve the problem. Now it spins at 100% to POLLIN the stderr log file on rotation. I don't need to reboot the host, though. I just restart the event listener and the problem clears until the next time the log rotates. I suppose I could just not keep any logs from my event listeners?

Contributor

plockaby commented Jan 2, 2018

My solution that has me setting stdout_logfile = NONE does not completely solve the problem. Now it spins at 100% to POLLIN the stderr log file on rotation. I don't need to reboot the host, though. I just restart the event listener and the problem clears until the next time the log rotates. I suppose I could just not keep any logs from my event listeners?

@engin

This comment has been minimized.

Show comment
Hide comment
@engin

engin Jun 29, 2018

i think mine was caused by redirect stderr
in one box ~ 120 job watched by supervisord. and cpu usage of supervisord was around 99%

# supervisorctl version
3.3.4
# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 14.04.2 LTS
Release:	14.04
Codename:	trusty
find /etc/supervisor/conf.d -name "*.conf" -exec sh -c 'echo "redirect_stderr=true" >> "$0" ' {} \;

now dropped to below 2%

engin commented Jun 29, 2018

i think mine was caused by redirect stderr
in one box ~ 120 job watched by supervisord. and cpu usage of supervisord was around 99%

# supervisorctl version
3.3.4
# lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 14.04.2 LTS
Release:	14.04
Codename:	trusty
find /etc/supervisor/conf.d -name "*.conf" -exec sh -c 'echo "redirect_stderr=true" >> "$0" ' {} \;

now dropped to below 2%

@devspyrosv

This comment has been minimized.

Show comment
Hide comment
@devspyrosv

devspyrosv Jul 4, 2018

Hello! I had a similar experience with supervisord 3.3.4

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.4 LTS
Release: 16.04
Codename: xenial

strace -p

Shows tens of messages per second like the following:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=12, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}, {fd=19, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=28, events=POLLIN|POLLPRI|POLLHUP}, {fd=32, events=POLLIN|POLLPRI|POLLHUP}, {fd=38, events=POLLIN|POLLPRI|POLLHUP}, {fd=39, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=47, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=55, events=POLLIN|POLLPRI|POLLHUP}, {fd=61, events=POLLIN|POLLPRI|POLLHUP}, {fd=65, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 0 (Timeout)

Supervisord has been running for ~1.5 months and is managing 10 processes.

supervisorctl status all

Shows the same process ids (no process is constantly restarting etc).

Having read the comments above, I decided to do a little experiment. So I started stopping one by one the services supervisord is managing, through the web interface.

So among the custom processes I am running, I have crashmailbatch and fatalmailbatch from the package superlance.

So I was seeing the same frequency of polls in strace up untill I stopped fatalmailbatch.

Fatalmailbatch has the following config:

[eventlistener:fatalmailbatch]
command=fatalmailbatch --toEmail= --fromEmail= --smtpHost= --userName=... --password=... --subject="Alert - MYDOMAIN has crashed processes!" --interval=5.0
events=PROCESS_STATE,TICK_60
autostart=true
autorestart=true

The process was running for as long as supervisord was running without stopping.

I hope it helps somehow.

devspyrosv commented Jul 4, 2018

Hello! I had a similar experience with supervisord 3.3.4

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.4 LTS
Release: 16.04
Codename: xenial

strace -p

Shows tens of messages per second like the following:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=9, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=12, events=POLLIN|POLLPRI|POLLHUP}, {fd=18, events=POLLIN|POLLPRI|POLLHUP}, {fd=19, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=28, events=POLLIN|POLLPRI|POLLHUP}, {fd=32, events=POLLIN|POLLPRI|POLLHUP}, {fd=38, events=POLLIN|POLLPRI|POLLHUP}, {fd=39, events=POLLIN|POLLPRI|POLLHUP}, {fd=40, events=POLLIN|POLLPRI|POLLHUP}, {fd=47, events=POLLIN|POLLPRI|POLLHUP}, {fd=54, events=POLLIN|POLLPRI|POLLHUP}, {fd=55, events=POLLIN|POLLPRI|POLLHUP}, {fd=61, events=POLLIN|POLLPRI|POLLHUP}, {fd=65, events=POLLIN|POLLPRI|POLLHUP}], 18, 1000) = 0 (Timeout)

Supervisord has been running for ~1.5 months and is managing 10 processes.

supervisorctl status all

Shows the same process ids (no process is constantly restarting etc).

Having read the comments above, I decided to do a little experiment. So I started stopping one by one the services supervisord is managing, through the web interface.

So among the custom processes I am running, I have crashmailbatch and fatalmailbatch from the package superlance.

So I was seeing the same frequency of polls in strace up untill I stopped fatalmailbatch.

Fatalmailbatch has the following config:

[eventlistener:fatalmailbatch]
command=fatalmailbatch --toEmail= --fromEmail= --smtpHost= --userName=... --password=... --subject="Alert - MYDOMAIN has crashed processes!" --interval=5.0
events=PROCESS_STATE,TICK_60
autostart=true
autorestart=true

The process was running for as long as supervisord was running without stopping.

I hope it helps somehow.

@plockaby

This comment has been minimized.

Show comment
Hide comment
@plockaby

plockaby Aug 8, 2018

Contributor

I know that earlier I stated that this was being caused by an event listener and that by modifying the configuration for the event listener to not log that the problem went away. The fact that it was an event listener was apparently a red herring because I'm experiencing this problem yet again where supervisord is spinning like crazy on POLLIN on a log file. The only commonality that I'm seeing between all of the systems where this is happening is that the file handle is consistently numbered "5" on every server and that it happens after a log rotation and that when I restart the process with the log file where it is spinning that the spinning stops. It's possible that this is also related to log events which I also have enabled. I'd love to get some traction on this bug as it has been plaguing me for almost two years. If the maintainer could give me some pointers on how I can assist in getting this addressed I'd be grateful.

Contributor

plockaby commented Aug 8, 2018

I know that earlier I stated that this was being caused by an event listener and that by modifying the configuration for the event listener to not log that the problem went away. The fact that it was an event listener was apparently a red herring because I'm experiencing this problem yet again where supervisord is spinning like crazy on POLLIN on a log file. The only commonality that I'm seeing between all of the systems where this is happening is that the file handle is consistently numbered "5" on every server and that it happens after a log rotation and that when I restart the process with the log file where it is spinning that the spinning stops. It's possible that this is also related to log events which I also have enabled. I'd love to get some traction on this bug as it has been plaguing me for almost two years. If the maintainer could give me some pointers on how I can assist in getting this addressed I'd be grateful.

@mnaberez

This comment has been minimized.

Show comment
Hide comment
@mnaberez

mnaberez Aug 8, 2018

Member

If the maintainer could give me some pointers on how I can assist in getting this addressed I'd be grateful.

I am one of the people with commit access to this repository, but I was not involved with any of the code in question. I am not very familiar with this particular code. The problem seems to have been introduced by #129, which was written by @igorsobreira and merged by @mcdonc. I attempted to contact @igorsobreira after reports of problems but he did not reply. Another contributor, @jaredsuttles, put some effort into the high CPU issue and submitted #875, which I merged. I appreciate all your follow-up, @plockaby, and would like see this resolved also. Help is needed.

Member

mnaberez commented Aug 8, 2018

If the maintainer could give me some pointers on how I can assist in getting this addressed I'd be grateful.

I am one of the people with commit access to this repository, but I was not involved with any of the code in question. I am not very familiar with this particular code. The problem seems to have been introduced by #129, which was written by @igorsobreira and merged by @mcdonc. I attempted to contact @igorsobreira after reports of problems but he did not reply. Another contributor, @jaredsuttles, put some effort into the high CPU issue and submitted #875, which I merged. I appreciate all your follow-up, @plockaby, and would like see this resolved also. Help is needed.

@p01ymath

This comment has been minimized.

Show comment
Hide comment
@p01ymath

p01ymath Aug 20, 2018

In my case, It started to happen a few days back. Now I have to restart supervisor once every day otherwise it would create unusual spikes. I am using Supervisor 3.2.0 and the issue is still there. Any way we can solve this issue permanently? I haven't installed it from a git repository.

p01ymath commented Aug 20, 2018

In my case, It started to happen a few days back. Now I have to restart supervisor once every day otherwise it would create unusual spikes. I am using Supervisor 3.2.0 and the issue is still there. Any way we can solve this issue permanently? I haven't installed it from a git repository.

@hathawsh

This comment has been minimized.

Show comment
Hide comment
@hathawsh

hathawsh Aug 21, 2018

Contributor

@p01ymath: This bug has been tricky because we don't know how to reproduce it, but it sounds like you're running into it consistently. Your case could be key to solving it. Can you describe what programs Supervisor is running?

Some types of child processes do unusual things like closing the stdout file descriptor. Unusual behaviors can influence the operation of Supervisor and reveal issues.

Contributor

hathawsh commented Aug 21, 2018

@p01ymath: This bug has been tricky because we don't know how to reproduce it, but it sounds like you're running into it consistently. Your case could be key to solving it. Can you describe what programs Supervisor is running?

Some types of child processes do unusual things like closing the stdout file descriptor. Unusual behaviors can influence the operation of Supervisor and reveal issues.

@p01ymath

This comment has been minimized.

Show comment
Hide comment
@p01ymath

p01ymath Aug 22, 2018

I ran into it 2 times till now and it happens at random times. I am running Laravel queue workers using supervisor. The first time I faced this issue, I turned off logging completely and restarted supervisor. It worked so I thought it was the issue of logging.

After 2-3 days, It happened again. And this time Logs were already turned off. So I didn't restart the supervisor directly. From htop, I found out the worker that was using the majority of my CPU. So, I just turned off that individual process. Still, My CPU usage was 100% so I checked htop again and found out that the other worker/process in the same program started using 100% of CPU. Then I turned off the whole process group and found out that the other worker in other process group started using all the CPU.

Then, I restarted supervisor and it worked. This happened about 3 days ago but I haven't faced any issue since then. I am using supervisor 3.2.0 and I have 5 programs with 12 queue workers.

p01ymath commented Aug 22, 2018

I ran into it 2 times till now and it happens at random times. I am running Laravel queue workers using supervisor. The first time I faced this issue, I turned off logging completely and restarted supervisor. It worked so I thought it was the issue of logging.

After 2-3 days, It happened again. And this time Logs were already turned off. So I didn't restart the supervisor directly. From htop, I found out the worker that was using the majority of my CPU. So, I just turned off that individual process. Still, My CPU usage was 100% so I checked htop again and found out that the other worker/process in the same program started using 100% of CPU. Then I turned off the whole process group and found out that the other worker in other process group started using all the CPU.

Then, I restarted supervisor and it worked. This happened about 3 days ago but I haven't faced any issue since then. I am using supervisor 3.2.0 and I have 5 programs with 12 queue workers.

@hathawsh

This comment has been minimized.

Show comment
Hide comment
@hathawsh

hathawsh Aug 24, 2018

Contributor

@p01ymath, here's a guess: is it possible that some of your child processes are sending a large amount of log data through Supervisor? I'm not sure how much log data Supervisor can reasonably handle, but I suspect it would choke on a lot of logging data, especially if the child process happens to send only a character at a time.

Contributor

hathawsh commented Aug 24, 2018

@p01ymath, here's a guess: is it possible that some of your child processes are sending a large amount of log data through Supervisor? I'm not sure how much log data Supervisor can reasonably handle, but I suspect it would choke on a lot of logging data, especially if the child process happens to send only a character at a time.

@hathawsh

This comment has been minimized.

Show comment
Hide comment
@hathawsh

hathawsh Sep 7, 2018

Contributor

There's a new tool available called py-spy and I think it will greatly help anyone struggling with high CPU usage in Supervisor or any other Python process. If you're using a virtualenv, install it with 'bin/pip install py-spy' and run it using 'sudo bin/py-spy --pid PID' where PID is the PID of the supervisord process. It will show a 'top'-like breakdown of what the process is doing. Use ctrl-S to pause its output (for copy/paste) and ctrl-Q to resume.

Contributor

hathawsh commented Sep 7, 2018

There's a new tool available called py-spy and I think it will greatly help anyone struggling with high CPU usage in Supervisor or any other Python process. If you're using a virtualenv, install it with 'bin/pip install py-spy' and run it using 'sudo bin/py-spy --pid PID' where PID is the PID of the supervisord process. It will show a 'top'-like breakdown of what the process is doing. Use ctrl-S to pause its output (for copy/paste) and ctrl-Q to resume.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment