Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected lock of a process #829

Closed
enVolt opened this Issue Sep 24, 2016 · 11 comments

Comments

3 participants
@enVolt
Copy link

enVolt commented Sep 24, 2016

I've few Laravel's consumer process running via supervisor. Here is how my supervisor config looks like

[program:process_m]
process_name=%(program_name)s_%(process_num)02d
command=/usr/bin/php /home/user/public_html/artisan process:message
autostart=true
autorestart=true
;user=apache
numprocs=1
redirect_stderr=true
stdout_logfile=/home/user/worker.log

This task are stopping unexpectedly. From the log stream of process, it looks like the process has just stopped working, but supervisorctl status says that process is in RUNNING state. Then I added log statement almost after each line of code, to pin point the issue (My assumption was that it was something related to command). But nothing is getting logged.

strace result for that particular process says

[~]> sudo strace -p 15548
Process 15548 attached
write(3, "Processed: Illuminate\\Queue\\Call"..., 51^CProcess 15548 detached
 <detached ...>

This particular log statement is generated as stdout by the command (not logged by process to log files)

The process is locked here. It's not ending, so it's not restarting, neither it's in running state despite supervisor status being RUNNING.

What could possibly be wrong here? Is it something system specific? What can I do to avoid this?

@mnaberez mnaberez added the question label Sep 24, 2016

@mnaberez

This comment has been minimized.

Copy link
Member

mnaberez commented Sep 24, 2016

This task are stopping unexpectedly. From the log stream of process, it looks like the process has just stopped working, but supervisorctl status says that process is in RUNNING state.

Does the command supervisorctl status work when the process is in this state? If that hangs or doesn't connect, it might indicate a problem in supervisord. If supervisorctl commands work normally, that almost always indicates something is wrong with the process running under supervisord (as opposed to a bug in supervisord itself).

The process is locked here. It's not ending, so it's not restarting, neither it's in running state despite supervisor status being RUNNING.

The RUNNING state only means the process stayed up long enough to exit the STARTING state, and the process hasn't terminated. If you run supervisorctl pid <processname> and get a pid, you can see the process running with tools like ps. supervisord doesn't have any knowledge of what your process might want to do, so RUNNING doesn't imply your program is doing what it's supposed to.

@enVolt

This comment has been minimized.

Copy link
Author

enVolt commented Sep 24, 2016

supervisorctl status is working fine. To be honest, I'm not sure if it is something related to supervisor.

@mnaberez

This comment has been minimized.

Copy link
Member

mnaberez commented Sep 24, 2016

To be honest, I'm not sure if it is something related to supervisor.

Can you run your process outside of supervisord for a while (e.g. in a screen session)? If it hangs there too, that would eliminate supervisord.

@enVolt

This comment has been minimized.

Copy link
Author

enVolt commented Sep 25, 2016

This log statement

Processed: Illuminate\Queue\Call

is generated as stdout by process, and writing it to a log file (/home/user/worker.log) is part of supervisord, and strace output suggest that a write operation to a file is blocked (fd is 3), that's why I've a feeling that it might be related to supervisor. I've tried running the process in background via screen. No such problem there (basis on screen output, it crashes because of some error (application part; intended behaviour))

@mnaberez

This comment has been minimized.

Copy link
Member

mnaberez commented Sep 25, 2016

No such problem there (basis on screen output, it crashes because of some error (application part; intended behaviour))

I don't know what to make of this because on the one hand you're saying your program runs fine under screen but you're also saying it crashes under screen, so maybe it's just not running long enough to get into its "stuck" state.

I haven't seen anything in this ticket that leads me to believe there's a problem with supervisord yet. You said that supervisorctl status works which means supervisord isn't hanging up. It's always possible there's a bug in supervisord but I don't see one here yet. I think we'd need more evidence or reproduce instructions to do anything more with this.

@enVolt

This comment has been minimized.

Copy link
Author

enVolt commented Sep 26, 2016

Update -

So when a process stops, here is what supervisorctl status shows

[~] $ supervisorctl status
...
proc_worder:name_01                         RUNNING   pid 30880, uptime 14:46:35
...
...

At this point strace output is

[~] $ sudo strace -p 30881
Process 30881 attached
write(4, "Messages Received, [{\"MessageId\""..., 3029^CProcess 30881 detached
 <detached ...>

I'm not a kernel expert, but it looks like it is trying to write to a file described by fd 4.

Going to /proc/$pid/fd, ls output is

[/proc/30881/fd] $ ls -l
total 0
lr-x------ 1 appuser appuser 64 Sep 26 10:29 0 -> pipe:[1703461245]
l-wx------ 1 appuser appuser 64 Sep 26 10:29 1 -> pipe:[1703461246]
lrwx------ 1 appuser appuser 64 Sep 26 10:29 10 -> socket:[1703461335]
lrwx------ 1 appuser appuser 64 Sep 26 10:29 11 -> socket:[1703461337]
l-wx------ 1 appuser appuser 64 Sep 26 10:29 2 -> pipe:[1703461246]
lrwx------ 1 appuser appuser 64 Sep 26 10:29 3 -> socket:[1703458649]
l-wx------ 1 appuser appuser 64 Sep 26 10:29 4 -> pipe:[1703461246]
l-wx------ 1 appuser appuser 64 Sep 26 10:29 5 -> pipe:[1703461246]
l-wx------ 1 appuser appuser 64 Sep 26 10:29 6 -> /home/user/public_html/storage/logs/laravel.log
lrwx------ 1 appuser appuser 64 Sep 26 10:29 7 -> socket:[1703460296]
lr-x------ 1 appuser appuser 64 Sep 26 10:29 8 -> /etc/pki/nssdb/cert9.db
lr-x------ 1 appuser appuser 64 Sep 26 10:29 9 -> /etc/pki/nssdb/key4.db

(appuser is the user supervisord is running as, hence all the workers are running with the same)
so 4 is piped to pipe:[1703461246]
Now basis of this answer, if I want to get more info about this pipe
http://serverfault.com/questions/48330/how-can-i-get-more-info-on-open-pipes-show-in-proc-in-linux

[/proc/30881/fd] $ lsof | grep 1703461246
superviso 27466   appuser   29r     FIFO               0,10        0t0 1703461246 pipe
php       30881   appuser    1w     FIFO               0,10        0t0 1703461246 pipe
php       30881   appuser    2w     FIFO               0,10        0t0 1703461246 pipe
php       30881   appuser    4w     FIFO               0,10        0t0 1703461246 pipe
php       30881   appuser    5w     FIFO               0,10        0t0 1703461246 pipe

So this is somehow related to supervisor.

Let me know what else debug info you need on this.

@mnaberez

This comment has been minimized.

Copy link
Member

mnaberez commented Sep 26, 2016

What version of Supervisor is this?

You could try loglevel=blat in the [supervisord] section of the config file. supervisord logs messages about log activity and file descriptor changes; perhaps something in there will be interesting.

@enVolt

This comment has been minimized.

Copy link
Author

enVolt commented Sep 27, 2016

Version is 3.2.3, In /etc/supervisord.conf I've changed loglevel=blather. I'll post output from log file here, next time when it happens.

@enVolt

This comment has been minimized.

Copy link
Author

enVolt commented Sep 29, 2016

It hasn't happen since, the only change I can think of was increasing the fd limit to 4096 (ulimit -n 4096, which some how seem to be reverted), I'll update here as soon as it happens again.

(I just wish my weekend doesn't get ruined because of it)

@mnaberez

This comment has been minimized.

Copy link
Member

mnaberez commented Nov 14, 2016

Closing due to lack of activity.

@mnaberez mnaberez closed this Nov 14, 2016

@MazzMazz

This comment has been minimized.

Copy link

MazzMazz commented Jun 28, 2018

@mnaberez it would be very interesting how you solved this. It happens for me too.

Laravel Jobs with access to a Amazon FIFO queue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.