Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential core dump during worker process shutdown #1883

Closed
wants to merge 1 commit into from

Conversation

jixam
Copy link
Contributor

@jixam jixam commented Nov 17, 2021

Summary

Remove redundant stop_gracefully after finish.

Motivation

When reaching max_accepts, IOLoop will

  1. run a finish callback on the Prefork worker which will
  2. send a finished=1 heartbeat on a socket which will
  3. cause the Prefork manager to SIGQUIT the worker which will
  4. have the worker call stop_gracefully on the IOLoop

This seems redundant (and slightly circular) as the IOLoop is already in the process of stopping. It is also a small bug because the worker process could have uninstalled its local $SIG{QUIT} handler between step 2 and step 3 and the signal will then cause a core dump (#1449).

I am a bit wary of suggesting this but I could not find anything else happening due to that finish callback so I propose to break the cycle already at the first step.

(Most of the PR is cleaning up the heartbeat protocol now that finished is no longer used.)

References

This fixes #1449 which is a rare but real problem.

@kraih
Copy link
Member

kraih commented Nov 17, 2021

I'm a big fan of simplifying code, but i'm also worried about losing important functionality. This PR needs to be double checked.

@kraih
Copy link
Member

kraih commented Jun 7, 2023

This PR needs to be rebased.

@jixam
Copy link
Contributor Author

jixam commented Jun 7, 2023

Thanks, I will just close it as nobody seems interested in reviewing.

@jixam jixam closed this Jun 7, 2023
@kraih
Copy link
Member

kraih commented Jun 7, 2023

That's the reason i was asking for it to be rebased, if tests don't work nobody will review. 🙄

@jixam jixam reopened this Jun 8, 2023
@jixam
Copy link
Contributor Author

jixam commented Jun 8, 2023

All tests pass and there are no conflicts so I just reopened 🤷

@kraih kraih requested review from a team, marcusramberg, jberger and Grinnz June 8, 2023 09:18
@kraih
Copy link
Member

kraih commented Jun 8, 2023

The PR would leave the documentation incorrect at least, since the worker does not go into graceful shutdown mode anymore.

@kraih
Copy link
Member

kraih commented Jun 8, 2023

So, since the question has come up on IRC: The upside of this change is that it eliminates a possible signal handler race condition during worker shutdown that kills the worker in an unfortunate way. The downside is that it leaves the possibility of a worker getting stuck not handling any new requests forever, without the manager process ever knowing, if the app developer made a small mistake that keeps the event loop running.

Actually not an easy decision to make.

@brsakai-csco
Copy link
Contributor

Looks like I'm new to this discussion, but what's the rationale for breaking the cycle at (1) instead of at (3)? That's the method we ended up with in #2046. Seems like it would be useful to notify the managing thread about attempted graceful shutdown, so it can still timeout and SIGKILL us if we get stuck.

@rabbiveesh
Copy link
Contributor

It sounds to me like we wouldn't want this change if it can end up with impossible-to-debug zombie processes.
As an app developer, I want to be able to assume that when I shutdown my app, it will shutdown in all instances

@jixam
Copy link
Contributor Author

jixam commented Jun 9, 2023

I didn't consider that a manual SIGQUIT is on the same code path so I agree that this PR is a no-go as is. However, I am unsure how to change it.

Please consider this skeleton app:

perl -Mojo -E 'get "/" => sub ($c) { $c->inactivity_timeout(0); $c->render_later; }; app->start' prefork -a 100

It seems wrong to close all requests here just because max_accepts is reached. I think that is a current bug?

But for SIGQUIT, yes – the timeout should obviously apply.

@jixam
Copy link
Contributor Author

jixam commented Jun 9, 2023

I am not going to update this PR as I think the fix by @brsakai-csco is better and my concern above is a separate issue.

@jixam
Copy link
Contributor Author

jixam commented Jun 9, 2023

Closing in favor of #2073.

@jixam jixam closed this Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hypnotoad / Prefork dump cores on small values of accepts and disabled keep-alive
4 participants