-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tools,watchfrr: coordinate timeout when restarting a daemon #12379
base: master
Are you sure you want to change the base?
Conversation
When watchfrr restarts a hung or failed daemon, convey the timeout that it's going to use to the shell script that's actually doing the work. If the two timeouts are different, watchfrr may terminate the shell script prematurely. Also add a "kill" step to frrcommon's daemon_stop(): if a daemon is hung or deadlocked, it may not be able to react to SIGINT, and SIGKILL may be necessary. Signed-off-by: Mark Stapp <mstapp@nvidia.com>
Continuous Integration Result: SUCCESSFULCongratulations, this patch passed basic tests Tested-by: NetDEF / OpenSourceRouting.org CI System CI System Testrun URL: https://ci1.netdef.org/browse/FRR-PULLREQ2-8527/ This is a comment from an automated CI system. |
kill -9 "$pid" | ||
|
||
# And wait a bit for the kill to take effect | ||
cnt=5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious how it should be under a high load, is this enough? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the point of this check — we sent SIGKILL
, the kernel is killing the process.
I assume this is here to try to make sure that resources are released by the still-running daemon. But stale resources are something we should clean up on start (even in the daemon itself), not by introducing delay during exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was concerned about (and saw) the case where the system was busy, and the "KILL" step took some time to have effect. If we just return "success" to watchfrr at this point, there's some risk that the old daemon will still be present. I was just trying to give a bit of an opportunity for the system to finish cleaning up the old process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nits
# Allow calling program to convey the timeout it may be using | ||
# | ||
if [ -n "$FRR_WATCHFRR_TIMEOUT" ] ; then | ||
$(( cnt = "$FRR_WATCHFRR_TIMEOUT" - 5 )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be more weird-shell compatible to do cnt=$(( … ))
rather than $(( cnt = … ))
also what happens if $FRR_WATCHFRR_TIMEOUT
is less than 5?
|
||
debug "kill -2 $pid, cnt $cnt" | ||
log_success_msg "Stopping $dmninst, pid $pid ..." | ||
kill -2 "$pid" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure when/where the -2
came into this script, but signal numbers are theoretically platform specific. I think for SIGINT / SIGKILL this is not a real issue, but kill -INT
and kill -KILL
are more readable too, so we should use those.
(reference for the curious: man 7 signal
, Signal numbering for standard signals
)
kill -9 "$pid" | ||
|
||
# And wait a bit for the kill to take effect | ||
cnt=5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the point of this check — we sent SIGKILL
, the kernel is killing the process.
I assume this is here to try to make sure that resources are released by the still-running daemon. But stale resources are something we should clean up on start (even in the daemon itself), not by introducing delay during exit.
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
This PR is stale because it has been open 180 days with no activity. Comment or remove the |
When watchfrr restarts a hung or failed daemon, convey the timeout that it's going to use to the shell script that's actually doing the work. If the two timeouts are different, watchfrr may terminate the shell script prematurely.
Also add a "kill" step to frrcommon's daemon_stop(): if a daemon is hung or deadlocked, it may not be able to react to SIGINT, and SIGKILL may be necessary.