Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tools,watchfrr: coordinate timeout when restarting a daemon #12379

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mjstapp
Copy link
Contributor

@mjstapp mjstapp commented Nov 23, 2022

When watchfrr restarts a hung or failed daemon, convey the timeout that it's going to use to the shell script that's actually doing the work. If the two timeouts are different, watchfrr may terminate the shell script prematurely.

Also add a "kill" step to frrcommon's daemon_stop(): if a daemon is hung or deadlocked, it may not be able to react to SIGINT, and SIGKILL may be necessary.

When watchfrr restarts a hung or failed daemon, convey the
timeout that it's going to use to the shell script that's
actually doing the work. If the two timeouts are different,
watchfrr may terminate the shell script prematurely.

Also add a "kill" step to frrcommon's daemon_stop(): if
a daemon is hung or deadlocked, it may not be able to react
to SIGINT, and SIGKILL may be necessary.

Signed-off-by: Mark Stapp <mstapp@nvidia.com>
@NetDEF-CI
Copy link
Collaborator

Continuous Integration Result: SUCCESSFUL

Congratulations, this patch passed basic tests

Tested-by: NetDEF / OpenSourceRouting.org CI System

CI System Testrun URL: https://ci1.netdef.org/browse/FRR-PULLREQ2-8527/

This is a comment from an automated CI system.
For questions and feedback in regards to this CI system, please feel free to email
Martin Winter - mwinter (at) opensourcerouting.org.

kill -9 "$pid"

# And wait a bit for the kill to take effect
cnt=5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious how it should be under a high load, is this enough? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the point of this check — we sent SIGKILL, the kernel is killing the process.

I assume this is here to try to make sure that resources are released by the still-running daemon. But stale resources are something we should clean up on start (even in the daemon itself), not by introducing delay during exit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was concerned about (and saw) the case where the system was busy, and the "KILL" step took some time to have effect. If we just return "success" to watchfrr at this point, there's some risk that the old daemon will still be present. I was just trying to give a bit of an opportunity for the system to finish cleaning up the old process.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eqvinox @ton31337 did either of you have more feedback on this? Is the approach okay as is, or do you have a suggestion on what you'd like to see changed before approving?

Copy link
Contributor

@eqvinox eqvinox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits

# Allow calling program to convey the timeout it may be using
#
if [ -n "$FRR_WATCHFRR_TIMEOUT" ] ; then
$(( cnt = "$FRR_WATCHFRR_TIMEOUT" - 5 ))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be more weird-shell compatible to do cnt=$(( … )) rather than $(( cnt = … ))

also what happens if $FRR_WATCHFRR_TIMEOUT is less than 5?


debug "kill -2 $pid, cnt $cnt"
log_success_msg "Stopping $dmninst, pid $pid ..."
kill -2 "$pid"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure when/where the -2 came into this script, but signal numbers are theoretically platform specific. I think for SIGINT / SIGKILL this is not a real issue, but kill -INT and kill -KILL are more readable too, so we should use those.

(reference for the curious: man 7 signal, Signal numbering for standard signals)

kill -9 "$pid"

# And wait a bit for the kill to take effect
cnt=5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the point of this check — we sent SIGKILL, the kernel is killing the process.

I assume this is here to try to make sure that resources are released by the still-running daemon. But stale resources are something we should clean up on start (even in the daemon itself), not by introducing delay during exit.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link

This PR is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this PR closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants