tools,watchfrr: coordinate timeout when restarting a daemon #12379

mjstapp · 2022-11-23T19:25:57Z

When watchfrr restarts a hung or failed daemon, convey the timeout that it's going to use to the shell script that's actually doing the work. If the two timeouts are different, watchfrr may terminate the shell script prematurely.

Also add a "kill" step to frrcommon's daemon_stop(): if a daemon is hung or deadlocked, it may not be able to react to SIGINT, and SIGKILL may be necessary.

When watchfrr restarts a hung or failed daemon, convey the timeout that it's going to use to the shell script that's actually doing the work. If the two timeouts are different, watchfrr may terminate the shell script prematurely. Also add a "kill" step to frrcommon's daemon_stop(): if a daemon is hung or deadlocked, it may not be able to react to SIGINT, and SIGKILL may be necessary. Signed-off-by: Mark Stapp <mstapp@nvidia.com>

NetDEF-CI · 2022-11-23T21:40:28Z

Continuous Integration Result: SUCCESSFUL

Congratulations, this patch passed basic tests

Tested-by: NetDEF / OpenSourceRouting.org CI System

CI System Testrun URL: https://ci1.netdef.org/browse/FRR-PULLREQ2-8527/

This is a comment from an automated CI system.
For questions and feedback in regards to this CI system, please feel free to email
Martin Winter - mwinter (at) opensourcerouting.org.

ton31337 · 2022-11-24T07:26:56Z

tools/frrcommon.sh.in

+		kill -9 "$pid"
+
+		# And wait a bit for the kill to take effect
+		cnt=5


Just curious how it should be under a high load, is this enough? :)

I don't see the point of this check — we sent SIGKILL, the kernel is killing the process.

I assume this is here to try to make sure that resources are released by the still-running daemon. But stale resources are something we should clean up on start (even in the daemon itself), not by introducing delay during exit.

I was concerned about (and saw) the case where the system was busy, and the "KILL" step took some time to have effect. If we just return "success" to watchfrr at this point, there's some risk that the old daemon will still be present. I was just trying to give a bit of an opportunity for the system to finish cleaning up the old process.

@eqvinox @ton31337 did either of you have more feedback on this? Is the approach okay as is, or do you have a suggestion on what you'd like to see changed before approving?

eqvinox

some nits

eqvinox · 2022-11-28T10:11:38Z

tools/frrcommon.sh.in

+	# Allow calling program to convey the timeout it may be using
+	#
+	if [ -n "$FRR_WATCHFRR_TIMEOUT" ] ; then
+		$(( cnt = "$FRR_WATCHFRR_TIMEOUT" - 5 ))


might be more weird-shell compatible to do cnt=$(( … )) rather than $(( cnt = … ))

also what happens if $FRR_WATCHFRR_TIMEOUT is less than 5?

eqvinox · 2022-11-28T10:13:33Z

tools/frrcommon.sh.in

+
+	debug "kill -2 $pid, cnt $cnt"
+	log_success_msg "Stopping $dmninst, pid $pid ..."
+	kill -2 "$pid"


I'm not sure when/where the -2 came into this script, but signal numbers are theoretically platform specific. I think for SIGINT / SIGKILL this is not a real issue, but kill -INT and kill -KILL are more readable too, so we should use those.

(reference for the curious: man 7 signal, Signal numbering for standard signals)

eqvinox · 2022-11-28T10:15:17Z

tools/frrcommon.sh.in

+		kill -9 "$pid"
+
+		# And wait a bit for the kill to take effect
+		cnt=5


I don't see the point of this check — we sent SIGKILL, the kernel is killing the process.

I assume this is here to try to make sure that resources are released by the still-running daemon. But stale resources are something we should clean up on start (even in the daemon itself), not by introducing delay during exit.

github-actions · 2023-01-31T14:03:55Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions · 2023-07-31T01:48:03Z

This PR is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this PR closed.

frrbot bot added tools watchfrr labels Nov 23, 2022

github-actions bot added the master label Nov 23, 2022

ton31337 reviewed Nov 24, 2022

View reviewed changes

ton31337 approved these changes Nov 24, 2022

View reviewed changes

eqvinox requested changes Nov 28, 2022

View reviewed changes

github-actions bot added the conflicts label Jan 31, 2023

github-actions bot added the autoclose label Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tools,watchfrr: coordinate timeout when restarting a daemon #12379

tools,watchfrr: coordinate timeout when restarting a daemon #12379

mjstapp commented Nov 23, 2022

NetDEF-CI commented Nov 23, 2022

ton31337 Nov 24, 2022

eqvinox Nov 28, 2022

mjstapp Dec 6, 2022

taspelund Dec 13, 2022

eqvinox left a comment

eqvinox Nov 28, 2022

eqvinox Nov 28, 2022

eqvinox Nov 28, 2022

github-actions bot commented Jan 31, 2023

github-actions bot commented Jul 31, 2023

tools,watchfrr: coordinate timeout when restarting a daemon #12379

Are you sure you want to change the base?

tools,watchfrr: coordinate timeout when restarting a daemon #12379

Conversation

mjstapp commented Nov 23, 2022

NetDEF-CI commented Nov 23, 2022

Continuous Integration Result: SUCCESSFUL

ton31337 Nov 24, 2022

Choose a reason for hiding this comment

eqvinox Nov 28, 2022

Choose a reason for hiding this comment

mjstapp Dec 6, 2022

Choose a reason for hiding this comment

taspelund Dec 13, 2022

Choose a reason for hiding this comment

eqvinox left a comment

Choose a reason for hiding this comment

eqvinox Nov 28, 2022

Choose a reason for hiding this comment

eqvinox Nov 28, 2022

Choose a reason for hiding this comment

eqvinox Nov 28, 2022

Choose a reason for hiding this comment

github-actions bot commented Jan 31, 2023

github-actions bot commented Jul 31, 2023