-
Notifications
You must be signed in to change notification settings - Fork 951
Handle SIGTERM and SIGINT #2172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The traditional way of integrating signals to event loops is to use a pipe. Put one end of pipe in global, put other end as something we monitor in the main event loop. Posix When signal handler runs, it writes signal number to the pipe. This triggers event, which reads signal nunber from pipe and behaves appropriately. |
Suggest also adding to |
I'm not sure why we want to handle these? Our system should be entirely robust against such failures; if it's simply to log them that's OK, but we should use the pipe trick to do it... |
@rustyrussell I use clightning in BTCPay through docker. If there is any restart, or update that require clightning to restart, by default, docker-compose will hang for 10 sec before killing the service. Making the downtime during update to 10 sec, by default. Now, because I don't want to kill bitcoind to not have a corrupt state, I don't use a default timeout of 10 sec, but of 300 sec! Luckily, you can ask docker-compose to use SIGKILL instead of SIGTERM for specific containers. But this is just a bandaid that bring more burden on the person integrating clightning. When the only thing to do is to handle SIGTERM like does virtually 99% of the apps which are not written in C. Also, are you really sure that SQLite (this is the DB you are using right?) will be happy to be killed in the middle of a write operation? |
I am almost sure sqlite3 would be perfectly happy with being killed; that is not much different from literal pull-the-plug. https://www.sqlite.org/howtocorrupt.html That said, cleanly closing and flushing our own logs and prints would be a nice feature especially to capture possible problems that might induce a human operator to send Would suggest that multiple trggerings of the |
You misunderstand. We currently "handle" SIGTERM now by exiting immediately. This makes no difference; we just don't get a chance to log that we got SIGTERM.
100% positive, yes. If bitcoind gets corruption under this condition AFAICT that's a bitcoind bug... |
I don't think you do, else this PR would not be very useful. Also I would be at loss explaining why the clightning container does not close cleanly without me configuring docker compose to use SIGKILL instead of SIGTERM. I can give a try again. |
This works perfectly fine:
and the corresponding process exits without noticeable delay:
|
I wonder if docker is doing something strange: our code doesn't handle SIGTERM and thus should exit as per default signal handling. We could just |
Strange... I gave a try with https://github.com/btcpayserver/btcpayserver/blob/master/BTCPayServer.Tests/docker-compose.yml docker-compose up customer_lightningd -d
docker-compose down -t 100
# Result: Shutdown immediately Now I remove the SIGKILL diff --git a/BTCPayServer.Tests/docker-compose.yml b/BTCPayServer.Tests/docker-compose.yml
index 7aba34f..c90d9a4 100644
--- a/BTCPayServer.Tests/docker-compose.yml
+++ b/BTCPayServer.Tests/docker-compose.yml
@@ -119,7 +119,6 @@ services:
customer_lightningd:
image: nicolasdorier/clightning:v0.6.2-3-dev
- stop_signal: SIGKILL
restart: unless-stopped
environment:
EXPOSE_TCP: "true" docker-compose up customer_lightningd -d
docker-compose down -t 100
# Result: Hangs for 100 seconds. I will take a look if by any chance this signal is not trapped by something in the |
Ok, so this is a mistery for me at the moment... |
I am reading https://medium.freecodecamp.org/docker-entrypoint-cmd-dockerfile-best-practices-abc591c30e21 it seems it is a problem unrelated to this PR. Now trying to find out why it does not happen for other of my images. |
Ok so here is why I seems fine in other images I use: For some images, I don't execute a entrypoint script, meaning the process receive directly SIGTERM. For clightning, I execute an entrypoint script, which execute bash with PID 1, which for some reason, block SIGTERM to reach clightning. For other images (like bitcoind), I use Proposed fix: using It seems that we don't need to handle |
Nevermind @shesek handle it with a trap: https://github.com/shesek/spark-wallet/blob/master/scripts/docker-entrypoint.sh and it works. |
Ok @shesek hack of trapping TERM does not work, because TERM does not even reach the bash script because bash run with PID 1. |
@NicolasDorier thanks for digging into this! Weird, though... |
I am reading this blog post which explain the problem quite well. https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/ Playing with this now. Will make a PR once I find something that work. |
There is some linux black magic I don't quite understand yet. I tried to use But then However, on the bitcoind image, same situation I will give a try to this PR to see if it fix the situation. |
Surprisingly, this PR fix it!
Upon investigation, I think this PR should be merged (given point that @ZmnSCPxj mentioned are fixed). tACK. |
I was thinking about fixing the problem more cleanly via dumb-init. Because having this PR + However, |
Looking at
So it looks like all that is needed is to explicitly suicide when receiving SIGTERM and SIGINT, but only if we are PID 1. If self-pipe trick is more than what we want to chew, we could do something simple like:
(we should use |
I've been studying However I've been a bit distracted lately in RL. I can implement you're above solution later today and then pursue the self-pipe solution on my own time. I appreciate the help. |
We do not really need self-pipe trick; outside of docker we just not handle Only need to handle |
Signed-off-by: Mark Beckwith <wythe@intrig.com>
I also took your wording from above and used it for the code comment :-) |
@NicolasDorier if you can confirm this new version of PR is OK, I can ACK this. @rustyrussell @cdecker @niftynei Should we be more ambitious and consider also proper signalling to child processes? |
Yeah, or just use Adding stop_signal SIGKILL is not too much of a burden, but I think it is better to make it harder for the docker image user to fuck up. Also if later you need proper shutdown, there is no risk I forget to remove the SIGKILL. I will try this PR. |
tACK 9005a45 |
In conjunction to ElementsProject#2172 this fix ElementsProject#2074 if EXPOSE_TCP is false (which is the case in production) More info ElementsProject#2172 (comment)
ACK 9005a45 Still need some thought about taking on more |
In conjunction to #2172 this fix #2074 if EXPOSE_TCP is false (which is the case in production) More info #2172 (comment)
Addresses #2074.
I made
ld
a file global variable in order to do this. @rustyrussell already admitted it was "basically a global variable" in the comment below :-)It seems the code really wants to exit via
io_break
so that's the approach I took.I tried to just set a
bool
and exit themain()
for(;;)
loop by checking it. But it was not very responsive and I got an error fromhsmd
with SIGINT.