Handle SIGTERM and SIGINT #2172

wythe · 2018-12-12T21:15:32Z

Addresses #2074.

I madeld a file global variable in order to do this. @rustyrussell already admitted it was "basically a global variable" in the comment below :-)

It seems the code really wants to exit via io_break so that's the approach I took.

I tried to just set a bool and exit the main() for(;;) loop by checking it. But it was not very responsive and I got an error from hsmd with SIGINT.

ZmnSCPxj · 2018-12-13T01:27:18Z

The traditional way of integrating signals to event loops is to use a pipe. Put one end of pipe in global, put other end as something we monitor in the main event loop.

Posix write function is one of the few Posix things that can be safely called in a signal handler.

When signal handler runs, it writes signal number to the pipe. This triggers event, which reads signal nunber from pipe and behaves appropriately.

lightningd/lightningd.c

ZmnSCPxj · 2018-12-14T03:28:04Z

Suggest also adding to CHANGELOG.md --- this seems a significant enough new feature (we now handle SIGINT and SIGTERM properly)

rustyrussell · 2018-12-14T05:39:01Z

I'm not sure why we want to handle these? Our system should be entirely robust against such failures; if it's simply to log them that's OK, but we should use the pipe trick to do it...

NicolasDorier · 2018-12-14T05:47:07Z

@rustyrussell I use clightning in BTCPay through docker.

If there is any restart, or update that require clightning to restart, by default, docker-compose will hang for 10 sec before killing the service. Making the downtime during update to 10 sec, by default.
Knowing there is like 10 services in the docker-compose, if everybody did the same it would take 100 sec of downtime.

Now, because I don't want to kill bitcoind to not have a corrupt state, I don't use a default timeout of 10 sec, but of 300 sec!
So it means docker-compose will wait 300 sec for clightning to close before killing it.

Luckily, you can ask docker-compose to use SIGKILL instead of SIGTERM for specific containers. But this is just a bandaid that bring more burden on the person integrating clightning. When the only thing to do is to handle SIGTERM like does virtually 99% of the apps which are not written in C.

Also, are you really sure that SQLite (this is the DB you are using right?) will be happy to be killed in the middle of a write operation?

ZmnSCPxj · 2018-12-14T11:44:12Z

I am almost sure sqlite3 would be perfectly happy with being killed; that is not much different from literal pull-the-plug. https://www.sqlite.org/howtocorrupt.html

That said, cleanly closing and flushing our own logs and prints would be a nice feature especially to capture possible problems that might induce a human operator to send SIGTERM et al.

Would suggest that multiple trggerings of the SIGTERM or SIGINT handler will outright kill the process, in case we get stuck in an infinite loop not in the mainloop.

rustyrussell · 2018-12-15T05:28:11Z

Luckily, you can ask docker-compose to use SIGKILL instead of SIGTERM for specific containers. But this is just a bandaid that bring more burden on the person integrating clightning. When the only thing to do is to handle SIGTERM like does virtually 99% of the apps which are not written in C.

You misunderstand. We currently "handle" SIGTERM now by exiting immediately. This makes no difference; we just don't get a chance to log that we got SIGTERM.

Also, are you really sure that SQLite (this is the DB you are using right?) will be happy to be killed in the middle of a write operation?

100% positive, yes. If bitcoind gets corruption under this condition AFAICT that's a bitcoind bug...

NicolasDorier · 2018-12-15T12:18:14Z

You misunderstand. We currently "handle" SIGTERM now by exiting immediately. This makes no difference; we just don't get a chance to log that we got SIGTERM.

I don't think you do, else this PR would not be very useful. Also I would be at loss explaining why the clightning container does not close cleanly without me configuring docker compose to use SIGKILL instead of SIGTERM. I can give a try again.

cdecker · 2018-12-15T21:47:58Z

This works perfectly fine:

pkill -TERM lightningd

and the corresponding process exits without noticeable delay:

[2]    20623 terminated  lightningd/lightningd --lightning-dir=/tmp/l2 --network=bitcoin

docker is known to mishandle signals in a few cases (I always use -t to have it allocate a tty, which seems to fix any issues I have with signals not working as expected).

rustyrussell · 2018-12-15T22:51:53Z

I wonder if docker is doing something strange: our code doesn't handle SIGTERM and thus should exit as per default signal handling.

We could just signal(SIGTERM, SIG_DFL); to be absolutely sure, but i am confused...

NicolasDorier · 2018-12-16T06:55:57Z

Strange... I gave a try with https://github.com/btcpayserver/btcpayserver/blob/master/BTCPayServer.Tests/docker-compose.yml

docker-compose up customer_lightningd -d
docker-compose down -t 100
# Result: Shutdown immediately

Now I remove the SIGKILL

diff --git a/BTCPayServer.Tests/docker-compose.yml b/BTCPayServer.Tests/docker-compose.yml
index 7aba34f..c90d9a4 100644
--- a/BTCPayServer.Tests/docker-compose.yml
+++ b/BTCPayServer.Tests/docker-compose.yml
@@ -119,7 +119,6 @@ services:

   customer_lightningd:
     image: nicolasdorier/clightning:v0.6.2-3-dev
-    stop_signal: SIGKILL
     restart: unless-stopped
     environment:
       EXPOSE_TCP: "true"

docker-compose up customer_lightningd -d
docker-compose down -t 100
# Result: Hangs for 100 seconds.

I will take a look if by any chance this signal is not trapped by something in the docker-entrypoint.

NicolasDorier · 2018-12-16T07:07:33Z

Ok, so this is a mistery for me at the moment...
As @cdecker, I tried pkill -TERM lightningd which work as it should...
But docker-compose down does not seem to send the TERM signal to clightning. Which is strange, first case I see. I am searching why.

NicolasDorier · 2018-12-16T07:16:10Z

I am reading https://medium.freecodecamp.org/docker-entrypoint-cmd-dockerfile-best-practices-abc591c30e21 it seems it is a problem unrelated to this PR. Now trying to find out why it does not happen for other of my images.

NicolasDorier · 2018-12-16T07:33:23Z

Ok so here is why I seems fine in other images I use:

For some images, I don't execute a entrypoint script, meaning the process receive directly SIGTERM.

For clightning, I execute an entrypoint script, which execute bash with PID 1, which for some reason, block SIGTERM to reach clightning.

For other images (like bitcoind), I use exec which replace the bash process by the bitcoind process, making it receive the signal.

Proposed fix: using exec in the dockerfile, but I need to test if this will work even with EXPOSE_TCP=true. Ping @shesek I think your image on spark has same issue.

It seems that we don't need to handle SIGTERM explicitely in clightning after all.

NicolasDorier · 2018-12-16T07:48:02Z

Nevermind @shesek handle it with a trap: https://github.com/shesek/spark-wallet/blob/master/scripts/docker-entrypoint.sh and it works.

NicolasDorier · 2018-12-16T08:02:39Z

Ok @shesek hack of trapping TERM does not work, because TERM does not even reach the bash script because bash run with PID 1.

rustyrussell · 2018-12-17T00:28:31Z

@NicolasDorier thanks for digging into this! Weird, though...

NicolasDorier · 2018-12-17T04:35:36Z

I am reading this blog post which explain the problem quite well. https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/

Playing with this now. Will make a PR once I find something that work.

NicolasDorier · 2018-12-17T05:00:15Z

There is some linux black magic I don't quite understand yet.

I tried to use exec lightningd in the docker-entrypoint.sh, so ligthningd ends up with PID 1. (replacing bash)

But then pkill -TERM lightningd does not work anymore!

However, on the bitcoind image, same situation pkill -TERM bitcoind works.
The only difference I see now is the use of gosu in the bitcoind image OR that bticoind handle the signal explicitely.

I will give a try to this PR to see if it fix the situation.

NicolasDorier · 2018-12-17T05:35:21Z

Surprisingly, this PR fix it!

This PR merged	use `exec lightningd` (run as PID 1)	`pkill -TERM lightningd` works	`docker-compose down -t 100` terminate quick
✔️	✔️	✔️	✔️
✔️	❌	✔️	❌
❌	✔️	❌	❌
❌	❌	✔️	❌

A process running as PID 1 inside a container is treated specially by Linux: it ignores any signal with the default action. So, the process will not terminate on SIGINT or SIGTERM unless it is coded to do so.

Upon investigation, I think this PR should be merged (given point that @ZmnSCPxj mentioned are fixed). tACK.

NicolasDorier · 2018-12-17T06:14:24Z

I was thinking about fixing the problem more cleanly via dumb-init. Because having this PR + exec ligthningd will work correctly only if EXPOSE_TCP is false. dumb-init would solve for everything.

However, EXPOSE_TCP is only useful for dev time, where we don't really care about having short timeout for docker-compose. I don't like adding one dependency just for this corner case.

ZmnSCPxj · 2018-12-19T00:09:42Z

Looking at dumb-init/README.md:

However, if the process receiving the signal is PID 1, it gets special treatment by the kernel; if it hasn't registered a handler for the signal, the kernel won't fall back to default behavior, and nothing happens.

So it looks like all that is needed is to explicitly suicide when receiving SIGTERM and SIGINT, but only if we are PID 1.

If self-pipe trick is more than what we want to chew, we could do something simple like:

int on_sigint(int _ UNUSED)
{
        static const char *msg = "lightningd: SIGINT caught, exiting.\n";
        write(STDERR_FILENO, msg, strlen(msg));
        _exit(1);
}
int on_sigterm(int _ UNUSED)
{
        static const char *msg = "lightningd: SIGTERM caught, exiting.\n";
        write(STDERR_FILENO, msg, strlen(msg));
        _exit(1);
}

...

         if (1 == getpid()) {
                 signal(SIGINT, &on_sigint);
                 signal(SIGTERM, &on_sigterm);
         }

(we should use sigaction instead of signal but I am too lazy to hack it together)

wythe · 2018-12-19T17:00:39Z

I've been studying ccan/io along with your above links to create a proper solution. It's definitely been a nice exercise to understand ccan/io and the clightning architecture better.

However I've been a bit distracted lately in RL. I can implement you're above solution later today and then pursue the self-pipe solution on my own time. I appreciate the help.

ZmnSCPxj · 2018-12-20T00:18:46Z

We do not really need self-pipe trick; outside of docker we just not handle SIGTERM and SIGINT and depend on the fact that sqlite3 is 100% reliable and will never corrupt data file, because it is made of magic and not code.

Only need to handle SIGTERM and SIGINT for case that we are PID 1 of docker container, since, Linux makes special this PID and requires that some handler exist.

lightningd/lightningd.c

Signed-off-by: Mark Beckwith <wythe@intrig.com>

wythe · 2018-12-20T00:46:29Z

I also took your wording from above and used it for the code comment :-)

ZmnSCPxj · 2018-12-20T00:51:35Z

@NicolasDorier if you can confirm this new version of PR is OK, I can ACK this.

@rustyrussell @cdecker @niftynei Should we be more ambitious and consider also proper signalling to child processes?
Currently, our child processes are "only" our own daemons, none of which maintain any persistent state and can be SIGKILLed arbitrarily, but, in the future, more complex plugins may exist, which expect "correct" cleanup behavior when c-lightning docker is stopped.
I am almost tempted to just say "please put a minimal init system in Docker container". EDIT: Docker >=1.25 has --init option in run command which includes a minimal init system.

NicolasDorier · 2018-12-20T03:49:52Z

I am almost tempted to just say "please put a minimal init system in Docker container". EDIT: Docker >=1.25 has --init option in run command which includes a minimal init system.

Yeah, or just use SIGKILL. Sadly, init has been supported on docker-compose version 3 since not too long (first supported in 2.2, then dropped because it was not on swarm, and added again... docker-compose versioning is dumb), so I can't assume the users running BTCPay support it.

Adding stop_signal SIGKILL is not too much of a burden, but I think it is better to make it harder for the docker image user to fuck up. Also if later you need proper shutdown, there is no risk I forget to remove the SIGKILL.

I will try this PR.

NicolasDorier · 2018-12-20T04:15:50Z

tACK 9005a45

In conjunction to ElementsProject#2172 this fix ElementsProject#2074 if EXPOSE_TCP is false (which is the case in production) More info ElementsProject#2172 (comment)

ZmnSCPxj · 2018-12-21T00:42:08Z

ACK 9005a45

Still need some thought about taking on more init responsibilities (propagating SIGINT/SIGTERM to children, adopting orphans), or maybe just adding some minimal init implementation in the code base, or as a git submodule.

Addreseed

In conjunction to #2172 this fix #2074 if EXPOSE_TCP is false (which is the case in production) More info #2172 (comment)

wythe requested a review from cdecker as a code owner December 12, 2018 21:15

ZmnSCPxj suggested changes Dec 13, 2018

View reviewed changes

lightningd/lightningd.c Outdated Show resolved Hide resolved

cdecker added the docker label Dec 15, 2018

NicolasDorier mentioned this pull request Dec 17, 2018

Useless trap in the docker-entrypoint shesek/spark-wallet#46

Open

wythe force-pushed the shutdown branch from 6ca1e10 to 859b056 Compare December 20, 2018 00:17

ZmnSCPxj previously requested changes Dec 20, 2018

View reviewed changes

lightningd/lightningd.c Outdated Show resolved Hide resolved

Handle SIGINT and SIGTERM for PID 1

9005a45

Signed-off-by: Mark Beckwith <wythe@intrig.com>

wythe force-pushed the shutdown branch from 859b056 to 9005a45 Compare December 20, 2018 00:44

NicolasDorier mentioned this pull request Dec 20, 2018

[Docker] Make sure lightningd receives SIGTERM #2193

Merged

cdecker merged commit bcde967 into ElementsProject:master Dec 22, 2018

cdecker pushed a commit that referenced this pull request Dec 29, 2018

[Docker] Make sure lightningd receive SIGTERM

f9f4ed8

In conjunction to #2172 this fix #2074 if EXPOSE_TCP is false (which is the case in production) More info #2172 (comment)

NicolasDorier mentioned this pull request Mar 26, 2019

Docker: Add a graceful shutdown handler #2498

Closed

Handle SIGTERM and SIGINT #2172

Handle SIGTERM and SIGINT #2172

Uh oh!

Conversation

wythe commented Dec 12, 2018

Uh oh!

ZmnSCPxj commented Dec 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ZmnSCPxj commented Dec 14, 2018

Uh oh!

rustyrussell commented Dec 14, 2018

Uh oh!

NicolasDorier commented Dec 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZmnSCPxj commented Dec 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustyrussell commented Dec 15, 2018

Uh oh!

NicolasDorier commented Dec 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdecker commented Dec 15, 2018

Uh oh!

rustyrussell commented Dec 15, 2018

Uh oh!

NicolasDorier commented Dec 16, 2018

Uh oh!

NicolasDorier commented Dec 16, 2018

Uh oh!

NicolasDorier commented Dec 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasDorier commented Dec 16, 2018

Uh oh!

NicolasDorier commented Dec 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasDorier commented Dec 16, 2018

Uh oh!

rustyrussell commented Dec 17, 2018

Uh oh!

NicolasDorier commented Dec 17, 2018

Uh oh!

NicolasDorier commented Dec 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasDorier commented Dec 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasDorier commented Dec 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZmnSCPxj commented Dec 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wythe commented Dec 19, 2018

Uh oh!

ZmnSCPxj commented Dec 20, 2018

Uh oh!

Uh oh!

wythe commented Dec 20, 2018

Uh oh!

ZmnSCPxj commented Dec 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasDorier commented Dec 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasDorier commented Dec 20, 2018

Uh oh!

ZmnSCPxj commented Dec 21, 2018

Uh oh!

Uh oh!

ZmnSCPxj commented Dec 13, 2018 •

edited

Loading

NicolasDorier commented Dec 14, 2018 •

edited

Loading

ZmnSCPxj commented Dec 14, 2018 •

edited

Loading

NicolasDorier commented Dec 15, 2018 •

edited

Loading

NicolasDorier commented Dec 16, 2018 •

edited

Loading

NicolasDorier commented Dec 16, 2018 •

edited

Loading

NicolasDorier commented Dec 17, 2018 •

edited

Loading

NicolasDorier commented Dec 17, 2018 •

edited

Loading

NicolasDorier commented Dec 17, 2018 •

edited

Loading

ZmnSCPxj commented Dec 19, 2018 •

edited

Loading

ZmnSCPxj commented Dec 20, 2018 •

edited

Loading

NicolasDorier commented Dec 20, 2018 •

edited

Loading