Deregister services on SIGTERM #22

justenwalker · 2015-12-03T07:03:15Z

Changes

Handle SIGTERM, send to application
Add stopTimeout to config to manage wait time
Extend DiscoveryService - add Deregister
Deregister all services when application exits

Use Case

When i run my Containerbuddy images, i want them to automatically deregister from consul when they die, so that they do not clutter up the service namespace. Especially problematic when scaling these containers out with Mesos where applications may be moved around frequently.

Notes

I'm a bit of a golang noob, so I'm not sure how to add tests for this, since this has some external effects. FWIW, I tested this empirically using a local Mesos and Consul cluster.

- Handle SIGTERM, send to application - Add stopTimeout to config to manage wait time - Extend DiscoveryService - add Deregister - Deregister all services when application exits

tgross · 2015-12-03T14:12:01Z

Normally I'd suggest we open an issue first, but the way you've structured this PR makes it ideal for discussing the feature idea so this is great. Thanks for the PR @justenwalker. Some points of discussion:

i want them to automatically deregister from consul when they die

Just to be specific, this patch catches SIGTERM which means we're catching docker stop (or really the equivalent remote API call) but not docker kill or involuntary exit like a crash. I do like that split in behavior though -- it means that containers intentionally removed will deregister but ones that have crashed will remain in Consul until you reap them separately, which will help out for debugging.

I'd like if we documented the intended behavior with respect to Docker commands specifically. What you have in the README is accurate but it'll improve usability if we explicitly say "this is what happens with docker stop vs docker kill" (ex. the Docker docs say they fire SIGTERM + SIGKILL so you don't have to dig into source to verify the behavior).

so I'm not sure how to add tests for this

Tricky... You don't terminate Containerbuddy itself in the signal handler (which is correct, because the Containerbuddy process will clean itself up once it gets the exit code from the shimmed application). So you could write a test that runs an executeAndWait on something like sleep and then fires the appropriate signals and checks the exit code to make sure they were caught.

We should also update the example application to use this feature.

tgross · 2015-12-03T14:12:48Z

src/containerbuddy/discovery.go

@@ -4,4 +4,5 @@ type DiscoveryService interface {
 	SendHeartbeat(*ServiceConfig)
 	CheckForUpstreamChanges(*BackendConfig) bool
 	MarkForMaintenance(*ServiceConfig)
+	Deregister(*ServiceConfig)


I like that you've extended the DiscoveryService such that if MarkForMaintenance and Deregister need to be separate implementations for a non-Consul discovery backend we can do that easily.

Also extract magic number to a named constant

justenwalker · 2015-12-03T15:25:37Z

I'm going to add some documentation like the following in Operating Container Buddy. I noticed some weirdness with Marathon but it applies more generally with the way you run Containerbuddy from docker if it is not the ENTRYPOINT of the image.

Docker will automatically deliver a SIGTERM when you use docker stop; Not when using docker kill. Caveat: If Containerbuddy is executed in a shell such as: /bin/sh -c '/opt/containerbuddy .... ' then SIGTERM will not reach Containerbuddy from docker stop. This is important for systems like Marathon on Mesos which do this by default.

tgross · 2015-12-03T15:32:45Z

This is important for systems like Marathon on Mesos which do this by default.

Just for clarity: does Mesos/Marathon just fire SIGTERM to the shell process and pray cleanup works (yikes!), or are you saying that you don't want to use docker stop directly on a container launched by Marathon/Mesos?

justenwalker · 2015-12-03T15:57:41Z

Mesos/Marathon will issue a docker stop on the container when shutting it down, (at least if you specify docker_stop_timeout in your mesos-slave config). However, the problem is when launching the container - the default is to use a shell executor, which blocks signals from propagating.

From Mesos Docs - CommandInfo to run Docker images

To run a docker image with the default command (ie: docker run image), the CommandInfo’s value must not be set. If the value is set then it will override the default command.
To run a docker image with an entrypoint defined, the CommandInfo’s shell option must be set to false. If shell option is set to true the Docker Containerizer will run the user’s command wrapped with /bin/sh -c which will also become parameters to the image entrypoint.

tgross · 2015-12-03T16:10:28Z

However, the problem is when launching the container - the default is to use a shell executor, which blocks signals from propagating.

That's unfortunate but I don't see any way around it.

justenwalker · 2015-12-03T16:32:30Z

It's not Containerbuddy's responsibility to manage that - but I think it would be a good idea to make people aware of this because it manifests as "My services are not deregistering from Consul when Mesos shuts them down". I spent all night tracking this down, so i'd hate for anyone else to have to suffer through it.

- Add note about stopTimeout = -1 - Document docker stop vs docker kill - Add some Caveats about shell ENTRYPOINT

tgross · 2015-12-03T16:51:35Z

It's not Containerbuddy's responsibility to manage that - but I think it would be a good idea to make people aware of this because it manifests as "My services are not deregistering from Consul when Mesos shuts them down". I spent all night tracking this down, so i'd hate for anyone else to have to suffer through it.

I'm in total agreement there.

justenwalker · 2015-12-03T19:52:37Z

Added some tests for the SIGTERM handle.

Also, found out that we can only call handleSignals once, otherwise we'll get duplicate handles - so I had to dump the SIGTERM test into a single function along with SIGUSR1

- Split SIGUSR1 and SIGTERM tests into separate functions

justenwalker · 2015-12-03T22:06:35Z

@tgross Updated tests and removed panic on cmd not found.

misterbisson · 2015-12-03T22:40:36Z

This opens up the door to more than just deregistering services. Consider running Couchbase in Docker. The blueprint and demo makes deploying Couchbase and scaling it up easy (repo), but scaling down requires more steps that haven't been automated.

A SIGTERM handler could be exactly what's needed to add that automation. If it could also execute a user-defined executable (and wait for it), it would allow us to mark the Couchbase node for removal from the cluster and automatically rebalance the data to the remaining nodes before stopping it.

I haven't tested it, but I think the right command to call would be:

couchbase-cli rebalance -c 127.0.0.1:8091 -u $COUCHBASE_USER -p $COUCHBASE_PASS --server-remove=${IP_PRIVATE}:8091

And when that is done, it should be safe to stop (and remove/delete) the container.

justenwalker · 2015-12-04T01:15:32Z

@misterbisson Added another config for onStop support - Could be used for any arbitrary commands necessary to clean up after a service exits.

misterbisson · 2015-12-04T05:55:40Z

@justenwalker I was returning to the ticket to suggest I should move that to a ticket and let this PR move forward, but you were too quick!

onStop is the executable (and its arguments) that will be called immediately after the shimmed application exits. This field is optional. If the onStop handler returns a non-zero exit code, Containerbuddy will exit with this code rather than the application's exit code.

I like this a lot. I do have a question about "will be called immediately after the shimmed application exits," though. My usage scenario requires that this handler be called before the shimmed application exits, and that Containerbuddy wait for its exit before continuing.

Semantically, onStart is actually executed before the application starts. Would it be okay for onStop to similarly execute before the actual stop? Or, did you have a specific usage in mind that requires calling onStop after the shimmed app's exit?

Edit/followup

Would it be possible to move the onStop handler' from main.go to signals.go at the top of terminate?

Perhaps the PR should also move deregisterServices from main.go into the top of terminate in signals.go as well, so that the requests to the service can be stopped before the service truly stops responding to them. But perhaps you had a different plan for that too?

If my suggestion here is followed, it should also be noted that the time of the onStop plus the stopTimeout would all have to be less than the Mesos docker_stop_timeout to avoid errors.

misterbisson · 2015-12-04T06:13:29Z

src/containerbuddy/main.go

 		if err != nil {
 			log.Println(err)
 		}
+		deregisterServices(config)


Should this be called in the top of terminate in signals.go so that the requests to the service can be stopped before the service truly stops responding to them?

Perhaps I'm misreading the intention of stopTimeout, but it would seem that would be useful to help keep the shimmed service running past the heartbeat/healthcheck TTL, no? But, with the order of operations here, the shimmed app has been terminated before being deregistered, yes?

See also #22 (comment)

Should this be called in the top of terminate in signals.go so that the requests to the service can be stopped before the service truly stops responding to them?

If we move the call to deregisterServices then we'll miss cases where the application exits normally without a docker stop (processes don't call SIGTERM on themselves when they exit).

But your concern is well-founded. Maybe we need to also include the deregisterServices in the beginning of terminate; we'd need to either handle the error from the discovery service on the second call or set a flag that we've already terminated the process.

I don't see why the application should continue running after deregister is called. My intention for the cleanup method is to be analogous to a try / finally block.

stopTimeout is just the time that Containerbuddy will wait before killing the application itself (after sending SIGTERM) and proceeding with the cleanup.

Right, but as-written you're running the application, it exits (via exiting normally or SIGTERM) and then we're running deregister afterwards. Which means the application exits and can't serve requests but the discovery service will still think it's healthy.

I agree. I can move it to signals.go before terminate.

Is it possible that while this is going on, the poll function re-registers it though?

Is it possible that while this is going on, the poll function re-registers it though?

No because we check for maintenance mode for running any poll function: https://github.com/joyent/containerbuddy/blob/288c18cc9df4bd15f64be59456f9f362655af30e/src/containerbuddy/main.go#L74

tgross · 2015-12-04T13:00:26Z

src/containerbuddy/config.go

+
+const (
+	// Amount of time to wait before killing the application
+	DEFAULT_STOP_TIMEOUT int = 5


Golang style is not to all-caps these. Should just be lower-case so that we're not exporting it.

justenwalker · 2015-12-04T13:09:43Z

@misterbisson In my mind, OnStop makes more sense to happen after the application exits rather than before - so that we know the service isn't interacting with anything before we send cleanup commands (like Finally block). Perhaps I should undo this commit and move it to a separate PR - as I've kind of incorporated too many changes into this PR at once?

tgross · 2015-12-04T13:13:45Z

In my mind, OnStop makes more sense to happen after the application exits rather than before - so that we know the service isn't interacting with anything before we send cleanup commands (like Finally block)

I agree. A use case here would be backing up state to somewhere external after the application has exited (and therefore given up hold on file handles).

Perhaps I should undo this commit and move it to a separate PR - as I've kind of incorporated too many changes into this PR at once?

I think that's probably a good idea. That way we can discuss that new feature separately.

justenwalker · 2015-12-04T13:24:14Z

@tgross @misterbisson backed out the OnStop change. Will do separate PR

justenwalker · 2015-12-04T13:27:52Z

@misterbisson #28 created to discuss OnStop

justenwalker · 2015-12-04T13:59:29Z

@tgross Im stopping all of the polling threads, then deregistering the service, and finally terminating the application.

I did not use the maintenance mode - maybe I should?

tgross · 2015-12-04T14:14:46Z

I did not use the maintenance mode - maybe I should?

My first pass thru implementing maintenance mode I did this the same way you did here but then I realized we couldn't un-stop the polling threads so I used the toggleMaintenanceMode function at the top of the SIGUSR1 handler.

What you've done probably works just as well or better for the SIGTERM handler. I'm a little worried about implementation drift between maintenance and deregister... the deregisterServices code is similar to the loop in the SIGUSR1 handler except with a different function called on each service. Maybe we can factor that loop out and have func as one of its arguments?

Function that takes a config and a function which operates on each service in the config.

justenwalker · 2015-12-04T14:59:52Z

@tgross I factored out that logic into its own function: forAllServices

tgross · 2015-12-04T15:40:22Z

README.md

@@ -88,13 +88,16 @@ Other fields:

 - `consul` is the hostname and port of the Consul discovery service.
 - `onStart` is the executable (and its arguments) that will be called immediately prior to starting the shimmed application. This field is optional. If the `onStart` handler returns a non-zero exit code, Containerbuddy will exit.
+- `stopTimeout` Optional amount of time in seconds to wait before killing the application. (defaults to `5`). Providing `-1` will kill the application immediately.


Last comment: we should show this field in the JSON example above.

tgross · 2015-12-04T15:41:05Z

Ok, @justenwalker I've verified the tests pass, that we're go fmt clean, and that the example application builds and runs. I added one last documentation comment and then I think we can merge this in.

tgross · 2015-12-04T16:07:17Z

Add stopTimeout to example json

Ok, this looks good to merge! Thanks for the hard work in bringing this home @justenwalker!

Deregister services on SIGTERM

misterbisson · 2015-12-04T16:08:24Z

@justenwalker thanks for creating the new ticket. I've picked up the conversation there, as you've probably already seen ;-)

This is a good feature to add, thanks for working and thinking it through with us!

tgross · 2015-12-04T16:17:46Z

Released in https://github.com/joyent/containerbuddy/releases/tag/0.0.3

Deregister services on SIGTERM

d93e39c

- Handle SIGTERM, send to application - Add stopTimeout to config to manage wait time - Extend DiscoveryService - add Deregister - Deregister all services when application exits

tgross reviewed Dec 3, 2015
View reviewed changes

tgross added enhancement open PR labels Dec 3, 2015

Adjust default stopTimeout to match documentation

83be44a

Also extract magic number to a named constant

Enhance documenation around SIGTERM, docker stop/kill

1e9816a

- Add note about stopTimeout = -1 - Document docker stop vs docker kill - Add some Caveats about shell ENTRYPOINT

justenwalker force-pushed the deregister branch from 048105e to 1e9816a Compare December 3, 2015 16:37

Fix typos

b2feabe

Add tests for SIGTERM handling

c425858

justenwalker force-pushed the deregister branch from 9f37eb4 to c425858 Compare December 3, 2015 19:53

Move comment

3a57b94

tgross mentioned this pull request Dec 3, 2015

Support version flag #26

Merged

justenwalker added 2 commits December 3, 2015 16:57

Don't panic if cmd is not given

117d607

Clean up SIGTERM test

52a8bda

- Split SIGUSR1 and SIGTERM tests into separate functions

justenwalker force-pushed the deregister branch from 6be9aa9 to c67bb9b Compare December 4, 2015 01:18

misterbisson reviewed Dec 4, 2015
View reviewed changes

tgross reviewed Dec 4, 2015
View reviewed changes

Fix golang style: defaultStopTimeout

36f633e

justenwalker force-pushed the deregister branch from c67bb9b to 36f633e Compare December 4, 2015 13:23

justenwalker mentioned this pull request Dec 4, 2015

Add more custom script hooks #28

Closed

Deregister service and stop polling before terminating the application

07bfbc7

Refactor services loop into forAllServices

3621caa

Function that takes a config and a function which operates on each service in the config.

tgross reviewed Dec 4, 2015
View reviewed changes

Add stopTimeout to example json

b253da5

tgross added a commit that referenced this pull request Dec 4, 2015

Merge pull request #22 from justenwalker/deregister

833455c

Deregister services on SIGTERM

tgross merged commit 833455c into TritonDataCenter:master Dec 4, 2015

justenwalker deleted the deregister branch December 4, 2015 16:43

misterbisson mentioned this pull request Dec 9, 2015

Cleanup services when TTL expired autopilotpattern/consul#6

Closed

misterbisson mentioned this pull request Apr 11, 2016

Automatic cluster detach and rebalancing on graceful instance shutdown autopilotpattern/couchbase#14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deregister services on SIGTERM #22

Deregister services on SIGTERM #22

justenwalker commented Dec 3, 2015

tgross commented Dec 3, 2015

tgross Dec 3, 2015

justenwalker commented Dec 3, 2015

tgross commented Dec 3, 2015

justenwalker commented Dec 3, 2015

tgross commented Dec 3, 2015

justenwalker commented Dec 3, 2015

tgross commented Dec 3, 2015

justenwalker commented Dec 3, 2015

justenwalker commented Dec 3, 2015

misterbisson commented Dec 3, 2015

justenwalker commented Dec 4, 2015

misterbisson commented Dec 4, 2015

misterbisson Dec 4, 2015

tgross Dec 4, 2015

justenwalker Dec 4, 2015

tgross Dec 4, 2015

justenwalker Dec 4, 2015

tgross Dec 4, 2015

tgross Dec 4, 2015

justenwalker commented Dec 4, 2015

tgross commented Dec 4, 2015

justenwalker commented Dec 4, 2015

justenwalker commented Dec 4, 2015

justenwalker commented Dec 4, 2015

tgross commented Dec 4, 2015

justenwalker commented Dec 4, 2015

tgross Dec 4, 2015

tgross commented Dec 4, 2015

tgross commented Dec 4, 2015

misterbisson commented Dec 4, 2015

tgross commented Dec 4, 2015

Deregister services on SIGTERM #22

Deregister services on SIGTERM #22

Conversation

justenwalker commented Dec 3, 2015

Changes

Use Case

Notes

tgross commented Dec 3, 2015

Choose a reason for hiding this comment

justenwalker commented Dec 3, 2015

tgross commented Dec 3, 2015

justenwalker commented Dec 3, 2015

tgross commented Dec 3, 2015

justenwalker commented Dec 3, 2015

tgross commented Dec 3, 2015

justenwalker commented Dec 3, 2015

justenwalker commented Dec 3, 2015

misterbisson commented Dec 3, 2015

justenwalker commented Dec 4, 2015

misterbisson commented Dec 4, 2015

Edit/followup

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justenwalker commented Dec 4, 2015

tgross commented Dec 4, 2015

justenwalker commented Dec 4, 2015

justenwalker commented Dec 4, 2015

justenwalker commented Dec 4, 2015

tgross commented Dec 4, 2015

justenwalker commented Dec 4, 2015

Choose a reason for hiding this comment

tgross commented Dec 4, 2015

tgross commented Dec 4, 2015

misterbisson commented Dec 4, 2015

tgross commented Dec 4, 2015