-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deregister services on SIGTERM #22
Conversation
- Handle SIGTERM, send to application - Add stopTimeout to config to manage wait time - Extend DiscoveryService - add Deregister - Deregister all services when application exits
Normally I'd suggest we open an issue first, but the way you've structured this PR makes it ideal for discussing the feature idea so this is great. Thanks for the PR @justenwalker. Some points of discussion:
Just to be specific, this patch catches I'd like if we documented the intended behavior with respect to Docker commands specifically. What you have in the README is accurate but it'll improve usability if we explicitly say "this is what happens with
Tricky... You don't terminate Containerbuddy itself in the signal handler (which is correct, because the Containerbuddy process will clean itself up once it gets the exit code from the shimmed application). So you could write a test that runs an We should also update the example application to use this feature. |
@@ -4,4 +4,5 @@ type DiscoveryService interface { | |||
SendHeartbeat(*ServiceConfig) | |||
CheckForUpstreamChanges(*BackendConfig) bool | |||
MarkForMaintenance(*ServiceConfig) | |||
Deregister(*ServiceConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that you've extended the DiscoveryService
such that if MarkForMaintenance
and Deregister
need to be separate implementations for a non-Consul discovery backend we can do that easily.
Also extract magic number to a named constant
I'm going to add some documentation like the following in Operating Container Buddy. I noticed some weirdness with Marathon but it applies more generally with the way you run Containerbuddy from docker if it is not the ENTRYPOINT of the image. Docker will automatically deliver a SIGTERM when you use |
Just for clarity: does Mesos/Marathon just fire SIGTERM to the shell process and pray cleanup works (yikes!), or are you saying that you don't want to use |
Mesos/Marathon will issue a From Mesos Docs - CommandInfo to run Docker images
|
That's unfortunate but I don't see any way around it. |
It's not Containerbuddy's responsibility to manage that - but I think it would be a good idea to make people aware of this because it manifests as "My services are not deregistering from Consul when Mesos shuts them down". I spent all night tracking this down, so i'd hate for anyone else to have to suffer through it. |
- Add note about stopTimeout = -1 - Document docker stop vs docker kill - Add some Caveats about shell ENTRYPOINT
048105e
to
1e9816a
Compare
I'm in total agreement there. |
Added some tests for the SIGTERM handle. Also, found out that we can only call handleSignals once, otherwise we'll get duplicate handles - so I had to dump the SIGTERM test into a single function along with SIGUSR1 |
9f37eb4
to
c425858
Compare
- Split SIGUSR1 and SIGTERM tests into separate functions
@tgross Updated tests and removed panic on cmd not found. |
This opens up the door to more than just deregistering services. Consider running Couchbase in Docker. The blueprint and demo makes deploying Couchbase and scaling it up easy (repo), but scaling down requires more steps that haven't been automated. A I haven't tested it, but I think the right command to call would be: couchbase-cli rebalance -c 127.0.0.1:8091 -u $COUCHBASE_USER -p $COUCHBASE_PASS --server-remove=${IP_PRIVATE}:8091 And when that is done, it should be safe to stop (and remove/delete) the container. |
@misterbisson Added another config for onStop support - Could be used for any arbitrary commands necessary to clean up after a service exits. |
6be9aa9
to
c67bb9b
Compare
@justenwalker I was returning to the ticket to suggest I should move that to a ticket and let this PR move forward, but you were too quick!
I like this a lot. I do have a question about "will be called immediately after the shimmed application exits," though. My usage scenario requires that this handler be called before the shimmed application exits, and that Containerbuddy wait for its exit before continuing. Semantically, Edit/followupWould it be possible to move the Perhaps the PR should also move If my suggestion here is followed, it should also be noted that the time of the |
if err != nil { | ||
log.Println(err) | ||
} | ||
deregisterServices(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be called in the top of terminate
in signals.go
so that the requests to the service can be stopped before the service truly stops responding to them?
Perhaps I'm misreading the intention of stopTimeout
, but it would seem that would be useful to help keep the shimmed service running past the heartbeat/healthcheck TTL, no? But, with the order of operations here, the shimmed app has been terminated before being deregistered, yes?
See also #22 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be called in the top of terminate in signals.go so that the requests to the service can be stopped before the service truly stops responding to them?
If we move the call to deregisterServices
then we'll miss cases where the application exits normally without a docker stop
(processes don't call SIGTERM
on themselves when they exit).
But your concern is well-founded. Maybe we need to also include the deregisterServices
in the beginning of terminate
; we'd need to either handle the error from the discovery service on the second call or set a flag that we've already terminated the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why the application should continue running after deregister is called. My intention for the cleanup method is to be analogous to a try / finally block.
stopTimeout is just the time that Containerbuddy will wait before killing the application itself (after sending SIGTERM) and proceeding with the cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but as-written you're running the application, it exits (via exiting normally or SIGTERM) and then we're running deregister afterwards. Which means the application exits and can't serve requests but the discovery service will still think it's healthy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I can move it to signals.go before terminate.
Is it possible that while this is going on, the poll function re-registers it though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that while this is going on, the poll function re-registers it though?
No because we check for maintenance mode for running any poll function: https://github.com/joyent/containerbuddy/blob/288c18cc9df4bd15f64be59456f9f362655af30e/src/containerbuddy/main.go#L74
|
||
const ( | ||
// Amount of time to wait before killing the application | ||
DEFAULT_STOP_TIMEOUT int = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Golang style is not to all-caps these. Should just be lower-case so that we're not exporting it.
@misterbisson In my mind, OnStop makes more sense to happen after the application exits rather than before - so that we know the service isn't interacting with anything before we send cleanup commands (like Finally block). Perhaps I should undo this commit and move it to a separate PR - as I've kind of incorporated too many changes into this PR at once? |
I agree. A use case here would be backing up state to somewhere external after the application has exited (and therefore given up hold on file handles).
I think that's probably a good idea. That way we can discuss that new feature separately. |
c67bb9b
to
36f633e
Compare
@tgross @misterbisson backed out the OnStop change. Will do separate PR |
@misterbisson #28 created to discuss OnStop |
@tgross Im stopping all of the polling threads, then deregistering the service, and finally terminating the application. I did not use the maintenance mode - maybe I should? |
My first pass thru implementing maintenance mode I did this the same way you did here but then I realized we couldn't un-stop the polling threads so I used the What you've done probably works just as well or better for the SIGTERM handler. I'm a little worried about implementation drift between maintenance and deregister... the |
Function that takes a config and a function which operates on each service in the config.
@tgross I factored out that logic into its own function: forAllServices |
@@ -88,13 +88,16 @@ Other fields: | |||
|
|||
- `consul` is the hostname and port of the Consul discovery service. | |||
- `onStart` is the executable (and its arguments) that will be called immediately prior to starting the shimmed application. This field is optional. If the `onStart` handler returns a non-zero exit code, Containerbuddy will exit. | |||
- `stopTimeout` Optional amount of time in seconds to wait before killing the application. (defaults to `5`). Providing `-1` will kill the application immediately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last comment: we should show this field in the JSON example above.
Ok, @justenwalker I've verified the tests pass, that we're |
Ok, this looks good to merge! Thanks for the hard work in bringing this home @justenwalker! |
@justenwalker thanks for creating the new ticket. I've picked up the conversation there, as you've probably already seen ;-) This is a good feature to add, thanks for working and thinking it through with us! |
Changes
stopTimeout
to config to manage wait timeUse Case
When i run my Containerbuddy images, i want them to automatically deregister from consul when they die, so that they do not clutter up the service namespace. Especially problematic when scaling these containers out with Mesos where applications may be moved around frequently.
Notes
I'm a bit of a golang noob, so I'm not sure how to add tests for this, since this has some external effects. FWIW, I tested this empirically using a local Mesos and Consul cluster.