Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deregister services on SIGTERM #22

Merged
merged 12 commits into from
Dec 4, 2015

Conversation

justenwalker
Copy link
Contributor

Changes

  • Handle SIGTERM, send to application
  • Add stopTimeout to config to manage wait time
  • Extend DiscoveryService - add Deregister
  • Deregister all services when application exits

Use Case

When i run my Containerbuddy images, i want them to automatically deregister from consul when they die, so that they do not clutter up the service namespace. Especially problematic when scaling these containers out with Mesos where applications may be moved around frequently.

Notes

I'm a bit of a golang noob, so I'm not sure how to add tests for this, since this has some external effects. FWIW, I tested this empirically using a local Mesos and Consul cluster.

- Handle SIGTERM, send to application
- Add stopTimeout to config to manage wait time
- Extend DiscoveryService - add Deregister
- Deregister all services when application exits
@tgross
Copy link
Contributor

tgross commented Dec 3, 2015

Normally I'd suggest we open an issue first, but the way you've structured this PR makes it ideal for discussing the feature idea so this is great. Thanks for the PR @justenwalker. Some points of discussion:

i want them to automatically deregister from consul when they die

Just to be specific, this patch catches SIGTERM which means we're catching docker stop (or really the equivalent remote API call) but not docker kill or involuntary exit like a crash. I do like that split in behavior though -- it means that containers intentionally removed will deregister but ones that have crashed will remain in Consul until you reap them separately, which will help out for debugging.

I'd like if we documented the intended behavior with respect to Docker commands specifically. What you have in the README is accurate but it'll improve usability if we explicitly say "this is what happens with docker stop vs docker kill" (ex. the Docker docs say they fire SIGTERM + SIGKILL so you don't have to dig into source to verify the behavior).

so I'm not sure how to add tests for this

Tricky... You don't terminate Containerbuddy itself in the signal handler (which is correct, because the Containerbuddy process will clean itself up once it gets the exit code from the shimmed application). So you could write a test that runs an executeAndWait on something like sleep and then fires the appropriate signals and checks the exit code to make sure they were caught.

We should also update the example application to use this feature.

@@ -4,4 +4,5 @@ type DiscoveryService interface {
SendHeartbeat(*ServiceConfig)
CheckForUpstreamChanges(*BackendConfig) bool
MarkForMaintenance(*ServiceConfig)
Deregister(*ServiceConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that you've extended the DiscoveryService such that if MarkForMaintenance and Deregister need to be separate implementations for a non-Consul discovery backend we can do that easily.

Also extract magic number to a named constant
@justenwalker
Copy link
Contributor Author

I'm going to add some documentation like the following in Operating Container Buddy. I noticed some weirdness with Marathon but it applies more generally with the way you run Containerbuddy from docker if it is not the ENTRYPOINT of the image.

Docker will automatically deliver a SIGTERM when you use docker stop; Not when using docker kill. Caveat: If Containerbuddy is executed in a shell such as: /bin/sh -c '/opt/containerbuddy .... ' then SIGTERM will not reach Containerbuddy from docker stop. This is important for systems like Marathon on Mesos which do this by default.

@tgross
Copy link
Contributor

tgross commented Dec 3, 2015

This is important for systems like Marathon on Mesos which do this by default.

Just for clarity: does Mesos/Marathon just fire SIGTERM to the shell process and pray cleanup works (yikes!), or are you saying that you don't want to use docker stop directly on a container launched by Marathon/Mesos?

@justenwalker
Copy link
Contributor Author

Mesos/Marathon will issue a docker stop on the container when shutting it down, (at least if you specify docker_stop_timeout in your mesos-slave config). However, the problem is when launching the container - the default is to use a shell executor, which blocks signals from propagating.

From Mesos Docs - CommandInfo to run Docker images

To run a docker image with the default command (ie: docker run image), the CommandInfo’s value must not be set. If the value is set then it will override the default command.
To run a docker image with an entrypoint defined, the CommandInfo’s shell option must be set to false. If shell option is set to true the Docker Containerizer will run the user’s command wrapped with /bin/sh -c which will also become parameters to the image entrypoint.

@tgross
Copy link
Contributor

tgross commented Dec 3, 2015

However, the problem is when launching the container - the default is to use a shell executor, which blocks signals from propagating.

That's unfortunate but I don't see any way around it.

@justenwalker
Copy link
Contributor Author

It's not Containerbuddy's responsibility to manage that - but I think it would be a good idea to make people aware of this because it manifests as "My services are not deregistering from Consul when Mesos shuts them down". I spent all night tracking this down, so i'd hate for anyone else to have to suffer through it.

- Add note about stopTimeout = -1
- Document docker stop vs docker kill
- Add some Caveats about shell ENTRYPOINT
@tgross
Copy link
Contributor

tgross commented Dec 3, 2015

It's not Containerbuddy's responsibility to manage that - but I think it would be a good idea to make people aware of this because it manifests as "My services are not deregistering from Consul when Mesos shuts them down". I spent all night tracking this down, so i'd hate for anyone else to have to suffer through it.

I'm in total agreement there.

@justenwalker
Copy link
Contributor Author

Added some tests for the SIGTERM handle.

Also, found out that we can only call handleSignals once, otherwise we'll get duplicate handles - so I had to dump the SIGTERM test into a single function along with SIGUSR1

@tgross tgross mentioned this pull request Dec 3, 2015
- Split SIGUSR1 and SIGTERM tests into separate functions
@justenwalker
Copy link
Contributor Author

@tgross Updated tests and removed panic on cmd not found.

@misterbisson
Copy link
Contributor

This opens up the door to more than just deregistering services. Consider running Couchbase in Docker. The blueprint and demo makes deploying Couchbase and scaling it up easy (repo), but scaling down requires more steps that haven't been automated.

A SIGTERM handler could be exactly what's needed to add that automation. If it could also execute a user-defined executable (and wait for it), it would allow us to mark the Couchbase node for removal from the cluster and automatically rebalance the data to the remaining nodes before stopping it.

I haven't tested it, but I think the right command to call would be:

couchbase-cli rebalance -c 127.0.0.1:8091 -u $COUCHBASE_USER -p $COUCHBASE_PASS --server-remove=${IP_PRIVATE}:8091

And when that is done, it should be safe to stop (and remove/delete) the container.

@justenwalker
Copy link
Contributor Author

@misterbisson Added another config for onStop support - Could be used for any arbitrary commands necessary to clean up after a service exits.

@misterbisson
Copy link
Contributor

@justenwalker I was returning to the ticket to suggest I should move that to a ticket and let this PR move forward, but you were too quick!

onStop is the executable (and its arguments) that will be called immediately after the shimmed application exits. This field is optional. If the onStop handler returns a non-zero exit code, Containerbuddy will exit with this code rather than the application's exit code.

I like this a lot. I do have a question about "will be called immediately after the shimmed application exits," though. My usage scenario requires that this handler be called before the shimmed application exits, and that Containerbuddy wait for its exit before continuing.

Semantically, onStart is actually executed before the application starts. Would it be okay for onStop to similarly execute before the actual stop? Or, did you have a specific usage in mind that requires calling onStop after the shimmed app's exit?

Edit/followup

Would it be possible to move the onStop handler' from main.go to signals.go at the top of terminate?

Perhaps the PR should also move deregisterServices from main.go into the top of terminate in signals.go as well, so that the requests to the service can be stopped before the service truly stops responding to them. But perhaps you had a different plan for that too?

If my suggestion here is followed, it should also be noted that the time of the onStop plus the stopTimeout would all have to be less than the Mesos docker_stop_timeout to avoid errors.

if err != nil {
log.Println(err)
}
deregisterServices(config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called in the top of terminate in signals.go so that the requests to the service can be stopped before the service truly stops responding to them?

Perhaps I'm misreading the intention of stopTimeout, but it would seem that would be useful to help keep the shimmed service running past the heartbeat/healthcheck TTL, no? But, with the order of operations here, the shimmed app has been terminated before being deregistered, yes?

See also #22 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called in the top of terminate in signals.go so that the requests to the service can be stopped before the service truly stops responding to them?

If we move the call to deregisterServices then we'll miss cases where the application exits normally without a docker stop (processes don't call SIGTERM on themselves when they exit).

But your concern is well-founded. Maybe we need to also include the deregisterServices in the beginning of terminate; we'd need to either handle the error from the discovery service on the second call or set a flag that we've already terminated the process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why the application should continue running after deregister is called. My intention for the cleanup method is to be analogous to a try / finally block.

stopTimeout is just the time that Containerbuddy will wait before killing the application itself (after sending SIGTERM) and proceeding with the cleanup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but as-written you're running the application, it exits (via exiting normally or SIGTERM) and then we're running deregister afterwards. Which means the application exits and can't serve requests but the discovery service will still think it's healthy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I can move it to signals.go before terminate.

Is it possible that while this is going on, the poll function re-registers it though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that while this is going on, the poll function re-registers it though?

No because we check for maintenance mode for running any poll function: https://github.com/joyent/containerbuddy/blob/288c18cc9df4bd15f64be59456f9f362655af30e/src/containerbuddy/main.go#L74


const (
// Amount of time to wait before killing the application
DEFAULT_STOP_TIMEOUT int = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golang style is not to all-caps these. Should just be lower-case so that we're not exporting it.

@justenwalker
Copy link
Contributor Author

@misterbisson In my mind, OnStop makes more sense to happen after the application exits rather than before - so that we know the service isn't interacting with anything before we send cleanup commands (like Finally block). Perhaps I should undo this commit and move it to a separate PR - as I've kind of incorporated too many changes into this PR at once?

@tgross
Copy link
Contributor

tgross commented Dec 4, 2015

In my mind, OnStop makes more sense to happen after the application exits rather than before - so that we know the service isn't interacting with anything before we send cleanup commands (like Finally block)

I agree. A use case here would be backing up state to somewhere external after the application has exited (and therefore given up hold on file handles).

Perhaps I should undo this commit and move it to a separate PR - as I've kind of incorporated too many changes into this PR at once?

I think that's probably a good idea. That way we can discuss that new feature separately.

@justenwalker
Copy link
Contributor Author

@tgross @misterbisson backed out the OnStop change. Will do separate PR

@justenwalker
Copy link
Contributor Author

@misterbisson #28 created to discuss OnStop

@justenwalker
Copy link
Contributor Author

@tgross Im stopping all of the polling threads, then deregistering the service, and finally terminating the application.

I did not use the maintenance mode - maybe I should?

@tgross
Copy link
Contributor

tgross commented Dec 4, 2015

I did not use the maintenance mode - maybe I should?

My first pass thru implementing maintenance mode I did this the same way you did here but then I realized we couldn't un-stop the polling threads so I used the toggleMaintenanceMode function at the top of the SIGUSR1 handler.

What you've done probably works just as well or better for the SIGTERM handler. I'm a little worried about implementation drift between maintenance and deregister... the deregisterServices code is similar to the loop in the SIGUSR1 handler except with a different function called on each service. Maybe we can factor that loop out and have func as one of its arguments?

Function that takes a config and a function which operates on each
service in the config.
@justenwalker
Copy link
Contributor Author

@tgross I factored out that logic into its own function: forAllServices

@@ -88,13 +88,16 @@ Other fields:

- `consul` is the hostname and port of the Consul discovery service.
- `onStart` is the executable (and its arguments) that will be called immediately prior to starting the shimmed application. This field is optional. If the `onStart` handler returns a non-zero exit code, Containerbuddy will exit.
- `stopTimeout` Optional amount of time in seconds to wait before killing the application. (defaults to `5`). Providing `-1` will kill the application immediately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last comment: we should show this field in the JSON example above.

@tgross
Copy link
Contributor

tgross commented Dec 4, 2015

Ok, @justenwalker I've verified the tests pass, that we're go fmt clean, and that the example application builds and runs. I added one last documentation comment and then I think we can merge this in.

@tgross
Copy link
Contributor

tgross commented Dec 4, 2015

Add stopTimeout to example json

Ok, this looks good to merge! Thanks for the hard work in bringing this home @justenwalker!

tgross added a commit that referenced this pull request Dec 4, 2015
Deregister services on SIGTERM
@tgross tgross merged commit 833455c into TritonDataCenter:master Dec 4, 2015
@misterbisson
Copy link
Contributor

@justenwalker thanks for creating the new ticket. I've picked up the conversation there, as you've probably already seen ;-)

This is a good feature to add, thanks for working and thinking it through with us!

@tgross
Copy link
Contributor

tgross commented Dec 4, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants