New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continuous health checks #746
Comments
Note that there is some movement on the Mesos side here: |
We have considered this feature. The only reason we haven't done this is b/c of cascading failures, and the complexity around handling it. You really need to have a light-weight healthcheck or Singularity would have to be very smart about how to handle failures. Typically, healthchecks at deploy time are fairly heavy-weight (to catch configuration issues), so we'd probably introduce the concept of 2 separate healthchecks. |
Yeah, the exact same concern is why I have been deferring this feature on our side as well. It's really easy to write a health check that will bring down the entire platform as soon as one thing goes wrong. My favorite idea right now is to have a separate health check (I'm calling it the "suicide" endpoint, but that might be too morbid). If it errors you are a candidate for termination, but only if the request has all its other members healthy. This means that we would only ever kill at most one instance at a time, and if it does not recover, we are then stuck until an administrator intervenes. |
Yup, I like that idea. |
I like the safety of the check that other members are healthy, but it would be helpful it it was configurable. |
At Nitro and now also at my current employer we've run Singularity with Sidecar doing health checks and our custom executor managing the containers. We fail containers from the executor rather than from Singularity when the health check fails. This has worked great in production for 2+ years. How it works is that Sidecar continually health checks the services and the executor watches the Sidecar state for the task health. When Sidecar fails the checks for a configurable amount of time, we take down the task from the executor and notify Mesos that it failed.We insist on lightweight health checks that can respond without much computation. For places where the health check is not lightweight, the services in question run them on a timed loop internally and the health check endpoint just reports on the state. The executor is a Docker-only implementation and we run one executor per task. They are in Go and take less than 16MB RAM to run so the overhead is minimal. |
Is there an option for continuous health checks (not just right after the deployment) or are you considering this feature? This could be a good implementation of the service watchdog pattern. Aside from potential for cascading failures, are there reasons not to do this?
The text was updated successfully, but these errors were encountered: