Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous health checks #746

Open
chornyi opened this issue Nov 4, 2015 · 6 comments
Open

Continuous health checks #746

chornyi opened this issue Nov 4, 2015 · 6 comments

Comments

@chornyi
Copy link

chornyi commented Nov 4, 2015

Is there an option for continuous health checks (not just right after the deployment) or are you considering this feature? This could be a good implementation of the service watchdog pattern. Aside from potential for cascading failures, are there reasons not to do this?

@stevenschlansker
Copy link
Contributor

Note that there is some movement on the Mesos side here:
https://issues.apache.org/jira/browse/MESOS-2533

@wsorenson
Copy link
Contributor

We have considered this feature. The only reason we haven't done this is b/c of cascading failures, and the complexity around handling it. You really need to have a light-weight healthcheck or Singularity would have to be very smart about how to handle failures.

Typically, healthchecks at deploy time are fairly heavy-weight (to catch configuration issues), so we'd probably introduce the concept of 2 separate healthchecks.

@stevenschlansker
Copy link
Contributor

Yeah, the exact same concern is why I have been deferring this feature on our side as well. It's really easy to write a health check that will bring down the entire platform as soon as one thing goes wrong.

My favorite idea right now is to have a separate health check (I'm calling it the "suicide" endpoint, but that might be too morbid). If it errors you are a candidate for termination, but only if the request has all its other members healthy. This means that we would only ever kill at most one instance at a time, and if it does not recover, we are then stuck until an administrator intervenes.

@wsorenson
Copy link
Contributor

Yup, I like that idea.

@markmsmith
Copy link

I like the safety of the check that other members are healthy, but it would be helpful it it was configurable.
For our use case, we're running a long-running Spark Streaming job on Mesos through Singularity. Unfortunately, if the Spark task fails twice for a given executor then that host is blacklisted. Once all the hosts are blacklisted, our job ends up with no resources and hangs. One workaround (until this is resolved) would be to use a health check on the driver to signal the singularity job is in a bad state and should be killed & restarted when this happens. However, from the singularity perspective there's only one member (the driver), so we wouldn't need the check for other members being up before it restarts.

@relistan
Copy link
Contributor

relistan commented Jan 18, 2019

At Nitro and now also at my current employer we've run Singularity with Sidecar doing health checks and our custom executor managing the containers. We fail containers from the executor rather than from Singularity when the health check fails. This has worked great in production for 2+ years.

How it works is that Sidecar continually health checks the services and the executor watches the Sidecar state for the task health. When Sidecar fails the checks for a configurable amount of time, we take down the task from the executor and notify Mesos that it failed.We insist on lightweight health checks that can respond without much computation. For places where the health check is not lightweight, the services in question run them on a timed loop internally and the health check endpoint just reports on the state.

The executor is a Docker-only implementation and we run one executor per task. They are in Go and take less than 16MB RAM to run so the overhead is minimal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants