Continuous health checks #746

chornyi · 2015-11-04T02:22:28Z

Is there an option for continuous health checks (not just right after the deployment) or are you considering this feature? This could be a good implementation of the service watchdog pattern. Aside from potential for cascading failures, are there reasons not to do this?

stevenschlansker · 2015-11-09T21:34:20Z

Note that there is some movement on the Mesos side here:
https://issues.apache.org/jira/browse/MESOS-2533

wsorenson · 2015-11-09T22:08:25Z

We have considered this feature. The only reason we haven't done this is b/c of cascading failures, and the complexity around handling it. You really need to have a light-weight healthcheck or Singularity would have to be very smart about how to handle failures.

Typically, healthchecks at deploy time are fairly heavy-weight (to catch configuration issues), so we'd probably introduce the concept of 2 separate healthchecks.

stevenschlansker · 2015-11-09T22:13:55Z

Yeah, the exact same concern is why I have been deferring this feature on our side as well. It's really easy to write a health check that will bring down the entire platform as soon as one thing goes wrong.

My favorite idea right now is to have a separate health check (I'm calling it the "suicide" endpoint, but that might be too morbid). If it errors you are a candidate for termination, but only if the request has all its other members healthy. This means that we would only ever kill at most one instance at a time, and if it does not recover, we are then stuck until an administrator intervenes.

wsorenson · 2015-11-10T16:38:44Z

Yup, I like that idea.

markmsmith · 2016-10-11T20:42:17Z

I like the safety of the check that other members are healthy, but it would be helpful it it was configurable.
For our use case, we're running a long-running Spark Streaming job on Mesos through Singularity. Unfortunately, if the Spark task fails twice for a given executor then that host is blacklisted. Once all the hosts are blacklisted, our job ends up with no resources and hangs. One workaround (until this is resolved) would be to use a health check on the driver to signal the singularity job is in a bad state and should be killed & restarted when this happens. However, from the singularity perspective there's only one member (the driver), so we wouldn't need the check for other members being up before it restarts.

relistan · 2019-01-18T09:58:52Z

At Nitro and now also at my current employer we've run Singularity with Sidecar doing health checks and our custom executor managing the containers. We fail containers from the executor rather than from Singularity when the health check fails. This has worked great in production for 2+ years.

How it works is that Sidecar continually health checks the services and the executor watches the Sidecar state for the task health. When Sidecar fails the checks for a configurable amount of time, we take down the task from the executor and notify Mesos that it failed.We insist on lightweight health checks that can respond without much computation. For places where the health check is not lightweight, the services in question run them on a timed loop internally and the health check endpoint just reports on the state.

The executor is a Docker-only implementation and we run one executor per task. They are in Go and take less than 16MB RAM to run so the overhead is minimal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous health checks #746

Continuous health checks #746

chornyi commented Nov 4, 2015

stevenschlansker commented Nov 9, 2015

wsorenson commented Nov 9, 2015

stevenschlansker commented Nov 9, 2015

wsorenson commented Nov 10, 2015

markmsmith commented Oct 11, 2016

relistan commented Jan 18, 2019 •

edited

Continuous health checks #746

Continuous health checks #746

Comments

chornyi commented Nov 4, 2015

stevenschlansker commented Nov 9, 2015

wsorenson commented Nov 9, 2015

stevenschlansker commented Nov 9, 2015

wsorenson commented Nov 10, 2015

markmsmith commented Oct 11, 2016

relistan commented Jan 18, 2019 • edited

relistan commented Jan 18, 2019 •

edited