Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smarter Healthchecks #1306

Merged
merged 37 commits into from Dec 14, 2016
Merged

Smarter Healthchecks #1306

merged 37 commits into from Dec 14, 2016

Conversation

@ssalinas
Copy link
Member

ssalinas commented Sep 27, 2016

This PR attempts to do a few things regarding our existing healthcheck setup

  • Split out the configuration into its own object so we stop cluttering the SingularityDeploy
  • Add the concept of a startup period (identified primarily by 'connection refused' responses vs actuall http responses)
    • Add a separate startupDelay (time before running any checks) and startupInterval (how often to check during this time)
    • Add a startupTimeout - time limit for app to actually respond to a check
  • Make the healthcheckTotalTimeoutSeconds a calculated field based on the others present rather than an override
  • Ability to specify a specific port number for a healthcheck (/fixes #1287)

Still TODO:

  • Add validation on maxRetries * intervalSeconds (as a replacement for having a separate static max total timeout)
  • Finish wiring up startupTimeout in deploy and new task checkers
  • Smarter reactions based on status code
  • Tests specific to startup period and status codes
  • Clean up excess startup healthchecks since they provide no extra information
  • Ensure new task checks run at an appropritate interval
  • Update deploy form
  • Update healthcheck notifications and messages in UI
  • Fix healthcheck/deploy checker timeout race condition

/cc @tpetr @darcatron

ssalinas added 2 commits Sep 27, 2016
@ssalinas ssalinas modified the milestone: 0.12.0 Sep 27, 2016
return uri;
}

@ApiModelProperty(required = false, value="Perform healthcheck on this dynamically allocated port (e.g. 0 for first port), defaults to first port")

This comment has been minimized.

@darcatron

darcatron Sep 29, 2016 Contributor

nit: required = false => required=false

return portIndex;
}

@ApiModelProperty(required = false, value="Perform healthcheck on this port (portIndex cannot also be used when using this setting)")

This comment has been minimized.

@darcatron

darcatron Sep 29, 2016 Contributor

nit: required = false => required=false

Objects.equal(startupDelaySeconds, that.startupDelaySeconds) &&
Objects.equal(intervalSeconds, that.intervalSeconds) &&
Objects.equal(responseTimeoutSeconds, that.responseTimeoutSeconds) &&
Objects.equal(maxRetries, that.maxRetries);

This comment has been minimized.

@darcatron

darcatron Sep 29, 2016 Contributor

should this have startupInvervalSeconds as well?

private final Optional<HealthcheckProtocol> protocol;
private final Optional<Integer> startupTimeoutSeconds;
private final Optional<Integer> startupDelaySeconds;
private final Optional<Integer> startupIntervalSeconds;

This comment has been minimized.

@darcatron

darcatron Sep 29, 2016 Contributor

maybe call it startupFrequencySeconds?

.add("intervalSeconds", intervalSeconds)
.add("responseTimeoutSeconds", responseTimeoutSeconds)
.add("maxRetries", maxRetries)
.toString();

This comment has been minimized.

@darcatron

darcatron Sep 29, 2016 Contributor

should this have startupInvervalSeconds as well?

@@ -446,6 +472,11 @@ public String getId() {
return healthcheckMaxTotalTimeoutSeconds;
}

@ApiModelProperty(required = false, value="HTTP Healthcheck settings")

This comment has been minimized.

@darcatron

darcatron Sep 29, 2016 Contributor

nit: required = false => required=false

checkBadRequest(deploy.getResources().get().getNumPorts() > deploy.getHealthcheck().get().getPortIndex().get(), String
.format("Must request %s ports for healthcheckPortIndex %s, only requested %s", deploy.getHealthcheck().get().getPortIndex().get() + 1, deploy.getHealthcheck().get().getPortIndex().get(),
deploy.getResources().get().getNumPorts()));
}

This comment has been minimized.

@darcatron

darcatron Sep 29, 2016 Contributor

maybe assign HealthcheckOptions healthcheck = deploy.getHealthcheck().get() after verifying isPresent() so the later uses are a little cleaner?

if (throwable.isPresent() && throwable.get() instanceof ConnectException) {
inStartup = true;
}

This comment has been minimized.

@darcatron

darcatron Sep 29, 2016 Contributor

boolean inStartup = throwable.isPresent() && throwable.get() instanceof ConnectException

private Optional<Integer> healthcheckMaxRetries;
@Deprecated
private Optional<Long> healthcheckMaxTotalTimeoutSeconds;

This comment has been minimized.

@darcatron

darcatron Sep 29, 2016 Contributor

@Deprecated use {@link healthcheck}?

private final Optional<Integer> healthcheckMaxRetries;
@Deprecated
private final Optional<Long> healthcheckMaxTotalTimeoutSeconds;

This comment has been minimized.

@darcatron

darcatron Sep 29, 2016 Contributor

@Deprecated use {@link healthcheck}?

@ssalinas ssalinas modified the milestone: 0.12.0 Oct 17, 2016
@ssalinas ssalinas changed the title (WIP) Smarter Healthchecks Smarter Healthchecks Oct 20, 2016
@ssalinas
Copy link
Member Author

ssalinas commented Oct 20, 2016

First frontend+backend draft of this is done now (save for the race condition fix). The setup of the new healthchecks looks like this:
screen shot 2016-10-20 at 12 23 21 pm

UI messages are also updated to reflect this. Some examples:

  • Failed because the app never responded (never left the startup phase). This was with a 10s startupDelay and 15s startupTimeout set. Note that only a single healthcheck appreas in the table. This is because startup phase healthchecks are cleaned up after task checks run (except if it is the last healthcheck). This way we can check more often during the startup phase without cluttering healthchecks.

    screen shot 2016-10-20 at 10 51 33 am

    screen shot 2016-10-20 at 10 51 41 am

  • Failed due to a status code in the failureStatusCodes list
    screen shot 2016-10-20 at 10 51 26 am
    screen shot 2016-10-20 at 10 51 08 am

  • Failed due to too many retries
    screen shot 2016-10-20 at 10 55 51 am

Open to any feedback on the ui messages, overall setup, or additional features before we merge this into qa or stable

@tpetr @darcatron

@ssalinas ssalinas added the hs_staging label Oct 20, 2016
@ssalinas
Copy link
Member Author

ssalinas commented Dec 14, 2016

Since we recently did a release, going to merge this and I will address any bugs we find in future PRs. Been looking good in hs_stable though.

@ssalinas ssalinas merged commit ea60453 into master Dec 14, 2016
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
@ssalinas ssalinas deleted the smarter_healthchecks branch Dec 14, 2016
@ssalinas ssalinas removed hs_qa labels Dec 14, 2016
@nyarly
Copy link

nyarly commented May 9, 2017

@ssalinas I don't understand this new behavior at all. Is the reasoning documented somewhere?

Specifically, from what I can tell, once we leave startup, it looks like we poll the healthcheck URL looking for failure until the HC timeout elapses. What's the reasoning behind that? i.e. after one successful healthcheck, why not transition to "started"?

@stevenschlansker
Copy link
Contributor

stevenschlansker commented May 9, 2017

Hm, my understanding is that's exactly what happens -- any single successful check turns it to running state and no further checks will be performed.

@nyarly
Copy link

nyarly commented May 9, 2017

I skimmed the code and it seemed to be what I was concerned by. In that case, why are there multiple polls then? Because I was pretty sure that a single failure status (as defined by a configuration) marks the service UNHEALTHY, which I assume means Singularity kills the task.

@ssalinas
Copy link
Member Author

ssalinas commented May 9, 2017

There's a few things to divide up here:

  • In terms of the task state, we only start health checks once the task enters TASK_RUNNING. This just means the process is running not that it is healthy yet
  • It is correct that we stop after a single successful check

In the old method of healthchecks there was just a total amount of time/number of checks as the constraint. Now, there is a constraint around startup (i.e. time until we stop getting connection refused responses) and constraint around the actual response (i.e. time until a 'successfull' status code is received)

@ssalinas
Copy link
Member Author

ssalinas commented May 9, 2017

A single failure status does not mark the task as unhealthy unless you set it to do so. Once it has received some sort of valid http response, you then have maxRetries attempts left for the app to return an successful check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

4 participants
You can’t perform that action at this time.