Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smarter Healthchecks #1306

Merged
merged 37 commits into from
Dec 14, 2016
Merged

Smarter Healthchecks #1306

merged 37 commits into from
Dec 14, 2016

Conversation

ssalinas
Copy link
Member

@ssalinas ssalinas commented Sep 27, 2016

This PR attempts to do a few things regarding our existing healthcheck setup

  • Split out the configuration into its own object so we stop cluttering the SingularityDeploy
  • Add the concept of a startup period (identified primarily by 'connection refused' responses vs actuall http responses)
    • Add a separate startupDelay (time before running any checks) and startupInterval (how often to check during this time)
    • Add a startupTimeout - time limit for app to actually respond to a check
  • Make the healthcheckTotalTimeoutSeconds a calculated field based on the others present rather than an override
  • Ability to specify a specific port number for a healthcheck (/fixes Singularity should support healthchecks on statically defined service ports #1287)

Still TODO:

  • Add validation on maxRetries * intervalSeconds (as a replacement for having a separate static max total timeout)
  • Finish wiring up startupTimeout in deploy and new task checkers
  • Smarter reactions based on status code
  • Tests specific to startup period and status codes
  • Clean up excess startup healthchecks since they provide no extra information
  • Ensure new task checks run at an appropritate interval
  • Update deploy form
  • Update healthcheck notifications and messages in UI
  • Fix healthcheck/deploy checker timeout race condition

/cc @tpetr @darcatron

@ssalinas ssalinas modified the milestone: 0.12.0 Sep 27, 2016
return uri;
}

@ApiModelProperty(required = false, value="Perform healthcheck on this dynamically allocated port (e.g. 0 for first port), defaults to first port")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: required = false => required=false

return portIndex;
}

@ApiModelProperty(required = false, value="Perform healthcheck on this port (portIndex cannot also be used when using this setting)")
Copy link
Contributor

@matush-v matush-v Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: required = false => required=false

Objects.equal(startupDelaySeconds, that.startupDelaySeconds) &&
Objects.equal(intervalSeconds, that.intervalSeconds) &&
Objects.equal(responseTimeoutSeconds, that.responseTimeoutSeconds) &&
Objects.equal(maxRetries, that.maxRetries);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this have startupInvervalSeconds as well?

private final Optional<HealthcheckProtocol> protocol;
private final Optional<Integer> startupTimeoutSeconds;
private final Optional<Integer> startupDelaySeconds;
private final Optional<Integer> startupIntervalSeconds;
Copy link
Contributor

@matush-v matush-v Sep 29, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call it startupFrequencySeconds?

.add("intervalSeconds", intervalSeconds)
.add("responseTimeoutSeconds", responseTimeoutSeconds)
.add("maxRetries", maxRetries)
.toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this have startupInvervalSeconds as well?

@@ -446,6 +472,11 @@ public String getId() {
return healthcheckMaxTotalTimeoutSeconds;
}

@ApiModelProperty(required = false, value="HTTP Healthcheck settings")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: required = false => required=false

checkBadRequest(deploy.getResources().get().getNumPorts() > deploy.getHealthcheck().get().getPortIndex().get(), String
.format("Must request %s ports for healthcheckPortIndex %s, only requested %s", deploy.getHealthcheck().get().getPortIndex().get() + 1, deploy.getHealthcheck().get().getPortIndex().get(),
deploy.getResources().get().getNumPorts()));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe assign HealthcheckOptions healthcheck = deploy.getHealthcheck().get() after verifying isPresent() so the later uses are a little cleaner?

if (throwable.isPresent() && throwable.get() instanceof ConnectException) {
inStartup = true;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boolean inStartup = throwable.isPresent() && throwable.get() instanceof ConnectException

private Optional<Integer> healthcheckMaxRetries;
@Deprecated
private Optional<Long> healthcheckMaxTotalTimeoutSeconds;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Deprecated use {@link healthcheck}?

private final Optional<Integer> healthcheckMaxRetries;
@Deprecated
private final Optional<Long> healthcheckMaxTotalTimeoutSeconds;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Deprecated use {@link healthcheck}?

@ssalinas ssalinas modified the milestone: 0.12.0 Oct 17, 2016
@ssalinas ssalinas changed the title (WIP) Smarter Healthchecks Smarter Healthchecks Oct 20, 2016
@ssalinas
Copy link
Member Author

ssalinas commented Oct 20, 2016

First frontend+backend draft of this is done now (save for the race condition fix). The setup of the new healthchecks looks like this:
screen shot 2016-10-20 at 12 23 21 pm

UI messages are also updated to reflect this. Some examples:

  • Failed because the app never responded (never left the startup phase). This was with a 10s startupDelay and 15s startupTimeout set. Note that only a single healthcheck appreas in the table. This is because startup phase healthchecks are cleaned up after task checks run (except if it is the last healthcheck). This way we can check more often during the startup phase without cluttering healthchecks.

    screen shot 2016-10-20 at 10 51 33 am

    screen shot 2016-10-20 at 10 51 41 am

  • Failed due to a status code in the failureStatusCodes list
    screen shot 2016-10-20 at 10 51 26 am
    screen shot 2016-10-20 at 10 51 08 am

  • Failed due to too many retries
    screen shot 2016-10-20 at 10 55 51 am

Open to any feedback on the ui messages, overall setup, or additional features before we merge this into qa or stable

@tpetr @darcatron

@ssalinas
Copy link
Member Author

Since we recently did a release, going to merge this and I will address any bugs we find in future PRs. Been looking good in hs_stable though.

@ssalinas ssalinas merged commit ea60453 into master Dec 14, 2016
@ssalinas ssalinas deleted the smarter_healthchecks branch December 14, 2016 17:51
@nyarly
Copy link

nyarly commented May 9, 2017

@ssalinas I don't understand this new behavior at all. Is the reasoning documented somewhere?

Specifically, from what I can tell, once we leave startup, it looks like we poll the healthcheck URL looking for failure until the HC timeout elapses. What's the reasoning behind that? i.e. after one successful healthcheck, why not transition to "started"?

@stevenschlansker
Copy link
Contributor

Hm, my understanding is that's exactly what happens -- any single successful check turns it to running state and no further checks will be performed.

@nyarly
Copy link

nyarly commented May 9, 2017

I skimmed the code and it seemed to be what I was concerned by. In that case, why are there multiple polls then? Because I was pretty sure that a single failure status (as defined by a configuration) marks the service UNHEALTHY, which I assume means Singularity kills the task.

@ssalinas
Copy link
Member Author

ssalinas commented May 9, 2017

There's a few things to divide up here:

  • In terms of the task state, we only start health checks once the task enters TASK_RUNNING. This just means the process is running not that it is healthy yet
  • It is correct that we stop after a single successful check

In the old method of healthchecks there was just a total amount of time/number of checks as the constraint. Now, there is a constraint around startup (i.e. time until we stop getting connection refused responses) and constraint around the actual response (i.e. time until a 'successfull' status code is received)

@ssalinas
Copy link
Member Author

ssalinas commented May 9, 2017

A single failure status does not mark the task as unhealthy unless you set it to do so. Once it has received some sort of valid http response, you then have maxRetries attempts left for the app to return an successful check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Singularity should support healthchecks on statically defined service ports
4 participants