New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smarter Healthchecks #1306

Merged
merged 37 commits into from Dec 14, 2016

Conversation

Projects
None yet
4 participants
@ssalinas
Member

ssalinas commented Sep 27, 2016

This PR attempts to do a few things regarding our existing healthcheck setup

  • Split out the configuration into its own object so we stop cluttering the SingularityDeploy
  • Add the concept of a startup period (identified primarily by 'connection refused' responses vs actuall http responses)
    • Add a separate startupDelay (time before running any checks) and startupInterval (how often to check during this time)
    • Add a startupTimeout - time limit for app to actually respond to a check
  • Make the healthcheckTotalTimeoutSeconds a calculated field based on the others present rather than an override
  • Ability to specify a specific port number for a healthcheck (/fixes #1287)

Still TODO:

  • Add validation on maxRetries * intervalSeconds (as a replacement for having a separate static max total timeout)
  • Finish wiring up startupTimeout in deploy and new task checkers
  • Smarter reactions based on status code
  • Tests specific to startup period and status codes
  • Clean up excess startup healthchecks since they provide no extra information
  • Ensure new task checks run at an appropritate interval
  • Update deploy form
  • Update healthcheck notifications and messages in UI
  • Fix healthcheck/deploy checker timeout race condition

/cc @tpetr @darcatron

ssalinas added some commits Sep 27, 2016

@ssalinas ssalinas modified the milestone: 0.12.0 Sep 27, 2016

Show outdated Hide outdated SingularityBase/src/main/java/com/hubspot/deploy/HealthcheckOptions.java
Show outdated Hide outdated SingularityBase/src/main/java/com/hubspot/deploy/HealthcheckOptions.java
Show outdated Hide outdated SingularityBase/src/main/java/com/hubspot/deploy/HealthcheckOptions.java
private final Optional<HealthcheckProtocol> protocol;
private final Optional<Integer> startupTimeoutSeconds;
private final Optional<Integer> startupDelaySeconds;
private final Optional<Integer> startupIntervalSeconds;

This comment has been minimized.

@darcatron

darcatron Sep 29, 2016

Contributor

maybe call it startupFrequencySeconds?

@darcatron

darcatron Sep 29, 2016

Contributor

maybe call it startupFrequencySeconds?

Show outdated Hide outdated SingularityBase/src/main/java/com/hubspot/deploy/HealthcheckOptions.java
@@ -446,6 +472,11 @@ public String getId() {
return healthcheckMaxTotalTimeoutSeconds;
}
@ApiModelProperty(required = false, value="HTTP Healthcheck settings")

This comment has been minimized.

@darcatron

darcatron Sep 29, 2016

Contributor

nit: required = false => required=false

@darcatron

darcatron Sep 29, 2016

Contributor

nit: required = false => required=false

Show outdated Hide outdated ...ice/src/main/java/com/hubspot/singularity/data/SingularityValidator.java
Show outdated Hide outdated ...om/hubspot/singularity/scheduler/SingularityHealthcheckAsyncHandler.java
Show outdated Hide outdated ...Base/src/main/java/com/hubspot/singularity/SingularityDeployBuilder.java
Show outdated Hide outdated ...ularityBase/src/main/java/com/hubspot/singularity/SingularityDeploy.java

@ssalinas ssalinas modified the milestone: 0.12.0 Oct 17, 2016

@ssalinas ssalinas changed the title from (WIP) Smarter Healthchecks to Smarter Healthchecks Oct 20, 2016

@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas Oct 20, 2016

Member

First frontend+backend draft of this is done now (save for the race condition fix). The setup of the new healthchecks looks like this:
screen shot 2016-10-20 at 12 23 21 pm

UI messages are also updated to reflect this. Some examples:

  • Failed because the app never responded (never left the startup phase). This was with a 10s startupDelay and 15s startupTimeout set. Note that only a single healthcheck appreas in the table. This is because startup phase healthchecks are cleaned up after task checks run (except if it is the last healthcheck). This way we can check more often during the startup phase without cluttering healthchecks.

    screen shot 2016-10-20 at 10 51 33 am

    screen shot 2016-10-20 at 10 51 41 am

  • Failed due to a status code in the failureStatusCodes list
    screen shot 2016-10-20 at 10 51 26 am
    screen shot 2016-10-20 at 10 51 08 am

  • Failed due to too many retries
    screen shot 2016-10-20 at 10 55 51 am

Open to any feedback on the ui messages, overall setup, or additional features before we merge this into qa or stable

@tpetr @darcatron

Member

ssalinas commented Oct 20, 2016

First frontend+backend draft of this is done now (save for the race condition fix). The setup of the new healthchecks looks like this:
screen shot 2016-10-20 at 12 23 21 pm

UI messages are also updated to reflect this. Some examples:

  • Failed because the app never responded (never left the startup phase). This was with a 10s startupDelay and 15s startupTimeout set. Note that only a single healthcheck appreas in the table. This is because startup phase healthchecks are cleaned up after task checks run (except if it is the last healthcheck). This way we can check more often during the startup phase without cluttering healthchecks.

    screen shot 2016-10-20 at 10 51 33 am

    screen shot 2016-10-20 at 10 51 41 am

  • Failed due to a status code in the failureStatusCodes list
    screen shot 2016-10-20 at 10 51 26 am
    screen shot 2016-10-20 at 10 51 08 am

  • Failed due to too many retries
    screen shot 2016-10-20 at 10 55 51 am

Open to any feedback on the ui messages, overall setup, or additional features before we merge this into qa or stable

@tpetr @darcatron

@ssalinas ssalinas added the hs_staging label Oct 20, 2016

@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas Oct 21, 2016

Member

Updated for the last checkbox as well. Fixed the race condition by adding a healthchecks-finished node the deploy checker can create when it is failing a deploy. It was much more involved and error-prone to have to deploy checker block on an in-progress healthcheck. So, the deploy will still fail at the appropriate timeout, but the user will not be confused by an additional healthcheck being present

Member

ssalinas commented Oct 21, 2016

Updated for the last checkbox as well. Fixed the race condition by adding a healthchecks-finished node the deploy checker can create when it is failing a deploy. It was much more involved and error-prone to have to deploy checker block on an in-progress healthcheck. So, the deploy will still fail at the appropriate timeout, but the user will not be confused by an additional healthcheck being present

@ssalinas ssalinas modified the milestones: 0.12.0, 0.13.0 Nov 4, 2016

@ssalinas ssalinas added the hs_qa label Nov 16, 2016

@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas Dec 14, 2016

Member

Since we recently did a release, going to merge this and I will address any bugs we find in future PRs. Been looking good in hs_stable though.

Member

ssalinas commented Dec 14, 2016

Since we recently did a release, going to merge this and I will address any bugs we find in future PRs. Been looking good in hs_stable though.

@ssalinas ssalinas merged commit ea60453 into master Dec 14, 2016

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details

@ssalinas ssalinas deleted the smarter_healthchecks branch Dec 14, 2016

@nyarly

This comment has been minimized.

Show comment
Hide comment
@nyarly

nyarly May 9, 2017

@ssalinas I don't understand this new behavior at all. Is the reasoning documented somewhere?

Specifically, from what I can tell, once we leave startup, it looks like we poll the healthcheck URL looking for failure until the HC timeout elapses. What's the reasoning behind that? i.e. after one successful healthcheck, why not transition to "started"?

nyarly commented May 9, 2017

@ssalinas I don't understand this new behavior at all. Is the reasoning documented somewhere?

Specifically, from what I can tell, once we leave startup, it looks like we poll the healthcheck URL looking for failure until the HC timeout elapses. What's the reasoning behind that? i.e. after one successful healthcheck, why not transition to "started"?

@stevenschlansker

This comment has been minimized.

Show comment
Hide comment
@stevenschlansker

stevenschlansker May 9, 2017

Contributor

Hm, my understanding is that's exactly what happens -- any single successful check turns it to running state and no further checks will be performed.

Contributor

stevenschlansker commented May 9, 2017

Hm, my understanding is that's exactly what happens -- any single successful check turns it to running state and no further checks will be performed.

@nyarly

This comment has been minimized.

Show comment
Hide comment
@nyarly

nyarly May 9, 2017

I skimmed the code and it seemed to be what I was concerned by. In that case, why are there multiple polls then? Because I was pretty sure that a single failure status (as defined by a configuration) marks the service UNHEALTHY, which I assume means Singularity kills the task.

nyarly commented May 9, 2017

I skimmed the code and it seemed to be what I was concerned by. In that case, why are there multiple polls then? Because I was pretty sure that a single failure status (as defined by a configuration) marks the service UNHEALTHY, which I assume means Singularity kills the task.

@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas May 9, 2017

Member

There's a few things to divide up here:

  • In terms of the task state, we only start health checks once the task enters TASK_RUNNING. This just means the process is running not that it is healthy yet
  • It is correct that we stop after a single successful check

In the old method of healthchecks there was just a total amount of time/number of checks as the constraint. Now, there is a constraint around startup (i.e. time until we stop getting connection refused responses) and constraint around the actual response (i.e. time until a 'successfull' status code is received)

Member

ssalinas commented May 9, 2017

There's a few things to divide up here:

  • In terms of the task state, we only start health checks once the task enters TASK_RUNNING. This just means the process is running not that it is healthy yet
  • It is correct that we stop after a single successful check

In the old method of healthchecks there was just a total amount of time/number of checks as the constraint. Now, there is a constraint around startup (i.e. time until we stop getting connection refused responses) and constraint around the actual response (i.e. time until a 'successfull' status code is received)

@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas May 9, 2017

Member

A single failure status does not mark the task as unhealthy unless you set it to do so. Once it has received some sort of valid http response, you then have maxRetries attempts left for the app to return an successful check

Member

ssalinas commented May 9, 2017

A single failure status does not mark the task as unhealthy unless you set it to do so. Once it has received some sort of valid http response, you then have maxRetries attempts left for the app to return an successful check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment