-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smarter Healthchecks #1306
Smarter Healthchecks #1306
Conversation
return uri; | ||
} | ||
|
||
@ApiModelProperty(required = false, value="Perform healthcheck on this dynamically allocated port (e.g. 0 for first port), defaults to first port") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: required = false
=> required=false
return portIndex; | ||
} | ||
|
||
@ApiModelProperty(required = false, value="Perform healthcheck on this port (portIndex cannot also be used when using this setting)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: required = false
=> required=false
Objects.equal(startupDelaySeconds, that.startupDelaySeconds) && | ||
Objects.equal(intervalSeconds, that.intervalSeconds) && | ||
Objects.equal(responseTimeoutSeconds, that.responseTimeoutSeconds) && | ||
Objects.equal(maxRetries, that.maxRetries); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this have startupInvervalSeconds
as well?
private final Optional<HealthcheckProtocol> protocol; | ||
private final Optional<Integer> startupTimeoutSeconds; | ||
private final Optional<Integer> startupDelaySeconds; | ||
private final Optional<Integer> startupIntervalSeconds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call it startupFrequencySeconds
?
.add("intervalSeconds", intervalSeconds) | ||
.add("responseTimeoutSeconds", responseTimeoutSeconds) | ||
.add("maxRetries", maxRetries) | ||
.toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this have startupInvervalSeconds
as well?
@@ -446,6 +472,11 @@ public String getId() { | |||
return healthcheckMaxTotalTimeoutSeconds; | |||
} | |||
|
|||
@ApiModelProperty(required = false, value="HTTP Healthcheck settings") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: required = false
=> required=false
checkBadRequest(deploy.getResources().get().getNumPorts() > deploy.getHealthcheck().get().getPortIndex().get(), String | ||
.format("Must request %s ports for healthcheckPortIndex %s, only requested %s", deploy.getHealthcheck().get().getPortIndex().get() + 1, deploy.getHealthcheck().get().getPortIndex().get(), | ||
deploy.getResources().get().getNumPorts())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe assign HealthcheckOptions healthcheck = deploy.getHealthcheck().get()
after verifying isPresent() so the later uses are a little cleaner?
if (throwable.isPresent() && throwable.get() instanceof ConnectException) { | ||
inStartup = true; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
boolean inStartup = throwable.isPresent() && throwable.get() instanceof ConnectException
private Optional<Integer> healthcheckMaxRetries; | ||
@Deprecated | ||
private Optional<Long> healthcheckMaxTotalTimeoutSeconds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Deprecated use {@link healthcheck}
?
private final Optional<Integer> healthcheckMaxRetries; | ||
@Deprecated | ||
private final Optional<Long> healthcheckMaxTotalTimeoutSeconds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Deprecated use {@link healthcheck}
?
First frontend+backend draft of this is done now (save for the race condition fix). The setup of the new healthchecks looks like this: UI messages are also updated to reflect this. Some examples:
Open to any feedback on the ui messages, overall setup, or additional features before we merge this into qa or stable @tpetr @darcatron |
Since we recently did a release, going to merge this and I will address any bugs we find in future PRs. Been looking good in hs_stable though. |
@ssalinas I don't understand this new behavior at all. Is the reasoning documented somewhere? Specifically, from what I can tell, once we leave startup, it looks like we poll the healthcheck URL looking for failure until the HC timeout elapses. What's the reasoning behind that? i.e. after one successful healthcheck, why not transition to "started"? |
Hm, my understanding is that's exactly what happens -- any single successful check turns it to running state and no further checks will be performed. |
I skimmed the code and it seemed to be what I was concerned by. In that case, why are there multiple polls then? Because I was pretty sure that a single failure status (as defined by a configuration) marks the service UNHEALTHY, which I assume means Singularity kills the task. |
There's a few things to divide up here:
In the old method of healthchecks there was just a total amount of time/number of checks as the constraint. Now, there is a constraint around startup (i.e. time until we stop getting connection refused responses) and constraint around the actual response (i.e. time until a 'successfull' status code is received) |
A single failure status does not mark the task as unhealthy unless you set it to do so. Once it has received some sort of valid http response, you then have |
This PR attempts to do a few things regarding our existing healthcheck setup
startup
period (identified primarily by 'connection refused' responses vs actuall http responses)startupDelay
(time before running any checks) andstartupInterval
(how often to check during this time)startupTimeout
- time limit for app to actually respond to a checkhealthcheckTotalTimeoutSeconds
a calculated field based on the others present rather than an overrideStill TODO:
maxRetries
*intervalSeconds
(as a replacement for having a separate static max total timeout)startupTimeout
in deploy and new task checkers/cc @tpetr @darcatron