-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for progress conditions on deployments #130
Conversation
fd59b7d
to
4e2ca8a
Compare
4e2ca8a
to
8b35997
Compare
e955cea
to
540b654
Compare
cc/ @Shopify/cloudplatform |
The following events keep the deployment "progressing"
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/ |
4e33094
to
6abea04
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also make sure the deploy output is clear about what happened in the case of a progress-based timeout, whether that's by including something in @status
or maybe a custom prefix on the timeout_message
.
@@ -14,6 +14,9 @@ def sync | |||
@rollout_data = { "replicas" => 0 }.merge(deployment_data["status"] | |||
.slice("replicas", "updatedReplicas", "availableReplicas", "unavailableReplicas")) | |||
@status = @rollout_data.map { |state_replicas, num| "#{num} #{state_replicas.chop.pluralize(num)}" }.join(", ") | |||
deployment_data["status"]["conditions"].each do |condition| | |||
@progress = condition["type"] == 'Progressing' ? condition : nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will set @progress
to nil
if the Progressing condition isn't last.
end | ||
|
||
def exists? | ||
@found | ||
end | ||
|
||
def timeout | ||
raw_json, _err, st = kubectl.run("get", type, @name, "--output=json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there isn't a default value for this (i.e. if it isn't set in the template, it won't be set ever), we can look at @definition["spec"]["progressDeadlineSeconds"]
for this instead of querying the API. As a rule, I try to keep all API queries inside the sync
method. This reduces the number of overall calls (since e.g. deploy_succeeded?
might get called a bunch of times in a row), and also makes sure we keep a consistent view of the cluster within a given polling cycle.
EDIT: In the end I don't think we should override this anyway (see my other comment).
if @progress | ||
@progress["status"] == 'False' | ||
else | ||
super || @latest_rs && @latest_rs.deploy_timed_out? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they had progressDeadlineSeconds
in their spec but the status condition is missing, we'd fail the deploy with a very short timeout. That's probably not what we want. I'm thinking what you have right here is correct--if the status condition failed, time out without bothering to look at the actual progressDeadlineSeconds number. However, IMO the class-level hard timeout that kicks in if the condition is missing should retain its normal definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely sure what you mean here? The current logic falls back to the default logic if the progress condition does not exist. Isn't that what we want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I probably should have commented on def timeout
instead. I was trying to say we should remove that override so that it always uses TIMEOUT
. Here's an example of the scenario I'm worried about:
- spec.progressDeadlineSeconds is set to a low value, e.g. 30s
- the deployment has a ton of pods and a conservative rollout strategy, so it takes, say, 5min to finish
- partway through, for whatever reason, we don't observe the status condition (so
@progress
becomes nil) deploy_timed_out?
sees@progress == nil
and usessuper
, which thinks the timeout is 30s- deploy times out even though it was still progressing, which is what we're trying to avoid with this feature
Of course landing in the super
when progressDeadlineSeconds was set would be unexpected. But what I'm trying to say is if that happens for some reason, I think the sane behaviour would be to use the global timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove the def timeout
override 👍
@@ -0,0 +1,29 @@ | |||
apiVersion: extensions/v1beta1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the test/fixtures/long-running/undying-deployment.yml.erb fixture work for this? It just sleeps, and you can modify it to add the progressDeadlineSeconds
by using the block form of deploy_fixtures
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I tried this and was able to add the progressDeadlineSeconds
into the spec but adding a readinessProbe
is not as easy. We need a readinessProbe
so that the container fails the ready check causing the timeout
EDIT: nvm figured it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh true, I probably should have suggested test/fixtures/invalid/bad_probe.yml
instead.
accd623
to
d3f168e
Compare
307562c
to
7f8c470
Compare
@KnVerey made the changes you recommended except the one question. The tests started failing with:
Have you ever seen that before or have any ideas why this could be happening after I wrote the EDIT: Figured it out, fixing it |
@@ -14,6 +14,11 @@ def sync | |||
@rollout_data = { "replicas" => 0 }.merge(deployment_data["status"] | |||
.slice("replicas", "updatedReplicas", "availableReplicas", "unavailableReplicas")) | |||
@status = @rollout_data.map { |state_replicas, num| "#{num} #{state_replicas.chop.pluralize(num)}" }.join(", ") | |||
deployment_data["status"]["conditions"].each do |condition| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be shortened to:
@progress = deployment_data["status"]["conditions"].find { |condition| condition['type'] == 'Progressing' }
if @progress | ||
@progress["status"] == 'False' | ||
else | ||
super || @latest_rs && @latest_rs.deploy_timed_out? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I probably should have commented on def timeout
instead. I was trying to say we should remove that override so that it always uses TIMEOUT
. Here's an example of the scenario I'm worried about:
- spec.progressDeadlineSeconds is set to a low value, e.g. 30s
- the deployment has a ton of pods and a conservative rollout strategy, so it takes, say, 5min to finish
- partway through, for whatever reason, we don't observe the status condition (so
@progress
becomes nil) deploy_timed_out?
sees@progress == nil
and usessuper
, which thinks the timeout is 30s- deploy times out even though it was still progressing, which is what we're trying to avoid with this feature
Of course landing in the super
when progressDeadlineSeconds was set would be unexpected. But what I'm trying to say is if that happens for some reason, I think the sane behaviour would be to use the global timeout.
@@ -0,0 +1,29 @@ | |||
apiVersion: extensions/v1beta1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh true, I probably should have suggested test/fixtures/invalid/bad_probe.yml
instead.
a626642
to
ddd5411
Compare
cc/ @ibawt |
code LGTM, I'll just have to whip this on core and see what happens I think. |
cc @dwradcliffe this is how we can avoid bumping the global timeout for deployments/services |
@KnVerey can I get a 👍 on the final changes? |
What?
progressDeadlineSeconds
for deploymentsspec.progressDeadlineSeconds
defined, the deployment class will use theProgressing
condition to evaluate timeout instead of the one defined for the classWhat needs to be improved?
The test is still a work in progressIn order to replicate the case of overriding the timeout and causing a deploy to fail, I tried to use an invalid image, or an image with an invalid entrypoint. In either case, kubernetes-deploy fails the deploy due to detecting unrecoverable states.The current test attempts to deploy 10 nginx containers with a 2 secondprogressDeadlineSeconds
with aminReadySeconds
set to 1 second. This causes the deploy to time out.Fixed the test by creating a simple busy box container that simply sleeps and the test deployment adds a readiness probe to run
ls
with aprogressDeadlineSeconds
of 2 seconds which causes the deployment to timeout.Open to suggestions if there is a better way to write this test