Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for progress conditions on deployments #130

Merged
merged 7 commits into from
Jul 19, 2017

Conversation

karanthukral
Copy link
Contributor

@karanthukral karanthukral commented Jul 14, 2017

What?

  • Resolves Enable per-resource timeout overrides via an annotation #63
  • Adds support for progressDeadlineSeconds for deployments
  • If a deployment template has spec.progressDeadlineSeconds defined, the deployment class will use the Progressing condition to evaluate timeout instead of the one defined for the class

What needs to be improved?

  • The test is still a work in progress

  • In order to replicate the case of overriding the timeout and causing a deploy to fail, I tried to use an invalid image, or an image with an invalid entrypoint. In either case, kubernetes-deploy fails the deploy due to detecting unrecoverable states.

  • The current test attempts to deploy 10 nginx containers with a 2 second progressDeadlineSeconds with a minReadySeconds set to 1 second. This causes the deploy to time out.

  • Fixed the test by creating a simple busy box container that simply sleeps and the test deployment adds a readiness probe to run ls with a progressDeadlineSeconds of 2 seconds which causes the deployment to timeout.

  • Open to suggestions if there is a better way to write this test

@karanthukral
Copy link
Contributor Author

cc/ @Shopify/cloudplatform

@karanthukral
Copy link
Contributor Author

The following events keep the deployment "progressing"

Kubernetes marks a Deployment as progressing when one of the following tasks is performed:
- The Deployment creates a new ReplicaSet.
- The Deployment is scaling up its newest ReplicaSet.
- The Deployment is scaling down its older ReplicaSet(s).
- New Pods become ready or available (ready for at least MinReadySeconds).

https://kubernetes.io/docs/concepts/workloads/controllers/deployment/

Copy link
Contributor

@KnVerey KnVerey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also make sure the deploy output is clear about what happened in the case of a progress-based timeout, whether that's by including something in @status or maybe a custom prefix on the timeout_message.

@@ -14,6 +14,9 @@ def sync
@rollout_data = { "replicas" => 0 }.merge(deployment_data["status"]
.slice("replicas", "updatedReplicas", "availableReplicas", "unavailableReplicas"))
@status = @rollout_data.map { |state_replicas, num| "#{num} #{state_replicas.chop.pluralize(num)}" }.join(", ")
deployment_data["status"]["conditions"].each do |condition|
@progress = condition["type"] == 'Progressing' ? condition : nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will set @progress to nil if the Progressing condition isn't last.

end

def exists?
@found
end

def timeout
raw_json, _err, st = kubectl.run("get", type, @name, "--output=json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there isn't a default value for this (i.e. if it isn't set in the template, it won't be set ever), we can look at @definition["spec"]["progressDeadlineSeconds"] for this instead of querying the API. As a rule, I try to keep all API queries inside the sync method. This reduces the number of overall calls (since e.g. deploy_succeeded? might get called a bunch of times in a row), and also makes sure we keep a consistent view of the cluster within a given polling cycle.

EDIT: In the end I don't think we should override this anyway (see my other comment).

if @progress
@progress["status"] == 'False'
else
super || @latest_rs && @latest_rs.deploy_timed_out?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they had progressDeadlineSeconds in their spec but the status condition is missing, we'd fail the deploy with a very short timeout. That's probably not what we want. I'm thinking what you have right here is correct--if the status condition failed, time out without bothering to look at the actual progressDeadlineSeconds number. However, IMO the class-level hard timeout that kicks in if the condition is missing should retain its normal definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely sure what you mean here? The current logic falls back to the default logic if the progress condition does not exist. Isn't that what we want?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I probably should have commented on def timeout instead. I was trying to say we should remove that override so that it always uses TIMEOUT. Here's an example of the scenario I'm worried about:

  • spec.progressDeadlineSeconds is set to a low value, e.g. 30s
  • the deployment has a ton of pods and a conservative rollout strategy, so it takes, say, 5min to finish
  • partway through, for whatever reason, we don't observe the status condition (so @progress becomes nil)
  • deploy_timed_out? sees @progress == nil and uses super, which thinks the timeout is 30s
  • deploy times out even though it was still progressing, which is what we're trying to avoid with this feature

Of course landing in the super when progressDeadlineSeconds was set would be unexpected. But what I'm trying to say is if that happens for some reason, I think the sane behaviour would be to use the global timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove the def timeout override 👍

@@ -0,0 +1,29 @@
apiVersion: extensions/v1beta1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the test/fixtures/long-running/undying-deployment.yml.erb fixture work for this? It just sleeps, and you can modify it to add the progressDeadlineSeconds by using the block form of deploy_fixtures.

Copy link
Contributor Author

@karanthukral karanthukral Jul 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I tried this and was able to add the progressDeadlineSeconds into the spec but adding a readinessProbe is not as easy. We need a readinessProbe so that the container fails the ready check causing the timeout

EDIT: nvm figured it out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh true, I probably should have suggested test/fixtures/invalid/bad_probe.yml instead.

@karanthukral karanthukral force-pushed the progress-deployments branch 2 times, most recently from 307562c to 7f8c470 Compare July 18, 2017 16:57
@karanthukral
Copy link
Contributor Author

karanthukral commented Jul 18, 2017

@KnVerey made the changes you recommended except the one question. The tests started failing with:

KubernetesDeployTest#test_deploy_result_logging_for_mixed_result_deploy:
  | RuntimeError: can't modify frozen String

Have you ever seen that before or have any ideas why this could be happening after I wrote the timeout_message method in deployment.rb

EDIT: Figured it out, fixing it

@@ -14,6 +14,11 @@ def sync
@rollout_data = { "replicas" => 0 }.merge(deployment_data["status"]
.slice("replicas", "updatedReplicas", "availableReplicas", "unavailableReplicas"))
@status = @rollout_data.map { |state_replicas, num| "#{num} #{state_replicas.chop.pluralize(num)}" }.join(", ")
deployment_data["status"]["conditions"].each do |condition|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be shortened to:
@progress = deployment_data["status"]["conditions"].find { |condition| condition['type'] == 'Progressing' }

if @progress
@progress["status"] == 'False'
else
super || @latest_rs && @latest_rs.deploy_timed_out?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I probably should have commented on def timeout instead. I was trying to say we should remove that override so that it always uses TIMEOUT. Here's an example of the scenario I'm worried about:

  • spec.progressDeadlineSeconds is set to a low value, e.g. 30s
  • the deployment has a ton of pods and a conservative rollout strategy, so it takes, say, 5min to finish
  • partway through, for whatever reason, we don't observe the status condition (so @progress becomes nil)
  • deploy_timed_out? sees @progress == nil and uses super, which thinks the timeout is 30s
  • deploy times out even though it was still progressing, which is what we're trying to avoid with this feature

Of course landing in the super when progressDeadlineSeconds was set would be unexpected. But what I'm trying to say is if that happens for some reason, I think the sane behaviour would be to use the global timeout.

@@ -0,0 +1,29 @@
apiVersion: extensions/v1beta1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh true, I probably should have suggested test/fixtures/invalid/bad_probe.yml instead.

@karanthukral
Copy link
Contributor Author

cc/ @ibawt

@ibawt
Copy link
Contributor

ibawt commented Jul 19, 2017

code LGTM, I'll just have to whip this on core and see what happens I think.

@ibawt
Copy link
Contributor

ibawt commented Jul 19, 2017

cc @dwradcliffe this is how we can avoid bumping the global timeout for deployments/services

@karanthukral
Copy link
Contributor Author

@KnVerey can I get a 👍 on the final changes?

@karanthukral karanthukral merged commit 7115434 into master Jul 19, 2017
@karanthukral karanthukral deleted the progress-deployments branch July 19, 2017 17:42
@Teots Teots mentioned this pull request Dec 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants