Add support for progress conditions on deployments #130

karanthukral · 2017-07-14T18:22:21Z

What?

Resolves Enable per-resource timeout overrides via an annotation #63
Adds support for progressDeadlineSeconds for deployments
If a deployment template has spec.progressDeadlineSeconds defined, the deployment class will use the Progressing condition to evaluate timeout instead of the one defined for the class

What needs to be improved?

~~The test is still a work in progress~~
In order to replicate the case of overriding the timeout and causing a deploy to fail, I tried to use an invalid image, or an image with an invalid entrypoint. In either case, kubernetes-deploy fails the deploy due to detecting unrecoverable states.
~~The current test attempts to deploy 10 nginx containers with a 2 second progressDeadlineSeconds with a minReadySeconds set to 1 second. This causes the deploy to time out.~~
Fixed the test by creating a simple busy box container that simply sleeps and the test deployment adds a readiness probe to run ls with a progressDeadlineSeconds of 2 seconds which causes the deployment to timeout.
Open to suggestions if there is a better way to write this test

karanthukral · 2017-07-17T14:13:40Z

cc/ @Shopify/cloudplatform

karanthukral · 2017-07-17T14:25:05Z

The following events keep the deployment "progressing"

Kubernetes marks a Deployment as progressing when one of the following tasks is performed:
- The Deployment creates a new ReplicaSet.
- The Deployment is scaling up its newest ReplicaSet.
- The Deployment is scaling down its older ReplicaSet(s).
- New Pods become ready or available (ready for at least MinReadySeconds).

https://kubernetes.io/docs/concepts/workloads/controllers/deployment/

KnVerey

We should also make sure the deploy output is clear about what happened in the case of a progress-based timeout, whether that's by including something in @status or maybe a custom prefix on the timeout_message.

KnVerey · 2017-07-17T20:33:53Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

@@ -14,6 +14,9 @@ def sync
        @rollout_data = { "replicas" => 0 }.merge(deployment_data["status"]
          .slice("replicas", "updatedReplicas", "availableReplicas", "unavailableReplicas"))
        @status = @rollout_data.map { |state_replicas, num| "#{num} #{state_replicas.chop.pluralize(num)}" }.join(", ")
+        deployment_data["status"]["conditions"].each do |condition|
+          @progress = condition["type"] == 'Progressing' ? condition : nil


This will set @progress to nil if the Progressing condition isn't last.

KnVerey · 2017-07-17T20:50:00Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

    end

    def exists?
      @found
    end

+    def timeout
+      raw_json, _err, st = kubectl.run("get", type, @name, "--output=json")


Since there isn't a default value for this (i.e. if it isn't set in the template, it won't be set ever), we can look at @definition["spec"]["progressDeadlineSeconds"] for this instead of querying the API. As a rule, I try to keep all API queries inside the sync method. This reduces the number of overall calls (since e.g. deploy_succeeded? might get called a bunch of times in a row), and also makes sure we keep a consistent view of the cluster within a given polling cycle.

EDIT: In the end I don't think we should override this anyway (see my other comment).

KnVerey · 2017-07-17T20:56:53Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+      if @progress
+        @progress["status"] == 'False'
+      else
+        super || @latest_rs && @latest_rs.deploy_timed_out?


If they had progressDeadlineSeconds in their spec but the status condition is missing, we'd fail the deploy with a very short timeout. That's probably not what we want. I'm thinking what you have right here is correct--if the status condition failed, time out without bothering to look at the actual progressDeadlineSeconds number. However, IMO the class-level hard timeout that kicks in if the condition is missing should retain its normal definition.

I am not entirely sure what you mean here? The current logic falls back to the default logic if the progress condition does not exist. Isn't that what we want?

Sorry, I probably should have commented on def timeout instead. I was trying to say we should remove that override so that it always uses TIMEOUT. Here's an example of the scenario I'm worried about:

spec.progressDeadlineSeconds is set to a low value, e.g. 30s

the deployment has a ton of pods and a conservative rollout strategy, so it takes, say, 5min to finish

partway through, for whatever reason, we don't observe the status condition (so @progress becomes nil)

deploy_timed_out? sees @progress == nil and uses super, which thinks the timeout is 30s

deploy times out even though it was still progressing, which is what we're trying to avoid with this feature

Of course landing in the super when progressDeadlineSeconds was set would be unexpected. But what I'm trying to say is if that happens for some reason, I think the sane behaviour would be to use the global timeout.

I'll remove the def timeout override 👍

KnVerey · 2017-07-17T21:00:35Z

test/fixtures/misc-templates/deployment-w-progress.yml

@@ -0,0 +1,29 @@
+apiVersion: extensions/v1beta1


Would the test/fixtures/long-running/undying-deployment.yml.erb fixture work for this? It just sleeps, and you can modify it to add the progressDeadlineSeconds by using the block form of deploy_fixtures.

So I tried this and was able to add the progressDeadlineSeconds into the spec but adding a readinessProbe is not as easy. We need a readinessProbe so that the container fails the ready check causing the timeout

EDIT: nvm figured it out

Oh true, I probably should have suggested test/fixtures/invalid/bad_probe.yml instead.

karanthukral · 2017-07-18T16:59:00Z

@KnVerey made the changes you recommended except the one question. The tests started failing with:

KubernetesDeployTest#test_deploy_result_logging_for_mixed_result_deploy:
  | RuntimeError: can't modify frozen String

Have you ever seen that before or have any ideas why this could be happening after I wrote the timeout_message method in deployment.rb

EDIT: Figured it out, fixing it

KnVerey · 2017-07-19T01:33:27Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

@@ -14,6 +14,11 @@ def sync
        @rollout_data = { "replicas" => 0 }.merge(deployment_data["status"]
          .slice("replicas", "updatedReplicas", "availableReplicas", "unavailableReplicas"))
        @status = @rollout_data.map { |state_replicas, num| "#{num} #{state_replicas.chop.pluralize(num)}" }.join(", ")
+        deployment_data["status"]["conditions"].each do |condition|


I think this could be shortened to:
@progress = deployment_data["status"]["conditions"].find { |condition| condition['type'] == 'Progressing' }

KnVerey · 2017-07-19T01:49:34Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+      if @progress
+        @progress["status"] == 'False'
+      else
+        super || @latest_rs && @latest_rs.deploy_timed_out?


Sorry, I probably should have commented on def timeout instead. I was trying to say we should remove that override so that it always uses TIMEOUT. Here's an example of the scenario I'm worried about:

spec.progressDeadlineSeconds is set to a low value, e.g. 30s

the deployment has a ton of pods and a conservative rollout strategy, so it takes, say, 5min to finish

partway through, for whatever reason, we don't observe the status condition (so @progress becomes nil)

deploy_timed_out? sees @progress == nil and uses super, which thinks the timeout is 30s

deploy times out even though it was still progressing, which is what we're trying to avoid with this feature

Of course landing in the super when progressDeadlineSeconds was set would be unexpected. But what I'm trying to say is if that happens for some reason, I think the sane behaviour would be to use the global timeout.

KnVerey · 2017-07-19T01:53:20Z

test/fixtures/misc-templates/deployment-w-progress.yml

@@ -0,0 +1,29 @@
+apiVersion: extensions/v1beta1


Oh true, I probably should have suggested test/fixtures/invalid/bad_probe.yml instead.

karanthukral · 2017-07-19T14:47:10Z

cc/ @ibawt

ibawt · 2017-07-19T15:21:40Z

code LGTM, I'll just have to whip this on core and see what happens I think.

ibawt · 2017-07-19T15:21:59Z

cc @dwradcliffe this is how we can avoid bumping the global timeout for deployments/services

karanthukral · 2017-07-19T16:59:26Z

@KnVerey can I get a 👍 on the final changes?

karanthukral requested a review from KnVerey July 14, 2017 18:22

karanthukral force-pushed the progress-deployments branch from fd59b7d to 4e2ca8a Compare July 14, 2017 18:25

Add support for progress conditions on deployments

8b35997

karanthukral force-pushed the progress-deployments branch from 4e2ca8a to 8b35997 Compare July 14, 2017 19:03

Improve test by removing race condition

540b654

karanthukral force-pushed the progress-deployments branch from e955cea to 540b654 Compare July 17, 2017 12:54

Clean up timeout logic for deployment

97a7d9b

Update comment to reflect test changes

6abea04

karanthukral force-pushed the progress-deployments branch from 4e33094 to 6abea04 Compare July 17, 2017 17:07

KnVerey reviewed Jul 17, 2017

View reviewed changes

karanthukral force-pushed the progress-deployments branch from accd623 to d3f168e Compare July 18, 2017 13:54

Use existing long running templates in test

7f8c470

karanthukral force-pushed the progress-deployments branch 2 times, most recently from 307562c to 7f8c470 Compare July 18, 2017 16:57

Fix frozen literal issue

2e5e39f

KnVerey reviewed Jul 19, 2017

View reviewed changes

Use global timeouts at all times

ddd5411

karanthukral force-pushed the progress-deployments branch from a626642 to ddd5411 Compare July 19, 2017 13:05

KnVerey approved these changes Jul 19, 2017

View reviewed changes

karanthukral merged commit 7115434 into master Jul 19, 2017

karanthukral deleted the progress-deployments branch July 19, 2017 17:42

karanthukral deployed to rubygems July 24, 2017 20:27 Active

Teots mentioned this pull request Dec 15, 2017

Timeout on stateful sets #225

Closed

Add support for progress conditions on deployments #130

Add support for progress conditions on deployments #130

Uh oh!

Conversation

karanthukral commented Jul 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

What needs to be improved?

Uh oh!

karanthukral commented Jul 17, 2017

Uh oh!

karanthukral commented Jul 17, 2017

Uh oh!

KnVerey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karanthukral Jul 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karanthukral commented Jul 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karanthukral commented Jul 19, 2017

Uh oh!

ibawt commented Jul 19, 2017

Uh oh!

ibawt commented Jul 19, 2017

Uh oh!

karanthukral commented Jul 19, 2017

Uh oh!

Uh oh!

karanthukral commented Jul 14, 2017 •

edited

Loading

karanthukral Jul 18, 2017 •

edited

Loading

karanthukral commented Jul 18, 2017 •

edited

Loading