Return success before fully deployed #208

dturn · 2017-11-20T21:39:26Z

Support a new annotation on deployments: kubernetes-deploy.shopify.io/required-rollout. It changes the behavior of deploy_succesful?

full: maintain current behavior.
none: deploy is sucessful as soon as the new rs is created for the deploy.
maxUnavailable: Only for RollingUpdate deploys. The deploy is successful when there are less than strategy.RollingUpdate.maxUnavailable (supports both the % and fixed number) updated & ready replicas.

This feature is useful when deployments may take an extended period of time to fully roll out. This could happen when cluster capacity is near usage.

dturn · 2017-11-21T22:33:08Z

cc: @dwradcliffe @ibawt since we were just chatting about this. its still a wip, I'm looking for feedback before finishing it up

dwradcliffe · 2017-11-21T22:42:19Z

I think it probably needs to be an annotation because each deployment might have different requirements

KnVerey · 2017-11-21T23:22:44Z

I agree that an annotation makes more sense. I'd go with a generic name like kubernetes-deploy.shopify.io/required-rollout so we could theoretically expand it to multiple values in the future, e.g. maxUnavailable (this PR), none (could be a more flexible replacement for the --skip-wait feature)

dturn · 2017-12-04T21:33:52Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

@@ -74,7 +81,7 @@ def pretty_timeout_type

    def deploy_timed_out?
      # Do not use the hard timeout if progress deadline is set
-      @progress_condition.present? ? deploy_failing_to_progress? : super
+      super #@progress_condition.present? ? deploy_failing_to_progress? : super


Are we assuming that deployments always use progressDeadlineSeconds and never the timeout in this file?

dturn · 2017-12-05T18:29:15Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

@@ -74,7 +73,11 @@ def pretty_timeout_type

    def deploy_timed_out?
      # Do not use the hard timeout if progress deadline is set
-      @progress_condition.present? ? deploy_failing_to_progress? : super
+      if @definition['spec']['progressDeadlineSeconds'].present? && @progress_condition.present?


Looks like k8s merges in a default, this will check to see if you manually set one. This might be a behavior change

Yes, it would be a behaviour change. Why do you think that would be better? IMO getting the smarter timeout logic shouldn't depend on whether you liked the default value the server gives you. (interestingly, we originally intended to look at @definition but made a mistake... and I ended up thinking it was a happy accident 😄 )

I don't have strong feelings either way. Though the behavior being different on first vs subsequent deploys seems less than ideal and makes me wonder if there are cases the test aren't capturing.

My bigger issue is without changing this behavior, I have no idea how to write a reliable test here.

the behavior being different on first vs subsequent deploys

What do you mean by this? Why would that be the case?

My bigger issue is without changing this behavior, I have no idea how to write a reliable test here.

Maybe I'm missing the obvious, but why does changing this behaviour help? I agree this is tricky to test reliably... you'd almost want a way to prevent the whole RS from coming up... I'll think about it.

KnVerey · 2017-12-13T19:56:54Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+    def partial_deployment_success
+      minimum_needed = minimum_unavailable_replicas_to_succeeded
+
+      running_rs.size <= 2 &&


I can understand the motivation for this check, but couldn't it make the new feature ineffective for deployments with long-running jobs (like core has) that take a long time to terminate? We actually have a test asserting that terminating pods from older revisions do not impede deploy success. I'm not certain what that state looks like from a RS standpoint, but I'd expect it to be desired = 0 and replicas > 0.

dturn · 2017-12-14T20:33:28Z

ready for 👀 again.

KnVerey · 2017-12-18T18:52:42Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

-        @latest_rs = find_latest_rs(deployment_data)
-
-        @rollout_data = { "replicas" => 0 }.merge(deployment_data["status"]
+        @deployment_data = JSON.parse(raw_json)


Why do we need to make the entire blob available now? I may be missing something, but I don't actually see new information being used outside the sync cycle.

Its going to have to be an instance var if we want to use it in minimum_unavailable_replicas_to_succeeded (or whatever it gets called). Well that or an instance var that just captures the value

Well that or an instance var that just captures the value

Yeah, if it's only one more value that you need, I'd be explicit about capturing that (and potentially resetting it on sync cycles that err) rather than the whole blob.

KnVerey · 2017-12-18T18:57:56Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+      when 'none'
+        0
+      else
+        raise "#{min_unavailable_rollout} is not a valid value for required-rollout"


Might be helpful to include what the valid options are in this message. (see also my comment about this needing to happen earlier)

KnVerey · 2017-12-18T19:12:08Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

@@ -123,5 +126,35 @@ def find_latest_rs(deployment_data)
      rs.sync(latest_rs_data)
      rs
    end
+
+    def min_unavailable_rollout


I'd call this something like required_rollout_type, have it always return a value, i.e.:

def required_rollout_type strategy = @definition.dig('metadata', 'annotations', 'kubernetes-deploy.shopify.io/required-rollout') strategy.presence || REQUIRED_ROLLOUT_TYPES.fetch(:full_availability) end

The validation of the annotation value also needs to happen earlier, before the deploy might have run. I'd suggest overwriting validate_definition and calling super first, then validating required_rollout_type against a list in a constant if that passed.

I like Katrina's suggestion above. I think this function as is requires too much knowledge from users (i.e. calling .blank? on its result).

KnVerey · 2017-12-18T19:15:53Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

@@ -42,6 +40,7 @@ def fetch_logs

    def deploy_succeeded?
      return false unless @latest_rs.present?
+      return partial_deployment_success unless min_unavailable_rollout.blank?


I think the new concept you're introducing should have first-class visibility here. I.e. I'd move the case statement on required_rollout_type (or whatever you name the method) right in here, and have minimum_unavailable_replicas assume it's being asked in the context of the relevant rollout strategy.

KnVerey · 2017-12-18T19:18:45Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+
+      case min_unavailable_rollout
+      when 'maxUnavailable'
+        max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable')


We need to use the value from sync rather than @definition, possibly with a fallback to @definition (like we do for @progress_deadline), since this value can be server-side defaulted.

KnVerey · 2017-12-18T19:26:31Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+      when 'maxUnavailable'
+        max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable')
+        if max_unavailable =~ /%/
+          (desired * (100 - max_unavailable.to_i) / 100.0).to_i


This won't match k8s behaviour, because k8s rounds up (well, it rounds the maxUnavailable num down) and Float.to_i simply truncates (i.e. rounds down).

e.g. desired = 5, maxUnavailable = 25%, we should return 4 here and we'll return 3

I know there's not much in the way of infrastructure for this, but we should probably get some unit tests on this logic.

KnVerey · 2017-12-18T19:29:06Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+      end
+    end
+
+    def partial_deployment_success


nit: returns a boolean so should end in a ?

KnVerey · 2017-12-18T20:55:19Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+
+      @latest_rs.desired_replicas >= minimum_needed &&
+      @latest_rs.rollout_data["readyReplicas"].to_i >= minimum_needed &&
+      @latest_rs.rollout_data["availableReplicas"].to_i >= minimum_needed


nit: rollout_data seems like a bit of an implementation detail. I'd rather expose ready_replicas and available_replicas as integers, just like deisred_replicas.

KnVerey · 2017-12-18T20:57:31Z

test/fixtures/slow-cloud/web.yml.erb

@@ -0,0 +1,75 @@
+apiVersion: extensions/v1beta1
+kind: Ingress


I don't think this fixture set really needs an ingress or service (doesn't matter that much since those deploy quickly, but you aren't asserting about them).

Doesn't look like you really use the configmap either.

KnVerey · 2017-12-18T21:04:30Z

test/fixtures/slow-cloud/web.yml.erb

+            command:
+              - sleep
+              - '4'
+        image: nginx:alpine


we're trying to use busybox for everything now (with imagePullPolicy: IfNotPresent and command: ["tail", "-f", "/dev/null"]) so that the suite only pulls one tiny image.

KnVerey · 2017-12-18T21:10:23Z

test/fixtures/slow-cloud/web.yml.erb

+    spec:
+      containers:
+      - name: app
+        readinessProbe:


Instead of having the probe sleep, would it work to set initialDelaySeconds to 4? I'd be surprised if this probe would ever pass as is, because timeoutSeconds is 1 by default. As a rule I've also been setting periodSeconds to 1 in fixtures, so that we don't get delayed 10s by the probe if something weird happens.

I'd be surprised if this probe would ever pass as is ...

I don't understand this comment. The tests pass so are you suggesting I'm not testing what I think I am?

Yeah, I'm confused by how this works. It could be the test, or it could just be me. If the probe should time out after 1 second by default (maybe that's not the default you're getting, somehow?), and your probe sleeps for 4, how can that work?

Looks like a k8s bug kubernetes/kubernetes#26895

KnVerey · 2017-12-18T21:11:51Z

test/integration/kubernetes_deploy_test.rb

@@ -632,6 +632,18 @@ def test_can_deploy_deployment_with_zero_replicas
    ])
  end

+  def test_can_deploy_deployment_with_partial_rollout_success
+    result = deploy_fixtures("slow-cloud", subset: ["configmap-data.yml", "web.yml.erb"])


Since this test is the purpose of this fixture set, you shouldn't need to deploy a subset.

KnVerey · 2017-12-18T21:12:29Z

test/fixtures/slow-cloud/web.yml.erb

+            configMapKeyRef:
+              name: hello-cloud-configmap-data
+              key: datapoint1
+      - name: sidecar


This isn't doing anything useful, so I'd remove it.

KnVerey · 2017-12-18T21:19:22Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+      when 'maxUnavailable'
+        max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable')
+        if max_unavailable =~ /%/
+          (desired * (100 - max_unavailable.to_i) / 100.0).to_i


I know there's not much in the way of infrastructure for this, but we should probably get some unit tests on this logic.

KnVerey · 2017-12-18T21:26:54Z

test/integration/kubernetes_deploy_test.rb

@@ -632,6 +632,18 @@ def test_can_deploy_deployment_with_zero_replicas
    ])
  end

+  def test_can_deploy_deployment_with_partial_rollout_success


nit: the point is more that a deployment with the maxUnavailable required-rollout annotation does not wait for full availability (and the copy in restart_task_test should talk about restarts)

KnVerey · 2017-12-18T21:29:04Z

test/integration/kubernetes_deploy_test.rb

+    assert_deploy_success(result)
+
+    assert_logs_match_all(
+      [%r{Deployment\/web\s+[34] replicas, 3 updatedReplicas, 2 availableReplicas, [12] unavailableReplica}]


could maxSurge: 0 remove the variability here?

I thought it might, but after trying it doesn't, just changes the combinations.

stefanmb

Approach LGTM - I think Katrina already captured a few areas needing attention.

stefanmb · 2017-12-19T16:27:44Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

@@ -123,5 +126,35 @@ def find_latest_rs(deployment_data)
      rs.sync(latest_rs_data)
      rs
    end
+
+    def min_unavailable_rollout


I like Katrina's suggestion above. I think this function as is requires too much knowledge from users (i.e. calling .blank? on its result).

dturn · 2017-12-19T20:29:21Z

test/integration/kubernetes_deploy_test.rb

+    assert_deploy_success(result)
+
+    assert_logs_match_all(
+      [%r{Deployment\/web\s+[34] replicas, 3 updatedReplicas, 2 availableReplicas, [12] unavailableReplica}]


This test still isn't that reliable. But I'm skeptical that increasing the sleep or using initialDelaySeconds is going to make it better. The rest of the feedback is addressed.

The reason I think tuning the sleep/delay could help is that the gem only polls for statuses every 3 seconds. So e.g. if we make a scaling operation take, say, 5-7 seconds, we should have a much better chance of succeeding at a consistent point in the scaling, because we'll see every step.

If I'm wrong about that, or it makes the test hella slow, let's use a fancier regex (e.g. Deployment\/web\s+(?<replicas>\d+) replicas?, (?<updatedReplicas>\d+) updatedReplicas?, (?<availableReplicas>\d+) availableReplicas?) to parse the numbers we want out of the logs and make looser assertions (e.g. availableReplicas < replicas, updatedReplicas < replicas). It's shitty, but having multiple flaky tests is worse.

Are these tests reliable now?

The last two commits have passed all tests on the first pass, so I'd comfortable calling them reliable.

I just ran this one 100 times locally, and it failed 3 times:

Deployment/web 3 replicas, 3 updatedReplicas, 3 availableReplicas

Deployment/web 3 replicas, 3 updatedReplicas, 3 availableReplicas

Deployment/web 4 replicas, 2 updatedReplicas, 3 availableReplicas, 1 unavailableReplica

That's not horrible, but considering that there are three of them, they could end up noticeably increasing the fail rate overall. I'm thinking we should make sure we have good unit coverage of the % option and remove that copy of the test. Maybe the restart copy too. @stefanmb wdyt? Do you have any ideas for improving the reliability here?

We could change the restart one to be % based and take out the % one here. That would save us 1 test w/o any loss of coverage.

dturn · 2017-12-20T21:21:08Z

@KnVerey We may need to revisit the annotation not taking a number or % . It seems reasonable to me that you might not want to loose availability during a deployment so have 0 for maxUnavailable but still be ok with the deployment 'succeeding' when 75% of the replicas are for the new SHA

KnVerey · 2017-12-20T22:22:43Z

Why would you want to declare it successful when you don't have acceptable capacity in the target revision? I'm worried it's because you just want to be able to twist the tool's arm when your deploys aren't rolling as fast as you'd like, not because that result is actually ok. That doesn't sound like the right thing to do, and I don't want to make the wrong thing easy.

KnVerey · 2017-12-20T22:40:05Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

@@ -19,6 +20,8 @@ def sync
        conditions = deployment_data.fetch("status", {}).fetch("conditions", [])
        @progress_condition = conditions.find { |condition| condition['type'] == 'Progressing' }
        @progress_deadline = deployment_data['spec']['progressDeadlineSeconds']
+        @required_rollout = deployment_data.dig('metadata', 'annotations',
+          'kubernetes-deploy.shopify.io/required-rollout')


I'd like to keep all the logic for setting this here, and use constants for the annotation name and default value. E.g.:

if @found @required_rollout = deployment_data.dig('metadata', 'annotations', REQUIRED_ROLLOUT_ANNOTATION).presence || DEFAULT_REQUIRED_ROLLOUT else @required_rollout = @definition.dig('metadata', 'annotations', REQUIRED_ROLLOUT_ANNOTATION).presence || DEFAULT_REQUIRED_ROLLOUT end

KnVerey · 2017-12-20T22:42:32Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+      super
+
+      unless REQUIRED_ROLLOUT_TYPES.include?(required_rollout_type)
+        @validation_error_msg = "#{required_rollout_type} is not valid for required-rollout."\


This needs to accommodate the case where super already put something in this message.

KnVerey · 2017-12-20T22:43:05Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+
+      unless REQUIRED_ROLLOUT_TYPES.include?(required_rollout_type)
+        @validation_error_msg = "#{required_rollout_type} is not valid for required-rollout."\
+          " Acceptable options: #{REQUIRED_ROLLOUT_TYPES}"


nit: join the list with commas

KnVerey · 2017-12-20T22:46:42Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+    end
+
+    def minimum_replicas_to_succeeded
+      desired = @desired_replicas


why not use @desired_replicas directly?

KnVerey · 2017-12-20T22:50:52Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+    def minimum_replicas_to_succeeded
+      desired = @desired_replicas
+
+      case required_rollout_type


I still think this method should be something like min_available_replicas and should not contain this case statement. Maybe I'm missing something, but I don't think the else would ever get hit, and IMO the 'none' option should be reflected directly in the case statement in deploy_succeeded?--that case statement should reflect every valid option, and maybe even raise in its else since it should never be hit. Incidentally the success condition for the none should probably be exists?, not >= 0.

KnVerey · 2017-12-20T22:51:36Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+      when 'maxUnavailable'
+        max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable')
+        if max_unavailable =~ /%/
+          (desired * (100 - max_unavailable.to_i) / 100.0).ceil


codeCov is pointing out that this line is not covered by any test. Same with the none option.

KnVerey · 2017-12-20T23:08:56Z

test/integration/kubernetes_deploy_test.rb

+    assert_deploy_success(result)
+
+    assert_logs_match_all(
+      [%r{Deployment\/web\s+[34] replicas, 3 updatedReplicas, 2 availableReplicas, [12] unavailableReplica}]


The reason I think tuning the sleep/delay could help is that the gem only polls for statuses every 3 seconds. So e.g. if we make a scaling operation take, say, 5-7 seconds, we should have a much better chance of succeeding at a consistent point in the scaling, because we'll see every step.

If I'm wrong about that, or it makes the test hella slow, let's use a fancier regex (e.g. Deployment\/web\s+(?<replicas>\d+) replicas?, (?<updatedReplicas>\d+) updatedReplicas?, (?<availableReplicas>\d+) availableReplicas?) to parse the numbers we want out of the logs and make looser assertions (e.g. availableReplicas < replicas, updatedReplicas < replicas). It's shitty, but having multiple flaky tests is worse.

dturn · 2017-12-21T21:37:33Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+      end
+    end
+
+    def required_rollout


I put this back because validate_definition gets called before sync and I wasn't happy extracting the definition value there.

Wait a second, why don't we just use the value we see in the definition all the time? This annotation isn't something that'd be server-side defaulted.

KnVerey

We also need to add something in the readme (like https://github.com/Shopify/kubernetes-deploy/pull/232/files#diff-04c6e90faac2675aa89e2176d2eec7d8).

KnVerey · 2018-01-02T21:26:16Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+        @rollout_data["updatedReplicas"].to_i == @desired_replicas &&
+        @rollout_data["updatedReplicas"].to_i == @rollout_data["availableReplicas"].to_i
+      when 'none'
+        true


This case is untested. I think a test using a bad readiness probe + the annotation could work nicely.

I added a unit test for this

KnVerey · 2018-01-02T21:27:36Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+      unless REQUIRED_ROLLOUT_TYPES.include?(required_rollout)
+        @validation_error_msg ||= ''
+        @validation_error_msg += "#{required_rollout} is not valid for required-rollout."\
+          " Acceptable options: #{REQUIRED_ROLLOUT_TYPES.join(',')}"


Let's use the same message here as you did on L64, where you include the complete annotation name using the constant. We also still need a test for this.

Btw it looks like this would tack the error messages (when super added one) together without a space. In https://github.com/Shopify/kubernetes-deploy/pull/232/files#diff-d2654750695bbb60cf39d64df9b868eeR72, I'm introducing an errors array, which gets around that problem.

KnVerey · 2018-01-02T21:39:22Z

lib/kubernetes-deploy/kubernetes_resource/replica_set.rb


    def initialize(namespace:, context:, definition:, logger:, parent: nil, deploy_started_at: nil)
      @parent = parent
      @deploy_started_at = deploy_started_at
      @rollout_data = { "replicas" => 0 }
      @desired_replicas = -1
+      @ready_replicas = -1
+      @available_replicas = -1


The same thing should technically be done in the reset section at L35 too.

KnVerey · 2018-01-02T21:41:41Z

test/integration/kubernetes_deploy_test.rb

+    assert_deploy_success(result)
+
+    assert_logs_match_all(
+      [%r{Deployment\/web\s+[34] replicas, 3 updatedReplicas, 2 availableReplicas, [12] unavailableReplica}]


Are these tests reliable now?

KnVerey · 2018-01-02T21:54:48Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

@@ -43,10 +46,24 @@ def fetch_logs
    def deploy_succeeded?


I'd really like to start unit testing this now that it's pretty complex. It could reduce the number of possibly-flakey integration tests we'd need (i.e. separate %/# tests would be unnecessary).

Added unit tests

KnVerey · 2018-01-02T22:01:40Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

@@ -123,5 +156,19 @@ def find_latest_rs(deployment_data)
      rs.sync(latest_rs_data)
      rs
    end
+
+    def min_available_replicas
+      max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable')


This one could very well be server-side defaulted. IIRC you originally cached the entire resource blob from the sync to get this, and I asked you to save the specific value instead. Why are we not using it at all now? Am I forgetting something from earlier review passes?

KnVerey · 2018-01-02T22:14:17Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+      max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable')
+      if max_unavailable =~ /%/
+        (@desired_replicas * (100 - max_unavailable.to_i) / 100.0).ceil
+      else


So if max_unavailable is nil, we'll effectively return @desired_replicas. If that's on purpose, I think it might be worth a comment. Because of the possibility of defaulting (see my other comment), we can't really validate this at template parsing time. However, it could be worth having a separate case that prints a warning (just once, not on every polling cycle).

I think the main time this would happen is for deployments with strategy: Recreate, which I hadn't been considering previously. The only time I'm aware of us using them is single-replica deployments with PVCs, but there's nothing preventing others from using them with larger deployments. That would be a compelling use case for having the %/# option on the annotation as you suggested. That said, I still think maxUnavailable is the correct option for rollingUpdate deployments, and %/# options could be added in future PRs instead of now.

I've added a validation that 'kubernetes-deploy.shopify.io/required-rollout = maxUnavailable isn't in use with strategy: Recreate. Since that's a bit nonSensical.

Works for me. The strategy isn't necessarily present in the templates, but the default has been RollingUpdate for many versions, so a validation using the definition should catch all cases in practice.

KnVerey · 2018-01-09T18:06:54Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+      unless REQUIRED_ROLLOUT_TYPES.include?(required_rollout)
+        errors << "#{REQUIRED_ROLLOUT_ANNOTATION}:#{required_rollout} is invalid "\
+        "Acceptable options: #{REQUIRED_ROLLOUT_TYPES.join(',')}"
+        return false


I don't think we want to return here anymore. Otherwise the errors never make it into the message.

KnVerey · 2018-01-09T18:28:01Z

README.md

 - `kubernetes-deploy.shopify.io/timeout-override`: Override the tool's hard timeout for one specific resource. Both full ISO8601 durations and the time portion of ISO8601 durations are valid. Value must be between 1 second and 24 hours.
  - _Example values_: 45s / 3m / 1h / PT0.25H
  - _Compatibility_: all resource types (Note: `Deployment` timeouts are based on `spec.progressDeadlineSeconds` if present, and that field has a default value as of the `apps/v1beta1` group version. Using this annotation will have no effect on `Deployment`s that time out with "Timeout reason: ProgressDeadlineExceeded".)
+- `kubernetes-deploy.shopify.io/required-rollout` (deployment): Modifies the conditions a deployment is
+   considered to be successful


For consistency with the other one (which I updated based on your suggestions on my PR 😄 ), how about:

kubernetes-deploy.shopify.io/required-rollout: Modifies how much of the rollout needs to finish before the deployment is considered successful.

Compatibility: Deployment

Valid values:

maxUnavailable: ...

KnVerey · 2018-01-09T18:33:08Z

README.md

 - `kubernetes-deploy.shopify.io/timeout-override`: Override the tool's hard timeout for one specific resource. Both full ISO8601 durations and the time portion of ISO8601 durations are valid. Value must be between 1 second and 24 hours.
  - _Example values_: 45s / 3m / 1h / PT0.25H
  - _Compatibility_: all resource types (Note: `Deployment` timeouts are based on `spec.progressDeadlineSeconds` if present, and that field has a default value as of the `apps/v1beta1` group version. Using this annotation will have no effect on `Deployment`s that time out with "Timeout reason: ProgressDeadlineExceeded".)
+- `kubernetes-deploy.shopify.io/required-rollout` (deployment): Modifies the conditions a deployment is
+   considered to be successful
+  - `maxUnavailable`: Only for `RollingUpdate deploys`. The deploy is successful when there are less


This seems to be backwards ("less than X updated and ready"). Suggestion:

"The deploy is successful when minimum availability is reached in the new ReplicaSet. In other words, the number of new pods that must be ready is equal to spec.replicas - strategy.RollingUpdate.maxUnavailable (converted from percentages by rounding up, if applicable). This option is only valid for deployments that use the RollingUpdate strategy."

KnVerey · 2018-01-09T18:37:20Z

README.md

+   considered to be successful
+  - `maxUnavailable`: Only for `RollingUpdate deploys`. The deploy is successful when there are less
+     than `strategy.RollingUpdate.maxUnavailable` (supports both the % and fixed number) updated & ready replicas.
+  - `full`: maintain current behavior


The deploy is successful when all pods in the new ReplicaSet are ready.

KnVerey · 2018-01-09T18:38:47Z

README.md

+  - `maxUnavailable`: Only for `RollingUpdate deploys`. The deploy is successful when there are less
+     than `strategy.RollingUpdate.maxUnavailable` (supports both the % and fixed number) updated & ready replicas.
+  - `full`: maintain current behavior
+  - `none`: deploy is successful as soon as the new `replicaSet` is created for the deploy


The deploy is successful as soon as the new ReplicaSet is created (no pods required).

KnVerey · 2018-01-09T18:43:11Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

@@ -81,6 +100,24 @@ def exists?
      @found
    end

+    def validate_definition
+      valid = super


I don't think we actually need this anymore. We should be able to use @validation_errors.empty? as the return value, since that has to contain the errors from super as well, right?

KnVerey · 2018-01-09T18:52:25Z

lib/kubernetes-deploy/kubernetes_resource/deployment.rb

+        valid = false
+      end
+
+      if required_rollout == 'maxUnavailable' && @definition.dig('spec', 'strategy') == 'recreate'


Is the strategy case-sensitive? I don't think it is, in which case we should make sure these string comparisons are also case-insensitive.

Good question, I have no idea. I'll just downcase everything.

KnVerey · 2018-01-09T20:03:07Z

test/integration/kubernetes_deploy_test.rb

+    assert_deploy_success(result)
+
+    assert_logs_match_all(
+      [%r{Deployment\/web\s+[34] replicas, 3 updatedReplicas, 2 availableReplicas, [12] unavailableReplica}]


I just ran this one 100 times locally, and it failed 3 times:

Deployment/web 3 replicas, 3 updatedReplicas, 3 availableReplicas

Deployment/web 3 replicas, 3 updatedReplicas, 3 availableReplicas

Deployment/web 4 replicas, 2 updatedReplicas, 3 availableReplicas, 1 unavailableReplica

That's not horrible, but considering that there are three of them, they could end up noticeably increasing the fail rate overall. I'm thinking we should make sure we have good unit coverage of the % option and remove that copy of the test. Maybe the restart copy too. @stefanmb wdyt? Do you have any ideas for improving the reliability here?

add a restart test and unit tests Docs are nice

dturn · 2018-01-09T22:14:24Z

Feedback is in. Let me know if there is anything else you'd like tweaked.

KnVerey · 2018-01-10T23:50:06Z

The new tests already flaked on one of the runs on my branch. Test flakes are already a problem on this repo, so I think it isn't actually ok to merge two new ones that are known to be unreliable. I put some more thought into this, and here is a faster (runs in ~19s instead of ~36s) test that passed 100 times in a row for me locally:

Test

  def test_deploy_successful_with_partial_availability
      result = deploy_fixtures("slow-cloud", sha: "deploy1")
      assert_deploy_success(result)

      result = deploy_fixtures("slow-cloud", sha: "deploy2") do |fixtures|
        dep = fixtures["web.yml.erb"]["Deployment"].first
        container = dep["spec"]["template"]["spec"]["containers"].first
        container["readinessProbe"] = {
          "exec" => { "command" => %w(sleep 5) },
          "timeoutSeconds" => 6
        }
      end
      assert_deploy_success(result)

      new_pods = kubeclient.get_pods(namespace: @namespace, label_selector: 'name=web,app=slow-cloud,sha=deploy2')
      assert new_pods.length >= 1, "Expected at least one new pod, saw #{new_pods.length}"

      new_ready_pods = new_pods.select do |pod|
        pod.status.phase == "Running" &&
        pod.status.conditions.any? { |condition| condition["type"] == "Ready" && condition["status"] == "True" }
      end
      assert_equal 1, new_ready_pods.length, "Expected exactly one new pod to be ready, saw #{new_ready_pods.length}"
  end

Fixture

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: web
  annotations:
    shipit.shopify.io/restart: "true"
    kubernetes-deploy.shopify.io/required-rollout: maxUnavailable
spec:
  replicas: 2
  selector:
    matchLabels:
      name: web
      app: slow-cloud
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
  template:
    metadata:
      labels:
        name: web
        app: slow-cloud
        sha: "<%= current_sha %>"
    spec:
      terminationGracePeriodSeconds: 0
      containers:
      - name: app
        image: busybox
        imagePullPolicy: IfNotPresent
        command: ["tail", "-f", "/dev/null"]
        ports:
        - containerPort: 80
          name: http

Finished in 1859.656208s, 0.0538 runs/s, 0.4302 assertions/s.

100 runs, 800 assertions, 0 failures, 0 errors, 0 skips

Unless you see a reason why the test isn't valid, let's use this instead (some adjustments will need to be made for the restart version).

Deployment unit tests

KnVerey · 2018-01-11T20:34:57Z

test/integration/restart_task_test.rb

+
+    pods = kubeclient.get_pods(namespace: @namespace, label_selector: 'name=web,app=slow-cloud')
+    new_pods = pods.select do |pod|
+      pod.spec.containers.select { |c| c["name"] == "app" && c.env&.find { |n| n.name == "RESTARTED_AT" } }


This will always be truthy, because a select with no results is []. So I'd expect that new_pods would always include all pods. I'd use any? instead.

KnVerey · 2018-01-11T20:36:20Z

test/integration/restart_task_test.rb

+    new_ready_pods = new_pods.select do |pod|
+      pod.status.phase == "Running" &&
+      pod.status.conditions.any? { |condition| condition["type"] == "Ready" && condition["status"] == "True" } &&
+      pod.spec.containers.detect { |c| c["name"] == "app" && c.env&.find { |n| n.name == "RESTARTED_AT" } }


You shouldn't need this if L225 filters properly

KnVerey · 2018-01-11T20:40:44Z

README.md

 - `kubernetes-deploy.shopify.io/timeout-override`: Override the tool's hard timeout for one specific resource. Both full ISO8601 durations and the time portion of ISO8601 durations are valid. Value must be between 1 second and 24 hours.
  - _Example values_: 45s / 3m / 1h / PT0.25H
  - _Compatibility_: all resource types (Note: `Deployment` timeouts are based on `spec.progressDeadlineSeconds` if present, and that field has a default value as of the `apps/v1beta1` group version. Using this annotation will have no effect on `Deployment`s that time out with "Timeout reason: ProgressDeadlineExceeded".)
+- `kubernetes-deploy.shopify.io/required-rollout`: Modifies how much of the rollout needs to finish
+before the deployment is considered successful.
+  - _Compatibility_: Deployment


We should be consistent about how we reference objects and whether they're in backticks. The new uses both "ReplicaSet" and "replicaSet", and has "deployment" and "Deployment" and "deploy". The existing text above uses "Deployment".

Good eye. I'm going to standardize on deployment and replicaSet

dturn requested a review from KnVerey November 20, 2017 21:39

dturn force-pushed the partial-rollout-success branch 2 times, most recently from 51e3c26 to 508c913 Compare November 30, 2017 18:09

dturn commented Dec 4, 2017

View reviewed changes

dturn commented Dec 5, 2017

View reviewed changes

KnVerey reviewed Dec 13, 2017

View reviewed changes

dturn force-pushed the partial-rollout-success branch 2 times, most recently from 6b7bbd8 to b65aa07 Compare December 14, 2017 17:23

KnVerey reviewed Dec 18, 2017

View reviewed changes

KnVerey requested a review from stefanmb December 18, 2017 21:30

stefanmb reviewed Dec 19, 2017

View reviewed changes

dturn force-pushed the partial-rollout-success branch from 7ac23f2 to 6ca7ebe Compare December 19, 2017 19:36

dturn commented Dec 19, 2017

View reviewed changes

KnVerey mentioned this pull request Dec 19, 2017

Version 1.0 requirements tracker #229

Closed

22 tasks

KnVerey reviewed Dec 20, 2017

View reviewed changes

dturn commented Dec 21, 2017

View reviewed changes

dturn force-pushed the partial-rollout-success branch from d770e9e to f8ba745 Compare December 30, 2017 18:17

KnVerey reviewed Jan 2, 2018

View reviewed changes

dturn force-pushed the partial-rollout-success branch from ccb8e31 to f65bd72 Compare January 9, 2018 18:16

KnVerey reviewed Jan 9, 2018

View reviewed changes

partial-rollout-successby using an annotation

8f46ab3

add a restart test and unit tests Docs are nice

dturn force-pushed the partial-rollout-success branch from 71f214d to 8f46ab3 Compare January 9, 2018 21:07

KnVerey mentioned this pull request Jan 10, 2018

Deployment unit tests #236

Merged

KnVerey and others added 5 commits January 10, 2018 18:51

Unit testing with simulated sync for Deployment

d6004c1

Merge pull request #236 from Shopify/deploy_unit_tests

894e5af

Deployment unit tests

New tests

ba66e36

Rework restart test

480d41b

line length

5447bc6

KnVerey reviewed Jan 11, 2018

View reviewed changes

Fix typos

ee5e9cb

dturn force-pushed the partial-rollout-success branch from 6b1a03e to ee5e9cb Compare January 11, 2018 21:54

KnVerey approved these changes Jan 11, 2018

View reviewed changes

dturn merged commit 967864e into master Jan 11, 2018

dturn deleted the partial-rollout-success branch January 11, 2018 22:35

dturn mentioned this pull request Jan 18, 2018

required-rollout takes a percent #240

Merged

Return success before fully deployed #208

Return success before fully deployed #208

Conversation

dturn commented Nov 20, 2017 • edited Loading

dturn commented Nov 21, 2017

dwradcliffe commented Nov 21, 2017

KnVerey commented Nov 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dturn commented Dec 14, 2017

Choose a reason for hiding this comment

dturn Dec 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dturn Dec 19, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stefanmb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dturn Jan 9, 2018 • edited Loading

Choose a reason for hiding this comment

dturn commented Dec 20, 2017 • edited Loading

KnVerey commented Dec 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dturn commented Nov 20, 2017 •

edited

Loading

dturn Dec 19, 2017 •

edited

Loading

dturn Dec 19, 2017 •

edited

Loading

dturn Jan 9, 2018 •

edited

Loading

dturn commented Dec 20, 2017 •

edited

Loading