Resolve errors for StatefulSet restart with updateStrategy: OnDelete #876

stefanmb · 2022-03-07T23:13:04Z

What are you trying to accomplish with this PR?

Resolve some logic errors in StatefulSet restart handling introduced in #840.

If a bare pod is present the current delete_statefulset_pods will crash. (See below)
The current success conditions for the restart cannot be met reliably due to:
1. An unresolved bug in Kubernetes that prevents the currentRevision from matching updateRevision
2. For STS with OnDelete the currentReplicas field is sometimes nil.

How is this accomplished?

Introduce new test to trigger the nil error condition
Check for nil dereference
Remove readiness condition that is impossible to satisfy in STS with OnDelete

What could go wrong?
Until the upstream fix we'll still have to deploy with --no-verify-result for STS with OnDelete.

Acknowledgments

Thanks to @epk for helping to debug these issues.

stefanmb · 2022-03-08T00:38:41Z

lib/krane/kubernetes_resource/stateful_set.rb

@@ -27,11 +30,10 @@ def deploy_succeeded?
                       "Consider switching to rollingUpdate.")
          @success_assumption_warning_shown = true
        end
+      else
+        success &= desired_replicas == status_data['currentReplicas'].to_i


~~This property is not supported for STS with OnDelete:~~

kubectl rollout status sts/foo error: rollout status is only available for RollingUpdate strategy type

~~... therefore this condition can never be true in this case and krane will stall.~~

Edit: It turns out currentReplicas can disappear sometimes, so this check can sometimes cause krane to stall, see below.

How was the CI test in restart_task_test passing if this is the case?

@timothysmith0609 Great question!

I had to look into this a bit. The brief answer is certain combinations of operations can cause the currentReplicas field to become nil. The simple test as-is does not hit these conditions, so it passes.

I've pushed a version of this test that hangs (in the same way I saw in real life) here: 5c4acd0

You can observe locally this is the state this sequence of events leaves the stateful set in:

status: collisionCount: 0 currentRevision: stateful-busybox-6b5489d7d6 observedGeneration: 2 readyReplicas: 2 replicas: 2 updateRevision: stateful-busybox-96676bd54 updatedReplicas: 2

Notice the absence of currentReplicas.

I did not push this version of the test here because it's hanging due to the previously mentioned bug so the only way to run this test is with verify_result: false which doesn't make for a good test.

If you prefer, I can drop these changes from this PR since IRL I will have to run without verify result anyway for now.

Thank you for the explanation!

If you prefer, I can drop these changes from this PR since IRL I will have to run without verify result anyway for now.

May as well keep the changes for now, to avoid repeating the same work later.

stefanmb · 2022-03-08T00:41:47Z

lib/krane/restart_task.rb

@@ -213,7 +213,7 @@ def patch_daemonset_with_restart(record)
    def delete_statefulset_pods(record)
      pods = kubeclient.get_pods(namespace: record.metadata.namespace)
      pods.select! do |pod|
-        pod.metadata.ownerReferences.find { |ref| ref.uid == record.metadata.uid }
+        pod.metadata&.ownerReferences&.find { |ref| ref.uid == record.metadata.uid }


Fixes the nil errors encountered if bare pods are found in the namespaces. At Shopify nearly all app namespaces will have such a pod (e.g. secret seeder).

🤦 Thank you for fixing that

stefanmb · 2022-03-08T02:01:13Z

lib/krane/kubernetes_resource/stateful_set.rb

@@ -19,6 +19,9 @@ def status
    end

    def deploy_succeeded?
+      success = observed_generation == current_generation &&
+        desired_replicas == status_data['readyReplicas'].to_i &&
+        status_data['currentRevision'] == status_data['updateRevision']


This check is currently broken in upstream Kubernetes for OnDelete, but if we remove the check the method will return immediately, before the pods are actually restarted successfully.

We could (re)implement some of the controller logic to check for individual pods but I believe running with --no-verify-result until the upstream fix is the better option.

Upstream fix: kubernetes/kubernetes#106059

stefanmb · 2022-03-08T02:41:47Z

test/integration/restart_task_test.rb

@@ -36,7 +36,8 @@ def test_restart_by_annotation
  end

  def test_restart_statefulset_on_delete_restarts_child_pods
-    result = deploy_fixtures("hello-cloud", subset: "stateful_set.yml") do |fixtures|
+    result = deploy_fixtures("hello-cloud", subset: ["configmap-data.yml", "unmanaged-pod-1.yml.erb",


Deploying a bare pod in the integration test without the fix produces the expected error:

Error: RestartTaskTest#test_restart_statefulset_on_delete_restarts_child_pods: NoMethodError: undefined method `find' for nil:NilClass /home/runner/work/krane/krane/lib/krane/restart_task.rb:216:in `block in delete_statefulset_pods' /opt/hostedtoolcache/Ruby/3.0.3/x64/lib/ruby/3.0.0/delegate.rb:349:in `select!' /opt/hostedtoolcache/Ruby/3.0.3/x64/lib/ruby/3.0.0/delegate.rb:349:in `block in delegating_block' /home/runner/work/krane/krane/lib/krane/restart_task.rb:215:in `delete_statefulset_pods' /home/runner/work/krane/krane/lib/krane/restart_task.rb:239:in `block in restart_statefulsets!' /home/runner/work/krane/krane/lib/krane/restart_task.rb:233:in `each' /home/runner/work/krane/krane/lib/krane/restart_task.rb:233:in `restart_statefulsets!' /home/runner/work/krane/krane/lib/krane/restart_task.rb:74:in `run!' /home/runner/work/krane/krane/lib/krane/restart_task.rb:49:in `run' /home/runner/work/krane/krane/test/integration/restart_task_test.rb:49:in `test_restart_statefulset_on_delete_restarts_child_pods'

timothysmith0609

❤️ Thank you for fixing this, just one small question

timothysmith0609 · 2022-03-08T13:54:29Z

lib/krane/restart_task.rb

@@ -213,7 +213,7 @@ def patch_daemonset_with_restart(record)
    def delete_statefulset_pods(record)
      pods = kubeclient.get_pods(namespace: record.metadata.namespace)
      pods.select! do |pod|
-        pod.metadata.ownerReferences.find { |ref| ref.uid == record.metadata.uid }
+        pod.metadata&.ownerReferences&.find { |ref| ref.uid == record.metadata.uid }


🤦 Thank you for fixing that

timothysmith0609 · 2022-03-08T13:54:57Z

lib/krane/kubernetes_resource/stateful_set.rb

@@ -27,11 +30,10 @@ def deploy_succeeded?
                       "Consider switching to rollingUpdate.")
          @success_assumption_warning_shown = true
        end
+      else
+        success &= desired_replicas == status_data['currentReplicas'].to_i


How was the CI test in restart_task_test passing if this is the case?

stefanmb added 2 commits March 7, 2022 19:36

Add new test for STS restart

bc77329

Prevent crash on nil value

c8dc4cc

stefanmb force-pushed the fix_ownerref_error branch from 60205ee to f81a87a Compare March 8, 2022 00:37

stefanmb changed the title ~~Add new test~~ Resolve errors for StatefulSet restart with updateStrategy: OnDelete Mar 8, 2022

stefanmb commented Mar 8, 2022

View reviewed changes

stefanmb requested review from timothysmith0609 and epk March 8, 2022 00:42

stefanmb added the 🪲 bug Something isn't working label Mar 8, 2022

Adjust logic for OnDelete STS Readiness

6e8bf2b

stefanmb force-pushed the fix_ownerref_error branch from f81a87a to 6e8bf2b Compare March 8, 2022 01:59

stefanmb commented Mar 8, 2022

View reviewed changes

use existing test

4fb61d8

stefanmb force-pushed the fix_ownerref_error branch from 1b2d9ae to 4fb61d8 Compare March 8, 2022 02:41

stefanmb commented Mar 8, 2022

View reviewed changes

stefanmb marked this pull request as ready for review March 8, 2022 03:52

stefanmb requested a review from a team as a code owner March 8, 2022 03:52

stefanmb requested review from JamesOwenHall and peiranliushop and removed request for a team March 8, 2022 03:52

timothysmith0609 reviewed Mar 8, 2022

View reviewed changes

timothysmith0609 approved these changes Mar 8, 2022

View reviewed changes

JamesOwenHall approved these changes Mar 8, 2022

View reviewed changes

stefanmb merged commit c9e81f7 into master Mar 8, 2022

stefanmb deleted the fix_ownerref_error branch March 8, 2022 18:31

stefanmb mentioned this pull request Mar 8, 2022

version 2.4.2 #877

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve errors for StatefulSet restart with updateStrategy: OnDelete #876

Resolve errors for StatefulSet restart with updateStrategy: OnDelete #876

stefanmb commented Mar 7, 2022 •

edited

Loading

stefanmb Mar 8, 2022 •

edited

Loading

timothysmith0609 Mar 8, 2022

stefanmb Mar 8, 2022 •

edited

Loading

timothysmith0609 Mar 8, 2022 •

edited

Loading

stefanmb Mar 8, 2022

timothysmith0609 Mar 8, 2022

stefanmb Mar 8, 2022

stefanmb Mar 8, 2022

timothysmith0609 left a comment

timothysmith0609 Mar 8, 2022

timothysmith0609 Mar 8, 2022

Resolve errors for StatefulSet restart with updateStrategy: OnDelete #876

Resolve errors for StatefulSet restart with updateStrategy: OnDelete #876

Conversation

stefanmb commented Mar 7, 2022 • edited Loading

stefanmb Mar 8, 2022 • edited Loading

Choose a reason for hiding this comment

timothysmith0609 Mar 8, 2022

Choose a reason for hiding this comment

stefanmb Mar 8, 2022 • edited Loading

Choose a reason for hiding this comment

timothysmith0609 Mar 8, 2022 • edited Loading

Choose a reason for hiding this comment

stefanmb Mar 8, 2022

Choose a reason for hiding this comment

timothysmith0609 Mar 8, 2022

Choose a reason for hiding this comment

stefanmb Mar 8, 2022

Choose a reason for hiding this comment

stefanmb Mar 8, 2022

Choose a reason for hiding this comment

timothysmith0609 left a comment

Choose a reason for hiding this comment

timothysmith0609 Mar 8, 2022

Choose a reason for hiding this comment

timothysmith0609 Mar 8, 2022

Choose a reason for hiding this comment

stefanmb commented Mar 7, 2022 •

edited

Loading

stefanmb Mar 8, 2022 •

edited

Loading

stefanmb Mar 8, 2022 •

edited

Loading

timothysmith0609 Mar 8, 2022 •

edited

Loading