Unmanaged pods should fail fast when evicted/preempted/deleted #353

KnVerey · 2018-10-16T21:15:42Z

Purpose

Bugfix: Unmanaged pods should fail fast if they're evicted or preempted. Previously, we made pods' failure condition ignore these two statuses, because it was very occasionally making a parent's deploy fail incorrectly because the parent was unlucky enough to have all its children hit one of those conditions (which are recoverable, because the parent spawns replacement children). For unmanaged pods, these states are not recoverable and should not be ignored.
Improvement: We should not wait around forever if a pod is deleted out of band. If we've deployed it and it doesn't exist, it isn't magically going to appear in the future. This is especially important for the runner. Note that I'm continuing to treat non-404 errors as transient. Related to Reconsider treatment of "!exists?" in KubernetesResource classes #347
Improvement: same as above, but handles the window where the deletion request has been made successfully but the pod isn't actually gone yet (i.e. deletion timestamp is set). Reviewers, do you agree with this? The containers should at least have received SIGTERM when we see this, but they may still be running.

Things I don't like

I think we should try splitting Pod into Pod and ManagedPod. Many of its methods do different things based on whether or not it has a parent. I don't think that is necessary for this PR though.
The disappeared? thing should be implemented for all resources, not just this one (Reconsider treatment of "!exists?" in KubernetesResource classes #347). I do think this is a good start though, and it is a lot more important for pods (because of kubernetes-run).
Because we're using kubectl to get these resources (which is really difficult to change at this point), I can't check the actual response status. The "404" check is actually looking for a non-zero exit and an error string.
I had to change a ton of the unit tests to accommodate the new raise_on_404 option. We could probably improve the mocking strategy to make this less painful, but I just brute-forced it for now.

cc @Shopify/cloudx

KnVerey · 2018-10-16T21:17:08Z

test/integration/runner_task_test.rb

@@ -18,7 +18,7 @@ def test_run_without_verify_result_succeeds_as_soon_as_pod_is_successfully_creat
      "Result: SUCCESS",
      "Result verification is disabled for this task",
      "The following status was observed immediately after pod creation:",
-      %r{Pod/task-runner-\w+\s+Pending},
+      %r{Pod/task-runner-\w+\s+(Pending|Running)},


This change and the one below are to decrease test flakiness. It's not important to these two test cases whether the pod manages to start running in the run window.

KnVerey · 2018-10-16T21:23:48Z

lib/kubernetes-deploy/deploy_task.rb

@@ -500,7 +498,7 @@ def confirm_namespace_exists
      st, err = nil
      with_retries(2) do
        _, err, st = kubectl.run("get", "namespace", @namespace, use_namespace: false, log_failure: true)
-        st.success? || err.include?(NOT_FOUND_ERROR)
+        st.success? || err.include?(KubernetesDeploy::Kubectl::NOT_FOUND_ERROR_TEXT)


I didn't refactor this to use the new raise_on_404 because I have a WIP branch that moves this check to use Kubeclient, which would be better.

fw42

LGTM, thanks for owning this!

fw42 · 2018-10-17T10:01:49Z

lib/kubernetes-deploy/kubectl.rb

@@ -17,7 +20,7 @@ def initialize(namespace:, context:, logger:, log_failure_by_default:, default_t
      raise ArgumentError, "context is required" if context.blank?
    end

-    def run(*args, log_failure: nil, use_context: true, use_namespace: true)
+    def run(*args, log_failure: nil, use_context: true, use_namespace: true, raise_on_404: false)


nitpick: raise_on_404 seems like a bit of a leaky abstraction (the fact that a missing resource is signaled via a 404 error code is an implementation detail that the consumer of this API shouldn't need to know about). How about raise_on_missing or raise_on_resource_not_found?

fw42 · 2018-10-17T10:04:52Z

lib/kubernetes-deploy/kubernetes_resource.rb

    end

    def after_sync
    end

+    def deleted?
+      @instance_data.dig('metadata', 'deletionTimestamp').present?


So this basically means that k8s has asked the pod to "please go away", but in theory the pod might still exist and the process might still be running, right? If so then I find "deleted?" slightly misleading (since it might still exist and might even terminate successfully, i.e. with non-zero error, right?).

What's the reasoning for not checking whether the pod has actually already been deleted? Because this will catch deletion requests earlier?

So this basically means that k8s has asked the pod to "please go away", but in theory the pod might still exist and the process might still be running, right?

Yes. I should probably call this terminating? -- that's the terminology kubectl uses.

What's the reasoning for not checking whether the pod has actually already been deleted? Because this will catch deletion requests earlier?

Yes. We've also seen a k8s bug in the past where resources would get stuck in the terminating state indefinitely, even after the underlying container was gone (though it hasn't been reported in recent version afaik). (Note that we are also checking that they have actually been deleted--if so disappeared? will be true)

fw42 · 2018-10-17T10:05:33Z

lib/kubernetes-deploy/kubernetes_resource/pod.rb

@@ -69,9 +69,18 @@ def timeout_message
      header + probe_failure_msgs.join("\n") + "\n"
    end

+    def permanent_failed_phase?


"permanently"? dunno ESL

nitpick, maybe something like this would be easier to follow:

def permanently_failed? failed_phase? && (unmanaged? || non_transient_error?) end

fw42 · 2018-10-17T10:08:35Z

lib/kubernetes-deploy/kubernetes_resource.rb

-      @instance_data = mediator.get_instance(kubectl_resource_type, name)
+      @instance_data = mediator.get_instance(kubectl_resource_type, name, raise_on_404: true)
+    rescue KubernetesDeploy::Kubectl::ResourceNotFoundError
+      @disappeared = true if deploy_started?


deploy_started? is here because if the pod hasn't been created yet then the 404 is actually expected, right?

I think we need to prioritize the state tracking refactor. We're effectively adding a new state here.

deploy_started? is here because if the pod hasn't been created yet then the 404 is actually expected, right?

Exactly.

I think we need to prioritize the state tracking refactor. We're effectively adding a new state here.

Are we? This is of course new data about the state of the resource in a general sense, but it isn't a new end state for the resource, which is what that refactor was about (the new state that triggered it was "ignored"). In other words, our end states are still succeeded, failed and timed out, and this is just a new way that resources can fail.

I still disagree. The refactor is about mutually exclusive states not just terminal ones.

fw42 · 2018-10-17T10:10:40Z

lib/kubernetes-deploy/sync_mediator.rb

      end
+
+      cached_instance = @cache[kind].fetch(resource_name, {})


Previously this used @cache.dig(kind, ...). That makes me believe someone thought that @cache[kind] might be nil. Is that not a concern anymore? The new code here would break in that case. Or is the only way for @cache[kind] to be nil if @cache.key?(kind) is false? i.e. can the key exist but the value legitimately be nil?

The only place that can set it is this:

@cache[kind] = JSON.parse(raw_json)["items"].each_with_object({}) do |r, instances| instances[r.dig("metadata", "name")] = r end

That's way too densely written. If I rewrite it like this it is clearer that it shouldn't be possible to have a nil value:

instances = {} JSON.parse(raw_json)["items"].each do |resource| resource_name = resource.dig("metadata", "name") instances[resource_name] = resource end @cache[kind] = instances

fw42 · 2018-10-17T10:12:56Z

test/integration/runner_task_test.rb

+
+    task_runner = build_task_runner
+    deleter_thread = Thread.new do
+      loop do


This is really scrappy. Good job! 😆

fw42 · 2018-10-17T10:19:17Z

test/unit/kubernetes-deploy/kubernetes_resource/pod_test.rb

+
+  def test_deploy_failed_is_true_for_deleted_unmanaged_pods
+    template = build_pod_template
+    template["metadata"] = template["metadata"].merge("deletionTimestamp" => "2018-04-13T22:43:23Z")


nitpick:

template["metadata"]["deletionTimestamp"] = "2018-04-13T22:43:23Z"

seems simpler

fw42 · 2018-10-17T10:19:31Z

test/unit/kubernetes-deploy/kubernetes_resource/pod_test.rb


+  def test_deploy_failed_is_false_for_deleted_managed_pods
+    template = build_pod_template
+    template["metadata"] = template["metadata"].merge("deletionTimestamp" => "2018-04-13T22:43:23Z")


fw42 · 2018-10-17T10:20:12Z

test/unit/kubernetes-deploy/kubernetes_resource/pod_test.rb

+
+    assert_predicate pod, :disappeared?
+    assert_predicate pod, :deploy_failed?
+    assert_equal "Pod status: Disappeared. ", pod.failure_message


this trailing space is a bit hacky 😄

dturn

Improvement: same as above, but handles the window where the deletion request has been made successfully but the pod isn't actually gone yet (i.e. deletion timestamp is set). Reviewers, do you agree with this? The containers should at least have received SIGTERM when we see this, but they may still be running.

I think its fine to say the deploy has failed. If you manually killed the pod, I think its ok to not to show anymore logs too, since they should be able to get to them some other way.

My big concern here is using exceptions as control flow. But, I don't see a better way.

dturn · 2018-10-17T13:16:54Z

lib/kubernetes-deploy/kubernetes_resource.rb

-      @instance_data = mediator.get_instance(kubectl_resource_type, name)
+      @instance_data = mediator.get_instance(kubectl_resource_type, name, raise_on_404: true)
+    rescue KubernetesDeploy::Kubectl::ResourceNotFoundError
+      @disappeared = true if deploy_started?


I think we need to prioritize the state tracking refactor. We're effectively adding a new state here.

dturn · 2018-10-17T13:32:27Z

lib/kubernetes-deploy/kubernetes_resource/pod.rb

    def failure_message
-      if phase == FAILED_PHASE_NAME && !TRANSIENT_FAILURE_REASONS.include?(reason)
-        phase_problem = "Pod status: #{status}. "
+      phase_problem = if permanent_failed_phase?


Should this entire if block be wrapped in if unmanaged?

dturn · 2018-10-17T13:40:15Z

test/integration/runner_task_test.rb

+            retry
+          end
+        end
+        sleep 0.1


What is this sleep for?

It's a tiny throttle that takes effect between when we start the thread and when the pod name is generated. The on one L102 takes effect between when the pod name is generated and when the the pod has been created.

Should this be in an else block?

dturn · 2018-10-17T13:43:23Z

test/integration/runner_task_test.rb

+      /Pod status\: (Terminating|Disappeared)/,
+    ])
+  ensure
+    if deleter_thread


Why not just deleter_thread&.kill in the ensure block? If we've gotten here I don't see how its important that the thread finish running.

Good question. This is for the case where the test is failing ultimately because of a problem in the thread (which I originally had when I wrote this). If the thread raises and you never join it, you'll never see the error and the test will suck to debug.

Can you leave a comment like: join to ensure error message is present is printed? Also do we want to set a limit on how long we'll wait for the join?

Added the comment. Thinking about this more, I think the better thing to do is to set abort_on_exception to the thread, and not join here.

KnVerey · 2018-10-23T22:51:28Z

I've pushed a commit to address the comments. Please take a look.

My big concern here is using exceptions as control flow. But, I don't see a better way.

Are you talking about ResourceNotFoundError? It's not really changing the control flow so much as passing a piece of data; the consequence is setting @disappeared, not changing what we do next. Another way to do that would be to make both Kubectl#run and SyncMediator.get_instance able to pass back enhanced information (e.g. multiple values, or some new type of object). That would be a huge change, and I don't really see why it would be better. It would also be possible to have it return nil instead of {} and have that mean "not found" (other errors would still have to return {}), but that feels a bit sketchy to me and handling those possible nils everywhere would not be fun.

tl;dr I haven't thought of an alternative I like better either. Can you be more specific about your concern? Having a resource fetcher raise an exception when the target is missing doesn't seem all that weird to me tbh.

fw42 · 2018-10-24T09:19:29Z

lib/kubernetes-deploy/kubernetes_resource.rb

    end

    def after_sync
    end

+    def terminating?


Thanks for giving this a clearer name

fw42 · 2018-10-24T09:23:50Z

test/unit/kubernetes-deploy/kubectl_test.rb

@@ -137,6 +137,22 @@ def test_version_info_raises_if_command_fails
    end
  end

+  def test_run_with_raise_err_on_404_raises_the_correct_thing


forgot to rename the raise_err_on_404 to raise_if_not_found here (and on line 149 too)

fw42 · 2018-10-24T09:25:19Z

lib/kubernetes-deploy/kubernetes_resource/pod.rb

@@ -86,7 +82,7 @@ def failure_message
          container_problems += "> #{red_name}: #{c.doom_reason}\n"
        end
      end
-      "#{phase_problem}#{container_problems}".presence
+      "#{phase_failure_message} #{container_problems}".lstrip.presence


Why lstrip and not strip? Otherwise you still have trailing spaces (like the tests below show).

You're right, it should be strip 🤦‍♀️

That turned out to open a small can of worms. My last commit centralizes the chunking of the debug message (into that method itself) and adds a test displaying/proving the desired output.

dturn · 2018-10-24T16:11:14Z

lib/kubernetes-deploy/kubernetes_resource.rb

-      @instance_data = mediator.get_instance(kubectl_resource_type, name)
+      @instance_data = mediator.get_instance(kubectl_resource_type, name, raise_on_404: true)
+    rescue KubernetesDeploy::Kubectl::ResourceNotFoundError
+      @disappeared = true if deploy_started?


I still disagree. The refactor is about mutually exclusive states not just terminal ones.

dturn · 2018-10-24T16:31:39Z

test/integration/runner_task_test.rb

+            retry
+          end
+        end
+        sleep 0.1


Should this be in an else block?

dturn · 2018-10-24T16:33:13Z

test/integration/runner_task_test.rb

+      /Pod status\: (Terminating|Disappeared)/,
+    ])
+  ensure
+    if deleter_thread


Can you leave a comment like: join to ensure error message is present is printed? Also do we want to set a limit on how long we'll wait for the join?

Unmanaged pods should fail fast when evicted/preempted/deleted

28888f3

KnVerey commented Oct 16, 2018

View reviewed changes

KnVerey requested review from dturn, fw42 and timothysmith0609 October 16, 2018 21:17

KnVerey commented Oct 16, 2018

View reviewed changes

Test should pass if pod has already disappeared too

32256fb

fw42 approved these changes Oct 17, 2018

View reviewed changes

dturn suggested changes Oct 17, 2018

View reviewed changes

KnVerey added 2 commits October 23, 2018 18:19

Code review

55cf833

Policial

ce8df22

fw42 approved these changes Oct 24, 2018

View reviewed changes

dturn approved these changes Oct 24, 2018

View reviewed changes

KnVerey added 2 commits October 24, 2018 13:52

Review 2 - Largely fixing whitespace

b0c14a5

Abort deleter thread on exception

02b7a93

KnVerey force-pushed the unmanaged_pod_fail_faster branch from 4862373 to 02b7a93 Compare October 24, 2018 18:07

KnVerey merged commit 401fa6b into master Oct 24, 2018

KnVerey deleted the unmanaged_pod_fail_faster branch October 24, 2018 19:31

fw42 temporarily deployed to rubygems October 25, 2018 17:57 Inactive

Unmanaged pods should fail fast when evicted/preempted/deleted #353

Unmanaged pods should fail fast when evicted/preempted/deleted #353

Conversation

KnVerey commented Oct 16, 2018 • edited Loading

Purpose

Things I don't like

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fw42 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dturn left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey commented Oct 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey commented Oct 16, 2018 •

edited

Loading

dturn left a comment •

edited

Loading