Do not fail eagerly on ImagePullBackoff #477

KnVerey · 2019-04-26T16:30:05Z

What are you trying to accomplish with this PR?

At Shopify, our largest apps are seeing many deploys failing because of transient registry errors. Kubernetes (well, the container runtime) is doing the right thing and retrying until the image is successfully pulled, usually within a couple minutes, but because kubernetes-deploy declares a resource failed as soon as it observes ImagePullBackoff, our developers are being overexposed to these transient errors.

ErrImagePull's messages give us enough information to determine whether the image actually isn't in the registry, which is the case we were trying to catch. ImagePullBackoff does not. Given how quickly the former turns into the latter regardless of the underlying error reason, our assumption that we can use ImagePullBackoff to fail the deploy is breaking down at scale.

How is this accomplished?

Always requiring "not found" to be part of the error message, which in reality I believe is only the case for initial ErrImagePull messages at this time.

What could go wrong?

We will catch cases where developers actually did not push their image before deploying (or did not wait for automation to do so) much less often. Basically we have to get really lucky and observe the pod between the first and second errors being thrown k8s-side.
The experience of accidentally deploying a missing image will be inconsistent as a result of the above--sometimes it'll fail fast, and sometimes it'll hang until a timeout.
This helps administrators paper over registry problems.

@Shopify/cloudx

KnVerey · 2019-04-26T16:30:47Z

lib/kubernetes-deploy/kubernetes_resource/pod.rb

@@ -210,8 +210,7 @@ def doom_reason
        elsif limbo_reason == "CrashLoopBackOff"
          exit_code = @status.dig('lastState', 'terminated', 'exitCode')
          "Crashing repeatedly (exit #{exit_code}). See logs for more information."
-        elsif %w(ImagePullBackOff ErrImagePull).include?(limbo_reason) &&
-          limbo_message.match(/(?:not found)|(?:back-off)/i)
+        elsif %w(ErrImagePull ImagePullBackOff).include?(limbo_reason) && limbo_message.match(/not found/i)


Alternatively, we could remove this entirely. In practice I don't think ImagePullBackoff ever includes "not found" in existing k8s versions, so this detection is extremely racy. I can see the argument that a consistent timeout would be better for UX.

I think we should try and be helpful where we can.

In practice I don't think ImagePullBackoff ever includes "not found" in existing k8s versions

This comes from here I think: https://github.com/kubernetes/kubernetes/blob/d24fe8a801748953a5c34fd34faa8005c6ad1770/pkg/kubelet/images/image_manager.go#L126

fmt.Sprintf("Back-off pulling image %q", container.Image)

Shouldn't we then remove ImagePullBackOff because it would always fail the not found string match?

Yes, it was lazy of me not to look up the source. Thanks for finding it! Since it is confirmed to not be possible for the error to match, I agree that keeping ImagePullBackoff here is just misleading.

KnVerey · 2019-04-26T16:32:21Z

test/integration/kubernetes_deploy_test.rb

@@ -355,33 +355,6 @@ def test_invalid_k8s_spec_that_is_valid_yaml_but_has_no_template_path_in_error_p
    ], in_order: true)
  end

-  def test_bad_container_image_on_unmanaged_pod_halts_and_fails_deploy


This cannot be integration tested anymore because it is super racy.

KnVerey · 2019-04-26T16:33:50Z

test/integration/kubernetes_deploy_test.rb

@@ -707,10 +664,10 @@ def test_failed_deploy_to_nonexistent_namespace
    @namespace = original_ns
  end

-  def test_failure_logs_from_unmanaged_pod_appear_in_summary_section
+  def test_unmanaged_pod_failure_halts_deploy_and_displays_logs_correctly


Modified this test to also cover the basic fact that it halts the deploy, which was previously covered by test_bad_container_image_on_unmanaged_pod_halts_and_fails_deploy.

KnVerey · 2019-04-26T16:34:18Z

test/unit/kubernetes-deploy/kubernetes_resource/pod_test.rb

@@ -58,25 +58,42 @@ def test_deploy_failed_is_false_for_intermittent_image_error
    assert_nil(pod.failure_message)
  end

-  def test_deploy_failed_is_true_for_image_pull_backoff
+  def test_deploy_failed_is_true_for_image_pull_backoff_with_specific_error


AFAIK this doesn't actually happen though.

dturn

Code lgtm. That being said, we're making our code less helpful to handle a docker/gcr issue. Could we run core on this branch for a bit before merging in hopes that this gets resolved quickly?

dturn · 2019-04-26T17:01:04Z

lib/kubernetes-deploy/kubernetes_resource/pod.rb

@@ -210,8 +210,7 @@ def doom_reason
        elsif limbo_reason == "CrashLoopBackOff"
          exit_code = @status.dig('lastState', 'terminated', 'exitCode')
          "Crashing repeatedly (exit #{exit_code}). See logs for more information."
-        elsif %w(ImagePullBackOff ErrImagePull).include?(limbo_reason) &&
-          limbo_message.match(/(?:not found)|(?:back-off)/i)
+        elsif %w(ErrImagePull ImagePullBackOff).include?(limbo_reason) && limbo_message.match(/not found/i)


I think we should try and be helpful where we can.

KnVerey · 2019-04-26T17:25:15Z

Could we run core on this branch for a bit before merging in hopes that this gets resolved quickly?

We could, but my feeling is that core uncovered that the code we're removing here is undermining a fundamental resiliency win Kubernetes provides by making judgements based on inadequate information.

DazWorrall

Code LGTM, and I agree that this seems not to have scaled well so it's ok to drop this optimisation.

These failures have become so common in the last week that it would be great to get this into production ASAP. It's possible that we'll find some direct cause of the uptick in failures that we can fix, in which case we can revisit this decision, but so far neither us nor Google have found anything so I lean towards acting now.

Do not fail eagerly on ImagePullBackoff

709b0dc

KnVerey requested review from dturn and DazWorrall April 26, 2019 16:30

KnVerey commented Apr 26, 2019

View reviewed changes

dturn approved these changes Apr 26, 2019

View reviewed changes

KnVerey requested a review from stefanmb April 26, 2019 19:42

Remove implication that ImagePullBackoff can ever contain specifics

8d70609

DazWorrall approved these changes Apr 29, 2019

View reviewed changes

stefanmb approved these changes Apr 29, 2019

View reviewed changes

KnVerey merged commit c941927 into master Apr 29, 2019

KnVerey deleted the image_pull branch April 29, 2019 15:29

KnVerey temporarily deployed to rubygems April 29, 2019 17:38 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not fail eagerly on ImagePullBackoff #477

Do not fail eagerly on ImagePullBackoff #477

KnVerey commented Apr 26, 2019

KnVerey Apr 26, 2019

dturn Apr 26, 2019

stefanmb Apr 26, 2019

KnVerey Apr 26, 2019

KnVerey Apr 26, 2019

KnVerey Apr 26, 2019

KnVerey Apr 26, 2019

dturn left a comment

dturn Apr 26, 2019

KnVerey commented Apr 26, 2019

DazWorrall left a comment

Do not fail eagerly on ImagePullBackoff #477

Do not fail eagerly on ImagePullBackoff #477

Conversation

KnVerey commented Apr 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dturn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey commented Apr 26, 2019

DazWorrall left a comment

Choose a reason for hiding this comment