Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not fail eagerly on ImagePullBackoff #477

Merged
merged 2 commits into from
Apr 29, 2019
Merged

Do not fail eagerly on ImagePullBackoff #477

merged 2 commits into from
Apr 29, 2019

Conversation

KnVerey
Copy link
Contributor

@KnVerey KnVerey commented Apr 26, 2019

What are you trying to accomplish with this PR?

At Shopify, our largest apps are seeing many deploys failing because of transient registry errors. Kubernetes (well, the container runtime) is doing the right thing and retrying until the image is successfully pulled, usually within a couple minutes, but because kubernetes-deploy declares a resource failed as soon as it observes ImagePullBackoff, our developers are being overexposed to these transient errors.

ErrImagePull's messages give us enough information to determine whether the image actually isn't in the registry, which is the case we were trying to catch. ImagePullBackoff does not. Given how quickly the former turns into the latter regardless of the underlying error reason, our assumption that we can use ImagePullBackoff to fail the deploy is breaking down at scale.

How is this accomplished?

Always requiring "not found" to be part of the error message, which in reality I believe is only the case for initial ErrImagePull messages at this time.

What could go wrong?

  • We will catch cases where developers actually did not push their image before deploying (or did not wait for automation to do so) much less often. Basically we have to get really lucky and observe the pod between the first and second errors being thrown k8s-side.
  • The experience of accidentally deploying a missing image will be inconsistent as a result of the above--sometimes it'll fail fast, and sometimes it'll hang until a timeout.
  • This helps administrators paper over registry problems.

@Shopify/cloudx

@@ -210,8 +210,7 @@ def doom_reason
elsif limbo_reason == "CrashLoopBackOff"
exit_code = @status.dig('lastState', 'terminated', 'exitCode')
"Crashing repeatedly (exit #{exit_code}). See logs for more information."
elsif %w(ImagePullBackOff ErrImagePull).include?(limbo_reason) &&
limbo_message.match(/(?:not found)|(?:back-off)/i)
elsif %w(ErrImagePull ImagePullBackOff).include?(limbo_reason) && limbo_message.match(/not found/i)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could remove this entirely. In practice I don't think ImagePullBackoff ever includes "not found" in existing k8s versions, so this detection is extremely racy. I can see the argument that a consistent timeout would be better for UX.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try and be helpful where we can.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice I don't think ImagePullBackoff ever includes "not found" in existing k8s versions

This comes from here I think: https://github.com/kubernetes/kubernetes/blob/d24fe8a801748953a5c34fd34faa8005c6ad1770/pkg/kubelet/images/image_manager.go#L126

fmt.Sprintf("Back-off pulling image %q", container.Image)

Shouldn't we then remove ImagePullBackOff because it would always fail the not found string match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was lazy of me not to look up the source. Thanks for finding it! Since it is confirmed to not be possible for the error to match, I agree that keeping ImagePullBackoff here is just misleading.

@@ -355,33 +355,6 @@ def test_invalid_k8s_spec_that_is_valid_yaml_but_has_no_template_path_in_error_p
], in_order: true)
end

def test_bad_container_image_on_unmanaged_pod_halts_and_fails_deploy
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot be integration tested anymore because it is super racy.

@@ -707,10 +664,10 @@ def test_failed_deploy_to_nonexistent_namespace
@namespace = original_ns
end

def test_failure_logs_from_unmanaged_pod_appear_in_summary_section
def test_unmanaged_pod_failure_halts_deploy_and_displays_logs_correctly
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified this test to also cover the basic fact that it halts the deploy, which was previously covered by test_bad_container_image_on_unmanaged_pod_halts_and_fails_deploy.

@@ -58,25 +58,42 @@ def test_deploy_failed_is_false_for_intermittent_image_error
assert_nil(pod.failure_message)
end

def test_deploy_failed_is_true_for_image_pull_backoff
def test_deploy_failed_is_true_for_image_pull_backoff_with_specific_error
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK this doesn't actually happen though.

Copy link
Contributor

@dturn dturn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code lgtm. That being said, we're making our code less helpful to handle a docker/gcr issue. Could we run core on this branch for a bit before merging in hopes that this gets resolved quickly?

@@ -210,8 +210,7 @@ def doom_reason
elsif limbo_reason == "CrashLoopBackOff"
exit_code = @status.dig('lastState', 'terminated', 'exitCode')
"Crashing repeatedly (exit #{exit_code}). See logs for more information."
elsif %w(ImagePullBackOff ErrImagePull).include?(limbo_reason) &&
limbo_message.match(/(?:not found)|(?:back-off)/i)
elsif %w(ErrImagePull ImagePullBackOff).include?(limbo_reason) && limbo_message.match(/not found/i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try and be helpful where we can.

@KnVerey
Copy link
Contributor Author

KnVerey commented Apr 26, 2019

Could we run core on this branch for a bit before merging in hopes that this gets resolved quickly?

We could, but my feeling is that core uncovered that the code we're removing here is undermining a fundamental resiliency win Kubernetes provides by making judgements based on inadequate information.

@KnVerey KnVerey requested a review from stefanmb April 26, 2019 19:42
Copy link
Member

@DazWorrall DazWorrall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, and I agree that this seems not to have scaled well so it's ok to drop this optimisation.

These failures have become so common in the last week that it would be great to get this into production ASAP. It's possible that we'll find some direct cause of the uptick in failures that we can fix, in which case we can revisit this decision, but so far neither us nor Google have found anything so I lean towards acting now.

@KnVerey KnVerey merged commit c941927 into master Apr 29, 2019
@KnVerey KnVerey deleted the image_pull branch April 29, 2019 15:29
@KnVerey KnVerey temporarily deployed to rubygems April 29, 2019 17:38 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants