-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not fail eagerly on ImagePullBackoff #477
Conversation
@@ -210,8 +210,7 @@ def doom_reason | |||
elsif limbo_reason == "CrashLoopBackOff" | |||
exit_code = @status.dig('lastState', 'terminated', 'exitCode') | |||
"Crashing repeatedly (exit #{exit_code}). See logs for more information." | |||
elsif %w(ImagePullBackOff ErrImagePull).include?(limbo_reason) && | |||
limbo_message.match(/(?:not found)|(?:back-off)/i) | |||
elsif %w(ErrImagePull ImagePullBackOff).include?(limbo_reason) && limbo_message.match(/not found/i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could remove this entirely. In practice I don't think ImagePullBackoff
ever includes "not found" in existing k8s versions, so this detection is extremely racy. I can see the argument that a consistent timeout would be better for UX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try and be helpful where we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice I don't think ImagePullBackoff ever includes "not found" in existing k8s versions
This comes from here I think: https://github.com/kubernetes/kubernetes/blob/d24fe8a801748953a5c34fd34faa8005c6ad1770/pkg/kubelet/images/image_manager.go#L126
fmt.Sprintf("Back-off pulling image %q", container.Image)
Shouldn't we then remove ImagePullBackOff
because it would always fail the not found
string match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was lazy of me not to look up the source. Thanks for finding it! Since it is confirmed to not be possible for the error to match, I agree that keeping ImagePullBackoff here is just misleading.
@@ -355,33 +355,6 @@ def test_invalid_k8s_spec_that_is_valid_yaml_but_has_no_template_path_in_error_p | |||
], in_order: true) | |||
end | |||
|
|||
def test_bad_container_image_on_unmanaged_pod_halts_and_fails_deploy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot be integration tested anymore because it is super racy.
@@ -707,10 +664,10 @@ def test_failed_deploy_to_nonexistent_namespace | |||
@namespace = original_ns | |||
end | |||
|
|||
def test_failure_logs_from_unmanaged_pod_appear_in_summary_section | |||
def test_unmanaged_pod_failure_halts_deploy_and_displays_logs_correctly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified this test to also cover the basic fact that it halts the deploy, which was previously covered by test_bad_container_image_on_unmanaged_pod_halts_and_fails_deploy
.
@@ -58,25 +58,42 @@ def test_deploy_failed_is_false_for_intermittent_image_error | |||
assert_nil(pod.failure_message) | |||
end | |||
|
|||
def test_deploy_failed_is_true_for_image_pull_backoff | |||
def test_deploy_failed_is_true_for_image_pull_backoff_with_specific_error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK this doesn't actually happen though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code lgtm. That being said, we're making our code less helpful to handle a docker/gcr issue. Could we run core on this branch for a bit before merging in hopes that this gets resolved quickly?
@@ -210,8 +210,7 @@ def doom_reason | |||
elsif limbo_reason == "CrashLoopBackOff" | |||
exit_code = @status.dig('lastState', 'terminated', 'exitCode') | |||
"Crashing repeatedly (exit #{exit_code}). See logs for more information." | |||
elsif %w(ImagePullBackOff ErrImagePull).include?(limbo_reason) && | |||
limbo_message.match(/(?:not found)|(?:back-off)/i) | |||
elsif %w(ErrImagePull ImagePullBackOff).include?(limbo_reason) && limbo_message.match(/not found/i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should try and be helpful where we can.
We could, but my feeling is that core uncovered that the code we're removing here is undermining a fundamental resiliency win Kubernetes provides by making judgements based on inadequate information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM, and I agree that this seems not to have scaled well so it's ok to drop this optimisation.
These failures have become so common in the last week that it would be great to get this into production ASAP. It's possible that we'll find some direct cause of the uptick in failures that we can fix, in which case we can revisit this decision, but so far neither us nor Google have found anything so I lean towards acting now.
What are you trying to accomplish with this PR?
At Shopify, our largest apps are seeing many deploys failing because of transient registry errors. Kubernetes (well, the container runtime) is doing the right thing and retrying until the image is successfully pulled, usually within a couple minutes, but because kubernetes-deploy declares a resource failed as soon as it observes
ImagePullBackoff
, our developers are being overexposed to these transient errors.ErrImagePull
's messages give us enough information to determine whether the image actually isn't in the registry, which is the case we were trying to catch.ImagePullBackoff
does not. Given how quickly the former turns into the latter regardless of the underlying error reason, our assumption that we can useImagePullBackoff
to fail the deploy is breaking down at scale.How is this accomplished?
Always requiring "not found" to be part of the error message, which in reality I believe is only the case for initial
ErrImagePull
messages at this time.What could go wrong?
@Shopify/cloudx