Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't consider ContainerCannotRun with a 128 exit code as doomed #694

Merged
merged 2 commits into from
Feb 19, 2020

Conversation

DazWorrall
Copy link
Member

What are you trying to accomplish with this PR?

Krane is treating a particular class of ContainerCannotRun termination reason as an unrecoverable failure - and so failing quickly - when it is in fact transient. This PR attempts to identify that case and not consider the pod rollout doomed.

How is this accomplished?

By ignoring ContainerCannotRun termination reasons only if the exit code is 128.

The test case reflects what we are observing in production - a few pods all on the same node fail to start with this message (and exit code). That they are all on the same node suggests that it is a shared resource having issues, but we cannot match the error with any of our own daemonsets/infra, so our working assumption is that this is a problem at the kubelet/cluster layer. We definitely see that this is a transient issue though - within a few seconds the pods are able to start successfully.

Rather than treat ContainerCannotRun in its entirety as a transient state - we have definitely seen cases where the issue is fatal, e.g. missing executables - instead treat only the 128 exit code as transient. I cannot find exactly where this exit code is coming from, which is pretty dissatisfying, but I think it's acceptable to make this very targetted change based only on our real world observations, to improve the user experience.

What could go wrong?

We treat other, genuinely fatal ContainerCannotRun conditions with a 128 exit code as transient, and Krane takes longer than today to declare the rollout of those resources as failed or doomed.

@dturn
Copy link
Contributor

dturn commented Feb 14, 2020

This causes fast failure when there is only 1 new pod, since the replica set requires all pods to be doomed to fail
https://github.com/Shopify/krane/blob/master/lib/krane/kubernetes_resource/replica_set.rb#L37

However, if there is a 128 exit code ContainerCannotRun that does doom the pod (and therefore the rs/deployment), we're going to hit the timeout which won't print the doom reason. Do you think there are cases where that message would be helpful?

@KnVerey
Copy link
Contributor

KnVerey commented Feb 14, 2020

However, if there is a 128 exit code ContainerCannotRun that does doom the pod (and therefore the rs/deployment), we're going to hit the timeout

Will we necessarily? If it is fatal, intuitively I'd expect the pods to enter CrashLoopBackoff and still be considered doomed at some point later than we would have caught it before this PR, but still earlier than the timeout.

@DazWorrall
Copy link
Member Author

However, if there is a 128 exit code ContainerCannotRun that does doom the pod (and therefore the rs/deployment), we're going to hit the timeout which won't print the doom reason.

Will we necessarily? If it is fatal, intuitively I'd expect the pods to enter CrashLoopBackoff

I decided to check this as best as I could - we cant reproduce the transient issue, so I created a pod with a missing executable. This causes ContainerCannotRun and does indeed end up in CrashLoopBackOff:

    lastState:
      terminated:
        containerID: docker://a0e3cbe7a7c2c6d37769be13d09cea66294bc88af4358e31bee7d36a5cb344d8
        exitCode: 127
        finishedAt: "2020-02-14T20:27:24Z"
        message: 'OCI runtime create failed: container_linux.go:344: starting container
          process caused "exec: \"/do/not/exist\": stat /do/not/exist: no such file
          or directory": unknown'
        reason: ContainerCannotRun
        startedAt: "2020-02-14T20:27:24Z"
    name: myapp-container
    ready: false
    restartCount: 5
    state:
      waiting:
        message: Back-off 2m40s restarting failed container=myapp-container pod=myapp-pod_default(fbe7b02d-4f67-11ea-9fe6-a2759a231599)
        reason: CrashLoopBackOff

However I think my return use here and the way the conditions are ordered would mean that the ContainerCannotRun would trump the CrashLoopBackOff, and end up not not detecting the doomed container. I will add another test to see and and refactor if necessary.

@DazWorrall
Copy link
Member Author

I think my return use here and the way the conditions are ordered would mean that the ContainerCannotRun would trump the CrashLoopBackOff, and end up not not detecting the doomed container.

@KnVerey @dturn I reordered the conditionals to check state before lastState, to try and fix this without having to consider the values of both at the same time. The tests are passing but I wonder if you can think of any problems with this approach?

Copy link
Contributor

@KnVerey KnVerey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are passing but I wonder if you can think of any problems with this approach?

The only thing that occurs to me is that CrashLoopBackoff generally doesn't have a useful message, so if we choose to display the CrashLoopBackoff when we previously would have chosen the (e.g.) ContainerCannotRun, our error message will be less enlightening.

@DazWorrall
Copy link
Member Author

@KnVerey thanks for that, I'll keep an eye once this ships and see if we need to tweak the messaging further.

@DazWorrall DazWorrall merged commit 59da174 into master Feb 19, 2020
@DazWorrall DazWorrall deleted the treat-container-cannot-run-128-as-transient branch February 19, 2020 17:51
KnVerey pushed a commit that referenced this pull request Apr 13, 2020
)

* Don't consider `ContainerCannotRun` with a 128 exit code as doomed

* Check state before lastState when testing for doomed pods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants