Don't consider `ContainerCannotRun` with a 128 exit code as doomed #694

DazWorrall · 2020-02-14T13:20:06Z

What are you trying to accomplish with this PR?

Krane is treating a particular class of ContainerCannotRun termination reason as an unrecoverable failure - and so failing quickly - when it is in fact transient. This PR attempts to identify that case and not consider the pod rollout doomed.

How is this accomplished?

By ignoring ContainerCannotRun termination reasons only if the exit code is 128.

The test case reflects what we are observing in production - a few pods all on the same node fail to start with this message (and exit code). That they are all on the same node suggests that it is a shared resource having issues, but we cannot match the error with any of our own daemonsets/infra, so our working assumption is that this is a problem at the kubelet/cluster layer. We definitely see that this is a transient issue though - within a few seconds the pods are able to start successfully.

Rather than treat ContainerCannotRun in its entirety as a transient state - we have definitely seen cases where the issue is fatal, e.g. missing executables - instead treat only the 128 exit code as transient. I cannot find exactly where this exit code is coming from, which is pretty dissatisfying, but I think it's acceptable to make this very targetted change based only on our real world observations, to improve the user experience.

What could go wrong?

We treat other, genuinely fatal ContainerCannotRun conditions with a 128 exit code as transient, and Krane takes longer than today to declare the rollout of those resources as failed or doomed.

dturn · 2020-02-14T15:14:02Z

This causes fast failure when there is only 1 new pod, since the replica set requires all pods to be doomed to fail
https://github.com/Shopify/krane/blob/master/lib/krane/kubernetes_resource/replica_set.rb#L37

However, if there is a 128 exit code ContainerCannotRun that does doom the pod (and therefore the rs/deployment), we're going to hit the timeout which won't print the doom reason. Do you think there are cases where that message would be helpful?

KnVerey · 2020-02-14T18:11:19Z

However, if there is a 128 exit code ContainerCannotRun that does doom the pod (and therefore the rs/deployment), we're going to hit the timeout

Will we necessarily? If it is fatal, intuitively I'd expect the pods to enter CrashLoopBackoff and still be considered doomed at some point later than we would have caught it before this PR, but still earlier than the timeout.

DazWorrall · 2020-02-14T20:31:24Z

However, if there is a 128 exit code ContainerCannotRun that does doom the pod (and therefore the rs/deployment), we're going to hit the timeout which won't print the doom reason.

Will we necessarily? If it is fatal, intuitively I'd expect the pods to enter CrashLoopBackoff

I decided to check this as best as I could - we cant reproduce the transient issue, so I created a pod with a missing executable. This causes ContainerCannotRun and does indeed end up in CrashLoopBackOff:

    lastState:
      terminated:
        containerID: docker://a0e3cbe7a7c2c6d37769be13d09cea66294bc88af4358e31bee7d36a5cb344d8
        exitCode: 127
        finishedAt: "2020-02-14T20:27:24Z"
        message: 'OCI runtime create failed: container_linux.go:344: starting container
          process caused "exec: \"/do/not/exist\": stat /do/not/exist: no such file
          or directory": unknown'
        reason: ContainerCannotRun
        startedAt: "2020-02-14T20:27:24Z"
    name: myapp-container
    ready: false
    restartCount: 5
    state:
      waiting:
        message: Back-off 2m40s restarting failed container=myapp-container pod=myapp-pod_default(fbe7b02d-4f67-11ea-9fe6-a2759a231599)
        reason: CrashLoopBackOff

However I think my return use here and the way the conditions are ordered would mean that the ContainerCannotRun would trump the CrashLoopBackOff, and end up not not detecting the doomed container. I will add another test to see and and refactor if necessary.

DazWorrall · 2020-02-17T11:36:19Z

I think my return use here and the way the conditions are ordered would mean that the ContainerCannotRun would trump the CrashLoopBackOff, and end up not not detecting the doomed container.

@KnVerey @dturn I reordered the conditionals to check state before lastState, to try and fix this without having to consider the values of both at the same time. The tests are passing but I wonder if you can think of any problems with this approach?

KnVerey

The tests are passing but I wonder if you can think of any problems with this approach?

The only thing that occurs to me is that CrashLoopBackoff generally doesn't have a useful message, so if we choose to display the CrashLoopBackoff when we previously would have chosen the (e.g.) ContainerCannotRun, our error message will be less enlightening.

DazWorrall · 2020-02-19T17:50:56Z

@KnVerey thanks for that, I'll keep an eye once this ships and see if we need to tweak the messaging further.

) * Don't consider `ContainerCannotRun` with a 128 exit code as doomed * Check state before lastState when testing for doomed pods

Don't consider ContainerCannotRun with a 128 exit code as doomed

daa00b8

DazWorrall requested review from dturn, KnVerey and a team as code owners February 14, 2020 13:20

DazWorrall requested a review from JackTLi February 14, 2020 13:20

Check state before lastState when testing for doomed pods

ab53b51

dturn approved these changes Feb 19, 2020

View reviewed changes

KnVerey approved these changes Feb 19, 2020

View reviewed changes

DazWorrall merged commit 59da174 into master Feb 19, 2020

DazWorrall deleted the treat-container-cannot-run-128-as-transient branch February 19, 2020 17:51

KnVerey pushed a commit that referenced this pull request Apr 13, 2020

Don't consider ContainerCannotRun with a 128 exit code as doomed (#694

6cad6d5

) * Don't consider `ContainerCannotRun` with a 128 exit code as doomed * Check state before lastState when testing for doomed pods

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't consider `ContainerCannotRun` with a 128 exit code as doomed #694

Don't consider `ContainerCannotRun` with a 128 exit code as doomed #694

DazWorrall commented Feb 14, 2020

dturn commented Feb 14, 2020

KnVerey commented Feb 14, 2020

DazWorrall commented Feb 14, 2020

DazWorrall commented Feb 17, 2020

KnVerey left a comment

DazWorrall commented Feb 19, 2020

Don't consider ContainerCannotRun with a 128 exit code as doomed #694

Don't consider ContainerCannotRun with a 128 exit code as doomed #694

Conversation

DazWorrall commented Feb 14, 2020

dturn commented Feb 14, 2020

KnVerey commented Feb 14, 2020

DazWorrall commented Feb 14, 2020

DazWorrall commented Feb 17, 2020

KnVerey left a comment

Choose a reason for hiding this comment

DazWorrall commented Feb 19, 2020

Don't consider `ContainerCannotRun` with a 128 exit code as doomed #694

Don't consider `ContainerCannotRun` with a 128 exit code as doomed #694