-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't consider ContainerCannotRun
with a 128 exit code as doomed
#694
Don't consider ContainerCannotRun
with a 128 exit code as doomed
#694
Conversation
This causes fast failure when there is only 1 new pod, since the replica set requires all pods to be doomed to fail However, if there is a 128 exit code |
Will we necessarily? If it is fatal, intuitively I'd expect the pods to enter CrashLoopBackoff and still be considered doomed at some point later than we would have caught it before this PR, but still earlier than the timeout. |
I decided to check this as best as I could - we cant reproduce the transient issue, so I created a pod with a missing executable. This causes lastState:
terminated:
containerID: docker://a0e3cbe7a7c2c6d37769be13d09cea66294bc88af4358e31bee7d36a5cb344d8
exitCode: 127
finishedAt: "2020-02-14T20:27:24Z"
message: 'OCI runtime create failed: container_linux.go:344: starting container
process caused "exec: \"/do/not/exist\": stat /do/not/exist: no such file
or directory": unknown'
reason: ContainerCannotRun
startedAt: "2020-02-14T20:27:24Z"
name: myapp-container
ready: false
restartCount: 5
state:
waiting:
message: Back-off 2m40s restarting failed container=myapp-container pod=myapp-pod_default(fbe7b02d-4f67-11ea-9fe6-a2759a231599)
reason: CrashLoopBackOff However I think my |
@KnVerey @dturn I reordered the conditionals to check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are passing but I wonder if you can think of any problems with this approach?
The only thing that occurs to me is that CrashLoopBackoff generally doesn't have a useful message, so if we choose to display the CrashLoopBackoff when we previously would have chosen the (e.g.) ContainerCannotRun, our error message will be less enlightening.
@KnVerey thanks for that, I'll keep an eye once this ships and see if we need to tweak the messaging further. |
What are you trying to accomplish with this PR?
Krane is treating a particular class of
ContainerCannotRun
termination reason as an unrecoverable failure - and so failing quickly - when it is in fact transient. This PR attempts to identify that case and not consider the pod rollout doomed.How is this accomplished?
By ignoring
ContainerCannotRun
termination reasons only if the exit code is 128.The test case reflects what we are observing in production - a few pods all on the same node fail to start with this message (and exit code). That they are all on the same node suggests that it is a shared resource having issues, but we cannot match the error with any of our own daemonsets/infra, so our working assumption is that this is a problem at the kubelet/cluster layer. We definitely see that this is a transient issue though - within a few seconds the pods are able to start successfully.
Rather than treat
ContainerCannotRun
in its entirety as a transient state - we have definitely seen cases where the issue is fatal, e.g. missing executables - instead treat only the 128 exit code as transient. I cannot find exactly where this exit code is coming from, which is pretty dissatisfying, but I think it's acceptable to make this very targetted change based only on our real world observations, to improve the user experience.What could go wrong?
We treat other, genuinely fatal
ContainerCannotRun
conditions with a 128 exit code as transient, and Krane takes longer than today to declare the rollout of those resources as failed or doomed.