-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notes on outage in nasa-ghg #3194
Comments
The hub pod didn't recover from the crash. Looking at describe, I saw:
And looking at the logs, I see:
Which makes sense - if a pod is restarted, |
Ah excellent, I'm happy that the failure to recover can be improved by us! I've seen that when things crash, pods with mounted disks (like the hub pod) can be slower to get running again because k8s needs to realize they are no longer coupled to the old container/pod. But, this has only been a ~5 minute issue from my experience so far. |
- Git clone into a subdirectory of the `emptyDir` mount, as the mount itself is default owned by root. This was the cause of 2i2c-org#1695, which was previously solved by adding another initContainer. This PR removes that, speeding up starts. - Remove existing directory if it exists before cloning. This prevents the container from failing when the *pod* is restarted due to node issues, as that doesn't clear out `emptyDirs`. And git freaks out if the repo already exists - Remove redundant `IfNotPresent` imagePullPolicy from alpine/git, as that is not needed when we specify a non :latest tag Ref 2i2c-org#3194
- Git clone into a subdirectory of the `emptyDir` mount, as the mount itself is default owned by root. This was the cause of 2i2c-org#1695, which was previously solved by adding another initContainer. This PR removes that, speeding up starts. - Remove existing directory if it exists before cloning. This prevents the container from failing when the *pod* is restarted due to node issues, as that doesn't clear out `emptyDirs`. And git freaks out if the repo already exists - Remove redundant `IfNotPresent` imagePullPolicy from alpine/git, as that is not needed when we specify a non :latest tag. Thanks to Erik's investigation in 2i2c-org#3165 (comment) Ref 2i2c-org#3194
Done in #3195 @consideRatio! |
So I think the node being screwed up was kinda normal to expect - the problem was that everything else recovered and auto healed automatically, except for the hub pod. #3195 will fix that. |
Closing as resolved! |
Yuvi restarted the hub pod, but looking at the other pods, they all had their containers restart
Fri, 29 Sep 2023 17:43:27 +0200
- what happened?Looking at nodes I see that a single core node has been running for a long time, and there is one 112m old user node as well.
Looking with
kubectl describe pod
to see the "Last state" on these pods, we see this for several (all?) restarted pods:It seems that something major crashed on the node for some reason, all pods including kube-system restarted - but it recovered to some degree at least.
The text was updated successfully, but these errors were encountered: