Notes on outage in nasa-ghg #3194

consideRatio · 2023-09-29T19:27:46Z

Yuvi restarted the hub pod, but looking at the other pods, they all had their containers restart Fri, 29 Sep 2023 17:43:27 +0200 - what happened?

api-prod-dask-gateway-84ddd5b8-5mgpv            1/1     Running   1 (3h36m ago)   59d
controller-prod-dask-gateway-554bb6c787-vqmkl   1/1     Running   1 (3h36m ago)   59d
hub-865d686bb4-hkgxp                            2/2     Running   0               70m
proxy-67dc5db685-r64w6                          1/1     Running   1 (3h36m ago)   15d
shared-dirsize-metrics-757595f7b4-f8slg         1/1     Running   1 (3h36m ago)   10d
shared-volume-metrics-5b9bc49c59-nxkzx          1/1     Running   1 (3h36m ago)   59d
traefik-prod-dask-gateway-659d8b4598-mlx7q      1/1     Running   1 (3h36m ago)   59d
user-scheduler-6d67457c96-9vvgc                 1/1     Running   1 (3h36m ago)   15d
user-scheduler-6d67457c96-sb4mh                 1/1     Running   1 (3h36m ago)   15d

Looking at nodes I see that a single core node has been running for a long time, and there is one 112m old user node as well.

kubectl get node                                                  
NAME                                           STATUS   ROLES    AGE    VERSION
ip-192-168-31-131.us-west-2.compute.internal   Ready    <none>   112m   v1.27.3-eks-a5565ad
ip-192-168-8-94.us-west-2.compute.internal     Ready    <none>   59d    v1.27.3-eks-a5565ad

Looking with kubectl describe pod to see the "Last state" on these pods, we see this for several (all?) restarted pods:

    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Tue, 01 Aug 2023 06:20:55 +0200
      Finished:     Fri, 29 Sep 2023 17:43:27 +0200

It seems that something major crashed on the node for some reason, all pods including kube-system restarted - but it recovered to some degree at least.

The text was updated successfully, but these errors were encountered:

yuvipanda · 2023-09-29T19:30:39Z

The hub pod didn't recover from the crash. Looking at describe, I saw:

`  Warning  BackOff  4m2s (x482 over 108m)  kubelet  Back-off restarting failed container templates-clone in pod hub-676d6ff5f9-9q45j_staging(c61f1240-7003-4f38-b7c5-67983ef9626d)`

And looking at the logs, I see:

➜ k -n staging logs hub-676d6ff5f9-9q45j                               -c templates-clone
fatal: destination path '/srv/repo' already exists and is not an empty directory.

Which makes sense - if a pod is restarted, template-clone from the prior round already did its job and so it's not empty. I'll put up a fix for this soon.

consideRatio · 2023-09-29T19:35:12Z

Ah excellent, I'm happy that the failure to recover can be improved by us!

I've seen that when things crash, pods with mounted disks (like the hub pod) can be slower to get running again because k8s needs to realize they are no longer coupled to the old container/pod. But, this has only been a ~5 minute issue from my experience so far.

- Git clone into a subdirectory of the `emptyDir` mount, as the mount itself is default owned by root. This was the cause of 2i2c-org#1695, which was previously solved by adding another initContainer. This PR removes that, speeding up starts. - Remove existing directory if it exists before cloning. This prevents the container from failing when the *pod* is restarted due to node issues, as that doesn't clear out `emptyDirs`. And git freaks out if the repo already exists - Remove redundant `IfNotPresent` imagePullPolicy from alpine/git, as that is not needed when we specify a non :latest tag Ref 2i2c-org#3194

- Git clone into a subdirectory of the `emptyDir` mount, as the mount itself is default owned by root. This was the cause of 2i2c-org#1695, which was previously solved by adding another initContainer. This PR removes that, speeding up starts. - Remove existing directory if it exists before cloning. This prevents the container from failing when the *pod* is restarted due to node issues, as that doesn't clear out `emptyDirs`. And git freaks out if the repo already exists - Remove redundant `IfNotPresent` imagePullPolicy from alpine/git, as that is not needed when we specify a non :latest tag. Thanks to Erik's investigation in 2i2c-org#3165 (comment) Ref 2i2c-org#3194

yuvipanda · 2023-09-29T20:09:37Z

Done in #3195 @consideRatio!

yuvipanda · 2023-09-29T20:30:28Z

So I think the node being screwed up was kinda normal to expect - the problem was that everything else recovered and auto healed automatically, except for the hub pod. #3195 will fix that.

consideRatio · 2023-09-29T20:35:25Z

Closing as resolved!

yuvipanda mentioned this issue Sep 29, 2023

Simplify git cloning hub templates #3195

Merged

consideRatio closed this as completed Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes on outage in nasa-ghg #3194

Notes on outage in nasa-ghg #3194

consideRatio commented Sep 29, 2023

yuvipanda commented Sep 29, 2023

consideRatio commented Sep 29, 2023

yuvipanda commented Sep 29, 2023

yuvipanda commented Sep 29, 2023

consideRatio commented Sep 29, 2023

Notes on outage in nasa-ghg #3194

Notes on outage in nasa-ghg #3194

Comments

consideRatio commented Sep 29, 2023

yuvipanda commented Sep 29, 2023

consideRatio commented Sep 29, 2023

yuvipanda commented Sep 29, 2023

yuvipanda commented Sep 29, 2023

consideRatio commented Sep 29, 2023