Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notes on outage in nasa-ghg #3194

Closed
consideRatio opened this issue Sep 29, 2023 · 5 comments
Closed

Notes on outage in nasa-ghg #3194

consideRatio opened this issue Sep 29, 2023 · 5 comments

Comments

@consideRatio
Copy link
Member

Yuvi restarted the hub pod, but looking at the other pods, they all had their containers restart Fri, 29 Sep 2023 17:43:27 +0200 - what happened?

api-prod-dask-gateway-84ddd5b8-5mgpv            1/1     Running   1 (3h36m ago)   59d
controller-prod-dask-gateway-554bb6c787-vqmkl   1/1     Running   1 (3h36m ago)   59d
hub-865d686bb4-hkgxp                            2/2     Running   0               70m
proxy-67dc5db685-r64w6                          1/1     Running   1 (3h36m ago)   15d
shared-dirsize-metrics-757595f7b4-f8slg         1/1     Running   1 (3h36m ago)   10d
shared-volume-metrics-5b9bc49c59-nxkzx          1/1     Running   1 (3h36m ago)   59d
traefik-prod-dask-gateway-659d8b4598-mlx7q      1/1     Running   1 (3h36m ago)   59d
user-scheduler-6d67457c96-9vvgc                 1/1     Running   1 (3h36m ago)   15d
user-scheduler-6d67457c96-sb4mh                 1/1     Running   1 (3h36m ago)   15d

Looking at nodes I see that a single core node has been running for a long time, and there is one 112m old user node as well.

kubectl get node                                                  
NAME                                           STATUS   ROLES    AGE    VERSION
ip-192-168-31-131.us-west-2.compute.internal   Ready    <none>   112m   v1.27.3-eks-a5565ad
ip-192-168-8-94.us-west-2.compute.internal     Ready    <none>   59d    v1.27.3-eks-a5565ad

Looking with kubectl describe pod to see the "Last state" on these pods, we see this for several (all?) restarted pods:

    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Tue, 01 Aug 2023 06:20:55 +0200
      Finished:     Fri, 29 Sep 2023 17:43:27 +0200

It seems that something major crashed on the node for some reason, all pods including kube-system restarted - but it recovered to some degree at least.

@yuvipanda
Copy link
Member

The hub pod didn't recover from the crash. Looking at describe, I saw:

`  Warning  BackOff  4m2s (x482 over 108m)  kubelet  Back-off restarting failed container templates-clone in pod hub-676d6ff5f9-9q45j_staging(c61f1240-7003-4f38-b7c5-67983ef9626d)`

And looking at the logs, I see:

➜ k -n staging logs hub-676d6ff5f9-9q45j                               -c templates-clone
fatal: destination path '/srv/repo' already exists and is not an empty directory.

Which makes sense - if a pod is restarted, template-clone from the prior round already did its job and so it's not empty. I'll put up a fix for this soon.

@consideRatio
Copy link
Member Author

Ah excellent, I'm happy that the failure to recover can be improved by us!

I've seen that when things crash, pods with mounted disks (like the hub pod) can be slower to get running again because k8s needs to realize they are no longer coupled to the old container/pod. But, this has only been a ~5 minute issue from my experience so far.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Sep 29, 2023
- Git clone into a subdirectory of the `emptyDir` mount, as the
  mount itself is default owned by root. This was the cause of
  2i2c-org#1695, which
  was previously solved by adding another initContainer. This PR
  removes that, speeding up starts.
- Remove existing directory if it exists before cloning. This prevents
  the container from failing when the *pod* is restarted due to node
  issues, as that doesn't clear out `emptyDirs`. And git freaks out
  if the repo already exists
- Remove redundant `IfNotPresent` imagePullPolicy from alpine/git,
  as that is not needed when we specify a non :latest tag

Ref 2i2c-org#3194
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Sep 29, 2023
- Git clone into a subdirectory of the `emptyDir` mount, as the
  mount itself is default owned by root. This was the cause of
  2i2c-org#1695, which
  was previously solved by adding another initContainer. This PR
  removes that, speeding up starts.
- Remove existing directory if it exists before cloning. This prevents
  the container from failing when the *pod* is restarted due to node
  issues, as that doesn't clear out `emptyDirs`. And git freaks out
  if the repo already exists
- Remove redundant `IfNotPresent` imagePullPolicy from alpine/git,
  as that is not needed when we specify a non :latest tag. Thanks
  to Erik's investigation in 2i2c-org#3165 (comment)

Ref 2i2c-org#3194
@yuvipanda
Copy link
Member

Done in #3195 @consideRatio!

@yuvipanda
Copy link
Member

So I think the node being screwed up was kinda normal to expect - the problem was that everything else recovered and auto healed automatically, except for the hub pod. #3195 will fix that.

@consideRatio
Copy link
Member Author

Closing as resolved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

2 participants