[BUG] Evicted spot nodes are not being drained #3528

Pionerd · 2023-03-10T11:14:12Z

Describe the bug
When AKS Spot nodes are being evicted, our nodes are not being drained. Instead the nodes are just disappearing from the cluster, leading to unnecessary downtime.

To Reproduce
Steps to reproduce the behavior:

Create an AKS cluster with a Spot node pool. Eviction-policy: delete.
Simulate an eviction using az vmss simulate-eviction --instance-id ${ID} -n aks-spot-23887340-vmss -g ${RG}
Applications keep running for some time, we have seen 30 seconds, but also 2m30s. After that the application goes down. The node remains in Ready state and the pods still appear to be running (according to the Kubernetes API).
After ~40 seconds the node goes in NotReady state but the pods are not being rescheduled immediately, even though the pods have tolerations set for node.kubernetes.io/unreachable and node.kubernetes.io/not-ready that last for only 2 seconds.
~30 seconds later the pods are being scheduled on another node.

Expected behavior
Based on https://learn.microsoft.com/en-us/azure/aks/node-auto-repair#node-autodrain we expect the nodes to be cordoned and drained before they are actually killed. This allows the pods to be rescheduled in a more graceful manner.

Environment (please complete the following information):

Kubernetes version 1.25.5

The text was updated successfully, but these errors were encountered:

Bryce-Soghigian · 2023-03-10T19:31:37Z

Thanks for reporting this, this seems related to a bug we had around scheduled event triggers, let me investigate further. Although Eviction Events sometimes come by a bit too fast for the current system to handle.

Aaron-ML · 2023-10-03T19:37:59Z

@Bryce-Soghigian Did this ever get handled? we are seeing something similar today.

ghost · 2023-10-20T13:44:39Z

Hello guys,

Any updates on it?

Because we suffer from something much worse, but also related to this bug.

Setup: AKS version 1.26.6
Issue: We do not mind if Spot Node just disappear (because it's Spot nodes), but some pods going into "zombie" state - pods keeps be marked as Running after Spot node deletion.

In the pod description we see status.phase: Running, and status.conditions[].Status: True that means that Kubernetes consider pod running, at the same time node completely disappear from AKS cluster.

And due to this - new deployments does not recreate replicas (STS or Deployment kind does not matter). In our case - It affects not all pods, but few of them (I tried to find differences from pods that were successfully redeployed vs zombie pods but no luck, they looks like the same).

Screen of how it looks like from Kubernetes IDE:

Or CLI get pods and nodes:

kubectl get pod -n infra -o wide | grep vault-2 
vault-2                                            3/3     Running     0              52d     10.50.8.171   aks-workloads2-11913460-vmss00000m

kubectl get nodes                                          

NAME                                 STATUS   ROLES   AGE     VERSION
aks-default-14727979-vmss00000o      Ready    agent   144d    v1.26.3
aks-default-14727979-vmss000022      Ready    agent   83d     v1.26.3
aks-workloads2-11913460-vmss000003   Ready    agent   127d    v1.26.3
aks-workloads2-11913460-vmss000005   Ready    agent   127d    v1.26.3
aks-workloads2-11913460-vmss000036   Ready    agent   27d     v1.26.3
aks-workloads2-11913460-vmss00004y   Ready    agent   8d      v1.26.3
aks-workloads2-11913460-vmss000054   Ready    agent   4d5h    v1.26.3
aks-workloads2-11913460-vmss000056   Ready    agent   4d5h    v1.26.3
aks-workloads2-11913460-vmss000057   Ready    agent   4d5h    v1.26.3
aks-workloads2-11913460-vmss000059   Ready    agent   4d4h    v1.26.3
aks-workloads2-11913460-vmss00005k   Ready    agent   2d19h   v1.26.3
aks-workloads2-11913460-vmss00005o   Ready    agent   2d      v1.26.3
aks-workloads2-11913460-vmss00005t   Ready    agent   4h28m   v1.26.3
aks-workloads2-11913460-vmss00005v   Ready    agent   15m     v1.26.3

As you may see - node does not present in the nodes list at all.

NassaDevops · 2023-10-27T17:09:22Z

Hello,

Here we have the similar error with spot nodes using eviction policy: delete.

Whenever a node "disappear" the pods are in a state of running and ready but the logs says

Failed to load logs: pods "aks-xxxxx-xxxxxxxx-vmss00003e" not found
Reason: NotFound (404)

We are using kubernetes version 1.26.6

ghost · 2023-11-01T11:19:58Z

@NassaDevops
I've solved it for my case. Not sure that it 100% match with this issue bug, but...

I found that some pods had duplicated env variables in manifests (you may check it also, open manifests and iterate over env vars), and during server-side apply they were merged, everything is OK.

But (I can't explain why) at pod termination it brings to stuck. After cleanup duplicates everything works as expected (for me).

NassaDevops · 2023-11-01T15:05:15Z

@NassaDevops I've solved it for my case. Not sure that it 100% match with this issue bug, but...

I found that some pods had duplicated env variables in manifests (you may check it also, open manifests and iterate over env vars), and during server-side apply they were merged, everything is OK.

But (I can't explain why) at pod termination it brings to stuck. After cleanup duplicates everything works as expected (for me).

Thank you for the Reply.
We do have duplicate variables in the pods affected.. We will remove then and let you guys know.

NassaDevops · 2023-11-01T17:10:51Z

I can confirm that removing duplicate variables in the manifest fixes the issue.
I have no Idea how a duplicate variable affects the ability of the pod to move to another healthy node but well.. Thank you very much @Dima-Diachenko-work

ghost · 2023-11-02T10:17:52Z

ArgoCD gitops tool helped me to understand the root cause of this issue - but during apply, not at deletion.
Anyway, I'm glad that my advice helped.

frederikspang · 2023-11-20T20:54:41Z

We're currently testing out spot instances, and have run into the same issues as mentioned. However, as far as I can tell, we have no duplicate environment variables.

The pods just stay ready, so does the node - However, they're not draining, or rescheduling pods until the node is "gone".

Any ideas for debugging steps here?

DiogoReisPinto · 2023-12-12T15:07:01Z

Hello,

Here we have the similar error with spot nodes using eviction policy: delete.

Whenever a node "disappear" the pods are in a state of running and ready but the logs says

Failed to load logs: pods "aks-xxxxx-xxxxxxxx-vmss00003e" not found Reason: NotFound (404)

We are using kubernetes version 1.26.6

We are seeing the same behaviour using Kubernetes version 1.27.3. Any update on this issue?

NassaDevops · 2023-12-12T15:50:32Z

All i can say on my part is that I reviewed my env variables and once the duplicates were removed, the problem got resolved.

You have to check on your deployments and make sure you dont have any duplicates

DiogoReisPinto · 2023-12-12T15:58:10Z

All i can say on my part is that I reviewed my env variables and once the duplicates were removed, the problem got resolved.

You have to check on your deployments and make sure you dont have any duplicates

Yes checked that and no duplicates in our case

frederikspang · 2023-12-12T21:14:27Z

All i can say on my part is that I reviewed my env variables and once the duplicates were removed, the problem got resolved.
You have to check on your deployments and make sure you dont have any duplicates

Yes checked that and no duplicates in our case

We have also checked, and as far as I can tell, no duplicates.

dtzar · 2024-05-10T17:05:30Z

@Bryce-Soghigian - were you able to verify or get more info on this issue? @frederikspang brought this up again in our AMA yesterday.

Bryce-Soghigian · 2024-05-10T17:43:45Z

i handed this issue off to @jason1028kr when I left the observability team in may of last year. Since then I believe the event based remediation required to make handling spot eviction has since been dropped. Could be wrong @aritraghosh could speak more on the roadmap

stockmaj · 2024-06-05T15:31:10Z

Has this been dropped? Or is it in progress?

microsoft-github-policy-service · 2024-06-10T18:31:41Z

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure

microsoft-github-policy-service · 2024-06-25T18:34:23Z