Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Evicted spot nodes are not being drained #3528

Open
Pionerd opened this issue Mar 10, 2023 · 25 comments
Open

[BUG] Evicted spot nodes are not being drained #3528

Pionerd opened this issue Mar 10, 2023 · 25 comments

Comments

@Pionerd
Copy link

Pionerd commented Mar 10, 2023

Describe the bug
When AKS Spot nodes are being evicted, our nodes are not being drained. Instead the nodes are just disappearing from the cluster, leading to unnecessary downtime.

To Reproduce
Steps to reproduce the behavior:

  1. Create an AKS cluster with a Spot node pool. Eviction-policy: delete.
  2. Simulate an eviction using az vmss simulate-eviction --instance-id ${ID} -n aks-spot-23887340-vmss -g ${RG}
  3. Applications keep running for some time, we have seen 30 seconds, but also 2m30s. After that the application goes down. The node remains in Ready state and the pods still appear to be running (according to the Kubernetes API).
  4. After ~40 seconds the node goes in NotReady state but the pods are not being rescheduled immediately, even though the pods have tolerations set for node.kubernetes.io/unreachable and node.kubernetes.io/not-ready that last for only 2 seconds.
  5. ~30 seconds later the pods are being scheduled on another node.

Expected behavior
Based on https://learn.microsoft.com/en-us/azure/aks/node-auto-repair#node-autodrain we expect the nodes to be cordoned and drained before they are actually killed. This allows the pods to be rescheduled in a more graceful manner.

Environment (please complete the following information):

  • Kubernetes version 1.25.5
@Pionerd Pionerd added the bug label Mar 10, 2023
@Bryce-Soghigian
Copy link

Bryce-Soghigian commented Mar 10, 2023

Thanks for reporting this, this seems related to a bug we had around scheduled event triggers, let me investigate further. Although Eviction Events sometimes come by a bit too fast for the current system to handle.

@Aaron-ML
Copy link

Aaron-ML commented Oct 3, 2023

@Bryce-Soghigian Did this ever get handled? we are seeing something similar today.

@ghost
Copy link

ghost commented Oct 20, 2023

Hello guys,

Any updates on it?

Because we suffer from something much worse, but also related to this bug.

Setup: AKS version 1.26.6
Issue: We do not mind if Spot Node just disappear (because it's Spot nodes), but some pods going into "zombie" state - pods keeps be marked as Running after Spot node deletion.

In the pod description we see status.phase: Running, and status.conditions[].Status: True that means that Kubernetes consider pod running, at the same time node completely disappear from AKS cluster.

And due to this - new deployments does not recreate replicas (STS or Deployment kind does not matter). In our case - It affects not all pods, but few of them (I tried to find differences from pods that were successfully redeployed vs zombie pods but no luck, they looks like the same).

Screen of how it looks like from Kubernetes IDE:
image

image

Or CLI get pods and nodes:

kubectl get pod -n infra -o wide | grep vault-2 
vault-2                                            3/3     Running     0              52d     10.50.8.171   aks-workloads2-11913460-vmss00000m   
kubectl get nodes                                          

NAME                                 STATUS   ROLES   AGE     VERSION
aks-default-14727979-vmss00000o      Ready    agent   144d    v1.26.3
aks-default-14727979-vmss000022      Ready    agent   83d     v1.26.3
aks-workloads2-11913460-vmss000003   Ready    agent   127d    v1.26.3
aks-workloads2-11913460-vmss000005   Ready    agent   127d    v1.26.3
aks-workloads2-11913460-vmss000036   Ready    agent   27d     v1.26.3
aks-workloads2-11913460-vmss00004y   Ready    agent   8d      v1.26.3
aks-workloads2-11913460-vmss000054   Ready    agent   4d5h    v1.26.3
aks-workloads2-11913460-vmss000056   Ready    agent   4d5h    v1.26.3
aks-workloads2-11913460-vmss000057   Ready    agent   4d5h    v1.26.3
aks-workloads2-11913460-vmss000059   Ready    agent   4d4h    v1.26.3
aks-workloads2-11913460-vmss00005k   Ready    agent   2d19h   v1.26.3
aks-workloads2-11913460-vmss00005o   Ready    agent   2d      v1.26.3
aks-workloads2-11913460-vmss00005t   Ready    agent   4h28m   v1.26.3
aks-workloads2-11913460-vmss00005v   Ready    agent   15m     v1.26.3

As you may see - node does not present in the nodes list at all.

@NassaDevops
Copy link

Hello,

Here we have the similar error with spot nodes using eviction policy: delete.

Whenever a node "disappear" the pods are in a state of running and ready but the logs says

Failed to load logs: pods "aks-xxxxx-xxxxxxxx-vmss00003e" not found
Reason: NotFound (404)

We are using kubernetes version 1.26.6

@ghost
Copy link

ghost commented Nov 1, 2023

@NassaDevops
I've solved it for my case. Not sure that it 100% match with this issue bug, but...

I found that some pods had duplicated env variables in manifests (you may check it also, open manifests and iterate over env vars), and during server-side apply they were merged, everything is OK.

But (I can't explain why) at pod termination it brings to stuck. After cleanup duplicates everything works as expected (for me).

@NassaDevops
Copy link

@NassaDevops I've solved it for my case. Not sure that it 100% match with this issue bug, but...

I found that some pods had duplicated env variables in manifests (you may check it also, open manifests and iterate over env vars), and during server-side apply they were merged, everything is OK.

But (I can't explain why) at pod termination it brings to stuck. After cleanup duplicates everything works as expected (for me).

Thank you for the Reply.
We do have duplicate variables in the pods affected.. We will remove then and let you guys know.

@NassaDevops
Copy link

I can confirm that removing duplicate variables in the manifest fixes the issue.
I have no Idea how a duplicate variable affects the ability of the pod to move to another healthy node but well.. Thank you very much @Dima-Diachenko-work

@ghost
Copy link

ghost commented Nov 2, 2023

ArgoCD gitops tool helped me to understand the root cause of this issue - but during apply, not at deletion.
Anyway, I'm glad that my advice helped.

@frederikspang
Copy link

We're currently testing out spot instances, and have run into the same issues as mentioned. However, as far as I can tell, we have no duplicate environment variables.

The pods just stay ready, so does the node - However, they're not draining, or rescheduling pods until the node is "gone".

Any ideas for debugging steps here?

@DiogoReisPinto
Copy link

Hello,

Here we have the similar error with spot nodes using eviction policy: delete.

Whenever a node "disappear" the pods are in a state of running and ready but the logs says

Failed to load logs: pods "aks-xxxxx-xxxxxxxx-vmss00003e" not found Reason: NotFound (404)

We are using kubernetes version 1.26.6

We are seeing the same behaviour using Kubernetes version 1.27.3. Any update on this issue?

@NassaDevops
Copy link

All i can say on my part is that I reviewed my env variables and once the duplicates were removed, the problem got resolved.

You have to check on your deployments and make sure you dont have any duplicates

@DiogoReisPinto
Copy link

All i can say on my part is that I reviewed my env variables and once the duplicates were removed, the problem got resolved.

You have to check on your deployments and make sure you dont have any duplicates

Yes checked that and no duplicates in our case

@frederikspang
Copy link

All i can say on my part is that I reviewed my env variables and once the duplicates were removed, the problem got resolved.
You have to check on your deployments and make sure you dont have any duplicates

Yes checked that and no duplicates in our case

We have also checked, and as far as I can tell, no duplicates.

@dtzar
Copy link

dtzar commented May 10, 2024

@Bryce-Soghigian - were you able to verify or get more info on this issue? @frederikspang brought this up again in our AMA yesterday.

@microsoft-github-policy-service microsoft-github-policy-service bot removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels May 10, 2024
@Bryce-Soghigian
Copy link

i handed this issue off to @jason1028kr when I left the observability team in may of last year. Since then I believe the event based remediation required to make handling spot eviction has since been dropped. Could be wrong @aritraghosh could speak more on the roadmap

@stockmaj
Copy link

stockmaj commented Jun 5, 2024

Has this been dropped? Or is it in progress?

Copy link
Contributor

Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure

@microsoft-github-policy-service microsoft-github-policy-service bot added the Needs Attention 👋 Issues needs attention/assignee/owner label Jun 10, 2024
Copy link
Contributor

Issue needing attention of @Azure/aks-leads

@arunp-motorq
Copy link

Seeing this for our cluster as well. We have zombie pods because of spot eviction

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

@agra6475
Copy link

We've started seeing this issue as well in multiple aks instances. Doing any update to aks fixes the issue.

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

@agra6475
Copy link

#4400 Looks to be the same.
Haven't observed the issue for ~2 weeks now.

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

1 similar comment
Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests