-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Evicted spot nodes are not being drained #3528
Comments
Thanks for reporting this, this seems related to a bug we had around scheduled event triggers, let me investigate further. Although Eviction Events sometimes come by a bit too fast for the current system to handle. |
@Bryce-Soghigian Did this ever get handled? we are seeing something similar today. |
Hello, Here we have the similar error with spot nodes using eviction policy: delete. Whenever a node "disappear" the pods are in a state of running and ready but the logs says Failed to load logs: pods "aks-xxxxx-xxxxxxxx-vmss00003e" not found We are using kubernetes version 1.26.6 |
@NassaDevops I found that some pods had duplicated env variables in manifests (you may check it also, open manifests and iterate over env vars), and during server-side apply they were merged, everything is OK. But (I can't explain why) at pod termination it brings to stuck. After cleanup duplicates everything works as expected (for me). |
Thank you for the Reply. |
I can confirm that removing duplicate variables in the manifest fixes the issue. |
ArgoCD gitops tool helped me to understand the root cause of this issue - but during apply, not at deletion. |
We're currently testing out spot instances, and have run into the same issues as mentioned. However, as far as I can tell, we have no duplicate environment variables. The pods just stay ready, so does the node - However, they're not draining, or rescheduling pods until the node is "gone". Any ideas for debugging steps here? |
We are seeing the same behaviour using Kubernetes version 1.27.3. Any update on this issue? |
All i can say on my part is that I reviewed my env variables and once the duplicates were removed, the problem got resolved. You have to check on your deployments and make sure you dont have any duplicates |
Yes checked that and no duplicates in our case |
We have also checked, and as far as I can tell, no duplicates. |
@Bryce-Soghigian - were you able to verify or get more info on this issue? @frederikspang brought this up again in our AMA yesterday. |
i handed this issue off to @jason1028kr when I left the observability team in may of last year. Since then I believe the event based remediation required to make handling spot eviction has since been dropped. Could be wrong @aritraghosh could speak more on the roadmap |
Has this been dropped? Or is it in progress? |
Action required from @aritraghosh, @julia-yin, @AllenWen-at-Azure |
Issue needing attention of @Azure/aks-leads |
Seeing this for our cluster as well. We have zombie pods because of spot eviction |
Issue needing attention of @Azure/aks-leads |
We've started seeing this issue as well in multiple aks instances. Doing any update to aks fixes the issue. |
Issue needing attention of @Azure/aks-leads |
#4400 Looks to be the same. |
Issue needing attention of @Azure/aks-leads |
1 similar comment
Issue needing attention of @Azure/aks-leads |
Describe the bug
When AKS Spot nodes are being evicted, our nodes are not being drained. Instead the nodes are just disappearing from the cluster, leading to unnecessary downtime.
To Reproduce
Steps to reproduce the behavior:
delete
.az vmss simulate-eviction --instance-id ${ID} -n aks-spot-23887340-vmss -g ${RG}
node.kubernetes.io/unreachable
andnode.kubernetes.io/not-ready
that last for only 2 seconds.Expected behavior
Based on https://learn.microsoft.com/en-us/azure/aks/node-auto-repair#node-autodrain we expect the nodes to be cordoned and drained before they are actually killed. This allows the pods to be rescheduled in a more graceful manner.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: