Cluster AutoScaler terminates on demand instances, leading to replace spot instance with new on demand instance, causing terminated pods. #6787
Labels
kind/bug
Categorizes issue or PR as related to a bug.
Cluster AutoScaler terminates on demand instances in a mixed policy ASG of on demand and spot instances, leading to AWS ASG to "replace" the on demand instance (terminating a spot instances and starting a new on demand instance), leading to pods killed mid execution.
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.2
Component version:
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
AWS EKS using managed node groups.
What did you expect to happen?:
Cluster autoscaler would not terminate on demand instance when i have
OnDemandBaseCapacity
set.What happened instead?:
The cluster autoscaler terminated an on demand instance and then the ASG tried to "rebalance" the min on demand base capacity by starting a new on demand node and terminating a random spot instance without waiting for pods to finish execution. (This is an issue as we are running pipeline build in the EKS cluster)
How to reproduce it (as minimally and precisely as possible):
Sadly this is really hard to replicate on request.
Anything else we need to know?:
This cluster is used as a gitlab runner cluster, so pod relocation is not an option.
ASG terminating random instances causes pipeline jobs to die in the middle of a build.
The text was updated successfully, but these errors were encountered: