Cluster AutoScaler terminates on demand instances, leading to replace spot instance with new on demand instance, causing terminated pods. #6787

timpur · 2024-05-01T05:11:46Z

Cluster AutoScaler terminates on demand instances in a mixed policy ASG of on demand and spot instances, leading to AWS ASG to "replace" the on demand instance (terminating a spot instances and starting a new on demand instance), leading to pods killed mid execution.

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.2

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.14-eks-b9c9ed7", GitCommit:"7c3f2be51edd9fa5727b6ecc2c3fc3c578aa02ca", GitTreeState:"clean", BuildDate:"2024-03-02T03:46:35Z", GoVersion:"go1.21.7", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:
AWS EKS using managed node groups.

What did you expect to happen?:
Cluster autoscaler would not terminate on demand instance when i have OnDemandBaseCapacity set.

What happened instead?:
The cluster autoscaler terminated an on demand instance and then the ASG tried to "rebalance" the min on demand base capacity by starting a new on demand node and terminating a random spot instance without waiting for pods to finish execution. (This is an issue as we are running pipeline build in the EKS cluster)

How to reproduce it (as minimally and precisely as possible):
Sadly this is really hard to replicate on request.

Anything else we need to know?:
This cluster is used as a gitlab runner cluster, so pod relocation is not an option.

ASG terminating random instances causes pipeline jobs to die in the middle of a build.

The text was updated successfully, but these errors were encountered:

timpur added the kind/bug Categorizes issue or PR as related to a bug. label May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster AutoScaler terminates on demand instances, leading to replace spot instance with new on demand instance, causing terminated pods. #6787

Cluster AutoScaler terminates on demand instances, leading to replace spot instance with new on demand instance, causing terminated pods. #6787

timpur commented May 1, 2024

Cluster AutoScaler terminates on demand instances, leading to replace spot instance with new on demand instance, causing terminated pods. #6787

Cluster AutoScaler terminates on demand instances, leading to replace spot instance with new on demand instance, causing terminated pods. #6787

Comments

timpur commented May 1, 2024