Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster AutoScaler terminates on demand instances, leading to replace spot instance with new on demand instance, causing terminated pods. #6787

Open
timpur opened this issue May 1, 2024 · 0 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@timpur
Copy link

timpur commented May 1, 2024

Cluster AutoScaler terminates on demand instances in a mixed policy ASG of on demand and spot instances, leading to AWS ASG to "replace" the on demand instance (terminating a spot instances and starting a new on demand instance), leading to pods killed mid execution.

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.2

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.14-eks-b9c9ed7", GitCommit:"7c3f2be51edd9fa5727b6ecc2c3fc3c578aa02ca", GitTreeState:"clean", BuildDate:"2024-03-02T03:46:35Z", GoVersion:"go1.21.7", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:
AWS EKS using managed node groups.

What did you expect to happen?:
Cluster autoscaler would not terminate on demand instance when i have OnDemandBaseCapacity set.

What happened instead?:
The cluster autoscaler terminated an on demand instance and then the ASG tried to "rebalance" the min on demand base capacity by starting a new on demand node and terminating a random spot instance without waiting for pods to finish execution. (This is an issue as we are running pipeline build in the EKS cluster)

How to reproduce it (as minimally and precisely as possible):
Sadly this is really hard to replicate on request.

Anything else we need to know?:
This cluster is used as a gitlab runner cluster, so pod relocation is not an option.

ASG terminating random instances causes pipeline jobs to die in the middle of a build.

image

@timpur timpur added the kind/bug Categorizes issue or PR as related to a bug. label May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

1 participant