Cluster AutoScaler: scale up timeout nodes should not be removed by fixIncorrectNodeGroupSizes #6746

xrmzju · 2024-04-23T06:30:08Z

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: cluster-autoscaler-release-1.30

What k8s version are you using (kubectl version)?:

v1.26

What behaviour did you expect to see?

The nodes that have timed out during the scale-up process should be specifically targeted for removal, rather than being removed randomly.

What happened instead?:

The nodes that have timed out during the scale-up process are removed randomly.

How to reproduce it (as minimally and precisely as possible):

Initiate a scale-up request.
Cloud instances have been successfully created.
The node was successfully registered, but for some reason, it remained in the 'notReady' state (due to issues such as CNI/container runtime failure, etc).
The scale-up requests were removed due to a timeout (refer to: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/clusterstate/clusterstate.go#L272), and an 'incorrectSize' status would be observed.
After the MaxNodeProvisionTime, the cloud instances would be removed during the fixNodeGroupSize reconciliation. However, the DecreaseTargetSize function will randomly delete cloud instances, instead of removing the newly created nodes.

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

xrmzju added the kind/bug Categorizes issue or PR as related to a bug. label Apr 23, 2024

Provide feedback