cluster-autoscaler gets stuck with "Failed to fix node group sizes" error #6128

com6056 · 2023-09-22T16:23:12Z

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: v1.28.0

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.11", GitCommit:"8cfcba0b15c343a8dc48567a74c29ec4844e0b9e", GitTreeState:"clean", BuildDate:"2023-06-14T09:49:38Z", GoVersion:"go1.19.10", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

AWS via the aws provider

What did you expect to happen?:

I expect cluster-autoscaler to be able to scale ASGs up/down without issue.

What happened instead?:

cluster-autoscaler gets stuck in a deadlock with the following error:

cluster-autoscaler-aws-65dccf9965-qmprl cluster-autoscaler I0922 16:10:22.995283       1 static_autoscaler.go:709] Decreasing size of build-16-32, expected=4 current=2 delta=-2
cluster-autoscaler-aws-65dccf9965-qmprl cluster-autoscaler E0922 16:10:22.995297       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease build-16-32: attempt to delete existing nodes targetSize:4 delta:-2 existingNodes: 4

How to reproduce it (as minimally and precisely as possible):

Not entirely sure what causes it unfortunately.

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

Tenzer · 2023-10-02T15:17:56Z

The same thing happened for one of our EKS clusters. here's the full output for a loop:

I1002 14:50:40.757207       1 reflector.go:790] k8s.io/client-go/informers/factory.go:150: Watch close - *v1.ReplicaSet total 112 items received
I1002 14:50:42.157028       1 reflector.go:790] k8s.io/client-go/informers/factory.go:150: Watch close - *v1.PersistentVolumeClaim total 8 items received
I1002 14:51:09.451280       1 static_autoscaler.go:287] Starting main loop
I1002 14:51:09.451449       1 auto_scaling_groups.go:393] Regenerating instance to ASG map for ASG names: []
I1002 14:51:09.451463       1 auto_scaling_groups.go:400] Regenerating instance to ASG map for ASG tags: map[k8s.io/cluster-autoscaler/enabled: k8s.io/cluster-autoscaler/public:]
I1002 14:51:09.586582       1 auto_scaling_groups.go:142] Updating ASG eks-x86-m7i-flex-xlarge-f6c4dd96-da24-ca3f-2824-ffbc8b741746
I1002 14:51:09.586626       1 aws_wrapper.go:703] 0 launch configurations to query
I1002 14:51:09.586632       1 aws_wrapper.go:704] 0 launch templates to query
I1002 14:51:09.586638       1 aws_wrapper.go:724] Successfully queried 0 launch configurations
I1002 14:51:09.586645       1 aws_wrapper.go:735] Successfully queried 0 launch templates
I1002 14:51:09.586651       1 aws_wrapper.go:746] Successfully queried instance requirements for 0 ASGs
I1002 14:51:09.586659       1 aws_manager.go:129] Refreshed ASG list, next refresh after 2023-10-02 14:52:09.586657471 +0000 UTC m=+375178.448258144
I1002 14:51:09.588859       1 aws_manager.go:185] Found multiple availability zones for ASG "eks-x86-m7i-flex-xlarge-f6c4dd96-da24-ca3f-2824-ffbc8b741746"; using us-east-1a for failure-domain.beta.kubernetes.io/zone label
I1002 14:51:09.589066       1 static_autoscaler.go:709] Decreasing size of eks-x86-m7i-flex-xlarge-f6c4dd96-da24-ca3f-2824-ffbc8b741746, expected=12 current=11 delta=-1
E1002 14:51:09.589091       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease eks-x86-m7i-flex-xlarge-f6c4dd96-da24-ca3f-2824-ffbc8b741746: attempt to delete existing nodes targetSize:12 delta:-1 existingNodes: 12

There's no obvious actions taken on our side that caused the problem to appear.

Tenzer · 2023-10-09T14:29:14Z

When this has happened to us, it seems to have helped if I changed the number of instances in the node group manually, which then would get the cluster autoscaler to be able to correct the size again.
It does however seem likely to reoccur again a couple of hours later.

My hypothesis so far is that this might be related to the cluster using the m7i-flex family of instances which aren't necessarily available in all availability zones. That can cause the node group to be set to for instance have 5 instances in it, but there will only be 4 created by AWS because the node group is configured to make use of subnets/AZs where it can't create instances.

I've now replaced the node group with a new one that only has subnets/AZs enabled that has this family of instances available and hope that might help.

@com6056 Any chance you might be in a similar situation?

com6056 · 2023-11-07T18:01:30Z

We aren't using those instance types, so I don't think that is what is causing it (at least for us). We have ASGs with a mixed instance type policy and there should usually be instances available (and if not, it should just fail and fallback to a different node group).

mykhailogorsky · 2023-11-08T11:48:23Z

Same issue, I use AWS as provider and kubernetes 1.28:
attempt to delete existing nodes targetSize:13 delta:-1 existingNodes: 13
From what I noticed in function DecreaseTargetSize it doesn't receive correct actual group size with this line of code:
nodes, err := ng.awsManager.GetAsgNodes(ng.asg.AwsRef)

After downgrade of cluster-autoscaler from1.28 to 1.27.3 it started to work ok.

stefansedich · 2023-11-15T18:03:22Z

We are seeing this issue in our environments too and we are doing nothing special with instance types, at one point after a few reboots of autoscaler it worked for a little while then stopped working again.

Currently downgrading as suggested by @mykhailogorsky

com6056 · 2023-11-15T19:13:24Z

@x13n could this be caused by #5976 enabling parallel drain by default? I guess I could try 1.28 again with -parallel-drain=false to confirm 👀

com6056 · 2023-11-15T19:35:38Z

Nope still hitting it, even with --parallel-drain="false":

cluster-autoscaler-aws-6d95d65-hbgdb cluster-autoscaler I1115 19:34:18.815122       1 static_autoscaler.go:709] Decreasing size of build-32-128, expected=5 current=1 delta=-4
cluster-autoscaler-aws-6d95d65-hbgdb cluster-autoscaler E1115 19:34:18.815135       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease build-32-128: attempt to delete existing nodes targetSize:5 delta:-4 existingNodes: 5

stefansedich · 2023-11-16T02:44:05Z

We have seen no issues since our downgrade to v1.27.3 and things have been operating as we expect, seems like it is an issue introduced in 1.18.0.

rajaie-sg · 2023-11-16T06:13:56Z

We are also seeing this in v1.28.0

x13n · 2023-11-16T15:39:12Z

This is unrelated to scale down logic - this is called only to reduce target size on a node group without actually deleting any nodes. The error suggests that it would actually require deleting existing VMs to reduce the node group size. Not sure why this started in 1.28. As a data point, this may be AWS specific - I haven't seen this error on GKE.

eaterm · 2023-12-06T14:02:57Z

We seen the same issue with 1.28.1. But only when deleting a node manually from from the cluster.
After some time the autoscaler then tries to change the number of nodes in the node group and then fails.

mycrEEpy · 2023-12-12T14:42:33Z

Got the same issue today. I found out that we had more instances in our ASG than there were Nodes in Kubernetes. After finding the instance in EC2 which had not joined the cluster as Node and terminating the instance the cluster-autoscaler started to recover again.

artificial-aidan · 2023-12-20T16:23:50Z

Seeing the same issue here. Terminating the instance that failed to join also fixes it.

mohag · 2024-02-02T07:57:50Z

Same issue on 1.28.2 on AWS.

Terminated the instance that was running but not part of the cluster. (ASG seems to have created a new node then, which did join the cluster)

cluster-autoscaler was restarting with failed health checks (apparently getting 500s) and had the log messages as above.

songminglong · 2024-02-05T12:08:17Z

The culprit is the implementation of aws provider：

func (ng *AwsNodeGroup) DecreaseTargetSize(delta int) error {
	if delta >= 0 {
		return fmt.Errorf("size decrease size must be negative")
	}

	size := ng.asg.curSize
	nodes, err := ng.awsManager.GetAsgNodes(ng.asg.AwsRef)
	if err != nil {
		return err
	}
	if int(size)+delta < len(nodes) {
		return fmt.Errorf("attempt to delete existing nodes targetSize:%d delta:%d existingNodes: %d",
			size, delta, len(nodes))
	}
	return ng.awsManager.SetAsgSize(ng.asg, size+delta)
}

which size is almost always equal to len(nodes) because aws nodes are composed of active nodes and fake nodes (place-holder), they are always equal to the target size.

songminglong · 2024-02-05T14:57:28Z

related issue: #5829

songminglong · 2024-02-06T03:59:16Z

The culprit is the implementation of aws provider：
func (ng *AwsNodeGroup) DecreaseTargetSize(delta int) error {
	if delta >= 0 {
		return fmt.Errorf("size decrease size must be negative")
	}

	size := ng.asg.curSize
	nodes, err := ng.awsManager.GetAsgNodes(ng.asg.AwsRef)
	if err != nil {
		return err
	}
	if int(size)+delta < len(nodes) {
		return fmt.Errorf("attempt to delete existing nodes targetSize:%d delta:%d existingNodes: %d",
			size, delta, len(nodes))
	}
	return ng.awsManager.SetAsgSize(ng.asg, size+delta)
}
which size is almost always equal to len(nodes) because aws nodes are composed of active nodes and fake nodes (place-holder), they are always equal to the target size.

Maybe we can filter out active running nodes, ignore these stale place-holder fake nodes which status == placeholderUnfulfillableStatus, and make sure asg target size could converge to active nodes

for example:

func (ng *AwsNodeGroup) DecreaseTargetSize(delta int) error {
	if delta >= 0 {
		return fmt.Errorf("size decrease size must be negative")
	}

	size := ng.asg.curSize
	nodes, err := ng.awsManager.GetAsgNodes(ng.asg.AwsRef)
	if err != nil {
		return err
	}

	// filter out active nodes, ignore these stale place-holder fake nodes which status == placeholderUnfulfillableStatus
	// make sure asg target size could converge to active nodes
	filteredNodes := make([]AwsInstanceRef, 0)
	for i := range nodes {
		node := nodes[i]
		instanceStatus, err := ng.awsManager.GetInstanceStatus(node)
		if err != nil {
			klog.V(4).Infof("Could not get instance status, continuing anyways: %v", err)
		} else if instanceStatus != nil && *instanceStatus == placeholderUnfulfillableStatus {
			continue
		}
		filteredNodes = append(filteredNodes, node)
	}
	nodes = filteredNodes

	if int(size)+delta < len(nodes) {
		return fmt.Errorf("attempt to delete existing nodes targetSize:%d delta:%d existingNodes: %d",
			size, delta, len(nodes))
	}
	return ng.awsManager.SetAsgSize(ng.asg, size+delta)
}

akloss-cibo · 2024-02-15T10:15:48Z

FWIW, we're stuck with this bug as well; downgrading to 1.27.3 isn't a great option for us:

uses the unknown EC2 instance type "m7i.48xlarge"

mhornero91 · 2024-02-26T13:57:54Z

Hello, I had exactly the same problem running in 1.29 k8s and 1.29.0 image version for cluster-autoscaler. One node was in problem and all our system wasn't scalated due to this reason during lot of hours.

fahaddd-git · 2024-03-01T15:07:21Z

Seeing this as well with cluster autoscaler@1.28.2 on EKS@1.28. If the bootstrap script (/etc/eks/bootstrap.sh) fails, the node will just sit in the ASG and all CA scaling up/down will be stopped.

pawelaugustyn · 2024-03-13T15:45:17Z

I observed a similar issue. The initial scale-up timeout caused the error to start appearing. The instance was eventually scaled up, but it wasn't registered as a Kubernetes node. When I deleted the node manually, issue has been resolved. I believe it was related to the low availability of a particular instance type (g4dn.4xlarge). EKS 1.28 with 1.28.2 cluster-autoscaler.

I0313 12:34:34.470823       1 executor.go:147] Scale-up: setting group eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598 size to 7
I0313 12:34:34.470926       1 auto_scaling_groups.go:265] Setting asg eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598 size to 7
...
W0313 12:49:40.563382       1 clusterstate.go:266] Scale-up timed out for node group eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598 after 15m5.500491908s
W0313 12:49:40.584456       1 clusterstate.go:297] Disabling scale-up for node group eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598 until 2024-03-13 12:54:40.13918732 +0000 UTC m=+2030.419314682; errorClass=Other; errorCode=timeout
W0313 12:50:01.422358       1 orchestrator.go:582] Node group eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598 is not ready for scaleup - backoff
W0313 12:50:22.047950       1 orchestrator.go:582] Node group eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598 is not ready for scaleup - backoff
I0313 12:50:32.503066       1 static_autoscaler.go:709] Decreasing size of eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598, expected=7 current=6 delta=-1
E0313 12:50:32.503102       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598: attempt to delete existing nodes targetSize:7 delta:-1 existingNodes: 7

...
NOW I DELETED THE NODE FROM EC2 CONSOLE
...

E0313 15:15:27.466600       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598: attempt to delete existing nodes targetSize:7 delta:-1 existingNodes: 7
I0313 15:15:37.796147       1 static_autoscaler.go:709] Decreasing size of eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598, expected=7 current=6 delta=-1
E0313 15:15:37.796180       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598: attempt to delete existing nodes targetSize:7 delta:-1 existingNodes: 7
I0313 15:15:48.264982       1 static_autoscaler.go:709] Decreasing size of eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598, expected=7 current=6 delta=-1
E0313 15:15:48.265018       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598: attempt to delete existing nodes targetSize:7 delta:-1 existingNodes: 8
I0313 15:15:58.763859       1 static_autoscaler.go:709] Decreasing size of eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598, expected=7 current=6 delta=-1
E0313 15:15:58.763892       1 static_autoscaler.go:439] Failed to fix node group sizes: failed to decrease eks-mli_inference_g4dn4xl-02c6fc6f-788b-c2ae-e9d5-647dfa039598: attempt to delete existing nodes targetSize:7 delta:-1 existingNodes: 8

NO MORE ENTRIES RELATED TO THIS ISSUE

benjimin · 2024-03-30T01:11:21Z

I believe the fundamental issue is that cluster-autoscaler does not handle instances without node objects, that is, it does not support deleting nodes when the instance is still running, and similarly, does not handle instances that fail to produce a node object in the first place (e.g., misconfigured launch templates).

Nodes are not supposed to be deleted manually. The kubelet creates the initial node object and keeps it updated, but does not react to the node object being removed. Cluster-autoscaler uses cloud APIs to scale up ASGs when needed or shutdown instances when drained, but does not intervene in node object lifecycle. The cloud node controller is supposed to remove the node object only after the instance is shutdown and relinquished, but has no way to tell whether a still-running instance is supposed to be a node of the cluster. So, manually deleting a node creates a zombie instance. (The control plane responds by redirecting service traffic and rescheduling workloads elsewhere, but the instance is left running.)

Ideally kubelet should recreate the missing node object if necessary, but it currently doesn't. Then cluster-autoscaler starts getting confused by the sustained mismatch between ASG sizes and the node object inventory. This initially presents as cluster-autoscaler's Failed to find readiness information for <ASG> message, while tending to leave pods indefinitely in a pending state (as if it thinks a scale-up is already in progress). If later the cluster manages to satisfy capacity due to some other trigger (either by manually bumping up the ASG desired size, or waiting for a transient workload spike to scale the cluster up further, or if enough pods just happen to finish) then all workloads may be restored to health, but nonetheless (after a scaling delay) cluster-autoscaler instead starts emitting Failed to fix node group sizes: failed to decrease <ASG>... messages containing the telltale clue that targetSize == existingNodes while delta < 0.

To actually clean up the problem you need to manually identify the zombie instances and individually terminate them (or, more disruptively, just bump that ASG's desired size to zero briefly - then manually bump it back again in case that's the same ASG that cluster-autoscaler itself is run on). And then stop letting people manually delete nodes (nor run untested configs).

A possible fix here would be, for all ASGs that cluster-autoscaler is configured to manage, to have cluster-autoscaler terminate any instance beyond a certain age if it doesn't correspond to a node object. (This would treat them like failed start-ups. An alternative solution would be for kubelet to recreate the node object it syncs to whenever necessary. Another alternative would be if kubelet reacted by exiting and shutting down its host, letting the ASG clean up such instances.)

Note this proposed fix will also clean up instances that fail to connect to the cluster, which is precisely the source of the problem for some of the reports above in this thread. (If the ASG launch template is misconfigured for the cluster, it makes sense that cluster-autoscaler should be allowed to periodically retry creating the instance, effectively checking for corrections to that configuration.)

zchenyu · 2024-04-01T23:15:52Z

We noticed this with a spot instance node group on EKS. Could a preempted instance also trigger this behavior?

ahilden · 2024-04-17T03:09:44Z

FYI, we are also seeing this with one of our users. It starts again with the timeout.

05:26:42.532720       1 clusterstate.go:266] Scale-up timed out for node group eks-something-worker-80c733af6-1e9e-534e-a483-918f39b0bc05 after 15m2.467913257s

kanupriyaraheja · 2024-04-22T14:57:07Z

We observed the same issue with cluster autoscaler 1.28.0. It started with a timeout error. The instance got created in ASG but was not registered as a node in kubernetes. I attempted to recreate the issue which was difficult to do so I tried a different method of getting the same error which was deleting a node manually on kubernetes.
For Cluster autoscaler 1.28.0:
I manually deleted one of the nodes that a worker pod was running on. The instance is still present in ASG but this deletes in corresponding node in k8. The pod running on the deleted node crashes first but then goes into an indefinite pending state. Autoscaler attempts to decrease the size of the nodegroup but is unable to and it leads to the "Failed to fix node group size error".

I tested the same scenario on autoscaler 1.27.3. I manually deleted one of the nodes that a worker pod was running on. The instance is still present in ASG but this deletes in corresponding node in k8. This led to the pod crashing and then going to pending state. However the cluster autoscaler recognized the zombie instance and printed a message in the logs 1 unregistered node present. The autoscaler deleted the zombie instance and created a new one and also created a new node in kubernetes. The autoscaler resolved the problem on its own.

In 1.27.3 autoscaler gets unregistered nodes correctly hence it is able to identify zombie instances.

func getNotRegisteredNodes(allNodes []*apiv1.Node, cloudProviderNodeInstances map[string][]cloudprovider.Instance, time time.Time) []UnregisteredNode {
	registered := sets.NewString()
	for _, node := range allNodes {
		registered.Insert(node.Spec.ProviderID)
	}
	notRegistered := make([]UnregisteredNode, 0)
	for _, instances := range cloudProviderNodeInstances {
		for _, instance := range instances {
			if !registered.Has(instance.Id) {
				notRegistered = append(notRegistered, UnregisteredNode{
					Node:              fakeNode(instance, cloudprovider.FakeNodeUnregistered),
					UnregisteredSince: time,
				})
			}
		}
	}
	return notRegistered
}

mrocheleau · 2024-04-22T20:49:05Z

Just had this occur on one of our EKS 1.28 clusters, fixed it by:

Look at the node group node list to get current instance names (double confirm instances in the node group within k8s directly, these should 100% match) - in our case this node group had 4 actual in-service nodes
Went into the ASG directly and to the Instance Management tab - in our case this showed 5 in service instances.
Look at each instance's details in EC2 directly to check the names/IP's, and in our case one of them in the ASG didn't have a corresponding node in EKS/K8s
Detach that instance from the ASG - warning about "ASG may replace this instance", but it didn't.
Instance expected counts now match between ASG and EKS/K8s
Errors on the cluster-autoscaler pod stopped and normal operation resumed on it's own, once it had picked up the ASG updated status

Everything looks stabilized for this cluster now and can probably repeat if it does return.

kappa8219 · 2024-04-23T07:14:16Z

Just had this occur on one of our EKS 1.28 clusters, fixed it by:

Look at the node group node list to get current instance names (double confirm instances in the node group within k8s directly, these should 100% match) - in our case this node group had 4 actual in-service nodes

Went into the ASG directly and to the Instance Management tab - in our case this showed 5 in service instances.

Look at each instance's details in EC2 directly to check the names/IP's, and in our case one of them in the ASG didn't have a corresponding node in EKS/K8s

Detach that instance from the ASG - warning about "ASG may replace this instance", but it didn't.

Instance expected counts now match between ASG and EKS/K8s

Errors on the cluster-autoscaler pod stopped and normal operation resumed on it's own, once it had picked up the ASG updated status

Everything looks stabilized for this cluster now and can probably repeat if it does return.

This method works fine for EKS 1.29 also, thanks.

I'd add some tips how to deduce this "cattle out of herd".

Error is:
static_autoscaler.go:449] Failed to fix node group sizes: failed to decrease gtlb: attempt to delete existing nodes targetSize:2 delta:-1 existingNodes: 2

To see what CAS "thinks" about this ASG see Configmap cluster-autoscaler-status. Something like this:
ScaleUp: NoActivity (ready=1 cloudProviderTarget=2 ...

Than see instance list for this ASG. There should be one spare, ready=1 from the ConfigMap is wrong. To check which one I used Annotations on Nodes csi.volume.kubernetes.io/nodeid={"ebs.csi.aws.com":"i-abcdabcddfv345"}. Compare to ASG list - got instance(s) to detach.

Open question is what circumstances lead to such a situation with manual fix needed. Will try 1.29.2.

Kamalpreet-KK · 2024-04-25T11:14:34Z

Just had this occur on one of our EKS 1.28 clusters, fixed it by:

Look at the node group node list to get current instance names (double confirm instances in the node group within k8s directly, these should 100% match) - in our case this node group had 4 actual in-service nodes

Went into the ASG directly and to the Instance Management tab - in our case this showed 5 in service instances.

Look at each instance's details in EC2 directly to check the names/IP's, and in our case one of them in the ASG didn't have a corresponding node in EKS/K8s

Detach that instance from the ASG - warning about "ASG may replace this instance", but it didn't.

Instance expected counts now match between ASG and EKS/K8s

Errors on the cluster-autoscaler pod stopped and normal operation resumed on it's own, once it had picked up the ASG updated status

Everything looks stabilized for this cluster now and can probably repeat if it does return.

Thanks for help, these solution working perfectly.

ferwasy · 2024-04-25T12:39:30Z

Just had this occur on one of our EKS 1.28 clusters, fixed it by:

* Look at the node group node list to get current instance names (double confirm instances in the node group within k8s directly, these should 100% match) - in our case this node group had 4 actual in-service nodes

* Went into the ASG directly and to the Instance Management tab - in our case this showed 5 in service instances.

* Look at each instance's details in EC2 directly to check the names/IP's, and in our case one of them in the ASG didn't have a corresponding node in EKS/K8s

* Detach that instance from the ASG - warning about "ASG may replace this instance", but it didn't.

* Instance expected counts now match between ASG and EKS/K8s

* Errors on the cluster-autoscaler pod stopped and normal operation resumed on it's own, once it had picked up the ASG updated status

Everything looks stabilized for this cluster now and can probably repeat if it does return.

Thank you! This works perfectly running EKS 1.28 and CA 1.28.2. Any plan to work on a fix? Thanks in advance.

jschwartzy · 2024-05-04T14:39:34Z

After encountering this issue on EKS 1.28 and CA 1.28.2, we reproduced the issue as described above by manually removing one of the Kubernetes nodes.
Taking a look at the difference in behavior with 1.27.3, I think the issue is caused by this change:
e5bc070#diff-d404c1cb5f21589e42a6372990c673fcb42738f61d35bda2f5eeccdd0a7c3abeL1000-R1002

When the autoscaler is able to recover from this disconnect between ASG state and Kubernetes state, you will see this in the log:

I0504 14:22:56.465489       1 static_autoscaler.go:405] 1 unregistered nodes present

This message is never present in 1.28.x versions.

To test this, I modified the getNotRegisteredNodes method to expand the check:

if (!registered.Has(instance.Id) && expRegister) || (!registered.Has(instance.Id) && instance.Status == nil)

And sure enough, on the next run of the cluster-autoscaler:

I0504 14:22:56.465489       1 static_autoscaler.go:405] 1 unregistered nodes present
...
I0504 14:25:57.187609       1 clusterstate.go:633] Found longUnregistered Nodes [aws:///us-east-1a/i-1234ab5678901a12b]
I0504 14:25:57.187651       1 static_autoscaler.go:405] 1 unregistered nodes present
I0504 14:25:57.187662       1 static_autoscaler.go:746] Marking unregistered node aws:///us-east-1a/i-1234ab5678901a12b for removal
I0504 14:25:57.187678       1 static_autoscaler.go:755] Removing 1 unregistered nodes for node group general
I0504 14:25:57.407027       1 auto_scaling_groups.go:343] Terminating EC2 instance: i-1234ab5678901a12b
I0504 14:25:57.407055       1 aws_manager.go:162] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation
I0504 14:25:57.407118       1 static_autoscaler.go:413] Some unregistered nodes were removed

daimaxiaxie · 2024-05-06T02:20:41Z

We also encountered this problem, and I refactored part of the logic. It can solve this problem very well #6729. Works well on our large cluster.

For AWS, instance has only two states. (expectedToRegister is incorrect)

instance.Status = nil

instance.State = InstanceCreating and instance.State.ErrorInfo = OutOfResourcesErrorClass

Therefore removeOldUnregisteredNodes will be invalid. Finally enter fixNodeGroupSize and report an error.

wsaeed-tkxel · 2024-05-07T08:01:51Z

Just had this occur on one of our EKS 1.28 clusters, fixed it by:

* Look at the node group node list to get current instance names (double confirm instances in the node group within k8s directly, these should 100% match) - in our case this node group had 4 actual in-service nodes

* Went into the ASG directly and to the Instance Management tab - in our case this showed 5 in service instances.

* Look at each instance's details in EC2 directly to check the names/IP's, and in our case one of them in the ASG didn't have a corresponding node in EKS/K8s

* Detach that instance from the ASG - warning about "ASG may replace this instance", but it didn't.

* Instance expected counts now match between ASG and EKS/K8s

* Errors on the cluster-autoscaler pod stopped and normal operation resumed on it's own, once it had picked up the ASG updated status

Everything looks stabilized for this cluster now and can probably repeat if it does return.

I am using EKS 1.29 with CA 1.29.2, tried this but i detach the instance, ASG automatically adds newer instances which are also not present in the cluster and i ended up with more extra zombie instances, I uncheck the instance replacement option and the instance is k8s starts to disappear.

tooptoop4 · 2024-05-21T23:33:08Z

any ETA on fix?

chuyee · 2024-06-03T19:47:31Z

Anyone who had the problem on 1.28/29 would you please verify if #6528 (merged in 1.30) solved the problem for you? The solution itself is very similar to the one mentioned by @jschwartzy in #6128 (comment)

markshawtoronto · 2024-06-07T14:07:36Z

Anyone who had the problem on 1.28/29 would you please verify if #6528 (merged in 1.30) solved the problem for you? The solution itself is very similar to the one mentioned by @jschwartzy in #6128 (comment)

@chuyee
Confirmed, we upgraded our autoscalers to v1.30.1, even while running kubernetes 1.28 clusters on EKS and the problem was easily reproducible before but now gone for us. 🥳

ferwasy · 2024-06-07T17:19:52Z

Hello everyone. We're running Kubernetes v1.28 on EKS and have been experiencing this issue with the cluster autoscaler. Is version v1.30.1 fully compatible with Kubernetes v1.28? Thanks in advance.

wsaeed-tkxel · 2024-06-07T17:24:02Z

To be honest folks i have tried almost every version of CA after 1.27 and man it's a pain. Now that it was hurting us pretty bad as our ability to scale up or down was hit pretty bad. Now we have moved to karpenter and i can sleep like a baby.

markshawtoronto · 2024-06-07T17:57:37Z

Hello everyone. We're running Kubernetes v1.28 on EKS and have been experiencing this issue with the cluster autoscaler. Is version v1.30.1 fully compatible with Kubernetes v1.28? Thanks in advance.

@ferwasy

"We don't do cross version testing or compatibility testing in other environments. Some user reports indicate successful use of a newer version of Cluster Autoscaler with older clusters, however, there is always a chance that it won't work as expected."

So you can either upgrade to v1.30.1 (as we did) or wait for this backport PR to be released as 1.28.6 if you'd like to be more cautious and use the intended supported version.

ferwasy · 2024-06-10T13:50:04Z

Hello everyone. We're running Kubernetes v1.28 on EKS and have been experiencing this issue with the cluster autoscaler. Is version v1.30.1 fully compatible with Kubernetes v1.28? Thanks in advance.

@ferwasy

"We don't do cross version testing or compatibility testing in other environments. Some user reports indicate successful use of a newer version of Cluster Autoscaler with older clusters, however, there is always a chance that it won't work as expected."

So you can either upgrade to v1.30.1 (as we did) or wait for this backport PR to be released as 1.28.6 if you'd like to be more cautious and use the intended supported version.

@markshawtoronto thanks, will wait 1.28.6 to be released.

andresrsanchez · 2024-06-12T05:24:30Z

Hello everyone. We're running Kubernetes v1.28 on EKS and have been experiencing this issue with the cluster autoscaler. Is version v1.30.1 fully compatible with Kubernetes v1.28? Thanks in advance.

@ferwasy
"We don't do cross version testing or compatibility testing in other environments. Some user reports indicate successful use of a newer version of Cluster Autoscaler with older clusters, however, there is always a chance that it won't work as expected."
So you can either upgrade to v1.30.1 (as we did) or wait for this backport PR to be released as 1.28.6 if you'd like to be more cautious and use the intended supported version.

@markshawtoronto thanks, will wait 1.28.6 to be released.

We are in the same spot, anyone knows when it will be released?

Thanks

pawelaugustyn · 2024-06-12T07:07:35Z

It'd also be great to see 1.29 backport PR released within the v1.29.4 🙏🏻

cloudwitch · 2024-06-17T21:39:46Z

Any ETA on 1.28.6 and 1.29.4?

maur1th · 2024-06-27T21:30:42Z

@cloudwitch Probably July 24th based on https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#schedule. This is really inconvenient though so I'm pondering running with CA 1.27 on newer clusters.

samuel-esp · 2024-07-22T12:18:24Z

Was this issue fixed for 1.28.6? I don't see mentions about it on the release notes

semoog · 2024-07-22T17:10:12Z

Was this issue fixed for 1.28.6? I don't see mentions about it on the release notes

Backport is included

relaxdiego · 2024-08-15T06:00:12Z

Just had this occur on one of our EKS 1.28 clusters, fixed it by:

* Look at the node group node list to get current instance names (double confirm instances in the node group within k8s directly, these should 100% match) - in our case this node group had 4 actual in-service nodes

* Went into the ASG directly and to the Instance Management tab - in our case this showed 5 in service instances.

* Look at each instance's details in EC2 directly to check the names/IP's, and in our case one of them in the ASG didn't have a corresponding node in EKS/K8s

* Detach that instance from the ASG - warning about "ASG may replace this instance", but it didn't.

* Instance expected counts now match between ASG and EKS/K8s

* Errors on the cluster-autoscaler pod stopped and normal operation resumed on it's own, once it had picked up the ASG updated status

Everything looks stabilized for this cluster now and can probably repeat if it does return.

For those just coming across this bug and need to perform the above workaround as a quick fix, take not that detaching an instance from an ASG does not terminate it. In our case, I actually went to the instance's page and terminated it. Then the ASG automatically recreated a new instance which was able to join the cluster properly. We just then let Cluster Autoscaler deal with any excess capacity if there were any. For completeness, this is how we performed the workaround (adapted from original steps by @mrocheleau):

Look at the node group node list to get current instance names (double confirm instances in the node group within k8s directly, these should 100% match)
Went into the ASG directly and to the Instance Management tab
Look at each instance's details in EC2 directly to check the names/IPs. In our case we found one extra instance that did not show up as a node in k8s (by using kubectl get nodes -o wide for example)
Terminate the instance. This will cause the ASG to create a new one. This is fine since cluster-autoscaler will just adjust the node group if there is any excess capacity.
Errors on the cluster-autoscaler should stop and normal operation resume.

agamez-harmonicinc · 2024-08-27T12:49:55Z

Is this fix already available in 1.29.3? I see it seems backported to 1.29, but I faced this problem 2 days ago using cluster autoscaler 1.29.3

samuel-esp · 2024-09-23T14:00:33Z

Is this fix already available in 1.29.3? I see it seems backported to 1.29, but I faced this problem 2 days ago using cluster autoscaler 1.29.3

Same with 1.29.4, can anyone confirm if the bad behavior is still present?

benmoss · 2024-10-17T23:54:51Z

It was backported to 1.29.4 and 1.28.6 and released as part of 1.30.0
1.28.6: 0c97874
1.29.4: 90591b5
1.30.0 and beyond: 6ca8414

hh-sushantkumar · 2024-10-30T09:10:31Z

Still facing this is in 1.30.2 image.

com6056 added the kind/bug Categorizes issue or PR as related to a bug. label Sep 22, 2023

jbartosik added the area/cluster-autoscaler label Oct 11, 2023

towca added the area/provider/aws Issues or PRs related to aws provider label Mar 21, 2024

sondrelg mentioned this issue Apr 25, 2024

CA: Confusing output #6768

Closed

daimaxiaxie mentioned this issue May 6, 2024

Fast scale up when an asg fault #6729

Open

ruiscosta mentioned this issue May 10, 2024

Fix deadlock in DecreaseTargetSize by filtering placeholders Fixes #6128 #6817

Closed

ruiscosta mentioned this issue May 26, 2024

Update DecreaseTargetSize to Exclude Placeholders #6866

Closed

trondhindenes mentioned this issue Jun 25, 2024

cluster-autoscaler does not scale out on pending pod #6974

Open

cluster-autoscaler gets stuck with "Failed to fix node group sizes" error #6128

cluster-autoscaler gets stuck with "Failed to fix node group sizes" error #6128

Comments

com6056 commented Sep 22, 2023 • edited Loading

Tenzer commented Oct 2, 2023

Tenzer commented Oct 9, 2023

com6056 commented Nov 7, 2023 • edited Loading

mykhailogorsky commented Nov 8, 2023

stefansedich commented Nov 15, 2023 • edited Loading

com6056 commented Nov 15, 2023

com6056 commented Nov 15, 2023

stefansedich commented Nov 16, 2023

rajaie-sg commented Nov 16, 2023

x13n commented Nov 16, 2023

eaterm commented Dec 6, 2023

mycrEEpy commented Dec 12, 2023

artificial-aidan commented Dec 20, 2023

mohag commented Feb 2, 2024

songminglong commented Feb 5, 2024

songminglong commented Feb 5, 2024

songminglong commented Feb 6, 2024

akloss-cibo commented Feb 15, 2024

mhornero91 commented Feb 26, 2024

fahaddd-git commented Mar 1, 2024 • edited Loading

pawelaugustyn commented Mar 13, 2024

benjimin commented Mar 30, 2024 • edited Loading

zchenyu commented Apr 1, 2024

ahilden commented Apr 17, 2024

kanupriyaraheja commented Apr 22, 2024

mrocheleau commented Apr 22, 2024

kappa8219 commented Apr 23, 2024 • edited Loading

Kamalpreet-KK commented Apr 25, 2024

ferwasy commented Apr 25, 2024

jschwartzy commented May 4, 2024

daimaxiaxie commented May 6, 2024 • edited Loading

wsaeed-tkxel commented May 7, 2024 • edited Loading

tooptoop4 commented May 21, 2024

chuyee commented Jun 3, 2024

markshawtoronto commented Jun 7, 2024

ferwasy commented Jun 7, 2024

wsaeed-tkxel commented Jun 7, 2024 • edited Loading

markshawtoronto commented Jun 7, 2024 • edited Loading

ferwasy commented Jun 10, 2024

andresrsanchez commented Jun 12, 2024

pawelaugustyn commented Jun 12, 2024

cloudwitch commented Jun 17, 2024

maur1th commented Jun 27, 2024

samuel-esp commented Jul 22, 2024

semoog commented Jul 22, 2024

relaxdiego commented Aug 15, 2024 • edited Loading

agamez-harmonicinc commented Aug 27, 2024

samuel-esp commented Sep 23, 2024

benmoss commented Oct 17, 2024 • edited Loading

hh-sushantkumar commented Oct 30, 2024

com6056 commented Sep 22, 2023 •

edited

Loading

com6056 commented Nov 7, 2023 •

edited

Loading

stefansedich commented Nov 15, 2023 •

edited

Loading

fahaddd-git commented Mar 1, 2024 •

edited

Loading

benjimin commented Mar 30, 2024 •

edited

Loading

kappa8219 commented Apr 23, 2024 •

edited

Loading

daimaxiaxie commented May 6, 2024 •

edited

Loading

wsaeed-tkxel commented May 7, 2024 •

edited

Loading

wsaeed-tkxel commented Jun 7, 2024 •

edited

Loading

markshawtoronto commented Jun 7, 2024 •

edited

Loading

relaxdiego commented Aug 15, 2024 •

edited

Loading

benmoss commented Oct 17, 2024 •

edited

Loading