Cluster-Autoscaler decrease AWS AutoScalingGroup desired capacity during unregistered nodes removal that causes unneeded shriking of the ASG #6795

makzzz1986 · 2024-05-06T14:05:43Z

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
v2.7.2

Component version:
Kubernetes server v1.26.14

What environment is this in?:
AWS EKS

What did you expect to happen?:
When Cluster-Autoscaler removes an unregistered node, it should not decrease the desired capacity of AWS.

What happened instead?:
When a broken Kubelet configuration is introduced and EC2 instance can't register as a node, Cluster-Autoscaler terminates the instance and decreases the Desired Capacity of AutoScaling Group by the number of terminated instances. It causes a rapid decrease of instances and healthy nodes to zero, especially if AutoScaling Group removes the oldest instance during shrinking

How to reproduce it (as minimally and precisely as possible):

Change a LaunchTemplate with a broken UserData (any type of kubelet-extra-args) to AutoScalingGroup.
Scale any application of a cluster to make Cluster-Autoscaler increase the Desired Capacity of AutoScalingGroup
Wait 5 minutes

Anything else we need to know?:
We experienced it a few times, some logs:
This is the chart of AutoScaling Group monitoring while reproducing the behavior

ASG Activity log:
repeatedly appearing log like:
At 2024-05-06T13:38:33Z a user request explicitly set group desired capacity changing the desired capacity from 18 to 17. At 2024-05-06T13:38:44Z a user request explicitly set group desired capacity changing the desired capacity from 17 to 15. At 2024-05-06T13:38:47Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 18 to 15. At 2024-05-06T13:38:47Z instance i-08259e65446b2711a was selected for termination. At 2024-05-06T13:38:47Z instance i-03a8da6f7f73f80f7 was selected for termination. At 2024-05-06T13:38:47Z instance i-0946a707c1d4d62f5 was selected for termination.
then it repeats, dropping the desired capacity very fast every few seconds:

At 2024-05-06T13:38:33Z a user request explicitly set group desired capacity changing the desired capacity from 18 to 17. At 2024-05-06T13:38:44Z a user request explicitly set group desired capacity changing the desired capacity from 17 to 15. At 2024-05-06T13:38:47Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 18 to 15
At 2024-05-06T13:38:55Z a user request explicitly set group desired capacity changing the desired capacity from 15 to 13. At 2024-05-06T13:38:59Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 15 to 13
At 2024-05-06T13:39:06Z a user request explicitly set group desired capacity changing the desired capacity from 13 to 11. At 2024-05-06T13:39:10Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 13 to 11
At 2024-05-06T13:39:17Z a user request explicitly set group desired capacity changing the desired capacity from 11 to 9. At 2024-05-06T13:39:22Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 11 to 9
At 2024-05-06T13:39:27Z a user request explicitly set group desired capacity changing the desired capacity from 9 to 7. At 2024-05-06T13:39:34Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 9 to 7
At 2024-05-06T13:39:38Z instance i-045007d227f6406aa was taken out of service in response to a user request, shrinking the capacity from 7 to 6.
At 2024-05-06T13:39:38Z instance i-0a90f6d48b464f279 was taken out of service in response to a user request, shrinking the capacity from 6 to 5.
At 2024-05-06T13:39:38Z instance i-0a461072a3a045985 was taken out of service in response to a user request, shrinking the capacity from 5 to 4.

I I0506 13:38:32.981481 I0506 13:38:32.981659 I0506 13:38:32.981674 I0506 13:38:33.089424 I0506 13:38:33.089603 I0506 13:38:33.089613 I0506 13:38:33.089618 I0506 13:38:33.089622 I0506 13:38:33.089627 I0506 13:38:33.089637 I0506 13:38:33.093156 I0506 13:38:33.093444 I0506 13:38:33.093720 I0506 13:38:33.094109 I0506 13:38:33.094144 I0506 13:38:33.094170 I0506 13:38:33.273533 I0506 13:38:33.273546 I0506 13:38:33.273594 I0506 13:38:33.396748 I0506 13:38:33.396764 I0506 13:38:33.396811 I0506 13:38:33.599688 I0506 13:38:33.599706 I0506 13:38:33.599755 I0506 13:38:33.835324 I0506 13:38:33.835341 I0506 13:38:33.835382 I0506 13:38:33.835513 I0506 13:38:33.835665 I0506 13:38:33.835818 I0506 13:38:33.835913 I0506 13:38:33.836019 I0506 13:38:33.836028 I0506 13:38:33.836043 I0506 13:38:33.836048 I0506 13:38:33.836053 I0506 13:38:33.836067 I0506 13:38:33.836072 I0506 13:38:33.836078 I0506 13:38:33.836084 I0506 13:38:33.836362 I0506 13:38:33.837357 I0506 13:38:33.837372 I0506 13:38:33.837381 I0506 13:38:33.837485 I0506 13:38:33.837500 I0506 13:38:33.837510 I0506 13:38:33.837522 I0506 13:38:33.837536 I0506 13:38:33.837559 I0506 13:38:33.837574 I0506 13:38:33.837590 I0506 13:38:34.001090 could find on the source code that decreasing Desired Capacity is logged, but I could not find it on Cluster-Autoscaler logs, only logs of removal instances:
1 static_autoscaler.go:289] Starting main loop
1 auto_scaling_groups.go:367] Regenerating instance to ASG map for ASG names: []
1 auto_scaling_groups.go:374] Regenerating instance to ASG map for ASG tags: map[k8s.io/cluster-autoscaler/eks-01-shared-staging: k8s.io/cluster-autoscaler/enabled:]
1 auto_scaling_groups.go:140] Updating ASG ASG_NAME_IS_REMOVED
1 aws_wrapper.go:693] 0 launch configurations to query
1 aws_wrapper.go:694] 0 launch templates to query
1 aws_wrapper.go:714] Successfully queried 0 launch configurations
1 aws_wrapper.go:725] Successfully queried 0 launch templates
1 aws_wrapper.go:736] Successfully queried instance requirements for 0 ASGs
1 aws_manager.go:129] Refreshed ASG list, next refresh after 2024-05-06 13:39:33.089634918 +0000 UTC m=+260083.881492346
1 aws_manager.go:185] Found multiple availability zones for ASG "ASG_NAME_IS_REMOVED"; using eu-west-1b for failure-domain.beta.kubernetes.io/zone label
1 aws_manager.go:185] Found multiple availability zones for ASG "ASG_NAME_IS_REMOVED"; using eu-west-1b for failure-domain.beta.kubernetes.io/zone label
1 aws_manager.go:185] Found multiple availability zones for ASG "ASG_NAME_IS_REMOVED"; using eu-west-1b for failure-domain.beta.kubernetes.io/zone label
1 clusterstate.go:623] Found longUnregistered Nodes [aws:///eu-west-1c/i-0356e7154743c269d aws:///eu-west-1b/i-0739a0918fbfaeff1 aws:///eu-west-1c/i-0d52e62f88b44e06f aws:///eu-west-1a/i-024ee7cacf172923e]
1 static_autoscaler.go:405] 13 unregistered nodes present
1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1a/i-024ee7cacf172923e
1 auto_scaling_groups.go:318] Terminating EC2 instance: i-024ee7cacf172923e
1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation
1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1b/i-0739a0918fbfaeff1
1 auto_scaling_groups.go:318] Terminating EC2 instance: i-0739a0918fbfaeff1
1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation
1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1c/i-0d52e62f88b44e06f
1 auto_scaling_groups.go:318] Terminating EC2 instance: i-0d52e62f88b44e06f
1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation
1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1c/i-0356e7154743c269d
1 auto_scaling_groups.go:318] Terminating EC2 instance: i-0356e7154743c269d
1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation
1 static_autoscaler.go:413] Some unregistered nodes were removed
1 filter_out_schedulable.go:63] Filtering out schedulables
1 klogx.go:87] failed to find place for XXXX: cannot put pod XXXX on any node
1 klogx.go:87] failed to find place for XXXX based on similar pods scheduling
1 klogx.go:87] failed to find place for XXXX based on similar pods scheduling
1 klogx.go:87] failed to find place for XXXX based on similar pods scheduling
1 filter_out_schedulable.go:120] 0 pods marked as unschedulable can be scheduled.
1 filter_out_schedulable.go:83] No schedulable pods
1 filter_out_daemon_sets.go:40] Filtering out daemon set pods
1 filter_out_daemon_sets.go:49] Filtered out 0 daemon set pods, 4 unschedulable pods left
1 klogx.go:87] Pod XXXX is unschedulable
1 klogx.go:87] Pod XXXX is unschedulable
1 klogx.go:87] Pod XXXX is unschedulable
1 klogx.go:87] Pod XXXX is unschedulable
1 orchestrator.go:109] Upcoming 0 nodes
1 orchestrator.go:466] Pod XXXX can't be scheduled on ASG_NAME_IS_REMOVED, predicate checking error: node(s) had untolerated taint {worker-type: criticalservices}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {worker-type: criticalservices}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"worker-type", Value:"criticalservices", Effect:"NoSchedule", TimeAdded:}}
1 orchestrator.go:468] 3 other pods similar to XXXX can't be scheduled on ASG_NAME_IS_REMOVED
1 orchestrator.go:167] No pod can fit to ASG_NAME_IS_REMOVED
1 orchestrator.go:466] Pod XXXX can't be scheduled on ASG_NAME_IS_REMOVED, predicate checking error: node(s) had untolerated taint {app: solr}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {app: solr}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"app", Value:"solr", Effect:"NoSchedule", TimeAdded:}}
1 orchestrator.go:468] 3 other pods similar to XXXX can't be scheduled on ASG_NAME_IS_REMOVED
1 orchestrator.go:167] No pod can fit to ASG_NAME_IS_REMOVED
1 orchestrator.go:193] Best option to resize: ASG_NAME_IS_REMOVED
1 orchestrator.go:197] Estimated 2 nodes needed in ASG_NAME_IS_REMOVED
1 orchestrator.go:310] Final scale-up plan: [{ASG_NAME_IS_REMOVED 15->17 (max: 80)}]
1 orchestrator.go:582] Scale-up: setting group ASG_NAME_IS_REMOVED size to 17
1 auto_scaling_groups.go:255] Setting asg ASG_NAME_IS_REMOVED size to 17
1 eventing_scale_up_processor.go:47] Skipping event processing for unschedulable pods since there is a ScaleUp attempt this loop

*** In the logs I can find changing scale-up plan to increase the Desired Capacity, trying to increase it constantly, but it doesn't help:
I0506 13:33:23.132521 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 14->16 (max: 80)}]
I0506 13:33:33.769051 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 13->15 (max: 80)}]
I0506 13:33:44.438663 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 12->14 (max: 80)}]
I0506 13:33:54.988557 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 11->13 (max: 80)}]
I0506 13:34:05.685369 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 10->12 (max: 80)}]
I0506 13:34:16.256407 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 9->11 (max: 80)}]
I0506 13:34:26.822806 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 8->10 (max: 80)}]
I0506 13:34:37.435677 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 7->9 (max: 80)}]
I0506 13:34:47.884433 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 8->10 (max: 80)}]
I0506 13:34:58.303376 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 9->11 (max: 80)}]
I0506 13:35:09.222123 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 10->12 (max: 80)}]
I0506 13:35:19.832029 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 11->13 (max: 80)}]
I0506 13:35:30.142655 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 13->15 (max: 80)}]
I0506 13:35:40.288624 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 15->17 (max: 80)}]
I0506 13:35:50.673319 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 17->19 (max: 80)}]
I0506 13:36:00.991247 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 19->20 (max: 80)}]
I0506 13:38:12.098179 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 18->20 (max: 80)}]
I0506 13:38:22.838214 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 17->19 (max: 80)}]

makzzz1986 · 2024-05-06T14:28:36Z

Looks like shrinking is Ok because of:

autoscaler/cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

Line 338 in 3fd892a

ShouldDecrementDesiredCapacity: aws.Bool(true),

but I can't explain how shrinking desired capacity by the number of unregistered nodes brings the desired capacity to almost zero. Can it be that Cluster-Autoscaler request removal the same instance few times and drops the desired capacity more than one? 🤔

makzzz1986 added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster-Autoscaler decrease AWS AutoScalingGroup desired capacity during unregistered nodes removal that causes unneeded shriking of the ASG #6795

Cluster-Autoscaler decrease AWS AutoScalingGroup desired capacity during unregistered nodes removal that causes unneeded shriking of the ASG #6795

makzzz1986 commented May 6, 2024

makzzz1986 commented May 6, 2024 •

edited

Cluster-Autoscaler decrease AWS AutoScalingGroup desired capacity during unregistered nodes removal that causes unneeded shriking of the ASG #6795

Cluster-Autoscaler decrease AWS AutoScalingGroup desired capacity during unregistered nodes removal that causes unneeded shriking of the ASG #6795

Comments

makzzz1986 commented May 6, 2024

makzzz1986 commented May 6, 2024 • edited

makzzz1986 commented May 6, 2024 •

edited