Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster-Autoscaler decrease AWS AutoScalingGroup desired capacity during unregistered nodes removal that causes unneeded shriking of the ASG #6795

Open
makzzz1986 opened this issue May 6, 2024 · 3 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@makzzz1986
Copy link

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
v2.7.2

Component version:
Kubernetes server v1.26.14

What environment is this in?:
AWS EKS

What did you expect to happen?:
When Cluster-Autoscaler removes an unregistered node, it should not decrease the desired capacity of AWS.

What happened instead?:
When a broken Kubelet configuration is introduced and EC2 instance can't register as a node, Cluster-Autoscaler terminates the instance and decreases the Desired Capacity of AutoScaling Group by the number of terminated instances. It causes a rapid decrease of instances and healthy nodes to zero, especially if AutoScaling Group removes the oldest instance during shrinking

How to reproduce it (as minimally and precisely as possible):

  1. Change a LaunchTemplate with a broken UserData (any type of kubelet-extra-args) to AutoScalingGroup.
  2. Scale any application of a cluster to make Cluster-Autoscaler increase the Desired Capacity of AutoScalingGroup
  3. Wait 5 minutes

Anything else we need to know?:
We experienced it a few times, some logs:
This is the chart of AutoScaling Group monitoring while reproducing the behavior
image

ASG Activity log:
repeatedly appearing log like:
At 2024-05-06T13:38:33Z a user request explicitly set group desired capacity changing the desired capacity from 18 to 17. At 2024-05-06T13:38:44Z a user request explicitly set group desired capacity changing the desired capacity from 17 to 15. At 2024-05-06T13:38:47Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 18 to 15. At 2024-05-06T13:38:47Z instance i-08259e65446b2711a was selected for termination. At 2024-05-06T13:38:47Z instance i-03a8da6f7f73f80f7 was selected for termination. At 2024-05-06T13:38:47Z instance i-0946a707c1d4d62f5 was selected for termination.
then it repeats, dropping the desired capacity very fast every few seconds:

  • At 2024-05-06T13:38:33Z a user request explicitly set group desired capacity changing the desired capacity from 18 to 17. At 2024-05-06T13:38:44Z a user request explicitly set group desired capacity changing the desired capacity from 17 to 15. At 2024-05-06T13:38:47Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 18 to 15
  • At 2024-05-06T13:38:55Z a user request explicitly set group desired capacity changing the desired capacity from 15 to 13. At 2024-05-06T13:38:59Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 15 to 13
  • At 2024-05-06T13:39:06Z a user request explicitly set group desired capacity changing the desired capacity from 13 to 11. At 2024-05-06T13:39:10Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 13 to 11
  • At 2024-05-06T13:39:17Z a user request explicitly set group desired capacity changing the desired capacity from 11 to 9. At 2024-05-06T13:39:22Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 11 to 9
  • At 2024-05-06T13:39:27Z a user request explicitly set group desired capacity changing the desired capacity from 9 to 7. At 2024-05-06T13:39:34Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 9 to 7
  • At 2024-05-06T13:39:38Z instance i-045007d227f6406aa was taken out of service in response to a user request, shrinking the capacity from 7 to 6.
  • At 2024-05-06T13:39:38Z instance i-0a90f6d48b464f279 was taken out of service in response to a user request, shrinking the capacity from 6 to 5.
  • At 2024-05-06T13:39:38Z instance i-0a461072a3a045985 was taken out of service in response to a user request, shrinking the capacity from 5 to 4.

I could find on the source code that decreasing Desired Capacity is logged, but I could not find it on Cluster-Autoscaler logs, only logs of removal instances:
I0506 13:38:32.981481 1 static_autoscaler.go:289] Starting main loop
I0506 13:38:32.981659 1 auto_scaling_groups.go:367] Regenerating instance to ASG map for ASG names: []
I0506 13:38:32.981674 1 auto_scaling_groups.go:374] Regenerating instance to ASG map for ASG tags: map[k8s.io/cluster-autoscaler/eks-01-shared-staging: k8s.io/cluster-autoscaler/enabled:]
I0506 13:38:33.089424 1 auto_scaling_groups.go:140] Updating ASG ASG_NAME_IS_REMOVED
I0506 13:38:33.089603 1 aws_wrapper.go:693] 0 launch configurations to query
I0506 13:38:33.089613 1 aws_wrapper.go:694] 0 launch templates to query
I0506 13:38:33.089618 1 aws_wrapper.go:714] Successfully queried 0 launch configurations
I0506 13:38:33.089622 1 aws_wrapper.go:725] Successfully queried 0 launch templates
I0506 13:38:33.089627 1 aws_wrapper.go:736] Successfully queried instance requirements for 0 ASGs
I0506 13:38:33.089637 1 aws_manager.go:129] Refreshed ASG list, next refresh after 2024-05-06 13:39:33.089634918 +0000 UTC m=+260083.881492346
I0506 13:38:33.093156 1 aws_manager.go:185] Found multiple availability zones for ASG "ASG_NAME_IS_REMOVED"; using eu-west-1b for failure-domain.beta.kubernetes.io/zone label
I0506 13:38:33.093444 1 aws_manager.go:185] Found multiple availability zones for ASG "ASG_NAME_IS_REMOVED"; using eu-west-1b for failure-domain.beta.kubernetes.io/zone label
I0506 13:38:33.093720 1 aws_manager.go:185] Found multiple availability zones for ASG "ASG_NAME_IS_REMOVED"; using eu-west-1b for failure-domain.beta.kubernetes.io/zone label
I0506 13:38:33.094109 1 clusterstate.go:623] Found longUnregistered Nodes [aws:///eu-west-1c/i-0356e7154743c269d aws:///eu-west-1b/i-0739a0918fbfaeff1 aws:///eu-west-1c/i-0d52e62f88b44e06f aws:///eu-west-1a/i-024ee7cacf172923e]
I0506 13:38:33.094144 1 static_autoscaler.go:405] 13 unregistered nodes present
I0506 13:38:33.094170 1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1a/i-024ee7cacf172923e
I0506 13:38:33.273533 1 auto_scaling_groups.go:318] Terminating EC2 instance: i-024ee7cacf172923e
I0506 13:38:33.273546 1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation
I0506 13:38:33.273594 1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1b/i-0739a0918fbfaeff1
I0506 13:38:33.396748 1 auto_scaling_groups.go:318] Terminating EC2 instance: i-0739a0918fbfaeff1
I0506 13:38:33.396764 1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation
I0506 13:38:33.396811 1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1c/i-0d52e62f88b44e06f
I0506 13:38:33.599688 1 auto_scaling_groups.go:318] Terminating EC2 instance: i-0d52e62f88b44e06f
I0506 13:38:33.599706 1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation
I0506 13:38:33.599755 1 static_autoscaler.go:746] Removing unregistered node aws:///eu-west-1c/i-0356e7154743c269d
I0506 13:38:33.835324 1 auto_scaling_groups.go:318] Terminating EC2 instance: i-0356e7154743c269d
I0506 13:38:33.835341 1 aws_manager.go:161] DeleteInstances was called: scheduling an ASG list refresh for next main loop evaluation
I0506 13:38:33.835382 1 static_autoscaler.go:413] Some unregistered nodes were removed
I0506 13:38:33.835513 1 filter_out_schedulable.go:63] Filtering out schedulables
I0506 13:38:33.835665 1 klogx.go:87] failed to find place for XXXX: cannot put pod XXXX on any node
I0506 13:38:33.835818 1 klogx.go:87] failed to find place for XXXX based on similar pods scheduling
I0506 13:38:33.835913 1 klogx.go:87] failed to find place for XXXX based on similar pods scheduling
I0506 13:38:33.836019 1 klogx.go:87] failed to find place for XXXX based on similar pods scheduling
I0506 13:38:33.836028 1 filter_out_schedulable.go:120] 0 pods marked as unschedulable can be scheduled.
I0506 13:38:33.836043 1 filter_out_schedulable.go:83] No schedulable pods
I0506 13:38:33.836048 1 filter_out_daemon_sets.go:40] Filtering out daemon set pods
I0506 13:38:33.836053 1 filter_out_daemon_sets.go:49] Filtered out 0 daemon set pods, 4 unschedulable pods left
I0506 13:38:33.836067 1 klogx.go:87] Pod XXXX is unschedulable
I0506 13:38:33.836072 1 klogx.go:87] Pod XXXX is unschedulable
I0506 13:38:33.836078 1 klogx.go:87] Pod XXXX is unschedulable
I0506 13:38:33.836084 1 klogx.go:87] Pod XXXX is unschedulable
I0506 13:38:33.836362 1 orchestrator.go:109] Upcoming 0 nodes
I0506 13:38:33.837357 1 orchestrator.go:466] Pod XXXX can't be scheduled on ASG_NAME_IS_REMOVED, predicate checking error: node(s) had untolerated taint {worker-type: criticalservices}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {worker-type: criticalservices}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"worker-type", Value:"criticalservices", Effect:"NoSchedule", TimeAdded:}}
I0506 13:38:33.837372 1 orchestrator.go:468] 3 other pods similar to XXXX can't be scheduled on ASG_NAME_IS_REMOVED
I0506 13:38:33.837381 1 orchestrator.go:167] No pod can fit to ASG_NAME_IS_REMOVED
I0506 13:38:33.837485 1 orchestrator.go:466] Pod XXXX can't be scheduled on ASG_NAME_IS_REMOVED, predicate checking error: node(s) had untolerated taint {app: solr}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {app: solr}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"app", Value:"solr", Effect:"NoSchedule", TimeAdded:}}
I0506 13:38:33.837500 1 orchestrator.go:468] 3 other pods similar to XXXX can't be scheduled on ASG_NAME_IS_REMOVED
I0506 13:38:33.837510 1 orchestrator.go:167] No pod can fit to ASG_NAME_IS_REMOVED
I0506 13:38:33.837522 1 orchestrator.go:193] Best option to resize: ASG_NAME_IS_REMOVED
I0506 13:38:33.837536 1 orchestrator.go:197] Estimated 2 nodes needed in ASG_NAME_IS_REMOVED
I0506 13:38:33.837559 1 orchestrator.go:310] Final scale-up plan: [{ASG_NAME_IS_REMOVED 15->17 (max: 80)}]
I0506 13:38:33.837574 1 orchestrator.go:582] Scale-up: setting group ASG_NAME_IS_REMOVED size to 17
I0506 13:38:33.837590 1 auto_scaling_groups.go:255] Setting asg ASG_NAME_IS_REMOVED size to 17
I0506 13:38:34.001090 1 eventing_scale_up_processor.go:47] Skipping event processing for unschedulable pods since there is a ScaleUp attempt this loop

*** In the logs I can find changing scale-up plan to increase the Desired Capacity, trying to increase it constantly, but it doesn't help:
I0506 13:33:23.132521 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 14->16 (max: 80)}]
I0506 13:33:33.769051 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 13->15 (max: 80)}]
I0506 13:33:44.438663 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 12->14 (max: 80)}]
I0506 13:33:54.988557 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 11->13 (max: 80)}]
I0506 13:34:05.685369 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 10->12 (max: 80)}]
I0506 13:34:16.256407 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 9->11 (max: 80)}]
I0506 13:34:26.822806 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 8->10 (max: 80)}]
I0506 13:34:37.435677 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 7->9 (max: 80)}]
I0506 13:34:47.884433 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 8->10 (max: 80)}]
I0506 13:34:58.303376 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 9->11 (max: 80)}]
I0506 13:35:09.222123 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 10->12 (max: 80)}]
I0506 13:35:19.832029 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 11->13 (max: 80)}]
I0506 13:35:30.142655 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 13->15 (max: 80)}]
I0506 13:35:40.288624 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 15->17 (max: 80)}]
I0506 13:35:50.673319 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 17->19 (max: 80)}]
I0506 13:36:00.991247 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 19->20 (max: 80)}]
I0506 13:38:12.098179 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 18->20 (max: 80)}]
I0506 13:38:22.838214 1 orchestrator.go:310] Final scale-up plan: [{eks-01-shared-staging-app-blended-20231119224503534600000004 17->19 (max: 80)}]

@makzzz1986 makzzz1986 added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2024
@makzzz1986
Copy link
Author

makzzz1986 commented May 6, 2024

Looks like shrinking is Ok because of:

ShouldDecrementDesiredCapacity: aws.Bool(true),

but I can't explain how shrinking desired capacity by the number of unregistered nodes brings the desired capacity to almost zero. Can it be that Cluster-Autoscaler request removal the same instance few times and drops the desired capacity more than one? 🤔

@adrianmoisey
Copy link
Member

/area cluster-autoscaler

@songminglong
Copy link

/cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants