Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] Unsafe decomissioning of nodes when ASGs are out of instances #5829

Open
theintz opened this issue Jun 2, 2023 · 18 comments · May be fixed by #5846 or #6818
Open

[AWS] Unsafe decomissioning of nodes when ASGs are out of instances #5829

theintz opened this issue Jun 2, 2023 · 18 comments · May be fixed by #5846 or #6818
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider kind/bug Categorizes issue or PR as related to a bug.

Comments

@theintz
Copy link

theintz commented Jun 2, 2023

Which component are you using?: cluster-autoscaler

What version of the component are you using?: cluster-autoscaler 1.25.1

Component version: chart cluster-autoscaler-9.28.0

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.9-eks-0a21954", GitCommit:"eb82cd845d007ae98d215744675dcf7ff024a5a3", GitTreeState:"clean", BuildDate:"2023-04-15T00:37:59Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: AWS EKS

What did you expect to happen?: We are running EKS on managed nodes using cluster-autoscaler to dynamically scale up and down the sizes of the node groups. Unfortunately, sometimes AWS is unable to fulfil requests for scaling up (Could not launch On-Demand Instances. InsufficientInstanceCapacity). When that happens, cluster-autoscaler gets confused about the size of the ASG and sends unnecessary SetDesiredCapacity requests, which result in unsafe scaling down of the ASG (details below).
The expectation is that CA handles these situations transparently and does not scale down the ASG.

What happened instead?: We are observing the following behavior, by looking at both the CA logs and the CloudTrail logs of the requests sent. I don't know enough about the internal workings of the CA, so some of this is an assumption.

  • Pods are pending, waiting for capacity in a given node group.
  • CA wants to scale up this node group and sends a SetDesiredCapacity request to the ASG to increase the size (say from 3 -> 5). This happens async, so the CA fires off the request and goes back to doing different things. However it updates its internal representation of the ASG to be at capacity 5.
  • ASG can’t scale up and returns that info to the CA.
  • CA comes back around to the request and realizes the ASG is blocked. It updates its internal representation (back to 3) but crucially for some reason it also updates the ASG, sending a number of SetDesiredCapacity requests, to revert the capacity back down to 3.
  • In the meantime, a new instance might have successfully started, so the actual capacity of the ASG is now 4.
  • So as a result to receiving the SetDesiredCapacity requests from the CA, the ASG terminates one (or more) of its running instances resulting in the unsafe scale down.

image

See the attached image that shows the sequence of events as well.

How to reproduce it (as minimally and precisely as possible): Difficult to reproduce as it only happens when the AWS capacity issues occur. We are regularly observing this behavior though, I am happy to assist with more information.

Anything else we need to know?:

@theintz theintz added the kind/bug Categorizes issue or PR as related to a bug. label Jun 2, 2023
@theintz
Copy link
Author

theintz commented Jun 8, 2023

I investigated this some more. Here is what I believe is happening:

  • CA recognizes that there are unschedulable pods and initiates a scale up of the relevant ASG. This sends a SetDesiredSize request to the ASG which then attempts to start new instances according to the request. The method also updates the internal representation of the ASG (asgCache) to reflect the desired size (
    func (m *asgCache) setAsgSizeNoLock(asg *asg, size int) error {
    ).
  • During the next iteration of the main loop, the asgCache is regenerated, during which placeholder instances are added to make up for the difference in desired size and actual instances in the group (
    func (m *asgCache) createPlaceholdersForDesiredNonStartedInstances(groups []*autoscaling.Group) []*autoscaling.Group {
    ). In this method the CA also discovers that the ASG was unable to scale up and assigns an error code to the health status of the created placeholder instances.
  • Further down in the same main loop, the CA tries to clean up dangling nodes that have been deleted already. In order to do this it looks at all the nodes it knows and checks their health status (
    func (csr *ClusterStateRegistry) GetCreatedNodesWithErrors() []*apiv1.Node {
    ). This lookup returns exactly those placeholder instances created just before.
  • The CA proceeds to delete the nodes, calling the DeleteInstances method and passing the placeholder instances as argument. (
    func (m *asgCache) DeleteInstances(instances []*AwsInstanceRef) error {
    ).
  • That method looks at the instances to be deleted and recognizes that they are just placeholders. Yet it calls a method that sends a SetDesiredSize request with asgSize-1 to the ASG, thereby telling the ASG to scale down a random instance, resulting in the observed behavior.

I can imagine a number of fixes for this, however I don't know the code well enough to comment on side effects.

  1. To my knowledge the placeholder instances are only ever checked in the DeleteInstances method of the asgCache, maybe that functionality could be removed altogether?
  2. A straightforward fix is to simply remove the placeholder instance from the asgCache when it is encountered without sending the request to scale down.

Will create a PR to implement 2).

@steelecliftonberry
Copy link

Would be great to get some insight/review here. It frequently (roughly daily) kills long-running batch jobs related to our model training and optimisation, due to causing accidental node scale down some times when AWS is out of instances (which is a constant thing).

This happens regardless of the fact we set the do not evict annotation on these Pods, because as above the AWS out of instances issue is not properly factored in.

@gicuuu3
Copy link

gicuuu3 commented Jul 19, 2023

Hello, hello!

We're also being impacted by this issue and would be nice to get some updates on when it can be expected. If not, are there any known workarounds to avoid the problems? (we've tried using stuff like the scaled down after scale up delay but it didn't seem to work)

Our setup has stateful pods so having the scaling group kill active nodes breaks the sessions.

Thank you for the time and effort!

@theintz
Copy link
Author

theintz commented Jul 26, 2023

Hi @gicuuu3 there is a PR open with a fix that's working for us, however I'm not sure whether it's going to get merged since it might have implications for other use cases.
One more thing we did that has helped is to configure the CA such that it can pick from a number of EC2 instance classes rather than just one: instance_types = ["m6a.xlarge", "m5a.xlarge", "m6i.xlarge", "m5.xlarge", "m4.xlarge"] This allows the underlying ASG to pick from any of these classes (with descending priority) so that even when m6a and m5a are broken, it can still spin up a new instance.
With these two measures in place, we haven't seen any nodes lost. I believe that either of these approaches would also work.

@gicuuu3
Copy link

gicuuu3 commented Jul 26, 2023

Understood.

We've started using mixed instances and it also helped in our scenario. I guess we'll keep track of this PR and if we ever start hitting the problem again we can use this PR for our own setup also.

Thank you for the help!

@gicuuu3
Copy link

gicuuu3 commented Oct 25, 2023

In case anyone is still struggling with this issue, another thing we did that helped us is enabled the aws auto scaling group scale-in protection

https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-instance-protection.html

Since our autoscaler always does targeted removal requests, this toggle protects us from random removal due to capacity reduction while allowing the autoscaler to still removes nodes it doesn't need in the cluster by "name"

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024
@apy-liu
Copy link

apy-liu commented Jan 31, 2024

We've also observed this happening multiple times across different clusters using cluster autoscaler v1.24.3.

  • The cluster autoscaler attempts to scale up the ASG and hits an OutOfCapacity error creating the placeholder instances as mentioned.
  • The ASG which received the scale-up request continues to attempt launching the instance and successfully does so after about 2 minutes.
  • Cluster autoscaler has already considered it failed and tries to clean up the placeholder instances with no knowledge of the successful launch

I suspect this started happening more frequently because of this feature to early abort on OutOfCapacity errors: #4489.

The logs from the intended behavior of the PR match what we see on the cluster:

1 clusterstate.go:1043] Failed adding 14 nodes (14 unseen previously) to group kubernetes-minion-test-mc-a-amd-asg-us-east-1a due to OutOfResource.placeholder-cannot-be-fullfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"} ... 1 auto_scaling_groups.go:287] instance i-placeholder-kubernetes-minion-test-mc-a-amd-asg-us-east-1b-0 is detected as a placeholder, decreasing ASG requested size instead of deleting instance

This means cluster autoscaler doesn't recognize there was a successfully launched instance due to the early abort on the first FAILED status and the decreased desiredCount removes a running instance with potential workloads.

If that's the case, it seems like the proposed PR above could be one fix.

Another one may be exposing a flag to respect the backoff duration even when hitting an OutOfCapacity error: #5756

@apy-liu
Copy link

apy-liu commented Jan 31, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024
@jlamillan
Copy link
Contributor

jlamillan commented Feb 5, 2024

This means cluster autoscaler doesn't recognize there was a successfully launched instance due to the early abort on the first FAILED status and the decreased desiredCount removes a running instance with potential workloads.

@apy-liu if requested instance(s) do not show up in DescribeAutoScalingGroup, it's added as a "placeholder" instance - meaning it is requested by not yet created.

For example, if you attempt to scale up an ASG from 3 -> 6, but only 1 of the 3 instances are provisioned before a InsufficientInstanceCapacity is detected, placeholder instances will be created for the 2 missing instances which the ASG reportedly has no capacity for. Looking at the code, once that InsufficientInstanceCapacity occurs, any requested-but-not-allocated instances for that ASG are written off as unfulfillable. DeleteInstances will be called with the "unfulfillable" instances, which in this case is handled simply by decreasing the size of the ASG by 2 (from 6 -> 4). It does seem possible that one or more of the so-called "unfulfillable" instances could actually be provisioned on the provider before DeleteInstances is called to effectively abort the original request.

If instead DeleteInstances did not reduce the size of the ASG by 2, and just deleted the placeholder instance from its cache in the corresponding PR, that ongoing/pending request for additional capacity on the provider may affect what capacity is available to them in a different ASG basically creating a capacity dead-lock.

@apy-liu
Copy link

apy-liu commented Feb 5, 2024

Hey @jlamillan thanks for taking a look and providing a detailed explanation. That makes sense to me, in that case, the PR which adds a flag to respect the backoff duration seems a more stable approach.

It does seem possible that one or more of the so-called "unfulfillable" instances could actually be provisioned on the provider before DeleteInstances is called to effectively abort the original request.

Attaching some sample logs to provide some context. This was what led us to that conclusion.

CA Logs:

  • 23:42:59.970791 1 auto_scaling_groups.go:438] Instance group <> has only 3 instances created while the requested count is 4. Creating placeholder instances.
  • 23:43:00.248897 1 auto_scaling_groups.go:481] ASG <> scaling failed with {AutoScalingGroupARN: "<>", AutoScalingGroupName: "<>",
  • 23:43:00.248971 1 auto_scaling_groups.go:446] Instance group <> cannot provision any more nodes!
  • 23:43:50.698297 1 clusterstate.go:1008] Found 1 instances with errorCode OutOfResource.placeholder-cannot-be-fulfilled in nodeGroup <>
  • 23:43:50.698330 1 clusterstate.go:1026] Failed adding 1 nodes (1 unseen previously) to group <> due to OutOfResource.placeholder-cannot-be-fulfilled; errorMessages=[]string{"AWS cannot provision any more instances for this node group"}
  • 23:43:50.698422 1 static_autoscaler.go:692] Deleting 1 from <> node group because of create errors
  • 23:43:50.698440 1 auto_scaling_groups.go:297] instance i-placeholder- is detected as a placeholder, decreasing ASG requested size instead of deleting instance
  • 23:43:50.698446 1 auto_scaling_groups.go:248] Setting asg <> size to 3

AWS console:

  • 23:41:33 - 23:41:33. Failed - Launching a new EC2 instance. Status reason: We currently do not have sufficient...
  • 23:41:43 - 23:41:43. Failed - Launching a new EC2 instance. Status reason: We currently do not have sufficient...
  • 23:41:53 - 23:41:53. Failed - Launching a new EC2 instance. Status reason: We currently do not have sufficient...
  • 23:42:03 - 23:42:03. Failed - Launching a new EC2 instance. Status reason: We currently do not have sufficient...
  • 23:43:07 - 23:43:23. Successful - launch of new EC2 instance in response to difference between desired and actual
  • 23:43:53 - 23:45:35. Successful - User request (CA) set group desired capacity changing the desired capacity from 4 to 3.

Right before the successful start at 23:43:07, the CA checks the status of the node group which has a FAILED status at 23:43:00 and does a clean up through the early abort feature. It does look like this causes a race between the ASG finishing the scale-up and the CA cleaning up placeholder instances.

@jlamillan
Copy link
Contributor

jlamillan commented Feb 7, 2024

@apy-liu or @theintz - curious if either of you have looked into using termination policies or scale-in protection as @gicuuu3 suggested as an alternative way to resolve this issue?

@apy-liu
Copy link

apy-liu commented Feb 12, 2024

Yes, looking into the suggested mitigations though still hoping to get to a resolution on this through one of the proposed fixes.

@apy-liu
Copy link

apy-liu commented Mar 13, 2024

Hi @gjtempleton this was the issue we raised during the SIG meeting. Would be interested to hear your feedback on this issue and the suggested PR to fix it or potentially a different way as we're still looking for a long-term solution. Thanks

@towca towca added the area/provider/aws Issues or PRs related to aws provider label Mar 21, 2024
@kmsarabu
Copy link

kmsarabu commented Apr 2, 2024

Hi @gjtempleton Could you please outline the next steps and recommend the optimal approach to address this issue? Additionally, is there any restriction on the range of older versions to which a fix can be backported when requesting a backport fix? Thanks

@apy-liu
Copy link

apy-liu commented Apr 11, 2024

Hi @gjtempleton, thank you for providing your feedback at this week's SIG meeting. It sounds like the proposed solution would be adding a check before cleaning up the placeholder instance and updating the internal asgCache to reflect the new instance if it does come up.

You had mentioned drafting up a PR for the above, so I wanted to check where we could follow along. Also wanted to check on the question @kmsarabu posed above about the range of older versions that backporting is supported. Thanks!

ruiscosta added a commit to ruiscosta/autoscaler that referenced this issue May 10, 2024
This merge resolves the issue where the Kubernetes Cluster Autoscaler incorrectly decommissions actual instances instead of placeholders within AWS ASGs. The fix ensures that only placeholders are considered for scaling down when recent scaling activities fail, thereby preventing the accidental removal of active nodes. Enhanced unit tests and checks are included to ensure robustness.

Fixes kubernetes#5829
ruiscosta added a commit to ruiscosta/autoscaler that referenced this issue May 10, 2024
This merge resolves the issue where the Kubernetes Cluster Autoscaler incorrectly decommissions actual instances instead of placeholders within AWS ASGs. The fix ensures that only placeholders are considered for scaling down when recent scaling activities fail, thereby preventing the accidental removal of active nodes. Enhanced unit tests and checks are included to ensure robustness.

Fixes kubernetes#5829
@ruiscosta ruiscosta linked a pull request May 10, 2024 that will close this issue
@drmorr0
Copy link
Contributor

drmorr0 commented May 16, 2024

@gjtempleton I am uncertain about the proposed fix (checking for recent scaling activity before changing the ASG size). I missed the SIG meeting unfortunately where this was discussed, but I think this doesn't solve the problem, it just narrows the window between which a race can occur (see my comment on #6818).

I think the ideal way to handle this (and how I've seen it done in the past) is to enable scale-in protection on the ASGs (as mentioned by @gicuuu3). This is the only way I know of to safely change the size of an ASG without accidentally terminating running nodes.

I will try to join the SIG meeting this Monday to discuss further.

@daimaxiaxie
Copy link
Contributor

The problem is not just AWS, it also exists on other clouds. For example AliCloud. The root of the problem is that scaling activities on the cloud are asynchronous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet