Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster autoscaler is not able to scale down Rancher managed cluster once OpenStack quota were reached #6778

Open
dirkdaems opened this issue Apr 29, 2024 · 1 comment
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@dirkdaems
Copy link

dirkdaems commented Apr 29, 2024

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
image v1.26.2, Helm chart version 9.28.0

Component version:

What k8s version are you using (kubectl version)?:

Client Version: v1.29.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.25.7+rke2r1
WARNING: version difference between client (1.29) and server (1.25) exceeds the supported minor version skew of +/-1

What environment is this in?:
Rancher managed Kubernetes cluster on an OpenStack based cloud at CloudFerro.

What did you expect to happen?:
The autoscaler should scale down the worker nodes.

What happened instead?:
The autoscaler was not able to scale down the worker nodes.

How to reproduce it (as minimally and precisely as possible):

  • Deploy a Rancher managed Kubernetes cluster on an OpenStack based cloud.
  • Start a workload which will exceed quota or which will exhaust OpenStack resources
  • Stop the workload
  • Autoscaler will not be able to downscale the cluster anymore

Anything else we need to know?:
``
Cluster autoscaler logs of one node:

I0425 13:03:40.312857       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 3m21.487973745s
I0425 13:03:52.515211       1 klogx.go:87] Node k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal - cpu utilization 0.015625
I0425 13:03:52.519922       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:00:16.487044958 +0000 UTC m=+569.633138942 duration 3m34.23905605s
I0425 13:03:52.523658       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 3m34.23905605s
I0425 13:04:04.972495       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:00:16.487044958 +0000 UTC m=+569.633138942 duration 3m46.499174803s
I0425 13:04:04.975815       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 3m46.499174803s
I0425 13:04:16.631155       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:00:16.487044958 +0000 UTC m=+569.633138942 duration 3m58.90075957s
I0425 13:04:16.634672       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 3m58.90075957s
I0425 13:04:29.390475       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:00:16.487044958 +0000 UTC m=+569.633138942 duration 4m10.640029717s
I0425 13:04:29.396629       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 4m10.640029717s
I0425 13:04:41.576473       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:00:16.487044958 +0000 UTC m=+569.633138942 duration 4m23.321489133s
I0425 13:04:41.579416       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 4m23.321489133s
I0425 13:04:54.313302       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:00:16.487044958 +0000 UTC m=+569.633138942 duration 4m35.509533415s
I0425 13:04:54.319305       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 4m35.509533415s
I0425 13:05:07.957279       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:00:16.487044958 +0000 UTC m=+569.633138942 duration 4m48.244325294s
I0425 13:05:07.962578       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 4m48.244325294s
I0425 13:05:20.710157       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:00:16.487044958 +0000 UTC m=+569.633138942 duration 5m1.887658037s
I0425 13:05:20.714317       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 5m1.887658037s
I0425 13:05:33.421047       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:00:16.487044958 +0000 UTC m=+569.633138942 duration 5m14.640802814s
I0425 13:05:33.426948       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 5m14.640802814s
I0425 13:05:45.684018       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:00:16.487044958 +0000 UTC m=+569.633138942 duration 5m27.351820225s
I0425 13:05:45.690245       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 5m27.351820225s
I0425 13:05:58.905082       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:00:16.487044958 +0000 UTC m=+569.633138942 duration 5m40.170426795s
I0425 13:05:58.907849       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 5m40.170426795s
I0425 13:05:58.924520       1 delete.go:103] Successfully added ToBeDeletedTaint on node k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal
I0425 13:05:58.924993       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal", UID:"2c53afa4-d0d7-4b52-b51f-de7f1d68d316", APIVersion:"v1", ResourceVersion:"137156560", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' marked the node as toBeDeleted/unschedulable
I0425 13:05:59.021127       1 actuator.go:161] Scale-down: removing empty node "k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal"
I0425 13:05:59.030817       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"a267dd01-cbc6-4ead-a61e-bb266a07e221", APIVersion:"v1", ResourceVersion:"137160441", FieldPath:""}): type: 'Normal' reason: 'ScaleDownEmpty' (combined from similar events): Scale-down: removing empty node "k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal"
E0425 13:06:04.022640       1 actuator.go:423] Scale-down: couldn't delete empty node, , status error: failed to delete k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal: could not find providerID in machine: k8s-stag-worker-hma-2xlarge-875ff7777-2zgkc/fleet-default
I0425 13:06:04.107440       1 delete.go:197] Releasing taint {Key:ToBeDeletedByClusterAutoscaler Value:1714050358 Effect:NoSchedule TimeAdded:<nil>} on node k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal
I0425 13:06:04.122406       1 delete.go:228] Successfully released ToBeDeletedTaint on node k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal
I0425 13:06:04.126439       1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal", UID:"2c53afa4-d0d7-4b52-b51f-de7f1d68d316", APIVersion:"v1", ResourceVersion:"137156560", FieldPath:""}): type: 'Warning' reason: 'ScaleDownFailed' failed to delete empty node: failed to delete k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal: could not find providerID in machine: k8s-stag-worker-hma-2xlarge-875ff7777-2zgkc/fleet-default
I0425 13:06:11.518848       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:06:09.042539031 +0000 UTC m=+922.188633015 duration 0s
I0425 13:06:11.522651       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 0s
I0425 13:06:22.852026       1 klogx.go:87] Node k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal - cpu utilization 0.015625
I0425 13:06:22.858971       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:06:09.042539031 +0000 UTC m=+922.188633015 duration 12.495445283s
I0425 13:06:22.864304       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 12.495445283s
I0425 13:06:35.234276       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:06:09.042539031 +0000 UTC m=+922.188633015 duration 23.840029531s
I0425 13:06:35.238608       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 23.840029531s
I0425 13:06:46.833159       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:06:09.042539031 +0000 UTC m=+922.188633015 duration 36.227815032s
I0425 13:06:46.837574       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 36.227815032s
I0425 13:06:59.391476       1 klogx.go:87] Node k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal - cpu utilization 0.015625
I0425 13:06:59.396173       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:06:09.042539031 +0000 UTC m=+922.188633015 duration 47.828123033s
I0425 13:06:59.400410       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 47.828123033s
I0425 13:07:11.918098       1 nodes.go:84] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal is unneeded since 2024-04-25 13:06:09.042539031 +0000 UTC m=+922.188633015 duration 1m0.374052861s
I0425 13:07:11.922539       1 nodes.go:123] k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal was unneeded for 1m0.374052861s

Worker node event logs:

Events:
  Type     Reason           Age               From                Message
  ----     ------           ----              ----                -------
  Normal   ScaleDown        36m               cluster-autoscaler  marked the node as toBeDeleted/unschedulable
  Warning  ScaleDownFailed  36m               cluster-autoscaler  failed to delete empty node: failed to delete k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal: could not find providerID in machine: k8s-stag-worker-hma-2xlarge-875ff7777-4fxtv/fleet-default
  Normal   RegisteredNode   30m               node-controller     Node k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal event: Registered Node k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal in Controller
  Normal   RegisteredNode   24m               node-controller     Node k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal event: Registered Node k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal in Controller
  Warning  ScaleDownFailed  14m               cluster-autoscaler  failed to delete empty node: failed to delete k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal: could not find providerID in machine: k8s-stag-worker-hma-2xlarge-875ff7777-72kqc/fleet-default
  Normal   ScaleDown        9m (x2 over 14m)  cluster-autoscaler  marked the node as toBeDeleted/unschedulable
  Warning  ScaleDownFailed  8m54s             cluster-autoscaler  failed to delete empty node: failed to delete k8s-stag-worker-hma-2xlarge-d497dd7b-jzpq9.novalocal: could not find providerID in machine: k8s-stag-worker-hma-2xlarge-875ff7777-2zgkc/fleet-default
@dirkdaems dirkdaems added the kind/bug Categorizes issue or PR as related to a bug. label Apr 29, 2024
@dirkdaems
Copy link
Author

dirkdaems commented May 8, 2024

Looks like this is solved when upgrading to Rancher server version 2.8.3, in combination with RKE2 v1.29.4 and cluster autoscaler version 1.29.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

1 participant