Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Autoscaler not backing off exhausted node group #6240

Closed
elohmeier opened this issue Nov 1, 2023 · 5 comments · Fixed by #6750
Closed

Cluster Autoscaler not backing off exhausted node group #6240

elohmeier opened this issue Nov 1, 2023 · 5 comments · Fixed by #6750
Labels
area/cluster-autoscaler area/provider/hetzner Issues or PRs related to Hetzner provider kind/bug Categorizes issue or PR as related to a bug.

Comments

@elohmeier
Copy link

elohmeier commented Nov 1, 2023

Which component are you using?:

Cluster Autoscaler

What version of the component are you using?:

registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.4

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.26.6+k3s1
Kustomize Version: v4.5.7
Server Version: v1.26.6+k3s1

What environment is this in?:

Hetzner Cloud

What did you expect to happen?:

When the cluster autoscaler is configured with a priority expander, and multiple node groups of differing priorities are provided, the cluster autoscaler should back-off after some time if the cloud provider fails to provision nodes in the high priority node group due to resource unavailability and proceed to lower priority node groups.

What happened instead?:

The high priority node group (in the below log pool1) has no resources available currently to provision the requested nodes.
The cluster autoscaler is stuck in a loop trying to provision nodes in the high-prio group and not proceeding to pool2 (lower prio, resources available). I've also tried to set --max-node-group-backoff-duration=1m with no effect.

W1101 05:15:05.399825       1 hetzner_servers_cache.go:94] Fetching servers from Hetzner API
I1101 05:15:15.806488       1 hetzner_node_group.go:438] Set node group draining-node-pool size from 0 to 0, expected delta 0
I1101 05:15:15.806519       1 hetzner_node_group.go:438] Set node group pool1 size from 1 to 1, expected delta 0
I1101 05:15:15.806525       1 hetzner_node_group.go:438] Set node group pool2 size from 0 to 0, expected delta 0
I1101 05:15:15.808727       1 scale_up.go:608] Scale-up: setting group pool1 size to 4
E1101 05:15:16.068533       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:16.079704       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:16.126786       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
W1101 05:15:16.126816       1 hetzner_servers_cache.go:94] Fetching servers from Hetzner API
I1101 05:15:26.655179       1 hetzner_node_group.go:438] Set node group pool1 size from 1 to 1, expected delta 0
I1101 05:15:26.655243       1 hetzner_node_group.go:438] Set node group pool2 size from 0 to 0, expected delta 0
I1101 05:15:26.655257       1 hetzner_node_group.go:438] Set node group draining-node-pool size from 0 to 0, expected delta 0
I1101 05:15:26.660093       1 scale_up.go:608] Scale-up: setting group pool1 size to 4
E1101 05:15:26.948368       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:26.981452       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)
E1101 05:15:27.044150       1 hetzner_node_group.go:117] failed to create error: could not create server type ccx43 in region fsn1: we are unable to provision servers for this location, try with a different location or try later (resource_unavailable)

How to reproduce it (as minimally and precisely as possible):

apiVersion: v1
data:
  priorities: |
    10:
      - pool2
    20:
      - pool1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
  namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    spec:
      containers:
      - command:
        - ./cluster-autoscaler
        - --scale-down-unneeded-time=5m
        - --cloud-provider=hetzner
        - --stderrthreshold=info
        - --nodes=0:4:CCX43:FSN1:pool1
        - --nodes=0:4:CCX43:NBG1:pool2
        - --expander=priority
        env:
        - name: HCLOUD_IMAGE
          value: debian-11
        - name: HCLOUD_TOKEN
          valueFrom:
            secretKeyRef:
              key: token
              name: hcloud
        image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.4
        name: cluster-autoscaler
      serviceAccountName: cluster-autoscaler

Anything else we need to know?:

@elohmeier elohmeier added the kind/bug Categorizes issue or PR as related to a bug. label Nov 1, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024
@elohmeier
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2024
@towca towca added area/cluster-autoscaler area/provider/hetzner Issues or PRs related to Hetzner provider labels Mar 21, 2024
@apricote
Copy link
Member

apricote commented Apr 2, 2024

Is there something we need to do as the cloudprovider to make this possible? Otherwise this looks like an area/core-autoscaler issue.

@tallaxes
Copy link
Contributor

@apricote Looking at the relevant provider code, it seems possible that on failure it logs a message, but neglects to return the error from IncreaseSize. This means the core autoscaler does not have any indication the scaleup has failed. (More generally, this could affect any provider that neglects reporting an error from IncreaseSize ...)

@apricote
Copy link
Member

Thanks for the hint @tallaxes! I opened a PR to properly return encountered errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/hetzner Issues or PRs related to Hetzner provider kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants