cluster-autoscaler clusterapi provider performance degrades when there are a high number of node groups #6784
Labels
area/provider/cluster-api
Issues or PRs related to Cluster API provider
kind/bug
Categorizes issue or PR as related to a bug.
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
Component version: all versions up to and including 1.30.0
What k8s version are you using (
kubectl version
)?:this affects all kubernetes versions that are compatible with the cluster autoscaler
What environment is this in?:
clusterapi provider, with more than 50 node groups (eg. MachineDeployments, MachineSets, MachinePools)
What did you expect to happen?:
expect cluster autoscaler to operate as normal
What happened instead?:
as the number of node groups increases, the performance of the autoscaler appears to degrade. it takes longer and longer to process the scan interval and in some cases (when node groups are in the 100s) it can take more than 40 minutes to add a new node when pods are pending.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
this problem appears related to how the clusterapi provider interacts with the api server. when assessing activity in the cluster, the provider will query the api server for all the node groups, then query again for scalable resources, and potentially another time for the infrastructure machine template. i have a feeling that this interaction is causing the issues.
i think it's possible that extending the scan interval time might alleviate some of the issues, but i hove not confirmed anything yet.
The text was updated successfully, but these errors were encountered: