[Question] Aks Autoscaling not working properly #3348

teocrispy91 · 2022-11-18T16:53:34Z

I have an aks cluster with 4 nodepools consisting of windows and linux nodepools and a total of 700 namespaces in it. The total node count would be between 50-60 all the time . So i had cleared down more than 200 namespaces which were utilizing the cluster bt still the cluster run between 50-60 average cpu and memory usage of cluster is very low and below 50 all the time. I'm still not sure why the scale down is not happening properly after clearing down namespaces autoscaling vmss is all in place and its working bt only scales in between 50-60 nodes.

carvido1 · 2022-11-18T17:01:36Z

Hello @teocrispy91 .

Do you have any potential daemonset that is installed on every node, maybe monitoring agents, service mesh, mTLS ... that could create enough CPU/memory consumption to avoid the autoscaling act upon nodes ?

There are no rules regarding the SKU of the nodes and the number of replicas but this has to be optimised.

Could you maybe provide a description of your nodes so there is more information about your setup ?

You ca use

kubectl describe nodes

BR.

teocrispy91 · 2022-11-18T20:11:16Z

actuallly kube-system is one among the highly utilised namespaces when i checked with grafana . kube-system has many daemon sets right for every node to work properly other than that i don't think there are deamon sets provisioned also when i did kubectl describe node in limits of cpu i could see it going over 200% where my nodepool size is of d4sv3(4 cores) will that be the problem? Deamon sets currently in the cluster is from kube-system and csidriver nothing else apart from this

carvido1 · 2022-11-21T15:41:35Z

Hello @teocrispy91

Some of the daemonsets, to be more exactly daemonsets from kube-system, are required to have a Kubernetes cluster fully operating. If you don't use the CSI driver you can disable that feature for your cluster and save some resources.

Regarding the problem with the daemonsets you have in kube-system, some of them are there because the system nodepools have certain daemonsets deployed there. I recommend you to the article Use system nodepools in AKS, you can have 1 system nodepool and then different user nodepools.

Hope this helps you with your problem.
BR

joaguas · 2022-11-24T13:48:50Z

@teocrispy91 @carvido1

When it comes to resources (excluding other factors like affinities, zonal constraints, etc.) The cluster autoscaler (either for scaling up or down) takes actions based on resource requests and not actual resource consumption.

Scaling up happens when the resources being requested by a pod that needs to be scheduled can't be met by any of the existing worker nodes (pod requests > node's allocatable resources - node's allocated resources) - the autoscaler then checks if a new node from a given nodepool can fulfill the pod's criteria (requests, tolerations, zonal constaints, etc.)

Scaling down happens when a node is considered a candidate for deletion - this can happen if a node is considered to be under-utilized but again, this will be based on resource requests vs node resources and not actual resource consumption/load:

If you're using the default values for the cluster autoscaler profile, the utilization threshold will be 50%. This means that node will be below the threshold when the combined resource requests of the pods running on that node is less than 50% of the node's allocatable resources - independently if the node's actual cpu and mem load is 10% or 90%.

If your node has 4000m alloctable cpu and there are 2 pods running there each requesting 1500m cpu (a total of 3000m) then this node (when it comes to cpu) is considered to be under 75% cpu utilization - the pods actual cpu consumption at a given moment (being 10m or 1600m) won't change this.

When you describe a node, you get the following data at the bottom:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                1208m (63%)   17230m (906%)
  memory             1688Mi (37%)  20934Mi (458%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)
Events:              <none>

For a node to be under-utilized, the percentages on the 2nd column (request) would need to be below what's defined on the autoscaler profile's scale-down-utilization-threshold flag.

https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler#using-the-autoscaler-profile
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-is-cluster-autoscaler-different-from-cpu-usage-based-node-autoscalers
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-scale-up-work

ghost · 2022-12-24T19:00:52Z

Action required from @Azure/aks-pm

ghost · 2023-01-09T00:00:52Z

Issue needing attention of @Azure/aks-leads

ghost · 2023-01-24T06:00:46Z

Issue needing attention of @Azure/aks-leads

sabbour · 2023-02-03T19:34:26Z

@teocrispy91 does @joaguas response answer your question?

ghost · 2023-03-06T01:01:24Z

Action required from @Azure/aks-pm

ghost · 2023-03-25T13:00:59Z

This issue will now be closed because it hasn't had any activity for 7 days after stale. teocrispy91 feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

teocrispy91 added the question label Nov 18, 2022

teocrispy91 changed the title ~~[Question]~~ [Question] Aks Autoscaling not working properly Nov 18, 2022

ghost added the action-required label Dec 19, 2022

ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Dec 24, 2022

ghost removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Feb 3, 2023

ghost added the action-required label Mar 1, 2023

ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Mar 6, 2023

nemobis mentioned this issue Mar 9, 2023

[BUG] kube-system pods reserve 35 % of allocatable memory on a 4 GB node #3525

Closed

justindavies added the Needs Author Feedback label Mar 10, 2023

ghost removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Mar 10, 2023

ghost added the stale Stale issue label Mar 18, 2023

ghost closed this as completed Mar 25, 2023

ghost locked as resolved and limited conversation to collaborators Apr 24, 2023

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Aks Autoscaling not working properly #3348

[Question] Aks Autoscaling not working properly #3348

teocrispy91 commented Nov 18, 2022

carvido1 commented Nov 18, 2022

teocrispy91 commented Nov 18, 2022 •

edited

carvido1 commented Nov 21, 2022

joaguas commented Nov 24, 2022

ghost commented Dec 24, 2022

ghost commented Jan 9, 2023

ghost commented Jan 24, 2023

sabbour commented Feb 3, 2023

ghost commented Mar 6, 2023

ghost commented Mar 25, 2023

[Question] Aks Autoscaling not working properly #3348

[Question] Aks Autoscaling not working properly #3348

Comments

teocrispy91 commented Nov 18, 2022

carvido1 commented Nov 18, 2022

teocrispy91 commented Nov 18, 2022 • edited

carvido1 commented Nov 21, 2022

joaguas commented Nov 24, 2022

ghost commented Dec 24, 2022

ghost commented Jan 9, 2023

ghost commented Jan 24, 2023

sabbour commented Feb 3, 2023

ghost commented Mar 6, 2023

ghost commented Mar 25, 2023

teocrispy91 commented Nov 18, 2022 •

edited