Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Aks Autoscaling not working properly #3348

Closed
teocrispy91 opened this issue Nov 18, 2022 · 10 comments
Closed

[Question] Aks Autoscaling not working properly #3348

teocrispy91 opened this issue Nov 18, 2022 · 10 comments

Comments

@teocrispy91
Copy link

I have an aks cluster with 4 nodepools consisting of windows and linux nodepools and a total of 700 namespaces in it. The total node count would be between 50-60 all the time . So i had cleared down more than 200 namespaces which were utilizing the cluster bt still the cluster run between 50-60 average cpu and memory usage of cluster is very low and below 50 all the time. I'm still not sure why the scale down is not happening properly after clearing down namespaces autoscaling vmss is all in place and its working bt only scales in between 50-60 nodes.

@teocrispy91 teocrispy91 changed the title [Question] [Question] Aks Autoscaling not working properly Nov 18, 2022
@carvido1
Copy link

Hello @teocrispy91 .

Do you have any potential daemonset that is installed on every node, maybe monitoring agents, service mesh, mTLS ... that could create enough CPU/memory consumption to avoid the autoscaling act upon nodes ?

There are no rules regarding the SKU of the nodes and the number of replicas but this has to be optimised.

Could you maybe provide a description of your nodes so there is more information about your setup ?

You ca use

kubectl describe nodes

BR.

@teocrispy91
Copy link
Author

teocrispy91 commented Nov 18, 2022

actuallly kube-system is one among the highly utilised namespaces when i checked with grafana . kube-system has many daemon sets right for every node to work properly other than that i don't think there are deamon sets provisioned also when i did kubectl describe node in limits of cpu i could see it going over 200% where my nodepool size is of d4sv3(4 cores) will that be the problem? Deamon sets currently in the cluster is from kube-system and csidriver nothing else apart from this

@carvido1
Copy link

Hello @teocrispy91

Some of the daemonsets, to be more exactly daemonsets from kube-system, are required to have a Kubernetes cluster fully operating. If you don't use the CSI driver you can disable that feature for your cluster and save some resources.

Regarding the problem with the daemonsets you have in kube-system, some of them are there because the system nodepools have certain daemonsets deployed there. I recommend you to the article Use system nodepools in AKS, you can have 1 system nodepool and then different user nodepools.

Hope this helps you with your problem.
BR

@joaguas
Copy link

joaguas commented Nov 24, 2022

@teocrispy91 @carvido1

When it comes to resources (excluding other factors like affinities, zonal constraints, etc.) The cluster autoscaler (either for scaling up or down) takes actions based on resource requests and not actual resource consumption.

Scaling up happens when the resources being requested by a pod that needs to be scheduled can't be met by any of the existing worker nodes (pod requests > node's allocatable resources - node's allocated resources) - the autoscaler then checks if a new node from a given nodepool can fulfill the pod's criteria (requests, tolerations, zonal constaints, etc.)

Scaling down happens when a node is considered a candidate for deletion - this can happen if a node is considered to be under-utilized but again, this will be based on resource requests vs node resources and not actual resource consumption/load:

If you're using the default values for the cluster autoscaler profile, the utilization threshold will be 50%. This means that node will be below the threshold when the combined resource requests of the pods running on that node is less than 50% of the node's allocatable resources - independently if the node's actual cpu and mem load is 10% or 90%.

If your node has 4000m alloctable cpu and there are 2 pods running there each requesting 1500m cpu (a total of 3000m) then this node (when it comes to cpu) is considered to be under 75% cpu utilization - the pods actual cpu consumption at a given moment (being 10m or 1600m) won't change this.

When you describe a node, you get the following data at the bottom:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                1208m (63%)   17230m (906%)
  memory             1688Mi (37%)  20934Mi (458%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)
Events:              <none>

For a node to be under-utilized, the percentages on the 2nd column (request) would need to be below what's defined on the autoscaler profile's scale-down-utilization-threshold flag.

https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler#using-the-autoscaler-profile
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-is-cluster-autoscaler-different-from-cpu-usage-based-node-autoscalers
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-scale-up-work

@ghost ghost added the action-required label Dec 19, 2022
@ghost
Copy link

ghost commented Dec 24, 2022

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Dec 24, 2022
@ghost
Copy link

ghost commented Jan 9, 2023

Issue needing attention of @Azure/aks-leads

1 similar comment
@ghost
Copy link

ghost commented Jan 24, 2023

Issue needing attention of @Azure/aks-leads

@sabbour
Copy link
Contributor

sabbour commented Feb 3, 2023

@teocrispy91 does @joaguas response answer your question?

@ghost ghost removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Feb 3, 2023
@ghost ghost added the action-required label Mar 1, 2023
@ghost
Copy link

ghost commented Mar 6, 2023

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Mar 6, 2023
@ghost ghost removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Mar 10, 2023
@ghost ghost added the stale Stale issue label Mar 18, 2023
@ghost ghost closed this as completed Mar 25, 2023
@ghost
Copy link

ghost commented Mar 25, 2023

This issue will now be closed because it hasn't had any activity for 7 days after stale. teocrispy91 feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 24, 2023
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants