-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Aks Autoscaling not working properly #3348
Comments
Hello @teocrispy91 . Do you have any potential daemonset that is installed on every node, maybe monitoring agents, service mesh, mTLS ... that could create enough CPU/memory consumption to avoid the autoscaling act upon nodes ? There are no rules regarding the SKU of the nodes and the number of replicas but this has to be optimised. Could you maybe provide a description of your nodes so there is more information about your setup ? You ca use kubectl describe nodes BR. |
actuallly kube-system is one among the highly utilised namespaces when i checked with grafana . kube-system has many daemon sets right for every node to work properly other than that i don't think there are deamon sets provisioned also when i did kubectl describe node in limits of cpu i could see it going over 200% where my nodepool size is of d4sv3(4 cores) will that be the problem? Deamon sets currently in the cluster is from kube-system and csidriver nothing else apart from this |
Hello @teocrispy91 Some of the daemonsets, to be more exactly daemonsets from kube-system, are required to have a Kubernetes cluster fully operating. If you don't use the CSI driver you can disable that feature for your cluster and save some resources. Regarding the problem with the daemonsets you have in kube-system, some of them are there because the system nodepools have certain daemonsets deployed there. I recommend you to the article Use system nodepools in AKS, you can have 1 system nodepool and then different user nodepools. Hope this helps you with your problem. |
When it comes to resources (excluding other factors like affinities, zonal constraints, etc.) The cluster autoscaler (either for scaling up or down) takes actions based on resource requests and not actual resource consumption. Scaling up happens when the resources being requested by a pod that needs to be scheduled can't be met by any of the existing worker nodes (pod requests > node's allocatable resources - node's allocated resources) - the autoscaler then checks if a new node from a given nodepool can fulfill the pod's criteria (requests, tolerations, zonal constaints, etc.) Scaling down happens when a node is considered a candidate for deletion - this can happen if a node is considered to be under-utilized but again, this will be based on resource requests vs node resources and not actual resource consumption/load: If you're using the default values for the cluster autoscaler profile, the utilization threshold will be 50%. This means that node will be below the threshold when the combined resource requests of the pods running on that node is less than 50% of the node's allocatable resources - independently if the node's actual cpu and mem load is 10% or 90%. If your node has 4000m alloctable cpu and there are 2 pods running there each requesting 1500m cpu (a total of 3000m) then this node (when it comes to cpu) is considered to be under 75% cpu utilization - the pods actual cpu consumption at a given moment (being 10m or 1600m) won't change this. When you describe a node, you get the following data at the bottom:
For a node to be under-utilized, the percentages on the 2nd column (request) would need to be below what's defined on the autoscaler profile's https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler#using-the-autoscaler-profile |
Action required from @Azure/aks-pm |
Issue needing attention of @Azure/aks-leads |
1 similar comment
Issue needing attention of @Azure/aks-leads |
@teocrispy91 does @joaguas response answer your question? |
Action required from @Azure/aks-pm |
This issue will now be closed because it hasn't had any activity for 7 days after stale. teocrispy91 feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion. |
I have an aks cluster with 4 nodepools consisting of windows and linux nodepools and a total of 700 namespaces in it. The total node count would be between 50-60 all the time . So i had cleared down more than 200 namespaces which were utilizing the cluster bt still the cluster run between 50-60 average cpu and memory usage of cluster is very low and below 50 all the time. I'm still not sure why the scale down is not happening properly after clearing down namespaces autoscaling vmss is all in place and its working bt only scales in between 50-60 nodes.
The text was updated successfully, but these errors were encountered: