Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azure-npm daemonset pods request 250m CPU instead of 10m. Is this configurable? #2792

Open
DaveOHenry opened this issue Feb 14, 2022 · 32 comments

Comments

@DaveOHenry
Copy link

DaveOHenry commented Feb 14, 2022

Duplicate of #2033 which was closed without an answer.

@ghost ghost added the triage label Feb 14, 2022
@ghost
Copy link

ghost commented Feb 14, 2022

Hi DaveOHenry, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

  1. If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
  2. Please abide by the AKS repo Guidelines and Code of Conduct.
  3. If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
  4. Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
  5. Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
  6. If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

@ghost ghost added the action-required label Feb 16, 2022
@ghost
Copy link

ghost commented Feb 16, 2022

Triage required from @Azure/aks-pm

@ghost
Copy link

ghost commented Feb 21, 2022

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Feb 21, 2022
@ngbrown
Copy link

ngbrown commented Feb 23, 2022

I would also like an answer on this. The last posting on #2033 was by @paulgmiller:

Azure NPM has had some improvements recently to try and address OOMs.
That said we also have a setting on outside that can boost memory limit so
if you're still effected file a support ticket and we'll try and help out.

But the issue described is the baseline reservation of CPU in the manifest for azurn-npm, not the amount of memory the process uses.

In my cluster, azure-npm is reserving 13% of the CPU per node while really using ~0%.

Both @miwithro and @juan-lee were tagged on the closed, but unresolved issue.

@vakalapa
Copy link

Because azure-npm is a daemonset, we apply default memory and cpu limits irrespective of the cluster size. Can you give your AKS cluster fqdn, i can check if it is recommended to reduce those limits or not.

Even though NPM steady state is not using a ton of CPU, when a flood of events come in/ NPM restarts for some reason, if CPU limit is reduced and the cluster is fairly large in size then there is a high chance for NPM to be in OOM kill loop.

@ghost ghost removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Feb 24, 2022
@ngbrown
Copy link

ngbrown commented Feb 24, 2022

Lowering requested CPU should not generate a out of memory kill loop as it has nothing to do with memory. It also doesn't limit the maximum allowed CPU usage. By lowering the resources.requests.cpu, more pods will fit per node. This is especially acute on 2 vCPU nodes.

@vakalapa
Copy link

vakalapa commented Feb 24, 2022

NPM pods watch pod, namespace and netpol related events and due to some inefficiencies (which we are actively working to solve) can only work on one event at a time. On start of NPM pod, this results in incoming events to be piled up and increasing memory usage. So reducing CPU limit will result in less number of events processed, in turn further aggravating memory usage. We are working on an improved design and until that we want to review a cluster's size before we can reduce the CPU limits.

I agree that this limit can be costly for smaller Node sizes, we have been exploring options on making these limits dynamic based on node size, but i am afraid we do not have a solution in sight. Until then we need to rely on adhoc requests to either increase/decrease the limits based on cluster size. Sorry for the inconvenience.

@ngbrown
Copy link

ngbrown commented Feb 24, 2022

CPU time is a much more fluid resource than memory is. The CPU will get to everything eventually, even in a highly loaded machine. Most clusters are going to be way over subscribed if you add up the maximum limits. Kubernetes prevents scheduling anything additional on a node if the requests add up to more than 100%. So requesting a minimum reservation of 13% on a two vCore instance (or 25% on a single vCore instance) and not consistently using it unreasonably prevents pods from being scheduled.

Other daemon sets usually set their requests.cpu right above idle, while keeping the limits.cpu much higher.

daemon_set_request_limits
(purple: limits, yellow: requests, blue: usage-max over 30 min)

In the azure-npm case, this issue and the one before was about just lowering the minimum reserved CPU down to above idle, and keeping the maximum limit at the current value.

daemon_set_azure-npm

@ghost ghost added the action-required label Feb 26, 2022
@ghost
Copy link

ghost commented Feb 26, 2022

Triage required from @Azure/aks-pm @vakalapa

@ghost ghost added the stale Stale issue label Apr 27, 2022
@ghost
Copy link

ghost commented Apr 27, 2022

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

@ngbrown
Copy link

ngbrown commented Apr 27, 2022

This should still be implemented in AKS.

#not-stale

@ghost ghost removed the stale Stale issue label Apr 27, 2022
@ghost ghost added the stale Stale issue label Jun 27, 2022
@ghost
Copy link

ghost commented Jun 27, 2022

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

@DaveOHenry
Copy link
Author

DaveOHenry commented Jun 27, 2022 via email

@ghost ghost removed the stale Stale issue label Jun 27, 2022
@ghost ghost added the stale Stale issue label Aug 26, 2022
@ghost
Copy link

ghost commented Aug 26, 2022

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

@ngbrown
Copy link

ngbrown commented Sep 2, 2022

The advice I've gotten from Microsoft employees is that our cluster's networkPolicy should be switched from azure to calico because of this and other issues aren't being taken care of.

@ghost ghost removed the stale Stale issue label Sep 2, 2022
@Kapsztajn
Copy link

This should be addressed from MS side... 250m is way to much for CPU Request as it takes basically nothing...

@ghost ghost added the stale Stale issue label Dec 7, 2022
@ghost
Copy link

ghost commented Dec 7, 2022

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

@Kapsztajn
Copy link

Not stale, bump.

@ghost ghost removed the stale Stale issue label Dec 7, 2022
@allyford allyford added feature-request Requested Features and removed triage action-required labels Feb 3, 2023
@westy-crey
Copy link

I'd like to join and also ping this issue. Quarter of a whole CPU to be requested is not justified for such usage. Limits naturally is okay to stay. This is a heavily contributing factor why a k8s system overutilizing itself and allocates more nodes than it is actually necessary.

@mieky
Copy link

mieky commented Jul 4, 2023

We also encountered a case where trying to apply a NetworkPolicy to multiple namespaces at once had our azure-npm CPU usage go through the roof (related to #2823).

@jwstauber2
Copy link

Bump, currently experiencing this problem as well, needs a resolution.

@kethahel99
Copy link

kethahel99 commented Jul 26, 2023

+1 ,we are also experiencing this issue. It is very disruptive. 250mCores is alot

@cjdell
Copy link

cjdell commented Aug 8, 2023

Needs resolving. We have 3 instances of this thing using up 0.75 cores and we are now at the limit of resources for our cluster. There is barely any CPU activity so it obviously does not need it.

@welersonlisboa
Copy link

Needs resolving.. issue still happening for azure-npm

@tomkukral
Copy link

I'd like to get this solved ... we are running 400 aks nodes and this azure-npm costs a lot for money due to incorrect requests configuration.

@joshcoburn
Copy link

I would also like to see this resolved.. at least give users the ability to tune this.

@thilan3547
Copy link

need to resolve this, costing us too much money at the moment

@viktor-gustafsson
Copy link

Would also like to see this resolved, seems ridiculous with this high amount of requested resources for what it is doing. This is our primary cost for our AKS cluster.

@yashak
Copy link

yashak commented Jul 26, 2024

Needs resolving.. issue is still happening for azure-npm

@restenb
Copy link

restenb commented Aug 20, 2024

Same issue. azure-npm pods have high request:

     Limits:                                                                                                                                                                              
       cpu:     250m                                                                                                                                                                      
       memory:  1000Mi                                                                                                                                                                    
     Requests:                                                                                                                                                                            
       cpu:     250m                                                                                                                                                                      
       memory:  300Mi  

While usage over time in a running cluster is usually quite low, 1-3% of request ...

@seguler
Copy link

seguler commented Sep 5, 2024

The team is discussing a few improvements here. We will evaluate automatic scaling in the future, but possibly can release a reduction in the short term. @vakalapa will update this thread once the plan is solidified.

@yashak
Copy link

yashak commented Oct 16, 2024

Hi @vakalapa ! Any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests