azure-npm daemonset pods request 250m CPU instead of 10m. Is this configurable? #2792

DaveOHenry · 2022-02-14T10:21:46Z

Duplicate of #2033 which was closed without an answer.

ghost · 2022-02-14T10:21:49Z

Hi DaveOHenry, AKS bot here 👋
Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such:

If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster.
Please abide by the AKS repo Guidelines and Code of Conduct.
If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics?
Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS.
Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue.
If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

ghost · 2022-02-16T12:00:45Z

Triage required from @Azure/aks-pm

ghost · 2022-02-21T16:00:41Z

Action required from @Azure/aks-pm

ngbrown · 2022-02-23T22:50:01Z

I would also like an answer on this. The last posting on #2033 was by @paulgmiller:

Azure NPM has had some improvements recently to try and address OOMs.
That said we also have a setting on outside that can boost memory limit so
if you're still effected file a support ticket and we'll try and help out.

But the issue described is the baseline reservation of CPU in the manifest for azurn-npm, not the amount of memory the process uses.

In my cluster, azure-npm is reserving 13% of the CPU per node while really using ~0%.

Both @miwithro and @juan-lee were tagged on the closed, but unresolved issue.

vakalapa · 2022-02-24T00:19:31Z

Because azure-npm is a daemonset, we apply default memory and cpu limits irrespective of the cluster size. Can you give your AKS cluster fqdn, i can check if it is recommended to reduce those limits or not.

Even though NPM steady state is not using a ton of CPU, when a flood of events come in/ NPM restarts for some reason, if CPU limit is reduced and the cluster is fairly large in size then there is a high chance for NPM to be in OOM kill loop.

ngbrown · 2022-02-24T00:59:57Z

Lowering requested CPU should not generate a out of memory kill loop as it has nothing to do with memory. It also doesn't limit the maximum allowed CPU usage. By lowering the resources.requests.cpu, more pods will fit per node. This is especially acute on 2 vCPU nodes.

vakalapa · 2022-02-24T04:51:50Z

NPM pods watch pod, namespace and netpol related events and due to some inefficiencies (which we are actively working to solve) can only work on one event at a time. On start of NPM pod, this results in incoming events to be piled up and increasing memory usage. So reducing CPU limit will result in less number of events processed, in turn further aggravating memory usage. We are working on an improved design and until that we want to review a cluster's size before we can reduce the CPU limits.

I agree that this limit can be costly for smaller Node sizes, we have been exploring options on making these limits dynamic based on node size, but i am afraid we do not have a solution in sight. Until then we need to rely on adhoc requests to either increase/decrease the limits based on cluster size. Sorry for the inconvenience.

ngbrown · 2022-02-24T17:29:21Z

CPU time is a much more fluid resource than memory is. The CPU will get to everything eventually, even in a highly loaded machine. Most clusters are going to be way over subscribed if you add up the maximum limits. Kubernetes prevents scheduling anything additional on a node if the requests add up to more than 100%. So requesting a minimum reservation of 13% on a two vCore instance (or 25% on a single vCore instance) and not consistently using it unreasonably prevents pods from being scheduled.

Other daemon sets usually set their requests.cpu right above idle, while keeping the limits.cpu much higher.

(purple: limits, yellow: requests, blue: usage-max over 30 min)

In the azure-npm case, this issue and the one before was about just lowering the minimum reserved CPU down to above idle, and keeping the maximum limit at the current value.

ghost · 2022-02-26T18:00:56Z

Triage required from @Azure/aks-pm @vakalapa

ghost · 2022-04-27T20:00:44Z

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

ngbrown · 2022-04-27T21:52:06Z

This should still be implemented in AKS.

#not-stale

ghost · 2022-06-27T02:01:05Z

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

DaveOHenry · 2022-06-27T07:17:26Z

ping

…

________________________________ From: msftbot[bot] ***@***.***> Sent: Monday, June 27, 2022 4:01:17 AM To: Azure/AKS ***@***.***> Cc: David Heinrich ***@***.***>; Author ***@***.***> Subject: Re: [Azure/AKS] azure-npm daemonset pods request 250m CPU instead of 10m. Is this configurable? (Issue #2792) This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment. — Reply to this email directly, view it on GitHub<#2792 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AE77SLMIPV62BRL5P3DNAK3VREDO3ANCNFSM5OK7LTYA>. You are receiving this because you authored the thread.Message ID: ***@***.***>

ghost · 2022-08-26T08:01:18Z

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

ngbrown · 2022-09-02T03:33:01Z

The advice I've gotten from Microsoft employees is that our cluster's networkPolicy should be switched from azure to calico because of this and other issues aren't being taken care of.

Kapsztajn · 2022-10-07T20:59:58Z

This should be addressed from MS side... 250m is way to much for CPU Request as it takes basically nothing...

ghost · 2022-12-07T02:00:53Z

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

Kapsztajn · 2022-12-07T02:15:17Z

Not stale, bump.

westy-crey · 2023-03-21T13:49:19Z

I'd like to join and also ping this issue. Quarter of a whole CPU to be requested is not justified for such usage. Limits naturally is okay to stay. This is a heavily contributing factor why a k8s system overutilizing itself and allocates more nodes than it is actually necessary.

mieky · 2023-07-04T09:24:19Z

We also encountered a case where trying to apply a NetworkPolicy to multiple namespaces at once had our azure-npm CPU usage go through the roof (related to #2823).

jwstauber2 · 2023-07-18T19:25:35Z

Bump, currently experiencing this problem as well, needs a resolution.

kethahel99 · 2023-07-26T14:19:05Z

+1 ,we are also experiencing this issue. It is very disruptive. 250mCores is alot

cjdell · 2023-08-08T16:42:11Z

Needs resolving. We have 3 instances of this thing using up 0.75 cores and we are now at the limit of resources for our cluster. There is barely any CPU activity so it obviously does not need it.

welersonlisboa · 2023-08-28T09:59:14Z

Needs resolving.. issue still happening for azure-npm

tomkukral · 2023-12-28T15:46:11Z

I'd like to get this solved ... we are running 400 aks nodes and this azure-npm costs a lot for money due to incorrect requests configuration.

joshcoburn · 2024-01-22T21:02:36Z

I would also like to see this resolved.. at least give users the ability to tune this.

thilan3547 · 2024-02-15T11:44:28Z

need to resolve this, costing us too much money at the moment

viktor-gustafsson · 2024-02-15T11:44:34Z

Would also like to see this resolved, seems ridiculous with this high amount of requested resources for what it is doing. This is our primary cost for our AKS cluster.

yashak · 2024-07-26T13:04:36Z

Needs resolving.. issue is still happening for azure-npm

restenb · 2024-08-20T13:39:27Z

Same issue. azure-npm pods have high request:

     Limits:                                                                                                                                                                              
       cpu:     250m                                                                                                                                                                      
       memory:  1000Mi                                                                                                                                                                    
     Requests:                                                                                                                                                                            
       cpu:     250m                                                                                                                                                                      
       memory:  300Mi

While usage over time in a running cluster is usually quite low, 1-3% of request ...

seguler · 2024-09-05T22:59:18Z

The team is discussing a few improvements here. We will evaluate automatic scaling in the future, but possibly can release a reduction in the short term. @vakalapa will update this thread once the plan is solidified.

yashak · 2024-10-16T13:40:20Z

Hi @vakalapa ! Any updates on this?

ghost added the triage label Feb 14, 2022

ghost added the action-required label Feb 16, 2022

ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Feb 21, 2022

ghost removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels Feb 24, 2022

paulgmiller assigned vakalapa Feb 24, 2022

ghost added the action-required label Feb 26, 2022

ghost added the stale Stale issue label Apr 27, 2022

ghost removed the stale Stale issue label Apr 27, 2022

ghost added the stale Stale issue label Jun 27, 2022

ghost removed the stale Stale issue label Jun 27, 2022

ghost added the stale Stale issue label Aug 26, 2022

ghost removed the stale Stale issue label Sep 2, 2022

ghost added the stale Stale issue label Dec 7, 2022

ghost removed the stale Stale issue label Dec 7, 2022

allyford added feature-request Requested Features and removed triage action-required labels Feb 3, 2023

seguler assigned RooMaiku Jul 19, 2023

PixelRobots added networking network-policies labels Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azure-npm daemonset pods request 250m CPU instead of 10m. Is this configurable? #2792

azure-npm daemonset pods request 250m CPU instead of 10m. Is this configurable? #2792

DaveOHenry commented Feb 14, 2022 •

edited

Loading

ghost commented Feb 14, 2022

ghost commented Feb 16, 2022

ghost commented Feb 21, 2022

ngbrown commented Feb 23, 2022

vakalapa commented Feb 24, 2022

ngbrown commented Feb 24, 2022 •

edited

Loading

vakalapa commented Feb 24, 2022 •

edited

Loading

ngbrown commented Feb 24, 2022 •

edited

Loading

ghost commented Feb 26, 2022

ghost commented Apr 27, 2022

ngbrown commented Apr 27, 2022

ghost commented Jun 27, 2022

DaveOHenry commented Jun 27, 2022 via email

ghost commented Aug 26, 2022

ngbrown commented Sep 2, 2022

Kapsztajn commented Oct 7, 2022

ghost commented Dec 7, 2022

Kapsztajn commented Dec 7, 2022

westy-crey commented Mar 21, 2023

mieky commented Jul 4, 2023

jwstauber2 commented Jul 18, 2023

kethahel99 commented Jul 26, 2023 •

edited

Loading

cjdell commented Aug 8, 2023

welersonlisboa commented Aug 28, 2023

tomkukral commented Dec 28, 2023

joshcoburn commented Jan 22, 2024

thilan3547 commented Feb 15, 2024

viktor-gustafsson commented Feb 15, 2024

yashak commented Jul 26, 2024

restenb commented Aug 20, 2024

seguler commented Sep 5, 2024

yashak commented Oct 16, 2024 •

edited

Loading

azure-npm daemonset pods request 250m CPU instead of 10m. Is this configurable? #2792

azure-npm daemonset pods request 250m CPU instead of 10m. Is this configurable? #2792

Comments

DaveOHenry commented Feb 14, 2022 • edited Loading

ghost commented Feb 14, 2022

ghost commented Feb 16, 2022

ghost commented Feb 21, 2022

ngbrown commented Feb 23, 2022

vakalapa commented Feb 24, 2022

ngbrown commented Feb 24, 2022 • edited Loading

vakalapa commented Feb 24, 2022 • edited Loading

ngbrown commented Feb 24, 2022 • edited Loading

ghost commented Feb 26, 2022

ghost commented Apr 27, 2022

ngbrown commented Apr 27, 2022

ghost commented Jun 27, 2022

DaveOHenry commented Jun 27, 2022 via email

ghost commented Aug 26, 2022

ngbrown commented Sep 2, 2022

Kapsztajn commented Oct 7, 2022

ghost commented Dec 7, 2022

Kapsztajn commented Dec 7, 2022

westy-crey commented Mar 21, 2023

mieky commented Jul 4, 2023

jwstauber2 commented Jul 18, 2023

kethahel99 commented Jul 26, 2023 • edited Loading

cjdell commented Aug 8, 2023

welersonlisboa commented Aug 28, 2023

tomkukral commented Dec 28, 2023

joshcoburn commented Jan 22, 2024

thilan3547 commented Feb 15, 2024

viktor-gustafsson commented Feb 15, 2024

yashak commented Jul 26, 2024

restenb commented Aug 20, 2024

seguler commented Sep 5, 2024

yashak commented Oct 16, 2024 • edited Loading

DaveOHenry commented Feb 14, 2022 •

edited

Loading

ngbrown commented Feb 24, 2022 •

edited

Loading

vakalapa commented Feb 24, 2022 •

edited

Loading

ngbrown commented Feb 24, 2022 •

edited

Loading

kethahel99 commented Jul 26, 2023 •

edited

Loading

yashak commented Oct 16, 2024 •

edited

Loading