nrpe.cfg.in thresholds for check_load seem too small #171

box293 · 2017-10-16T23:01:08Z

In the nrpe/sample-config/nrpe.cfg.in file there is a check_load command:

command[check_load]=@pluginsdir@/check_load -r -w .15,.10,.05 -c .30,.25,.20

Perhaps the thresholds are a little low. How about:

command[check_load]=@pluginsdir@/check_load -r -w 15.0,10.0,5.0 -c 30.0,25.0,20.0

@minusdavid Do you have any comment, these were set as part of 2c935fb

The text was updated successfully, but these errors were encountered:

minusdavid · 2017-10-23T23:22:33Z

Take a look at http://man7.org/linux/man-pages/man3/getloadavg.3.html and http://man7.org/linux/man-pages/man1/uptime.1.html. The check_load plugin uses getloadavg and thus uses the same numbers: https://github.com/nagios-plugins/nagios-plugins/blob/master/plugins/check_load.c#L329

1.0 represents 100% of 1 CPU.

So .30 represents 30% of 1 CPU.

Using the -r option, you specify load in terms of 1 CPU, even on a multi-CPU system.

So with the following:
command[check_load]=@pluginsdir@/check_load -r -w .15,.10,.05 -c .30,.25,.20

You'd get warnings if you have a 1 CPU system with 15% load at 1 minute, 10% load at 5 minutes, and 5% load at 15 minutes. If you ran "uptime" at the same time Nagios did a check, it would say .15,.10,.05.

If you had a 2 CPU system with 15% load on each CPU, you'd have a total load of 30% CPU usage, which in a tool like 'uptime' would show up as .3.

I'm looking at a 8 CPU system right now and 'uptime' says "load average: 1.38, 1.67, 1.83".

That means 1 minute ago 138% of 800% CPU was being used (or 1.38 of 8 when using decimal notation which is what uptime and getloadavg and nagios's check_load plugin use).

If you specified the following:
command[check_load]=@pluginsdir@/check_load -r -w 15.0,10.0,5.0 -c 30.0,25.0,20.0

Your system would (hopefully) never warn, because you'd be se saying you want warnings at 1500% usage of 1 CPU at 1 minute, 1000% usage at 5 minutes, and 500% usage at 15 minutes.

In terms of 'uptime', it would warn when you saw somethign like this: load average: 15.0, 10.0, 5.0. If you run 'uptime' or 'top' on one of your systems now, you probably won't see that. Or if you do... you need to upgrade your system because that is a really really high load.

minusdavid · 2017-10-23T23:26:45Z

That said... the numbers in the sample "-r -w .15,.10,.05 -c .30,.25,.20" might be too low.

If I look at that 8 core system again: "load average: 1.38, 1.67, 1.83" is based on 8 cores. If we divide by 8... we'd get "0.17,.21,.23".

It is a somewhat busy system but there is still tonnes of CPU power left, so I don't know. I'm not a full-time sysadmin; I'm a devops/jack-of-all-trades.

My original commit was made when I was testing my Nagios config and I realized that the sample "command[check_load]=@pluginsdir@/check_load -w 15,10,5 -c 30,25,20" was not doing what it looked like it was doing.

minusdavid · 2017-10-23T23:30:56Z

Depending on your version of 'top', you can press 1 on your keyboard to get an overview of all your CPUs at the same time. Use 'd' to change the delay to something approaching real time and you can get a good sense of where 'uptime'/'getloadavg' are getting their aggregate scores.

box293 · 2017-10-23T23:36:11Z

That is all excellent information, it makes sense.

What is required is some better documentation on thresholds with the different plugins, including this information greatly helps. This is something I plan on publishing in the Nagios Support Knowledgebase in the near future.

As for the example thresholds, I'll leave that up to the devs to decide if we should leave them as they are.

minusdavid · 2017-10-23T23:42:41Z

Sounds good to me. I was just looking to see if John or I added any extra documentation at the time, but it doesn't look like it.

Cheers for planning on publishing details about the thresholds!

box293 · 2017-11-10T03:16:05Z

Here is the KB article I've I created on this topic, it links back to here.

https://support.nagios.com/kb/article.php?id=771

frayber · 2019-02-15T11:14:18Z

I agree with box293, they are too low.
The old ones:

check_load -w 15,10,5 -c 30,25,20

were too high!

I think good values could be these:

check_load -r -w .8,.6,.5 -c .9,.7,.6

minusdavid · 2019-02-18T23:51:27Z

It's just a sample-config file, but having saner sample values does sound reasonable. Send a pull request?

Please refer to NagiosEnterprises#171 Even if it's not simple set general cpu_load thresholds for every system, I suggest to increase these values because are too low. Generally, 1 for core is considered the bottleneck, so I suggest these new values.

frayber mentioned this issue Feb 19, 2019

nrpe.cfg.in thresholds for check_load #205

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nrpe.cfg.in thresholds for check_load seem too small #171

nrpe.cfg.in thresholds for check_load seem too small #171

box293 commented Oct 16, 2017

minusdavid commented Oct 23, 2017

minusdavid commented Oct 23, 2017

minusdavid commented Oct 23, 2017

box293 commented Oct 23, 2017

minusdavid commented Oct 23, 2017

box293 commented Nov 10, 2017

frayber commented Feb 15, 2019

minusdavid commented Feb 18, 2019

nrpe.cfg.in thresholds for check_load seem too small #171

nrpe.cfg.in thresholds for check_load seem too small #171

Comments

box293 commented Oct 16, 2017

minusdavid commented Oct 23, 2017

minusdavid commented Oct 23, 2017

minusdavid commented Oct 23, 2017

box293 commented Oct 23, 2017

minusdavid commented Oct 23, 2017

box293 commented Nov 10, 2017

frayber commented Feb 15, 2019

minusdavid commented Feb 18, 2019