Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nrpe.cfg.in thresholds for check_load seem too small #171

Open
box293 opened this issue Oct 16, 2017 · 8 comments
Open

nrpe.cfg.in thresholds for check_load seem too small #171

box293 opened this issue Oct 16, 2017 · 8 comments

Comments

@box293
Copy link
Contributor

box293 commented Oct 16, 2017

In the nrpe/sample-config/nrpe.cfg.in file there is a check_load command:

command[check_load]=@pluginsdir@/check_load -r -w .15,.10,.05 -c .30,.25,.20

Perhaps the thresholds are a little low. How about:

command[check_load]=@pluginsdir@/check_load -r -w 15.0,10.0,5.0 -c 30.0,25.0,20.0

@minusdavid Do you have any comment, these were set as part of 2c935fb

@minusdavid
Copy link
Contributor

Take a look at http://man7.org/linux/man-pages/man3/getloadavg.3.html and http://man7.org/linux/man-pages/man1/uptime.1.html. The check_load plugin uses getloadavg and thus uses the same numbers: https://github.com/nagios-plugins/nagios-plugins/blob/master/plugins/check_load.c#L329

1.0 represents 100% of 1 CPU.

So .30 represents 30% of 1 CPU.

Using the -r option, you specify load in terms of 1 CPU, even on a multi-CPU system.

So with the following:
command[check_load]=@pluginsdir@/check_load -r -w .15,.10,.05 -c .30,.25,.20

You'd get warnings if you have a 1 CPU system with 15% load at 1 minute, 10% load at 5 minutes, and 5% load at 15 minutes. If you ran "uptime" at the same time Nagios did a check, it would say .15,.10,.05.

If you had a 2 CPU system with 15% load on each CPU, you'd have a total load of 30% CPU usage, which in a tool like 'uptime' would show up as .3.

I'm looking at a 8 CPU system right now and 'uptime' says "load average: 1.38, 1.67, 1.83".

That means 1 minute ago 138% of 800% CPU was being used (or 1.38 of 8 when using decimal notation which is what uptime and getloadavg and nagios's check_load plugin use).

If you specified the following:
command[check_load]=@pluginsdir@/check_load -r -w 15.0,10.0,5.0 -c 30.0,25.0,20.0

Your system would (hopefully) never warn, because you'd be se saying you want warnings at 1500% usage of 1 CPU at 1 minute, 1000% usage at 5 minutes, and 500% usage at 15 minutes.

In terms of 'uptime', it would warn when you saw somethign like this: load average: 15.0, 10.0, 5.0. If you run 'uptime' or 'top' on one of your systems now, you probably won't see that. Or if you do... you need to upgrade your system because that is a really really high load.

@minusdavid
Copy link
Contributor

That said... the numbers in the sample "-r -w .15,.10,.05 -c .30,.25,.20" might be too low.

If I look at that 8 core system again: "load average: 1.38, 1.67, 1.83" is based on 8 cores. If we divide by 8... we'd get "0.17,.21,.23".

It is a somewhat busy system but there is still tonnes of CPU power left, so I don't know. I'm not a full-time sysadmin; I'm a devops/jack-of-all-trades.

My original commit was made when I was testing my Nagios config and I realized that the sample "command[check_load]=@pluginsdir@/check_load -w 15,10,5 -c 30,25,20" was not doing what it looked like it was doing.

@minusdavid
Copy link
Contributor

Depending on your version of 'top', you can press 1 on your keyboard to get an overview of all your CPUs at the same time. Use 'd' to change the delay to something approaching real time and you can get a good sense of where 'uptime'/'getloadavg' are getting their aggregate scores.

@box293
Copy link
Contributor Author

box293 commented Oct 23, 2017

That is all excellent information, it makes sense.

What is required is some better documentation on thresholds with the different plugins, including this information greatly helps. This is something I plan on publishing in the Nagios Support Knowledgebase in the near future.

As for the example thresholds, I'll leave that up to the devs to decide if we should leave them as they are.

@minusdavid
Copy link
Contributor

Sounds good to me. I was just looking to see if John or I added any extra documentation at the time, but it doesn't look like it.

Cheers for planning on publishing details about the thresholds!

@box293
Copy link
Contributor Author

box293 commented Nov 10, 2017

Here is the KB article I've I created on this topic, it links back to here.

https://support.nagios.com/kb/article.php?id=771

@frayber
Copy link

frayber commented Feb 15, 2019

I agree with box293, they are too low.
The old ones:

check_load -w 15,10,5 -c 30,25,20

were too high!

I think good values could be these:

check_load -r -w .8,.6,.5 -c .9,.7,.6

@minusdavid
Copy link
Contributor

It's just a sample-config file, but having saner sample values does sound reasonable. Send a pull request?

frayber added a commit to frayber/nrpe that referenced this issue Feb 19, 2019
Please refer to NagiosEnterprises#171
Even if it's not simple set general cpu_load thresholds for every system, I suggest to increase these values because are too low.
Generally, 1 for core is considered the bottleneck, so I suggest these new values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants