-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nrpe.cfg.in thresholds for check_load seem too small #171
Comments
Take a look at http://man7.org/linux/man-pages/man3/getloadavg.3.html and http://man7.org/linux/man-pages/man1/uptime.1.html. The check_load plugin uses getloadavg and thus uses the same numbers: https://github.com/nagios-plugins/nagios-plugins/blob/master/plugins/check_load.c#L329 1.0 represents 100% of 1 CPU. So .30 represents 30% of 1 CPU. Using the -r option, you specify load in terms of 1 CPU, even on a multi-CPU system. So with the following: You'd get warnings if you have a 1 CPU system with 15% load at 1 minute, 10% load at 5 minutes, and 5% load at 15 minutes. If you ran "uptime" at the same time Nagios did a check, it would say .15,.10,.05. If you had a 2 CPU system with 15% load on each CPU, you'd have a total load of 30% CPU usage, which in a tool like 'uptime' would show up as .3. I'm looking at a 8 CPU system right now and 'uptime' says "load average: 1.38, 1.67, 1.83". That means 1 minute ago 138% of 800% CPU was being used (or 1.38 of 8 when using decimal notation which is what uptime and getloadavg and nagios's check_load plugin use). If you specified the following: Your system would (hopefully) never warn, because you'd be se saying you want warnings at 1500% usage of 1 CPU at 1 minute, 1000% usage at 5 minutes, and 500% usage at 15 minutes. In terms of 'uptime', it would warn when you saw somethign like this: load average: 15.0, 10.0, 5.0. If you run 'uptime' or 'top' on one of your systems now, you probably won't see that. Or if you do... you need to upgrade your system because that is a really really high load. |
That said... the numbers in the sample "-r -w .15,.10,.05 -c .30,.25,.20" might be too low. If I look at that 8 core system again: "load average: 1.38, 1.67, 1.83" is based on 8 cores. If we divide by 8... we'd get "0.17,.21,.23". It is a somewhat busy system but there is still tonnes of CPU power left, so I don't know. I'm not a full-time sysadmin; I'm a devops/jack-of-all-trades. My original commit was made when I was testing my Nagios config and I realized that the sample "command[check_load]=@pluginsdir@/check_load -w 15,10,5 -c 30,25,20" was not doing what it looked like it was doing. |
Depending on your version of 'top', you can press 1 on your keyboard to get an overview of all your CPUs at the same time. Use 'd' to change the delay to something approaching real time and you can get a good sense of where 'uptime'/'getloadavg' are getting their aggregate scores. |
That is all excellent information, it makes sense. What is required is some better documentation on thresholds with the different plugins, including this information greatly helps. This is something I plan on publishing in the Nagios Support Knowledgebase in the near future. As for the example thresholds, I'll leave that up to the devs to decide if we should leave them as they are. |
Sounds good to me. I was just looking to see if John or I added any extra documentation at the time, but it doesn't look like it. Cheers for planning on publishing details about the thresholds! |
Here is the KB article I've I created on this topic, it links back to here. |
I agree with box293, they are too low. check_load -w 15,10,5 -c 30,25,20 were too high! I think good values could be these: check_load -r -w .8,.6,.5 -c .9,.7,.6 |
It's just a sample-config file, but having saner sample values does sound reasonable. Send a pull request? |
Please refer to NagiosEnterprises#171 Even if it's not simple set general cpu_load thresholds for every system, I suggest to increase these values because are too low. Generally, 1 for core is considered the bottleneck, so I suggest these new values.
In the
nrpe/sample-config/nrpe.cfg.in
file there is a check_load command:Perhaps the thresholds are a little low. How about:
@minusdavid Do you have any comment, these were set as part of 2c935fb
The text was updated successfully, but these errors were encountered: