Add percent_free to the API instead of only percent #234

Closed
ericloyd opened this Issue Jun 14, 2016 · 7 comments

Comments

Projects
None yet
3 participants
@ericloyd

I'm not going to redo the entire post, but from a forum post I made (https://support.nagios.com/forum/viewtopic.php?f=7&t=38822&p=186222#p186222) this is an inconsistency that may cause an order of magnitude problem for performance monitoring, capacity planning, threshold maintenance, and just plain head-banging-against-the-wall.

Here's the bottom line:

Nagios's default memory check is custom_check_mem which returns the % free memory. That's memory that is not in use by anything.

The problem is that memory on Linux systems can be in use by buffers and caches until it's needed for something else, and immediately reallocated with no penalty. So a better metric might be available memory; that is, memory that is available to applications.

NCPA's "api/memory/virtual/percent" node returns just that - the percent of memory that is available.

The problem is these two things are not the same, and are often 5% for custom_check_mem and 50% for NCPA's check. While still accurate, they are not the same thing. So switching from an NRPE-based memory check (with custom_check_mem) to an NCPA-based memory check (without changing anything else) is going to seriously upset anything that relies on the value being returned (warning, critical, SLA, etc).

Consider adding percent_free and percent_available to NCPA, and changing the default percent to be percent_free to match the default check, custom_check_mem. Both numbers can be easily calculated from the other available data.

@jomann09

This comment has been minimized.

Show comment
Hide comment
@jomann09

jomann09 Jun 14, 2016

Member

I wanted this posted here to get more opinions, thanks for linking and posting on here.

NCPA's "api/memory/virtual/percent" node returns just that - the percent of memory that is available.

The api/memory/virtual/percent is actually the amount of memory used in % directly from psutil, a Python module which gets all this for us on every system. It's the total percent calculated from total - available / 100 and is documented here which is why the values are the same as the document.

First off, we cannot change the api/memory/virtual/percent value unless it was a bug in calculating the used percentage since this needs to stay backwards compatible if someone already has checks created and wants to upgrade to 2.0.0 and above. Definitely a strange name for used percent which is why I would like to add some values into the returned output so that it's more clear what it's actually the percentage of so I added an issue here for that (#235) and will likely get that into 2.0.0 as well.

I would not be against putting in a percent_free if this is useful.

The percent_available would not be necessary though because percent is the amount used and since the leftover is the actual amount available it is the percent available - even on Linux systems - this shouldn't be an issue. If you wanted to alert on 20% available, you'd just set your warning or critical to 80% and that'd give you the 20% available warning using percent.

I'd also like to point out that you can use the custom_check_mem as a plugin within NCPA if that is easy for you to do in an environment. Or you can create your own simple plugin to give you free memory if you must. Obviously just temporary, if you needed it now.

Just a personal opinion at the end here. I'd argue that using the free percent (without buffer and cache as a part of it) is not a very good way to monitor memory usage. I do believe the user should be shown the data when the check is performed but since Linux loves eating up that free memory for cache and buffer there is no reason to ever think you only have < 10% free when in reality your apps have GBs of ram available. Why the custom_check_mem plugin does it that way, I have no idea.

P.S. Looking at the custom_check_mem it looks like the cache is the only thing added to the free amount to get the free % and you can actually disable it with --nocache which just makes this all even more messy and means that buffer is left out.

Member

jomann09 commented Jun 14, 2016

I wanted this posted here to get more opinions, thanks for linking and posting on here.

NCPA's "api/memory/virtual/percent" node returns just that - the percent of memory that is available.

The api/memory/virtual/percent is actually the amount of memory used in % directly from psutil, a Python module which gets all this for us on every system. It's the total percent calculated from total - available / 100 and is documented here which is why the values are the same as the document.

First off, we cannot change the api/memory/virtual/percent value unless it was a bug in calculating the used percentage since this needs to stay backwards compatible if someone already has checks created and wants to upgrade to 2.0.0 and above. Definitely a strange name for used percent which is why I would like to add some values into the returned output so that it's more clear what it's actually the percentage of so I added an issue here for that (#235) and will likely get that into 2.0.0 as well.

I would not be against putting in a percent_free if this is useful.

The percent_available would not be necessary though because percent is the amount used and since the leftover is the actual amount available it is the percent available - even on Linux systems - this shouldn't be an issue. If you wanted to alert on 20% available, you'd just set your warning or critical to 80% and that'd give you the 20% available warning using percent.

I'd also like to point out that you can use the custom_check_mem as a plugin within NCPA if that is easy for you to do in an environment. Or you can create your own simple plugin to give you free memory if you must. Obviously just temporary, if you needed it now.

Just a personal opinion at the end here. I'd argue that using the free percent (without buffer and cache as a part of it) is not a very good way to monitor memory usage. I do believe the user should be shown the data when the check is performed but since Linux loves eating up that free memory for cache and buffer there is no reason to ever think you only have < 10% free when in reality your apps have GBs of ram available. Why the custom_check_mem plugin does it that way, I have no idea.

P.S. Looking at the custom_check_mem it looks like the cache is the only thing added to the free amount to get the free % and you can actually disable it with --nocache which just makes this all even more messy and means that buffer is left out.

@tmcnag

This comment has been minimized.

Show comment
Hide comment
@tmcnag

tmcnag Jun 14, 2016

Member

We can maintain backwards-compatibility by keeping /api/memory/virtual/percent as-is, and going with a more logical/sane naming convention going forward. However, there are many potential endpoints we could add:

  • /percent_used, being essentially an alias to the current /percent (used, including buffers and cache)
  • /percent_used_nobc, similar to /percent_used but without including buffers and cache
  • /percent_available (/total - /percent_used_nobc)
  • /percent_free (/total - /percent_used)

In this instance, /percent_free and /percent_used (or just /percent) should add up to 100%, while 100% - /percent_available should equal /percent_used_nobc.

Make sense? It's late and I'm tired so my math might not be right, but the gist is "maintain back-compat while promising more logical names going forward".

Member

tmcnag commented Jun 14, 2016

We can maintain backwards-compatibility by keeping /api/memory/virtual/percent as-is, and going with a more logical/sane naming convention going forward. However, there are many potential endpoints we could add:

  • /percent_used, being essentially an alias to the current /percent (used, including buffers and cache)
  • /percent_used_nobc, similar to /percent_used but without including buffers and cache
  • /percent_available (/total - /percent_used_nobc)
  • /percent_free (/total - /percent_used)

In this instance, /percent_free and /percent_used (or just /percent) should add up to 100%, while 100% - /percent_available should equal /percent_used_nobc.

Make sense? It's late and I'm tired so my math might not be right, but the gist is "maintain back-compat while promising more logical names going forward".

@ericloyd

This comment has been minimized.

Show comment
Hide comment
@ericloyd

ericloyd Jun 14, 2016

If you need to maintain "percent" as "percent available" for backpat, that's fine, but I recommend something in API documentation that says it's not the same calculation as the "normal" custom_check_mem result.

I'm quite happy with specific strings for specific math. I prefer to know exactly what I'm getting anyway. Besides, all I care about is memory available to applications, or /percent_avail.

Thanks.

ericloyd commented Jun 14, 2016

If you need to maintain "percent" as "percent available" for backpat, that's fine, but I recommend something in API documentation that says it's not the same calculation as the "normal" custom_check_mem result.

I'm quite happy with specific strings for specific math. I prefer to know exactly what I'm getting anyway. Besides, all I care about is memory available to applications, or /percent_avail.

Thanks.

@jomann09 jomann09 changed the title from NCPA reports free available memory, not free unused memory (read why this is a problem) to [Linux] Add percent_free to the API instead of only percent Jun 14, 2016

@jomann09 jomann09 changed the title from [Linux] Add percent_free to the API instead of only percent to Add percent_free to the API instead of only percent Jun 15, 2016

@jomann09

This comment has been minimized.

Show comment
Hide comment
@jomann09

jomann09 Jul 23, 2016

Member

Closing this because of what I found out with #235 which, while not displaying the percent, gives the actual data that could be useful in figuring out what is going on. I'd rather not add a bunch of percent endpoints to the API.

Member

jomann09 commented Jul 23, 2016

Closing this because of what I found out with #235 which, while not displaying the percent, gives the actual data that could be useful in figuring out what is going on. I'd rather not add a bunch of percent endpoints to the API.

@jomann09 jomann09 closed this Jul 23, 2016

@ericloyd

This comment has been minimized.

Show comment
Hide comment
@ericloyd

ericloyd Jul 23, 2016

I believe this to be a mistake. The point of NCPA is to have one-stop shopping across all platforms for the same data, accessed through the same API. Now you're saying that I'll need to write a custom plugin that grabs the data from the API and calculates the used percentage by subtracting from one.

I believe this to be a mistake. The point of NCPA is to have one-stop shopping across all platforms for the same data, accessed through the same API. Now you're saying that I'll need to write a custom plugin that grabs the data from the API and calculates the used percentage by subtracting from one.

@jomann09

This comment has been minimized.

Show comment
Hide comment
@jomann09

jomann09 Jul 23, 2016

Member

I understand where you are coming from, I do. But we already have multiple options to get the used amount, I am not sure how this relates. I am just not giving the exact percents of everything as API endpoints. You're right that it's supposed to be a one-stop shop which means it also should have the same values no matter what OS it's running on. Including the same calculations. I fail to see a reason why the below ways of finding memory are not sufficient. And of course, you can always just run the nagios-plugins version of memory check if that is really what you want to do.

Simplicity is the key here. Why make this more confusing than it needs to be? Getting the memory stats of a server already exists, and we cannot accommodate every request for an API endpoint or it would get messy really fast. Some ways to get the used memory amount of a server using NCPA:

  • Get the total used memory % and all values (free, avail, used, total) api/memory/virtual?check=true&units=Gi (Gi for Windows since that's what it uses in Explorer, G for Linux probably) which returns what I posted in #235 which includes free and total ... this is probably the best memory check available since it gives the admin all the data required to be able to figure out what the problem is and they can make the judgement call based on all the values
  • Get the actual amount of memory used api/memory/virtual/used
  • Get the actual amount of memory free api/memory/virtual/free
  • Get the percent of memory used on the machine (the value that, if close to 100% actually means you need more memory or something going wrong) api/memory/virtual/percent
  • Install and use nagios-plugins from NCPA instead of NRPE
  • Write your own plugin to do what you want; the ability to use the internal NCPA python is there, although not documented yet (you can run a python script through the Python binary - on Linux - and Python.exe binary on Windows - which are located in the main NCPA folder), so you can use things like psutil to get any of the data on the server that you might want, without installing any extra libraries - if the issue is that you absolutely need something that isn't being displayed

I would say that the percent node name is a big annoying since you don't know what the value is when it is given out in check form. It just says Percent was x %, but with some changes in the future I think the output will be able to be customized so that our output text has more information in it.

The one thing that could be useful is making some sort of way to do math and whatnot on API endpoints without having to create a custom plugin to do that. However, due to the complexity in this based on the structure of NCPA I doubt it will happen until a larger re-write occurs.

Member

jomann09 commented Jul 23, 2016

I understand where you are coming from, I do. But we already have multiple options to get the used amount, I am not sure how this relates. I am just not giving the exact percents of everything as API endpoints. You're right that it's supposed to be a one-stop shop which means it also should have the same values no matter what OS it's running on. Including the same calculations. I fail to see a reason why the below ways of finding memory are not sufficient. And of course, you can always just run the nagios-plugins version of memory check if that is really what you want to do.

Simplicity is the key here. Why make this more confusing than it needs to be? Getting the memory stats of a server already exists, and we cannot accommodate every request for an API endpoint or it would get messy really fast. Some ways to get the used memory amount of a server using NCPA:

  • Get the total used memory % and all values (free, avail, used, total) api/memory/virtual?check=true&units=Gi (Gi for Windows since that's what it uses in Explorer, G for Linux probably) which returns what I posted in #235 which includes free and total ... this is probably the best memory check available since it gives the admin all the data required to be able to figure out what the problem is and they can make the judgement call based on all the values
  • Get the actual amount of memory used api/memory/virtual/used
  • Get the actual amount of memory free api/memory/virtual/free
  • Get the percent of memory used on the machine (the value that, if close to 100% actually means you need more memory or something going wrong) api/memory/virtual/percent
  • Install and use nagios-plugins from NCPA instead of NRPE
  • Write your own plugin to do what you want; the ability to use the internal NCPA python is there, although not documented yet (you can run a python script through the Python binary - on Linux - and Python.exe binary on Windows - which are located in the main NCPA folder), so you can use things like psutil to get any of the data on the server that you might want, without installing any extra libraries - if the issue is that you absolutely need something that isn't being displayed

I would say that the percent node name is a big annoying since you don't know what the value is when it is given out in check form. It just says Percent was x %, but with some changes in the future I think the output will be able to be customized so that our output text has more information in it.

The one thing that could be useful is making some sort of way to do math and whatnot on API endpoints without having to create a custom plugin to do that. However, due to the complexity in this based on the structure of NCPA I doubt it will happen until a larger re-write occurs.

@ericloyd

This comment has been minimized.

Show comment
Hide comment
@ericloyd

ericloyd Jul 23, 2016

I also get where you're coming from. I do. :-) looking forward to the rewrite.

I also get where you're coming from. I do. :-) looking forward to the rewrite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment