Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory plugin: Metrics to not sum up to a stable physical memory size. #4237

Open
octo opened this issue Jan 17, 2024 · 7 comments
Open

memory plugin: Metrics to not sum up to a stable physical memory size. #4237

octo opened this issue Jan 17, 2024 · 7 comments
Labels
Bug A genuine bug linux only

Comments

@octo
Copy link
Member

octo commented Jan 17, 2024

Expected behavior

The sum of the memory/memory-* metrics should be total amount of physical memory, i.e. the value should not fluctuate.

Actual behavior

The "available" and "slab" metrics count memory that is already accounted for by other metrics. The "shared" memory metric is missing.

Steps to reproduce

See graphs in #3916. Some more information is available in #4224.

@octo
Copy link
Member Author

octo commented Jan 17, 2024

I just recently ran into this issue as part of #4224 and ended up removing the "available" and "slab" metrics for collectd 6.

The idea is that all the memory/memory-* metrics can be grouped in one graph, and these two metrics were violating that assumption. After reading the kernel sources I ended up adding the "shared" metric, because it also originates in the struct sysinfo in the kernel.

I think we should:

  • Put both, "available" and "slab" behind a flag, if they're not already.
  • Use a separate "plugin instance" when reporting these metrics, e.g. "memory-available/bytes".

@eero-t
Copy link
Contributor

eero-t commented Jan 17, 2024

Grouping "random" memory values reported by OS to get "total" is not correct. Some of them can overlap (slab, shared), and you can miss multiple relevant types of memory (for example on Linux, system RAM used by DRM subsystem for GPU allocations, which could in some situations be fairly huge amount, can depending on driver and kernel version be accounted in other reported memory types, or not have any accounted at all, except maybe in debugfs).

OS reports memory total separately, and that is what should be reported as total, and used for ratio calculations.

(IMHO it's a bit odd that swap usage comes from separate plugin, as logically that's also about memory usage.)

@eero-t
Copy link
Contributor

eero-t commented Jan 24, 2024

By (correctly) using OS reported value for total memory amount, slap & available values could be brought back.

(They are somewhat useful and removing them is bound to disrupt somebody's workflow: https://m.xkcd.com/1172/)

@eero-t
Copy link
Contributor

eero-t commented Jan 31, 2024

By (correctly) using OS reported value for total memory amount, slap & available values could be brought back.

Btw. Available is defined in OTEL spec: https://opentelemetry.io/docs/specs/semconv/system/system-metrics/#metric-systemlinuxmemoryavailable

@zzzyhtheonly
Copy link
Member

zzzyhtheonly commented Feb 5, 2024

Hi,
I'm thinking of adding "Dirty" and "Writeback" (personally use, also willing to send a PR then) in Linux for “memory” plugin, going through I find the same issue in #3916

I'm trying to resolve this issue by following the two ways @octo says, just come up with some thoughts, not sure understand the whole thing correctly :-)

  • Put both, "available" and "slab" behind a flag, if they're not already.
  1. Maybe we could put all metrics that do not contribute to "total" into another submit function, i.e., we do not use MEMORY_SUBMIT for submitting metrics like "Available", instead we write a new submit function to submit those metrics one by one.
    The disadvantage is that ValuesPercentage is not working for those metrics.
    I prefer this way since it might make it easier to move on to adding new useful metrics :)

  2. Another way is to make the sum to calculate percentages fixed as "MemTotal".
    The disadvantage is it would be hard to add new metrics since we should always figure out if it has the "Cached" and "Shm" issue found in Report MemAvailable when present in meminfo #3916

  • Use a separate "plugin instance" when reporting these metrics, e.g. "memory-available/bytes".

We could put it into a separate plugin like what "swap" does. But if we want to add new metrics later, to follow what "swap" does, we may need lots of plugins to make things reasonable, e.g, if we want "HugePages_*", we need a "hugepage" plugin, if we want "Vmalloc*", we need a "vmalloc" plugin. It might be redundant since all of the metrics belong to "memory" usage. Like @eero-t also says.

@octo
Copy link
Member Author

octo commented Feb 5, 2024

I'm trying to resolve this issue by following the two ways @octo says,

To clarify, my suggestion was to put them behind a flag and use different plugin instance. The flag is negotiable. Aggregating all metrics with the same plugin, plugin instance, and type must yield something useful, hence the different plugin instance values.

  • Use a separate "plugin instance" when reporting these metrics, e.g. "memory-available/bytes".

We could put it into a separate plugin like what "swap" does. But if we want to add new metrics later, to follow what "swap" does, we may need lots of plugins to make things reasonable, e.g, if we want "HugePages_", we need a "hugepage" plugin, if we want "Vmalloc", we need a "vmalloc" plugin. It might be redundant since all of the metrics belong to "memory" usage. Like @eero-t also says.

I'm not sure I fully understand what you're saying. I think all code dealing with Linux' /proc/meminfo should remain in the "memory" plugin. My note was about setting the "plugin instance", which is one of the field in collectd 5 metric identifiers.

To follow the provided example, the structure would look like this:

  • memory/ (plugin instance is empty)
    • memory-{free,cached,buffered,shmem,used}
  • memory-hugepages/ (plugin instance is "hugepages")
    • memory-{free,reserved,surplus,free}
  • memory-vmalloc/ (plugin instance is "vmalloc")
    • memory-{used,chunk,free}

That said, the hugepages and vmem plugins do exist.

  1. Maybe we could put all metrics that do not contribute to "total" into another submit function, i.e., we do not use MEMORY_SUBMIT for submitting metrics like "Available", instead we write a new submit function to submit those metrics one by one.
    The disadvantage is that ValuesPercentage is not working for those metrics.

I'm not a huge van of the MEMORY_SUBMIT macro – it's way too clever and used to call a very clever function (here: a function with variable arguments). Read: if you want to change the way the memory plugin reports its metrics, be my guest!

That said, the macro uses a local variable named vl as a "template", so setting vl.plugin_instance before using MEMORY_SUBMIT should do what we need.

For relative metrics, we need to answer: relative to what? I'd argue that Committed_AS should be relative to CommitLimit, HugePages_Rsvd should be relative to HugePages_Total, and VmallocUsed should be relative to VmallocTotal. None of them should be relative to MemTotal.

@octo
Copy link
Member Author

octo commented Feb 5, 2024

Btw. Available is defined in OTEL spec: https://opentelemetry.io/docs/specs/semconv/system/system-metrics/#metric-systemlinuxmemoryavailable

Note that this is a separate metric (system.linux.memory.available vs. system.memory.usage). That is becasue aggregations over all attributes of a given metric should be meaningful. The collectd 5 equivalent would be "aggregating metrics with the same plugin, plugin instance, and type should be meaningful".

I don't want to banish the "memory available" metric from collectd. But we cannot throw it in with the other memory metrics because these metrics must not be aggregated together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug A genuine bug linux only
Projects
None yet
Development

No branches or pull requests

3 participants