[CONTINT-4562] Add WSS measurement (Working Set Size)#1322
[CONTINT-4562] Add WSS measurement (Working Set Size)#1322dd-mergequeue[bot] merged 9 commits intomainfrom
Conversation
65dbd61 to
92c2f7e
Compare
lading/src/observer/linux.rs
Outdated
| // WSS measures the amount of memory that has been accessed since the last poll. | ||
| // As a consequence, the poll interval impacts the measure. | ||
| // That’s why we need to be sure we don’t poll more often than once per minute. | ||
| if self.last_wss_sample.elapsed() > tokio::time::Duration::from_secs(60) { |
There was a problem hiding this comment.
How do we arrive at 60s as a good poll interval?
There was a problem hiding this comment.
The thing is that the agent doesn’t have an homogeneous workload that would execute the same pieces of code on the same pieces of data at all points in time.
Instead, it has a lot of periodic tasks that are scheduled every so often.
For ex., some python checks are scheduled once every 15s, the forwarder sends the aggregated data every 5s, the workloadmeta store polls the kubelet every 15s, etc.
So, if we were collecting WSS every second, the data accessed by the workloadmeta kubelet collector would be taken into account only in one sample out of 15, resulting in a very spiky metric.
In order to have a stable metric, we need a polling interval that is long enough to be sure that all the periodic tasks executed every so often by the agent have been executed during that interval.
With all the agent tasks executed every 15s, a WSS polling interval of 20s could theoretically have been enough. But experiments shown that it wasn’t enough to get a stable metric.
A possible future improvement (but in a follow up PR), could be to collect WSS at different intervals.
lading/src/observer/linux.rs
Outdated
| // WSS measures the amount of memory that has been accessed since the last poll. | ||
| // As a consequence, the poll interval impacts the measure. | ||
| // That’s why we need to be sure we don’t poll more often than once per minute. | ||
| if self.last_wss_sample.elapsed() > tokio::time::Duration::from_secs(60) { |
There was a problem hiding this comment.
Separately, we already sample smaps access every 10s, logic for this is directly above https://github.com/DataDog/lading/blob/main/lading/src/observer/linux.rs#L36-L49
It would be nice if we could line up these sampling intervals so that we can correlate if needed.
There was a problem hiding this comment.
I’ve refactored this logic to make it identical to the one currently used by smaps polling in c6fc25a#diff-d0609dafc88a691f93eb0c7eaf235015ffc74b1bfbe98e3ca36aa149b5b1f0d1R68-R69.
| use pfnset::PfnSet; | ||
|
|
||
| pub(super) const PAGE_IDLE_BITMAP: &str = "/sys/kernel/mm/page_idle/bitmap"; | ||
| // From https://github.com/torvalds/linux/blob/c62f4b82d57155f35befb5c8bbae176614b87623/arch/x86/include/asm/page_64_types.h#L42 |
There was a problem hiding this comment.
is this amd64 specific? If so, lets represent that here in the code so that we can add support for aarch64 in the future if needed.
There was a problem hiding this comment.
I’ve added an architecture attribute in c6fc25a#diff-12dfd5a93fda4762b36064bede36873a0a6a1c3843a8f3167dc31ab915abaa36R14.
lading/src/observer/linux.rs
Outdated
| Some(wss::Sampler::new(parent_pid)?) | ||
| } else { | ||
| warn!( | ||
| "{} isn’t accessible. Either the kernel hasn’t been compiled with CONFIG_IDLE_PAGE_TRACKING or the process doesn’t have access to it. WSS sampling is not available", |
There was a problem hiding this comment.
Is there a specific capability that we can include in this error message as a hint for providing access?
There was a problem hiding this comment.
I’ve added some hints about what would need to be looked at in the warning log in c6fc25a#diff-d0609dafc88a691f93eb0c7eaf235015ffc74b1bfbe98e3ca36aa149b5b1f0d1R38-R52.
| let page_size = page_size::get(); | ||
| let mut pfn_set = PfnSet::new(); | ||
|
|
||
| for process in ProcessDescendantsIterator::new(self.parent_pid) { |
There was a problem hiding this comment.
This seems potentially useful in our procfs iteration code as well, I'd like to only have a single way to iterate through child processes ideally.
There was a problem hiding this comment.
I’ve updated the procfs observer to make it use the same ProcessDescendantsIterator in c6fc25a#diff-d8cfa4d9cb3461bf7decb34c7751bf84cc6fa811bf34d9100370c13eca3f500fR93-R105.
lading/Cargo.toml
Outdated
| ] } | ||
| futures = "0.3.31" | ||
| fuser = { version = "0.15", optional = true } | ||
| page_size = { version = "0.6.0" } |
There was a problem hiding this comment.
This crate doesn't look to be active -- which makes sense -- but I think also we don't require cross-platform page size reads in lading. Could we drop this and embed the heart of the crate directly in this PR: https://github.com/Elzair/page_size_rs/blob/ae67d388f3d34ac88136292901fc25e2b8c47af7/src/lib.rs#L95-L103 if I read this right.
There was a problem hiding this comment.
I agree that the feature provided by this crate is so simple that we can copy it and get rid of this direct dependency.
That’s what I did in c6fc25a#diff-12dfd5a93fda4762b36064bede36873a0a6a1c3843a8f3167dc31ab915abaa36R63
However, please note that it doesn’t remove the dependency to page_size as this crate was already an indirect dependency prior to this PR as it can be seen on current main branch:
lading/src/observer/linux.rs
Outdated
| cgroup: cgroup_sampler, | ||
| wss: wss_sampler, | ||
| smaps_interval: 10, | ||
| last_wss_sample: tokio::time::Instant::now() - tokio::time::Duration::from_secs(61), |
There was a problem hiding this comment.
Instead of using Instant, prefer the use of Interval. The way the procfs and cgroup pollers should be followed here, so call self.wss.poll if it's set and let the poll be an infinite loop.
Or change the other pollers to behave in a like manner to your new poller.
There was a problem hiding this comment.
Note also that interval returns on the first poll which solves your problem here: you don't need to scoot the first Instant back in time anymore if you use interval.
There was a problem hiding this comment.
I’ve refactored this logic to make it identical to the one currently used by smaps polling in c6fc25a#diff-d0609dafc88a691f93eb0c7eaf235015ffc74b1bfbe98e3ca36aa149b5b1f0d1R68-R69.
| let page_size = page_size::get(); | ||
| let mut pfn_set = PfnSet::new(); | ||
|
|
||
| for process in ProcessDescendantsIterator::new(self.parent_pid) { |
| self.page_idle_bitmap.write_all(&pfn_bitset.to_ne_bytes())?; | ||
| } | ||
|
|
||
| gauge!("total_wss_bytes").set((nb_pages * page_size) as f64); |
There was a problem hiding this comment.
Neat. This'll be cool to see live.
| /// in a format suitable for use with the Idle Page Tracking API. | ||
| /// <https://www.kernel.org/doc/html/latest/admin-guide/mm/idle_page_tracking.html> | ||
| #[derive(Debug)] | ||
| pub(super) struct PfnSet(HashMap<u64, u64>); |
There was a problem hiding this comment.
Rust's HashMap is resistant to HashDoS if you don't change the hasher. That's not a problem for lading so we tend for performance reasons to use rustc_hash::FxHashMap. It's the default hash-map but with the hasher changed out for you.
There was a problem hiding this comment.
Thanks for the advice!
I replaced HashMap by FxHashMap in c6fc25a#diff-4d74c479f8fd38760097e8767088b67282523e2f76cfa90613b007a6a51ea7ba.
|
|
||
| impl PfnSet { | ||
| pub(super) fn new() -> Self { | ||
| Self(HashMap::with_capacity(1024)) |
There was a problem hiding this comment.
1024 just an arbitrarily biggish number? I support the pre-allocation and have no concerns with the size, mostly just curious about the value's origin.
There was a problem hiding this comment.
It’s always difficult to find the right balance between a too small value, that will lead to many re-allocations, and a too big value, that will lead to a waste of resources (not only memory, but also CPU as iterating over a HashMap is O(capacity) instead of O(len))
In this case, a HashMap item represents an 8-bytes block, each bit of which represents a page of 4 KiB of memory.
So, a HashMap item represents between 1 and 64 pages. I.e. between 4 KiB and 256 KiB of memory.
So, a HashMap of 1024 items can represent between 1024×1×4 KiB = 4 MiB and 1024×64×4 KiB = 256 MiB of memory.
Given the fact that the probability for a process to have its virtual memory mapped to a contiguous portion of physical memory is rather low, we’re most probably closer to the lower bound than to the upper one.
In practice, what I observed on my machine is around 3 bits set per block in average.
As the workload that will be monitored by lading usually consumes more few megabytes, this initial capacity is in fact rather low.
| @@ -0,0 +1,100 @@ | |||
| use procfs::process::Process; | |||
|
|
|||
| // Iterator which, given a process ID, returns the process and all its descendants | |||
There was a problem hiding this comment.
This could be a full document string, missing a single /:
/// Iterator which, given a process ID, returns the process and all its descendants
There was a problem hiding this comment.
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
The merge request has been interrupted because the build 0 took longer than expected. The current limit for the base branch 'main' is 120 minutes. |
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
The merge request has been interrupted because the build 0 took longer than expected. The current limit for the base branch 'main' is 120 minutes. |
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
|
This commit re-introduces the forked but not exec'd heuristic accidentally removed in PR #1322. This commit re-introduces this heuristic, extracting it into a function and adding tests to hopefully avoid this situation in the future. Once merged we will cut a new version of lading. Signed-off-by: Brian L. Troutwine <brian.troutwine@datadoghq.com>
### What does this PR do?
This commit re-introduces the forked but not exec'd heuristic accidentally
removed in PR #1322. This commit re-introduces this heuristic, extracting
it into a function and adding tests to hopefully avoid this situation in
the future.
Once merged we will cut a new version of lading.
What does this PR do?
Implement WSS (Working Set Size) measurement with the Idle Page Tracking API.
Motivation
We need a metric that measures how much memory the target workload really uses.
The metric we currently use to asses memory consumption, RSS and PSS, are more measuring the amount of memory that has been allocated and it includes reclaimable memory.
Related issues
Additional Notes
See https://www.brendangregg.com/wss.html
Here is how to validate the new
total_wss_bytesmetric reported value:$ docker run --rm --cgroupns host --pid host --name dd-agent -e DD_HOSTNAME=lenaic-Precision-5570 -v /var/run/docker.sock:/var/run/docker.sock:ro -v /proc/:/host/proc/:ro -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro -e DD_API_KEY=$DD_API_KEY gcr.io/datadoghq/agent:7.65.0-rc.9 bash -c 'ln -s /etc/datadog-agent/datadog-docker.yaml /etc/datadog-agent/datadog.yaml && exec /opt/datadog-agent/bin/agent/agent run'lading:$ target/debug/lading --config-path examples/lading-idle.yaml --target-container dd-agent --prometheus-addr 127.0.0.1:8080 --warmup-duration-seconds 0 --experiment-duration-seconds 3600total_pss_bytestotal_rss_bytestotal_wss_bytes$ docker run --rm --cgroupns host --pid host --name dd-agent -e DD_HOSTNAME=lenaic-Precision-5570 -v /var/run/docker.sock:/var/run/docker.sock:ro -v /proc/:/host/proc/:ro -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro -e DD_API_KEY=$DD_API_KEY --memory 119m --memory-swap 119m gcr.io/datadoghq/agent:7.65.0-rc.9 bash -c 'ln -s /etc/datadog-agent/datadog-docker.yaml /etc/datadog-agent/datadog.yaml && exec /opt/datadog-agent/bin/agent/agent run'Observe that the agent is stable (10 minutes of uptime with no restart):
$ docker run --rm --cgroupns host --pid host --name dd-agent -e DD_HOSTNAME=lenaic-Precision-5570 -v /var/run/docker.sock:/var/run/docker.sock:ro -v /proc/:/host/proc/:ro -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro -e DD_API_KEY=$DD_API_KEY --memory 98m --memory-swap 98m gcr.io/datadoghq/agent:7.65.0-rc.9 bash -c 'ln -s /etc/datadog-agent/datadog-docker.yaml /etc/datadog-agent/datadog.yaml && exec /opt/datadog-agent/bin/agent/agent run'Observe that the agent is eventually OOM killed:
So, the value reported by
total_wss_bytesrepresents the boundary between a memory limit too low that triggers OOM kills and a memory limit high enough to let the target workload run fine.