Skip to content

[CONTINT-4562] Add WSS measurement (Working Set Size)#1322

Merged
dd-mergequeue[bot] merged 9 commits intomainfrom
lenaic/CONTINT-4562_wss
Apr 29, 2025
Merged

[CONTINT-4562] Add WSS measurement (Working Set Size)#1322
dd-mergequeue[bot] merged 9 commits intomainfrom
lenaic/CONTINT-4562_wss

Conversation

@L3n41c
Copy link
Copy Markdown
Member

@L3n41c L3n41c commented Apr 16, 2025

What does this PR do?

Implement WSS (Working Set Size) measurement with the Idle Page Tracking API.

Motivation

We need a metric that measures how much memory the target workload really uses.
The metric we currently use to asses memory consumption, RSS and PSS, are more measuring the amount of memory that has been allocated and it includes reclaimable memory.

Related issues

Additional Notes

See https://www.brendangregg.com/wss.html

Here is how to validate the new total_wss_bytes metric reported value:

  1. Start an agent with no memory limit with:
$ docker run --rm --cgroupns host --pid host --name dd-agent -e DD_HOSTNAME=lenaic-Precision-5570 -v /var/run/docker.sock:/var/run/docker.sock:ro -v /proc/:/host/proc/:ro -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro -e DD_API_KEY=$DD_API_KEY gcr.io/datadoghq/agent:7.65.0-rc.9 bash -c 'ln -s /etc/datadog-agent/datadog-docker.yaml /etc/datadog-agent/datadog.yaml && exec /opt/datadog-agent/bin/agent/agent run'
  1. Collect the memory metrics with lading:
$ target/debug/lading --config-path examples/lading-idle.yaml --target-container dd-agent --prometheus-addr 127.0.0.1:8080 --warmup-duration-seconds 0 --experiment-duration-seconds 3600
$ curl -s http://localhost:8080 | grep -E 'total_[prw]ss_bytes'
# TYPE total_rss_bytes gauge
total_rss_bytes 190119936
# TYPE total_wss_bytes gauge
total_wss_bytes 108711936
# TYPE total_pss_bytes gauge
total_pss_bytes 190111744
metric value in bytes
total_pss_bytes 190111744
total_rss_bytes 190119936
total_wss_bytes 108711936
  1. Stop the agent and restart it with a memory limit equal to the measured WSS + 10% of safety margin (119 MiB):
$ docker run --rm --cgroupns host --pid host --name dd-agent -e DD_HOSTNAME=lenaic-Precision-5570 -v /var/run/docker.sock:/var/run/docker.sock:ro -v /proc/:/host/proc/:ro -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro -e DD_API_KEY=$DD_API_KEY --memory 119m --memory-swap 119m gcr.io/datadoghq/agent:7.65.0-rc.9 bash -c 'ln -s /etc/datadog-agent/datadog-docker.yaml /etc/datadog-agent/datadog.yaml && exec /opt/datadog-agent/bin/agent/agent run'

Observe that the agent is stable (10 minutes of uptime with no restart):

$ docker ps
CONTAINER ID   IMAGE                                COMMAND                  CREATED          STATUS                    PORTS                                                                    NAMES
27beb411fc21   gcr.io/datadoghq/agent:7.65.0-rc.9   "bash -c 'ln -s /etc…"   10 minutes ago   Up 10 minutes (healthy)   8125/udp, 8126/tcp                                                       dd-agent
  1. Stop the agent and restart it with a memory limit equal to 10% less that WSS (98 MiB):
$ docker run --rm --cgroupns host --pid host --name dd-agent -e DD_HOSTNAME=lenaic-Precision-5570 -v /var/run/docker.sock:/var/run/docker.sock:ro -v /proc/:/host/proc/:ro -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro -e DD_API_KEY=$DD_API_KEY --memory 98m --memory-swap 98m gcr.io/datadoghq/agent:7.65.0-rc.9 bash -c 'ln -s /etc/datadog-agent/datadog-docker.yaml /etc/datadog-agent/datadog.yaml && exec /opt/datadog-agent/bin/agent/agent run'

Observe that the agent is eventually OOM killed:

avril 16 15:24:44 lenaic-Precision-5570 kernel: Memory cgroup out of memory: Killed process 174731 (agent) total-vm:3400360kB, anon-rss:77848kB, file-rss:96800kB, shmem-rss:0kB, UID:0 pgta>
avril 16 15:24:44 lenaic-Precision-5570 kernel: agent invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0

So, the value reported by total_wss_bytes represents the boundary between a memory limit too low that triggers OOM kills and a memory limit high enough to let the target workload run fine.

@L3n41c L3n41c marked this pull request as ready for review April 17, 2025 06:11
@L3n41c L3n41c requested a review from a team as a code owner April 17, 2025 06:11
@L3n41c L3n41c force-pushed the lenaic/CONTINT-4562_wss branch from 65dbd61 to 92c2f7e Compare April 18, 2025 13:10
Copy link
Copy Markdown
Contributor

@scottopell scottopell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial pass

// WSS measures the amount of memory that has been accessed since the last poll.
// As a consequence, the poll interval impacts the measure.
// That’s why we need to be sure we don’t poll more often than once per minute.
if self.last_wss_sample.elapsed() > tokio::time::Duration::from_secs(60) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we arrive at 60s as a good poll interval?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that the agent doesn’t have an homogeneous workload that would execute the same pieces of code on the same pieces of data at all points in time.
Instead, it has a lot of periodic tasks that are scheduled every so often.
For ex., some python checks are scheduled once every 15s, the forwarder sends the aggregated data every 5s, the workloadmeta store polls the kubelet every 15s, etc.
So, if we were collecting WSS every second, the data accessed by the workloadmeta kubelet collector would be taken into account only in one sample out of 15, resulting in a very spiky metric.
In order to have a stable metric, we need a polling interval that is long enough to be sure that all the periodic tasks executed every so often by the agent have been executed during that interval.

With all the agent tasks executed every 15s, a WSS polling interval of 20s could theoretically have been enough. But experiments shown that it wasn’t enough to get a stable metric.

A possible future improvement (but in a follow up PR), could be to collect WSS at different intervals.

// WSS measures the amount of memory that has been accessed since the last poll.
// As a consequence, the poll interval impacts the measure.
// That’s why we need to be sure we don’t poll more often than once per minute.
if self.last_wss_sample.elapsed() > tokio::time::Duration::from_secs(60) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately, we already sample smaps access every 10s, logic for this is directly above https://github.com/DataDog/lading/blob/main/lading/src/observer/linux.rs#L36-L49

It would be nice if we could line up these sampling intervals so that we can correlate if needed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve refactored this logic to make it identical to the one currently used by smaps polling in c6fc25a#diff-d0609dafc88a691f93eb0c7eaf235015ffc74b1bfbe98e3ca36aa149b5b1f0d1R68-R69.

use pfnset::PfnSet;

pub(super) const PAGE_IDLE_BITMAP: &str = "/sys/kernel/mm/page_idle/bitmap";
// From https://github.com/torvalds/linux/blob/c62f4b82d57155f35befb5c8bbae176614b87623/arch/x86/include/asm/page_64_types.h#L42
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this amd64 specific? If so, lets represent that here in the code so that we can add support for aarch64 in the future if needed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some(wss::Sampler::new(parent_pid)?)
} else {
warn!(
"{} isn’t accessible. Either the kernel hasn’t been compiled with CONFIG_IDLE_PAGE_TRACKING or the process doesn’t have access to it. WSS sampling is not available",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific capability that we can include in this error message as a hint for providing access?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve added some hints about what would need to be looked at in the warning log in c6fc25a#diff-d0609dafc88a691f93eb0c7eaf235015ffc74b1bfbe98e3ca36aa149b5b1f0d1R38-R52.

let page_size = page_size::get();
let mut pfn_set = PfnSet::new();

for process in ProcessDescendantsIterator::new(self.parent_pid) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems potentially useful in our procfs iteration code as well, I'd like to only have a single way to iterate through child processes ideally.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I like this as well.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve updated the procfs observer to make it use the same ProcessDescendantsIterator in c6fc25a#diff-d8cfa4d9cb3461bf7decb34c7751bf84cc6fa811bf34d9100370c13eca3f500fR93-R105.

] }
futures = "0.3.31"
fuser = { version = "0.15", optional = true }
page_size = { version = "0.6.0" }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This crate doesn't look to be active -- which makes sense -- but I think also we don't require cross-platform page size reads in lading. Could we drop this and embed the heart of the crate directly in this PR: https://github.com/Elzair/page_size_rs/blob/ae67d388f3d34ac88136292901fc25e2b8c47af7/src/lib.rs#L95-L103 if I read this right.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the feature provided by this crate is so simple that we can copy it and get rid of this direct dependency.
That’s what I did in c6fc25a#diff-12dfd5a93fda4762b36064bede36873a0a6a1c3843a8f3167dc31ab915abaa36R63

However, please note that it doesn’t remove the dependency to page_size as this crate was already an indirect dependency prior to this PR as it can be seen on current main branch:

  • "page_size",
  • lading/Cargo.lock

    Lines 2148 to 2156 in 5455864

    [[package]]
    name = "page_size"
    version = "0.6.0"
    source = "registry+https://github.com/rust-lang/crates.io-index"
    checksum = "30d5b2194ed13191c1999ae0704b7839fb18384fa22e49b57eeaa97d79ce40da"
    dependencies = [
    "libc",
    "winapi",
    ]

cgroup: cgroup_sampler,
wss: wss_sampler,
smaps_interval: 10,
last_wss_sample: tokio::time::Instant::now() - tokio::time::Duration::from_secs(61),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using Instant, prefer the use of Interval. The way the procfs and cgroup pollers should be followed here, so call self.wss.poll if it's set and let the poll be an infinite loop.

Or change the other pollers to behave in a like manner to your new poller.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note also that interval returns on the first poll which solves your problem here: you don't need to scoot the first Instant back in time anymore if you use interval.

Copy link
Copy Markdown
Member Author

@L3n41c L3n41c Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve refactored this logic to make it identical to the one currently used by smaps polling in c6fc25a#diff-d0609dafc88a691f93eb0c7eaf235015ffc74b1bfbe98e3ca36aa149b5b1f0d1R68-R69.

let page_size = page_size::get();
let mut pfn_set = PfnSet::new();

for process in ProcessDescendantsIterator::new(self.parent_pid) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I like this as well.

self.page_idle_bitmap.write_all(&pfn_bitset.to_ne_bytes())?;
}

gauge!("total_wss_bytes").set((nb_pages * page_size) as f64);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat. This'll be cool to see live.

/// in a format suitable for use with the Idle Page Tracking API.
/// <https://www.kernel.org/doc/html/latest/admin-guide/mm/idle_page_tracking.html>
#[derive(Debug)]
pub(super) struct PfnSet(HashMap<u64, u64>);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rust's HashMap is resistant to HashDoS if you don't change the hasher. That's not a problem for lading so we tend for performance reasons to use rustc_hash::FxHashMap. It's the default hash-map but with the hasher changed out for you.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice!
I replaced HashMap by FxHashMap in c6fc25a#diff-4d74c479f8fd38760097e8767088b67282523e2f76cfa90613b007a6a51ea7ba.


impl PfnSet {
pub(super) fn new() -> Self {
Self(HashMap::with_capacity(1024))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1024 just an arbitrarily biggish number? I support the pre-allocation and have no concerns with the size, mostly just curious about the value's origin.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s always difficult to find the right balance between a too small value, that will lead to many re-allocations, and a too big value, that will lead to a waste of resources (not only memory, but also CPU as iterating over a HashMap is O(capacity) instead of O(len))

In this case, a HashMap item represents an 8-bytes block, each bit of which represents a page of 4 KiB of memory.
So, a HashMap item represents between 1 and 64 pages. I.e. between 4 KiB and 256 KiB of memory.
So, a HashMap of 1024 items can represent between 1024×1×4 KiB = 4 MiB and 1024×64×4 KiB = 256 MiB of memory.

Given the fact that the probability for a process to have its virtual memory mapped to a contiguous portion of physical memory is rather low, we’re most probably closer to the lower bound than to the upper one.
In practice, what I observed on my machine is around 3 bits set per block in average.

As the workload that will be monitored by lading usually consumes more few megabytes, this initial capacity is in fact rather low.

@@ -0,0 +1,100 @@
use procfs::process::Process;

// Iterator which, given a process ID, returns the process and all its descendants
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a full document string, missing a single /:

/// Iterator which, given a process ID, returns the process and all its descendants

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@L3n41c L3n41c requested review from blt and scottopell April 25, 2025 15:53
@L3n41c
Copy link
Copy Markdown
Member Author

L3n41c commented Apr 29, 2025

/merge

@dd-devflow
Copy link
Copy Markdown

dd-devflow bot commented Apr 29, 2025

View all feedbacks in Devflow UI.

2025-04-29 05:20:01 UTC ℹ️ Start processing command /merge


2025-04-29 05:20:06 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in main is approximately 0s (p90).


2025-04-29 07:20:22 UTCMergeQueue: The build pipeline has timeout

The merge request has been interrupted because the build 0 took longer than expected. The current limit for the base branch 'main' is 120 minutes.

@L3n41c
Copy link
Copy Markdown
Member Author

L3n41c commented Apr 29, 2025

/merge

@dd-devflow
Copy link
Copy Markdown

dd-devflow bot commented Apr 29, 2025

View all feedbacks in Devflow UI.

2025-04-29 08:07:41 UTC ℹ️ Start processing command /merge


2025-04-29 08:07:46 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in main is approximately 0s (p90).


2025-04-29 10:07:59 UTCMergeQueue: The build pipeline has timeout

The merge request has been interrupted because the build 0 took longer than expected. The current limit for the base branch 'main' is 120 minutes.

@L3n41c
Copy link
Copy Markdown
Member Author

L3n41c commented Apr 29, 2025

/merge

@dd-devflow
Copy link
Copy Markdown

dd-devflow bot commented Apr 29, 2025

View all feedbacks in Devflow UI.

2025-04-29 10:13:07 UTC ℹ️ Start processing command /merge


2025-04-29 10:13:13 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in main is approximately 0s (p90).


2025-04-29 10:20:33 UTC ℹ️ MergeQueue: This merge request was merged

@dd-mergequeue dd-mergequeue bot merged commit 71c14f2 into main Apr 29, 2025
28 checks passed
@dd-mergequeue dd-mergequeue bot deleted the lenaic/CONTINT-4562_wss branch April 29, 2025 10:20
blt added a commit that referenced this pull request Jul 9, 2025
This commit re-introduces the forked but not exec'd heuristic accidentally
removed in PR #1322. This commit re-introduces this heuristic, extracting
it into a function and adding tests to hopefully avoid this situation in
the future.

Once merged we will cut a new version of lading.

Signed-off-by: Brian L. Troutwine <brian.troutwine@datadoghq.com>
blt added a commit that referenced this pull request Jul 9, 2025
### What does this PR do?

This commit re-introduces the forked but not exec'd heuristic accidentally
    removed in PR #1322. This commit re-introduces this heuristic, extracting
    it into a function and adding tests to hopefully avoid this situation in
    the future.

Once merged we will cut a new version of lading.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants