[CONTINT-4562] Add WSS measurement (Working Set Size) by L3n41c · Pull Request #1322 · DataDog/lading

L3n41c · 2025-04-16T13:42:20Z

What does this PR do?

Implement WSS (Working Set Size) measurement with the Idle Page Tracking API.

Motivation

We need a metric that measures how much memory the target workload really uses.
The metric we currently use to asses memory consumption, RSS and PSS, are more measuring the amount of memory that has been allocated and it includes reclaimable memory.

Related issues

Additional Notes

See https://www.brendangregg.com/wss.html

Here is how to validate the new total_wss_bytes metric reported value:

Start an agent with no memory limit with:

$ docker run --rm --cgroupns host --pid host --name dd-agent -e DD_HOSTNAME=lenaic-Precision-5570 -v /var/run/docker.sock:/var/run/docker.sock:ro -v /proc/:/host/proc/:ro -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro -e DD_API_KEY=$DD_API_KEY gcr.io/datadoghq/agent:7.65.0-rc.9 bash -c 'ln -s /etc/datadog-agent/datadog-docker.yaml /etc/datadog-agent/datadog.yaml && exec /opt/datadog-agent/bin/agent/agent run'

Collect the memory metrics with lading:

$ target/debug/lading --config-path examples/lading-idle.yaml --target-container dd-agent --prometheus-addr 127.0.0.1:8080 --warmup-duration-seconds 0 --experiment-duration-seconds 3600

$ curl -s http://localhost:8080 | grep -E 'total_[prw]ss_bytes'
# TYPE total_rss_bytes gauge
total_rss_bytes 190119936
# TYPE total_wss_bytes gauge
total_wss_bytes 108711936
# TYPE total_pss_bytes gauge
total_pss_bytes 190111744

metric	value in bytes
`total_pss_bytes`	190111744
`total_rss_bytes`	190119936
`total_wss_bytes`	108711936

Stop the agent and restart it with a memory limit equal to the measured WSS + 10% of safety margin (119 MiB):

$ docker run --rm --cgroupns host --pid host --name dd-agent -e DD_HOSTNAME=lenaic-Precision-5570 -v /var/run/docker.sock:/var/run/docker.sock:ro -v /proc/:/host/proc/:ro -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro -e DD_API_KEY=$DD_API_KEY --memory 119m --memory-swap 119m gcr.io/datadoghq/agent:7.65.0-rc.9 bash -c 'ln -s /etc/datadog-agent/datadog-docker.yaml /etc/datadog-agent/datadog.yaml && exec /opt/datadog-agent/bin/agent/agent run'

Observe that the agent is stable (10 minutes of uptime with no restart):

$ docker ps
CONTAINER ID   IMAGE                                COMMAND                  CREATED          STATUS                    PORTS                                                                    NAMES
27beb411fc21   gcr.io/datadoghq/agent:7.65.0-rc.9   "bash -c 'ln -s /etc…"   10 minutes ago   Up 10 minutes (healthy)   8125/udp, 8126/tcp                                                       dd-agent

Stop the agent and restart it with a memory limit equal to 10% less that WSS (98 MiB):

$ docker run --rm --cgroupns host --pid host --name dd-agent -e DD_HOSTNAME=lenaic-Precision-5570 -v /var/run/docker.sock:/var/run/docker.sock:ro -v /proc/:/host/proc/:ro -v /sys/fs/cgroup/:/host/sys/fs/cgroup:ro -e DD_API_KEY=$DD_API_KEY --memory 98m --memory-swap 98m gcr.io/datadoghq/agent:7.65.0-rc.9 bash -c 'ln -s /etc/datadog-agent/datadog-docker.yaml /etc/datadog-agent/datadog.yaml && exec /opt/datadog-agent/bin/agent/agent run'

Observe that the agent is eventually OOM killed:

avril 16 15:24:44 lenaic-Precision-5570 kernel: Memory cgroup out of memory: Killed process 174731 (agent) total-vm:3400360kB, anon-rss:77848kB, file-rss:96800kB, shmem-rss:0kB, UID:0 pgta>
avril 16 15:24:44 lenaic-Precision-5570 kernel: agent invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0

So, the value reported by total_wss_bytes represents the boundary between a memory limit too low that triggers OOM kills and a memory limit high enough to let the target workload run fine.

…4562_wss

…ilable

scottopell

Initial pass

scottopell · 2025-04-18T15:38:34Z

lading/src/observer/linux.rs

+            // WSS measures the amount of memory that has been accessed since the last poll.
+            // As a consequence, the poll interval impacts the measure.
+            // That’s why we need to be sure we don’t poll more often than once per minute.
+            if self.last_wss_sample.elapsed() > tokio::time::Duration::from_secs(60) {


How do we arrive at 60s as a good poll interval?

The thing is that the agent doesn’t have an homogeneous workload that would execute the same pieces of code on the same pieces of data at all points in time.
Instead, it has a lot of periodic tasks that are scheduled every so often.
For ex., some python checks are scheduled once every 15s, the forwarder sends the aggregated data every 5s, the workloadmeta store polls the kubelet every 15s, etc.
So, if we were collecting WSS every second, the data accessed by the workloadmeta kubelet collector would be taken into account only in one sample out of 15, resulting in a very spiky metric.
In order to have a stable metric, we need a polling interval that is long enough to be sure that all the periodic tasks executed every so often by the agent have been executed during that interval.

With all the agent tasks executed every 15s, a WSS polling interval of 20s could theoretically have been enough. But experiments shown that it wasn’t enough to get a stable metric.

A possible future improvement (but in a follow up PR), could be to collect WSS at different intervals.

scottopell · 2025-04-18T15:39:42Z

lading/src/observer/linux.rs

+            // WSS measures the amount of memory that has been accessed since the last poll.
+            // As a consequence, the poll interval impacts the measure.
+            // That’s why we need to be sure we don’t poll more often than once per minute.
+            if self.last_wss_sample.elapsed() > tokio::time::Duration::from_secs(60) {


Separately, we already sample smaps access every 10s, logic for this is directly above https://github.com/DataDog/lading/blob/main/lading/src/observer/linux.rs#L36-L49

It would be nice if we could line up these sampling intervals so that we can correlate if needed.

I’ve refactored this logic to make it identical to the one currently used by smaps polling in c6fc25a#diff-d0609dafc88a691f93eb0c7eaf235015ffc74b1bfbe98e3ca36aa149b5b1f0d1R68-R69.

scottopell · 2025-04-18T15:40:27Z

lading/src/observer/linux/wss.rs

+use pfnset::PfnSet;
+
+pub(super) const PAGE_IDLE_BITMAP: &str = "/sys/kernel/mm/page_idle/bitmap";
+// From https://github.com/torvalds/linux/blob/c62f4b82d57155f35befb5c8bbae176614b87623/arch/x86/include/asm/page_64_types.h#L42


is this amd64 specific? If so, lets represent that here in the code so that we can add support for aarch64 in the future if needed.

I’ve added an architecture attribute in c6fc25a#diff-12dfd5a93fda4762b36064bede36873a0a6a1c3843a8f3167dc31ab915abaa36R14.

scottopell · 2025-04-18T15:41:19Z

lading/src/observer/linux.rs

+            Some(wss::Sampler::new(parent_pid)?)
+        } else {
+            warn!(
+                "{} isn’t accessible. Either the kernel hasn’t been compiled with CONFIG_IDLE_PAGE_TRACKING or the process doesn’t have access to it. WSS sampling is not available",


Is there a specific capability that we can include in this error message as a hint for providing access?

I’ve added some hints about what would need to be looked at in the warning log in c6fc25a#diff-d0609dafc88a691f93eb0c7eaf235015ffc74b1bfbe98e3ca36aa149b5b1f0d1R38-R52.

scottopell · 2025-04-18T15:44:29Z

lading/src/observer/linux/wss.rs

+        let page_size = page_size::get();
+        let mut pfn_set = PfnSet::new();
+
+        for process in ProcessDescendantsIterator::new(self.parent_pid) {


This seems potentially useful in our procfs iteration code as well, I'd like to only have a single way to iterate through child processes ideally.

Agreed, I like this as well.

I’ve updated the procfs observer to make it use the same ProcessDescendantsIterator in c6fc25a#diff-d8cfa4d9cb3461bf7decb34c7751bf84cc6fa811bf34d9100370c13eca3f500fR93-R105.

blt · 2025-04-21T18:56:32Z

lading/Cargo.toml

 ] }
 futures = "0.3.31"
 fuser = { version = "0.15", optional = true }
+page_size = { version = "0.6.0" }


This crate doesn't look to be active -- which makes sense -- but I think also we don't require cross-platform page size reads in lading. Could we drop this and embed the heart of the crate directly in this PR: https://github.com/Elzair/page_size_rs/blob/ae67d388f3d34ac88136292901fc25e2b8c47af7/src/lib.rs#L95-L103 if I read this right.

I agree that the feature provided by this crate is so simple that we can copy it and get rid of this direct dependency.
That’s what I did in c6fc25a#diff-12dfd5a93fda4762b36064bede36873a0a6a1c3843a8f3167dc31ab915abaa36R63

However, please note that it doesn’t remove the dependency to page_size as this crate was already an indirect dependency prior to this PR as it can be seen on current main branch:

lading/Cargo.lock

Line 888 in 5455864

"page_size",

lading/Cargo.lock

Lines 2148 to 2156 in 5455864

[[package]]

name = "page_size"

version = "0.6.0"

source = "registry+https://github.com/rust-lang/crates.io-index"

checksum = "30d5b2194ed13191c1999ae0704b7839fb18384fa22e49b57eeaa97d79ce40da"

dependencies = [

"libc",

"winapi",

]

blt · 2025-04-21T18:58:57Z

lading/src/observer/linux.rs

            cgroup: cgroup_sampler,
+            wss: wss_sampler,
            smaps_interval: 10,
+            last_wss_sample: tokio::time::Instant::now() - tokio::time::Duration::from_secs(61),


Instead of using Instant, prefer the use of Interval. The way the procfs and cgroup pollers should be followed here, so call self.wss.poll if it's set and let the poll be an infinite loop.

Or change the other pollers to behave in a like manner to your new poller.

Note also that interval returns on the first poll which solves your problem here: you don't need to scoot the first Instant back in time anymore if you use interval.

I’ve refactored this logic to make it identical to the one currently used by smaps polling in c6fc25a#diff-d0609dafc88a691f93eb0c7eaf235015ffc74b1bfbe98e3ca36aa149b5b1f0d1R68-R69.

blt · 2025-04-21T19:00:53Z

lading/src/observer/linux/wss.rs

+        let page_size = page_size::get();
+        let mut pfn_set = PfnSet::new();
+
+        for process in ProcessDescendantsIterator::new(self.parent_pid) {


Agreed, I like this as well.

blt · 2025-04-21T19:01:36Z

lading/src/observer/linux/wss.rs

+            self.page_idle_bitmap.write_all(&pfn_bitset.to_ne_bytes())?;
+        }
+
+        gauge!("total_wss_bytes").set((nb_pages * page_size) as f64);


Neat. This'll be cool to see live.

blt · 2025-04-21T19:04:02Z

lading/src/observer/linux/wss/pfnset.rs

+/// in a format suitable for use with the Idle Page Tracking API.
+/// <https://www.kernel.org/doc/html/latest/admin-guide/mm/idle_page_tracking.html>
+#[derive(Debug)]
+pub(super) struct PfnSet(HashMap<u64, u64>);


Rust's HashMap is resistant to HashDoS if you don't change the hasher. That's not a problem for lading so we tend for performance reasons to use rustc_hash::FxHashMap. It's the default hash-map but with the hasher changed out for you.

Thanks for the advice!
I replaced HashMap by FxHashMap in c6fc25a#diff-4d74c479f8fd38760097e8767088b67282523e2f76cfa90613b007a6a51ea7ba.

blt · 2025-04-21T19:05:27Z

lading/src/observer/linux/wss/pfnset.rs

+
+impl PfnSet {
+    pub(super) fn new() -> Self {
+        Self(HashMap::with_capacity(1024))


1024 just an arbitrarily biggish number? I support the pre-allocation and have no concerns with the size, mostly just curious about the value's origin.

It’s always difficult to find the right balance between a too small value, that will lead to many re-allocations, and a too big value, that will lead to a waste of resources (not only memory, but also CPU as iterating over a HashMap is O(capacity) instead of O(len))

In this case, a HashMap item represents an 8-bytes block, each bit of which represents a page of 4 KiB of memory.
So, a HashMap item represents between 1 and 64 pages. I.e. between 4 KiB and 256 KiB of memory.
So, a HashMap of 1024 items can represent between 1024×1×4 KiB = 4 MiB and 1024×64×4 KiB = 256 MiB of memory.

Given the fact that the probability for a process to have its virtual memory mapped to a contiguous portion of physical memory is rather low, we’re most probably closer to the lower bound than to the upper one.
In practice, what I observed on my machine is around 3 bits set per block in average.

As the workload that will be monitored by lading usually consumes more few megabytes, this initial capacity is in fact rather low.

blt · 2025-04-21T19:06:52Z

lading/src/observer/linux/wss/process_descendents.rs

@@ -0,0 +1,100 @@
+use procfs::process::Process;
+
+// Iterator which, given a process ID, returns the process and all its descendants


This could be a full document string, missing a single /:

/// Iterator which, given a process ID, returns the process and all its descendants

Typo fixed in c6fc25a#diff-04e71f67663859201bcf8b6d25f62f65455f77c14a6deb5a5e49ee9561f7112bR3

L3n41c · 2025-04-29T05:19:57Z

/merge

dd-devflow · 2025-04-29T05:20:01Z

View all feedbacks in Devflow UI.

2025-04-29 05:20:01 UTC ℹ️ Start processing command /merge

2025-04-29 05:20:06 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in main is approximately 0s (p90).

2025-04-29 07:20:22 UTC ❌ MergeQueue: The build pipeline has timeout

The merge request has been interrupted because the build 0 took longer than expected. The current limit for the base branch 'main' is 120 minutes.

L3n41c · 2025-04-29T08:07:37Z

/merge

dd-devflow · 2025-04-29T08:07:41Z

View all feedbacks in Devflow UI.

2025-04-29 08:07:41 UTC ℹ️ Start processing command /merge

2025-04-29 08:07:46 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in main is approximately 0s (p90).

2025-04-29 10:07:59 UTC ❌ MergeQueue: The build pipeline has timeout

The merge request has been interrupted because the build 0 took longer than expected. The current limit for the base branch 'main' is 120 minutes.

L3n41c · 2025-04-29T10:13:03Z

/merge

dd-devflow · 2025-04-29T10:13:08Z

View all feedbacks in Devflow UI.

2025-04-29 10:13:07 UTC ℹ️ Start processing command /merge

2025-04-29 10:13:13 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in main is approximately 0s (p90).

2025-04-29 10:20:33 UTC ℹ️ MergeQueue: This merge request was merged

This commit re-introduces the forked but not exec'd heuristic accidentally removed in PR #1322. This commit re-introduces this heuristic, extracting it into a function and adding tests to hopefully avoid this situation in the future. Once merged we will cut a new version of lading. Signed-off-by: Brian L. Troutwine <brian.troutwine@datadoghq.com>

### What does this PR do? This commit re-introduces the forked but not exec'd heuristic accidentally removed in PR #1322. This commit re-introduces this heuristic, extracting it into a function and adding tests to hopefully avoid this situation in the future. Once merged we will cut a new version of lading.

L3n41c added 4 commits April 16, 2025 10:24

[CONTINT-4562] Add WSS measurement (Working Set Size)

1850204

[CONTINT-4562] WSS: put helper functions in dedicated files

b553c89

[CONTINT-4562] Fix clippy errors

3ec3bc5

[CONTINT-4562] Add a CHANGELOG.md entry

d236d9d

L3n41c marked this pull request as ready for review April 17, 2025 06:11

L3n41c requested a review from a team as a code owner April 17, 2025 06:11

L3n41c added 3 commits April 17, 2025 22:10

Merge branch 'main' of github.com:DataDog/lading into lenaic/CONTINT-…

48e888b

…4562_wss

[CONTINT-4562] Gracefully disable WSS collection if the API isn’t ava…

8292f87

…ilable

[CONTINT-4562] Add comments

92c2f7e

L3n41c force-pushed the lenaic/CONTINT-4562_wss branch from 65dbd61 to 92c2f7e Compare April 18, 2025 13:10

scottopell reviewed Apr 18, 2025

View reviewed changes

blt reviewed Apr 21, 2025

View reviewed changes

L3n41c added 2 commits April 25, 2025 11:13

[CONTINT-4562] Address code review comments

c6fc25a

[CONTINT-4562] Fix arm64 build

6122819

L3n41c requested review from blt and scottopell April 25, 2025 15:53

blt approved these changes Apr 28, 2025

View reviewed changes

dd-devflow bot added mergequeue-status: queued mergequeue-status: in_progress and removed mergequeue-status: queued labels Apr 29, 2025

dd-devflow bot mentioned this pull request Apr 29, 2025

[CONTINT-4562] Add a parameter to continuously stop and recreate containers #1325

Merged

dd-devflow bot added mergequeue-status: rejected and removed mergequeue-status: in_progress labels Apr 29, 2025

dd-devflow bot added mergequeue-status: queued and removed mergequeue-status: rejected mergequeue-status: queued labels Apr 29, 2025

dd-devflow bot added mergequeue-status: in_progress mergequeue-status: rejected and removed mergequeue-status: in_progress labels Apr 29, 2025

dd-devflow bot added mergequeue-status: queued mergequeue-status: in_progress and removed mergequeue-status: rejected mergequeue-status: queued labels Apr 29, 2025

dd-mergequeue bot merged commit 71c14f2 into main Apr 29, 2025
28 checks passed

dd-devflow bot removed the mergequeue-status: in_progress label Apr 29, 2025

dd-mergequeue bot deleted the lenaic/CONTINT-4562_wss branch April 29, 2025 10:20

dd-devflow bot added the mergequeue-status: done label Apr 29, 2025

blt mentioned this pull request Jul 9, 2025

Re-introduce forked but not exec'd heuristic #1415

Merged

	[[package]]
	name = "page_size"
	version = "0.6.0"
	source = "registry+https://github.com/rust-lang/crates.io-index"
	checksum = "30d5b2194ed13191c1999ae0704b7839fb18384fa22e49b57eeaa97d79ce40da"
	dependencies = [
	"libc",
	"winapi",
	]

		@@ -0,0 +1,100 @@
		use procfs::process::Process;

		// Iterator which, given a process ID, returns the process and all its descendants

Conversation

L3n41c commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Related issues

Additional Notes

Uh oh!

scottopell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

L3n41c Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

L3n41c commented Apr 29, 2025

Uh oh!

dd-devflow bot commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

L3n41c commented Apr 29, 2025

Uh oh!

dd-devflow bot commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

L3n41c commented Apr 29, 2025

Uh oh!

dd-devflow bot commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

L3n41c commented Apr 16, 2025 •

edited

Loading

L3n41c Apr 25, 2025 •

edited

Loading

dd-devflow bot commented Apr 29, 2025 •

edited

Loading

dd-devflow bot commented Apr 29, 2025 •

edited

Loading

dd-devflow bot commented Apr 29, 2025 •

edited

Loading