feat: support additional cgroup formats for container-id parsing #222

zacharycmontoya · 2025-06-30T21:37:17Z

Description

This PR adds support for additional container runtimes so we can properly parse and propagate the container-id to the Datadog Agent. The implementation continues to read the cgroup file line-by-line but the logic has been modified to run a regex against particular formats. The regex was lifted from the Java Tracer here and the test cases were lifted from the .NET Tracer here

Motivation

We've had a customer use-case where container-id parsing from the tracer was not functioning with containers in Fargate.

Additional Notes

The pre-existing Docker container-id was modified in the test code to be a 64 hex-char UUID, from "0::/system.slice/docker-abcdef0123456789abcdef0123456789.scope" => "0::/system.slice/docker-cde7c2bab394630a42d73dc610b9c57415dced996106665d427f6d0566594411.scope". Since this format is expected in other tracers, this felt like a safe change to make. If we were previously supporting environment where only 32 hex-chars were expected after the docker- prefix, then let me know and we can revert this change.

…g, as this is what other tracers are testing

…er_id and update its implementation so that it is regex-based and covers the same set of inputs as other tracers, like Java and .NET -- the implementation is borrowed from the Java tracer

pr-commenter · 2025-06-30T23:43:27Z

Benchmarks

Benchmark execution time: 2025-07-16 22:29:41

Comparing candidate commit 9d1946b in PR branch zach.montoya/container-id-fixup with baseline commit dbe6292 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 1 metrics, 0 unstable metrics.

codecov-commenter · 2025-06-30T23:48:03Z

Codecov Report

Attention: Patch coverage is 87.50000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 86.51%. Comparing base (dbe6292) to head (9d1946b).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/datadog/platform_util.cpp	87.50%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #222      +/-   ##
==========================================
+ Coverage   86.45%   86.51%   +0.05%     
==========================================
  Files          80       80              
  Lines        5251     5264      +13     
==========================================
+ Hits         4540     4554      +14     
+ Misses        711      710       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dmehala · 2025-07-07T12:28:49Z

@zacharycmontoya have you had a chance to test this manually? If so, could you share the steps so I can try it on my end as well? Also, the regex looks quite complex, have we considered reading the ECS task metadata as an alternative approach?

zacharycmontoya · 2025-07-08T16:13:46Z

@zacharycmontoya have you had a chance to test this manually? If so, could you share the steps so I can try it on my end as well?

I have not tested this manually yet. I'll need to figure out how to test this.

Also, the regex looks quite complex, have we considered reading the ECS task metadata as an alternative approach?

I have not considered reading the ECS task metadata as that seems like we'll need to issue an HTTP request to get this information, which is less appealing to me than reading from the filesystem. Although the regex looks complex, this is also being maintained without much changes in other tracing repositories so I'm not concerned about the maintenance burden.

dmehala · 2025-07-10T12:44:18Z

I understand. Given that STL regex are slow (and I am not even mentioning all other issues with regexes in general), might I propose a two-pass approach? At first, scan the filesystem with the current algorithm. If no container ID is found, process with the regex path. WDYT @zacharycmontoya?

zacharycmontoya · 2025-07-10T18:18:55Z

I understand. Given that STL regex are slow (and I am not even mentioning all other issues with regexes in general), might I propose a two-pass approach? At first, scan the filesystem with the current algorithm. If no container ID is found, process with the regex path. WDYT @zacharycmontoya?

Sure if that approach is preferable to you, I'll go ahead and implement that. My worry is that the current filesystem approach only checks for Docker runtimes, so we may fall into the fallback regex case in the majority of cases

zacharycmontoya · 2025-07-11T00:45:47Z

With regards to testing, I'm unable to create new resources directly in AWS so I haven't gotten a true reproduction of the Fargate issue.

However, I've started this system-tests PR (system-tests#4925) to better assert the container-id logic, which should be able to run against the cpp_httpd and cpp_nginx libraries, but I don't know how to update them (they run against official releases). Do you have any suggestions?

Once I'm able to test against that system-tests branch, I'll update the logic like you suggested.

…logs into our error logging so I can debug why this isn't working in system-tests

…ker Desktop.

…ollecting the container-id on Docker Desktop

…issues. Also remove unused code.

zacharycmontoya · 2025-07-15T00:20:47Z

src/datadog/platform_util.cpp

-        id.type = ContainerID::Type::container_id;
-        break;
-      }
+  if (auto maybe_id = find_container_id_from_cgroup()) {


At this moment, this changes the behavior to align more closely with the .NET and Java libraries:

Get the container-id first by reading (and parsing if available) the file /proc/self/cgroup

If getting the container-id fails, then try to get the inode, but if we are in the host cgroup namespace then do not do this. This host cgroup namespace check was implemented before, but this PR moves the logic to only run the check when invoking the inode fallback

Additional changes I plan to make:

This still doesn't incorporate the fast-path logic to avoid the regex usage, so I still plan to include that.

This has been tested with system-tests by running the cpp_nginx library locally against this system-tests PR (system-tests#4925). To get an nginx-datadog build, I had the CI run on the zach.montoya/test-dd-trace-cpp-container-id branch of Datadog/nginx-datadog, with the latest commit here.

In the current code, we clearly follow two paths: one for cgroup v1 and one for cgroup v2. Most importantly, since we know for sure we can’t get the container ID for cgroup v2, there’s no need to try in that case.

I spent a good amount of time reading the RFC, and I believe this version matches it better. I suggest we keep it this way.

EDIT: Given that the container ID is not found, it should report the inode. My understanding is using the inode alone be should sufficient for host-level tag correlation. Have we had a chance to investigate why this approach might not be working as expected?

The issue I had (maybe this is simply a Docker in Docker scenario) is that the original get_cgroup_version() implementation failed when running system-tests locally, even though getting the container ID via /proc/self/cgroup would end up succeeding. For this reason, I removed that logic entirely.

I can take a closer look at /sys/fs/cgroup to understand why the existing cgroup v1/v2 lookup failed, but to emphasize, I made the changes to both move the host namespace and remove the cgroup lookup because they were failing in the system-tests cpp_nginx weblog container, and revising the logic to align more closely with .NET and Java (while also passing the system-tests) yielded this.

I spent a good amount of time reading the RFC, and I believe this version matches it better. I suggest we keep it this way.

Just to make sure I'm understanding you correctly on this, is the newly proposed code your preference? Or is your preference the previous code where we clearly separated the logical paths based on cgroup v1 / v2? The "this" and "current" words in your comment were slightly ambiguous to me when reading

Following up on this, I ran the nginx:1.25.4 container (the one where the cpp_nginx weblog is deployed) and there's no statfs command available. When making the call from the stdlib, does statfs have to be available for that call to succeed? Because it was the statfs("/sys/fs/cgroup", &buf) that was failing before

So actually, I was able to run this, which I think functions the same as statfs():

# stat -f /sys/fs/cgroup File: "/sys/fs/cgroup" ID: 40b90acb90ae1b43 Namelen: 255 Type: tmpfs Block size: 4096 Fundamental block size: 4096 Blocks: Total: 2020153 Free: 2020153 Available: 2020153 Inodes: Total: 2020153 Free: 2020137

However, the type is tmpfs, which is not CGROUP_SUPER_MAGIC 0x27e0eb or CGROUP2_SUPER_MAGIC 0x63677270.

I've also confirmed that the statfs("/sys/fs/cgroup", &buf) call returns TMPFS_MAGIC 0x01021994. So this means in the Docker in Docker scenario where we can find a container-ID, we get the f_type of tmpfs. Shall we restore the previous Cgroup logic but consider tmpfs as Cgroup::v1?

is the newly proposed code your preference? Or is your preference the previous code where we clearly separated the logical paths based on cgroup v1 / v2? The "this" and "current" words in your comment were slightly ambiguous to me when reading

My bad. I'd rather keep the previous code.

I've also confirmed that the statfs("/sys/fs/cgroup", &buf) call returns TMPFS_MAGIC 0x01021994

cgroup v1 controller are usually mounter under tmpfs source. We should restore the previous cgroup logic and consider bothCGROUP_SUPER_MAGIC, TMPFS_MAGIC.

Understood, will follow up with the requested changes 👍🏼

Ok I added the following:

Restore the original cgroup control flow with tmpfs in 587405f

Change the container-id lookup to do the substring logic first then + regex as fallback in 376793e

…the f_type returned by the statfs call

…ue for using our cgroup v1 lookup

… invoking regex matching

src/datadog/platform_util.cpp

dmehala · 2025-07-16T21:14:14Z

src/datadog/platform_util.cpp

-      if (auto maybe_inode = get_inode("/sys/fs/cgroup")) {
-        id.type = ContainerID::Type::cgroup_inode;
-        id.value = std::to_string(*maybe_inode);
+      if (!is_running_in_host_namespace()) {


I’m not quite sure I understand the reasoning behind moving this here, could you explain the reasoning?

There's two reasons why I did this:

In our Java and C# implementations, we only perform this host cgroup namespace check when we try to get the inode, so this aligns with those implementations.

I'm not sure if this is a WSL/WSL2 issue, but when I run the system-tests against the nginx weblog container we're DEFINITELY running in a container, but the check fails. Perhaps this is the same Docker in Docker issue referenced in the code comments, but anyways this shouldn't fail and return early when we can definitely extract the container-id in the cgroupv1 scenario

I haven't investigated too much, however, IIRC this function can fails in docker in docker setup.

dd-trace-cpp/src/datadog/platform_util.cpp

Lines 313 to 316 in 31f263f

// Host namespace inode number are hardcoded, which allows for dectection of

// whether the binary is running in host or not. However, it does not work when

// running in a Docker in Docker environment.

bool is_running_in_host_namespace() {

src/datadog/platform_util.cpp

Co-authored-by: Damien Mehala <damien.mehala@datadoghq.com>

dmehala

LGTM! thanks.

zacharycmontoya added 3 commits June 30, 2025 14:13

Update docker container ID test to use a UUID that's 64 hex-chars lon…

edb4f89

…g, as this is what other tracers are testing

Rename container::find_docker_container_id to container::find_contain…

bb3d458

…er_id and update its implementation so that it is regex-based and covers the same set of inputs as other tracers, like Java and .NET -- the implementation is borrowed from the Java tracer

Add other cgroup file examples from the .NET Tracer

2174c80

zacharycmontoya requested a review from a team as a code owner June 30, 2025 21:37

zacharycmontoya requested review from dubloom and removed request for a team June 30, 2025 21:37

zacharycmontoya changed the title ~~feat: support additional group formats for container-id parsing~~ feat: support additional cgroup formats for container-id parsing Jun 30, 2025

Fix formatting

136d4cc

zacharycmontoya added 6 commits July 14, 2025 12:58

Add temporary logging to stdout and stderr for the container-id parsing

43f214a

This doesn't pass unit tests, but now I am writing some more helpful …

d6e9297

…logs into our error logging so I can debug why this isn't working in system-tests

Remove the host namespace check since it stops us from working on Doc…

6251ac4

…ker Desktop.

Also remove the eager cgroup lookup because that's stopping us from c…

7fb67c9

…ollecting the container-id on Docker Desktop

Refactor to better align with Java implementation

70d6db4

Remove all uses of logger now that we're done debugging system-tests …

f0abaad

…issues. Also remove unused code.

zacharycmontoya commented Jul 15, 2025

View reviewed changes

zacharycmontoya added 4 commits July 15, 2025 11:38

Put back the get_cgroup_version and add temporary logging to confirm …

565c4e0

…the f_type returned by the statfs call

Restore the cgroup logic, but also include TMPFS_MAGIC as a valid val…

587405f

…ue for using our cgroup v1 lookup

Remove logging (again) and reduce the diff

08909b1

Restore the previous substring approach and execute this first before…

376793e

… invoking regex matching

dmehala reviewed Jul 16, 2025

View reviewed changes

zacharycmontoya and others added 3 commits July 16, 2025 14:36

Apply suggestions from code review

039883a

Co-authored-by: Damien Mehala <damien.mehala@datadoghq.com>

Move the regex string literals closer to their usage

d8576b9

Fix formatting

9d1946b

dmehala approved these changes Jul 22, 2025

View reviewed changes

zacharycmontoya merged commit b44a8a3 into main Jul 22, 2025
24 checks passed

zacharycmontoya deleted the zach.montoya/container-id-fixup branch July 22, 2025 16:05

	// Host namespace inode number are hardcoded, which allows for dectection of
	// whether the binary is running in host or not. However, it does not work when
	// running in a Docker in Docker environment.
	bool is_running_in_host_namespace() {

feat: support additional cgroup formats for container-id parsing #222

feat: support additional cgroup formats for container-id parsing #222

Uh oh!

Conversation

zacharycmontoya commented Jun 30, 2025

Description

Motivation

Additional Notes

Uh oh!

pr-commenter bot commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Uh oh!

codecov-commenter commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dmehala commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zacharycmontoya commented Jul 8, 2025

Uh oh!

dmehala commented Jul 10, 2025

Uh oh!

zacharycmontoya commented Jul 10, 2025

Uh oh!

zacharycmontoya commented Jul 11, 2025

Uh oh!

zacharycmontoya Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmehala Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zacharycmontoya Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmehala Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dmehala left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pr-commenter bot commented Jun 30, 2025 •

edited

Loading

codecov-commenter commented Jun 30, 2025 •

edited

Loading

dmehala commented Jul 7, 2025 •

edited

Loading

zacharycmontoya Jul 15, 2025 •

edited

Loading

dmehala Jul 15, 2025 •

edited

Loading

zacharycmontoya Jul 15, 2025 •

edited

Loading

dmehala Jul 16, 2025 •

edited

Loading