Skip to content

Conversation

@zacharycmontoya
Copy link
Contributor

Description

This PR adds support for additional container runtimes so we can properly parse and propagate the container-id to the Datadog Agent. The implementation continues to read the cgroup file line-by-line but the logic has been modified to run a regex against particular formats. The regex was lifted from the Java Tracer here and the test cases were lifted from the .NET Tracer here

Motivation

We've had a customer use-case where container-id parsing from the tracer was not functioning with containers in Fargate.

Additional Notes

The pre-existing Docker container-id was modified in the test code to be a 64 hex-char UUID, from "0::/system.slice/docker-abcdef0123456789abcdef0123456789.scope" => "0::/system.slice/docker-cde7c2bab394630a42d73dc610b9c57415dced996106665d427f6d0566594411.scope". Since this format is expected in other tracers, this felt like a safe change to make. If we were previously supporting environment where only 32 hex-chars were expected after the docker- prefix, then let me know and we can revert this change.

…g, as this is what other tracers are testing
…er_id and update its implementation so that it is regex-based and covers the same set of inputs as other tracers, like Java and .NET -- the implementation is borrowed from the Java tracer
@zacharycmontoya zacharycmontoya requested a review from a team as a code owner June 30, 2025 21:37
@zacharycmontoya zacharycmontoya requested review from dubloom and removed request for a team June 30, 2025 21:37
@zacharycmontoya zacharycmontoya changed the title feat: support additional group formats for container-id parsing feat: support additional cgroup formats for container-id parsing Jun 30, 2025
@pr-commenter
Copy link

pr-commenter bot commented Jun 30, 2025

Benchmarks

Benchmark execution time: 2025-07-16 22:29:41

Comparing candidate commit 9d1946b in PR branch zach.montoya/container-id-fixup with baseline commit dbe6292 in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 1 metrics, 0 unstable metrics.

@codecov-commenter
Copy link

codecov-commenter commented Jun 30, 2025

Codecov Report

Attention: Patch coverage is 87.50000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 86.51%. Comparing base (dbe6292) to head (9d1946b).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/datadog/platform_util.cpp 87.50% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #222      +/-   ##
==========================================
+ Coverage   86.45%   86.51%   +0.05%     
==========================================
  Files          80       80              
  Lines        5251     5264      +13     
==========================================
+ Hits         4540     4554      +14     
+ Misses        711      710       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dmehala
Copy link
Collaborator

dmehala commented Jul 7, 2025

@zacharycmontoya have you had a chance to test this manually? If so, could you share the steps so I can try it on my end as well? Also, the regex looks quite complex, have we considered reading the ECS task metadata as an alternative approach?

@zacharycmontoya
Copy link
Contributor Author

@zacharycmontoya have you had a chance to test this manually? If so, could you share the steps so I can try it on my end as well?

I have not tested this manually yet. I'll need to figure out how to test this.

Also, the regex looks quite complex, have we considered reading the ECS task metadata as an alternative approach?

I have not considered reading the ECS task metadata as that seems like we'll need to issue an HTTP request to get this information, which is less appealing to me than reading from the filesystem. Although the regex looks complex, this is also being maintained without much changes in other tracing repositories so I'm not concerned about the maintenance burden.

@dmehala
Copy link
Collaborator

dmehala commented Jul 10, 2025

I understand. Given that STL regex are slow (and I am not even mentioning all other issues with regexes in general), might I propose a two-pass approach? At first, scan the filesystem with the current algorithm. If no container ID is found, process with the regex path. WDYT @zacharycmontoya?

@zacharycmontoya
Copy link
Contributor Author

I understand. Given that STL regex are slow (and I am not even mentioning all other issues with regexes in general), might I propose a two-pass approach? At first, scan the filesystem with the current algorithm. If no container ID is found, process with the regex path. WDYT @zacharycmontoya?

Sure if that approach is preferable to you, I'll go ahead and implement that. My worry is that the current filesystem approach only checks for Docker runtimes, so we may fall into the fallback regex case in the majority of cases

@zacharycmontoya
Copy link
Contributor Author

With regards to testing, I'm unable to create new resources directly in AWS so I haven't gotten a true reproduction of the Fargate issue.

However, I've started this system-tests PR (system-tests#4925) to better assert the container-id logic, which should be able to run against the cpp_httpd and cpp_nginx libraries, but I don't know how to update them (they run against official releases). Do you have any suggestions?

Once I'm able to test against that system-tests branch, I'll update the logic like you suggested.

id.type = ContainerID::Type::container_id;
break;
}
if (auto maybe_id = find_container_id_from_cgroup()) {
Copy link
Contributor Author

@zacharycmontoya zacharycmontoya Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this moment, this changes the behavior to align more closely with the .NET and Java libraries:

  • Get the container-id first by reading (and parsing if available) the file /proc/self/cgroup
  • If getting the container-id fails, then try to get the inode, but if we are in the host cgroup namespace then do not do this. This host cgroup namespace check was implemented before, but this PR moves the logic to only run the check when invoking the inode fallback

Additional changes I plan to make:

  • This still doesn't incorporate the fast-path logic to avoid the regex usage, so I still plan to include that.

This has been tested with system-tests by running the cpp_nginx library locally against this system-tests PR (system-tests#4925). To get an nginx-datadog build, I had the CI run on the zach.montoya/test-dd-trace-cpp-container-id branch of Datadog/nginx-datadog, with the latest commit here.

Copy link
Collaborator

@dmehala dmehala Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current code, we clearly follow two paths: one for cgroup v1 and one for cgroup v2. Most importantly, since we know for sure we can’t get the container ID for cgroup v2, there’s no need to try in that case.

I spent a good amount of time reading the RFC, and I believe this version matches it better. I suggest we keep it this way.

EDIT: Given that the container ID is not found, it should report the inode. My understanding is using the inode alone be should sufficient for host-level tag correlation. Have we had a chance to investigate why this approach might not be working as expected?

Copy link
Contributor Author

@zacharycmontoya zacharycmontoya Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I had (maybe this is simply a Docker in Docker scenario) is that the original get_cgroup_version() implementation failed when running system-tests locally, even though getting the container ID via /proc/self/cgroup would end up succeeding. For this reason, I removed that logic entirely.

I can take a closer look at /sys/fs/cgroup to understand why the existing cgroup v1/v2 lookup failed, but to emphasize, I made the changes to both move the host namespace and remove the cgroup lookup because they were failing in the system-tests cpp_nginx weblog container, and revising the logic to align more closely with .NET and Java (while also passing the system-tests) yielded this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a good amount of time reading the RFC, and I believe this version matches it better. I suggest we keep it this way.

Just to make sure I'm understanding you correctly on this, is the newly proposed code your preference? Or is your preference the previous code where we clearly separated the logical paths based on cgroup v1 / v2? The "this" and "current" words in your comment were slightly ambiguous to me when reading

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on this, I ran the nginx:1.25.4 container (the one where the cpp_nginx weblog is deployed) and there's no statfs command available. When making the call from the stdlib, does statfs have to be available for that call to succeed? Because it was the statfs("/sys/fs/cgroup", &buf) that was failing before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So actually, I was able to run this, which I think functions the same as statfs():

# stat -f /sys/fs/cgroup
  File: "/sys/fs/cgroup"
    ID: 40b90acb90ae1b43 Namelen: 255     Type: tmpfs
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 2020153    Free: 2020153    Available: 2020153
Inodes: Total: 2020153    Free: 2020137

However, the type is tmpfs, which is not CGROUP_SUPER_MAGIC 0x27e0eb or CGROUP2_SUPER_MAGIC 0x63677270.

I've also confirmed that the statfs("/sys/fs/cgroup", &buf) call returns TMPFS_MAGIC 0x01021994. So this means in the Docker in Docker scenario where we can find a container-ID, we get the f_type of tmpfs. Shall we restore the previous Cgroup logic but consider tmpfs as Cgroup::v1?

Copy link
Collaborator

@dmehala dmehala Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the newly proposed code your preference? Or is your preference the previous code where we clearly separated the logical paths based on cgroup v1 / v2? The "this" and "current" words in your comment were slightly ambiguous to me when reading

My bad. I'd rather keep the previous code.

I've also confirmed that the statfs("/sys/fs/cgroup", &buf) call returns TMPFS_MAGIC 0x01021994

cgroup v1 controller are usually mounter under tmpfs source. We should restore the previous cgroup logic and consider bothCGROUP_SUPER_MAGIC, TMPFS_MAGIC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, will follow up with the requested changes 👍🏼

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I added the following:

  • Restore the original cgroup control flow with tmpfs in 587405f
  • Change the container-id lookup to do the substring logic first then + regex as fallback in 376793e

if (auto maybe_inode = get_inode("/sys/fs/cgroup")) {
id.type = ContainerID::Type::cgroup_inode;
id.value = std::to_string(*maybe_inode);
if (!is_running_in_host_namespace()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not quite sure I understand the reasoning behind moving this here, could you explain the reasoning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's two reasons why I did this:

  • In our Java and C# implementations, we only perform this host cgroup namespace check when we try to get the inode, so this aligns with those implementations.
  • I'm not sure if this is a WSL/WSL2 issue, but when I run the system-tests against the nginx weblog container we're DEFINITELY running in a container, but the check fails. Perhaps this is the same Docker in Docker issue referenced in the code comments, but anyways this shouldn't fail and return early when we can definitely extract the container-id in the cgroupv1 scenario

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't investigated too much, however, IIRC this function can fails in docker in docker setup.

// Host namespace inode number are hardcoded, which allows for dectection of
// whether the binary is running in host or not. However, it does not work when
// running in a Docker in Docker environment.
bool is_running_in_host_namespace() {

Copy link
Collaborator

@dmehala dmehala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks.

@zacharycmontoya zacharycmontoya merged commit b44a8a3 into main Jul 22, 2025
24 checks passed
@zacharycmontoya zacharycmontoya deleted the zach.montoya/container-id-fixup branch July 22, 2025 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants