[OTAGENT-823] bootstrap Dogtel extension#47532
[OTAGENT-823] bootstrap Dogtel extension#47532gh-worker-dd-mergequeue-cf854d[bot] merged 18 commits intomainfrom
Conversation
Go Package Import DifferencesBaseline: 0b0f0c7
|
Files inventory check summaryFile checks results against ancestor 0b0f0c74: Results for datadog-agent_7.79.0~devel.git.297.f5a8d19.pipeline.105115926-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
27 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 1555c13 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -2.12 | [-5.10, +0.86] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | otlp_ingest_logs | memory utilization | +1.62 | [+1.50, +1.74] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.44 | [+0.27, +0.60] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.32 | [+0.18, +0.46] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | +0.21 | [+0.15, +0.27] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | +0.15 | [-0.08, +0.37] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.01 | [-0.53, +0.54] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.11, +0.11] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.00 | [-0.21, +0.20] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.02 | [-0.22, +0.18] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.02 | [-0.11, +0.07] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.05 | [-0.47, +0.38] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | -0.05 | [-0.22, +0.13] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.05 | [-0.45, +0.35] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.07 | [-0.23, +0.10] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.07 | [-0.14, +0.00] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.21 | [-0.25, -0.17] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle | memory utilization | -0.29 | [-0.34, -0.24] | 1 | Logs bounds checks dashboard |
| ➖ | file_tree | memory utilization | -0.40 | [-0.46, -0.34] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | -0.83 | [-0.89, -0.76] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.85 | [-1.02, -0.69] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | -1.22 | [-1.46, -0.99] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_logs | % cpu utilization | -1.55 | [-3.15, +0.04] | 1 | Logs bounds checks dashboard |
| ➖ | docker_containers_cpu | % cpu utilization | -2.12 | [-5.10, +0.86] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 710 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 271.23MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 703 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.23GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.21GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 172.59MiB ≤ 175MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 491.51MiB ≤ 550MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 204.01MiB ≤ 220MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 361.74 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 414.26MiB ≤ 475MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
0f93f0b to
5d620f6
Compare
Introduces the dogtelextension OTel Collector extension and refactors otel-agent startup to support standalone mode (DD_OTEL_STANDALONE=true), enabling the otel-agent to run independently without a core Datadog Agent. Key changes: - dogtelextension (comp/otelcol/dogtelextension): New OTel Collector extension providing a tagger gRPC server, host metadata submission, and secrets resolution for standalone mode. - Standalone/connected FX split (cmd/otel-agent/subcommands/run): Refactors otel-agent startup into commonAgentFxOptions plus mode- specific standaloneAgentFxOptions / connectedAgentFxOptions. Standalone mode wires local hostname, real secrets backend, local tagger, host metadata runner, and disables on-init config sync. Connected mode keeps remote hostname, remote tagger, and core-agent config sync. - K8s tag enrichment (comp/core/workloadmeta/collectors/catalog-otel): New catalog-otel workloadmeta catalog (kubelet, containerd, docker, ECS, crio, podman) compiled into otel-agent via the new kubelet build tag. In standalone mode the infraattributes processor enriches spans, metrics, and logs with K8s tags (kube_deployment, kube_namespace, pod_name, etc.) via the local tagger. Deployments require DD_KUBERNETES_KUBELET_HOST=status.hostIP, DD_KUBELET_TLS_VERIFY=false (or CA cert), and nodes/proxy RBAC on the otel-agent ServiceAccount for K8s tag enrichment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
0dcb28d to
3d7b219
Compare
truthbk
left a comment
There was a problem hiding this comment.
Super clean bootstrap! Also love how you were able to bring in the best of both worlds with fx + actual otel extension interfaces; and that resolves the extension configuration issue very cleanly. This is awesome.
We have to talk about what the otel-agent should default to, but this is a great start.
| ) | ||
| } | ||
|
|
||
| if acfg.GetBool("otel_standalone") { |
There was a problem hiding this comment.
So, I have some doubts with this: should we instead consider a check on otel_bundled? Or !acfg.GetBool("otel_standalone")?
On one hand this is better because it's backward compatible with our operator and helm charts. On the other it's not ideal because we'd have to set an env var when deploying with the otel operator/helm. We really do want to make a strong attempt to minimize the number of steps our OTel customers need to take on tooling we don't have full control over. Let's discuss this.
There was a problem hiding this comment.
Customers would have to set env vars in the otel operator/helm already, e.g. DD_OTELCOLLECTOR_ENABLED. Setting one more env var is probably fine.
There was a problem hiding this comment.
I think we should optimize to minimize the number of options a customer needs to set on the OpenTelemetry operator/helm chart. I feel like we can get away with a lot more of that transparently on the DD side.
I'm fine with merging this as-is; but I also think there's chances we want to revisit this specifically.
…andalone mode
- Apply dogtelextension settings to DD agent pkgconfig only when
otel_standalone=true; connected mode leaves core agent config untouched.
- Make EnableMetadataCollection a *bool (like KubeletTLSVerify) so absence
preserves the agent default rather than forcing false.
- Add MetadataInterval default (1800 s) to comment.
- Gate standalone block with pkgconfig.GetBool("otel_standalone").
- Add TestDogtelExtensionConfig_ConnectedModeIgnored to assert dogtelextension
fields are no-ops in connected mode.
- Tests use DD_OTEL_STANDALONE=true env var for standalone test cases.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
60768d2 to
4c5b322
Compare
jeremy-hanna
left a comment
There was a problem hiding this comment.
👍 for agent-runtime owned files
truthbk
left a comment
There was a problem hiding this comment.
Looks good to me. Added a couple of nits you can feel free to ignore. I do think for the actual standalone vs connected default path we may have to make some changes, but we can do that later once we take on the deployment question more specifically. At that point we'll have a better understanding of what's better.
| ) | ||
| } | ||
|
|
||
| if acfg.GetBool("otel_standalone") { |
There was a problem hiding this comment.
I think we should optimize to minimize the number of options a customer needs to set on the OpenTelemetry operator/helm chart. I feel like we can get away with a lot more of that transparently on the DD side.
I'm fine with merging this as-is; but I also think there's chances we want to revisit this specifically.
…er stream subscribers - getDogtelExtensionConfig now returns an error when multiple dogtel* extension entries are found instead of silently picking one - stopTaggerServer replaces unbounded GracefulStop() with a 5-second timeout that falls back to Stop(), preventing long-lived TaggerStreamEntities subscribers from blocking otel-agent termination - Add unit tests for both changes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 90662a7436
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…ders list
Setting metadata_interval in the dogtel extension config was replacing
metadata_providers wholesale with a single {name: host} entry, silently
dropping any other providers (e.g. "resources") configured in datadog.yaml.
Read the existing providers first, update the host entry in place (or
append it if absent), then write back the merged list. Handle both
map[string]interface{} and the map[interface{}]interface{} type that YAML
v2 produces for maps inside sequences.
Add a regression test that pre-seeds a "resources" provider in datadog.yaml
and asserts it survives alongside the updated host interval.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add TestFxRun_NoDatadogExporter_Standalone and its config fixture to cover the case where the otel-agent runs in standalone mode with no datadog exporter in the pipeline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
gabedos
left a comment
There was a problem hiding this comment.
otel workloadmeta catalog lgtm!
…ta catalog Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… files Packages with local BUILD.bazel files that load @rules_go (comp/core/log/def, comp/core/ipc/def) cannot be referenced via @com_github_datadog_... external repo labels because @rules_go is not visible in the external module context. Use local //comp/... paths instead. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Consistent with ddflareextension and ddprofilingextension, exclude the dogtelextension fx and impl directories from gazelle. The impl/BUILD.bazel uses local //comp/core/... paths for sub-modules that have their own BUILD.bazel files, which gazelle would incorrectly revert to @com_github_... external refs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… dep chain The gazelle-generated BUILD.bazel files for dogtelextension impl, fx, metrics, and metadata reference @com_github_datadog_datadog_agent_pkg_metrics and related external deps. These chain through pkg/util/buf which has a broken external BUILD.bazel (loads @rules_go unavailable in external context). Following the pattern of ddflareextension and ddprofilingextension, only def/BUILD.bazel is retained. The impl subtree cannot be built via Bazel because its transitive deps (pkg/metrics, pkg/serializer, pkg/util/grpc) lack local BUILD.bazel files and their external dep chains are broken. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
What does this PR do?
Adds standalone mode support to the otel-agent (DD_OTEL_STANDALONE=true) and introduces the dogtelextension OTel Collector extension for Datadog Agent functionalities.
Key changes:
Motivation
Standalone Dogtel Agent
Describe how you validated your changes
Additional Notes
Deployments using infraattributes in standalone mode require: