chore(agent-data-plane): wait for internal supervisor readiness before spawning topology by tobz · Pull Request #1661 · DataDog/saluki

tobz · 2026-05-15T14:29:36Z

Summary

This PR updates ADP to wait for all registered components in the health registry to become ready after spawning the internal supervisor before spawning the primary topology.

In #1636, the switch over for asynchronous tasks spawned by the environment provider to be supervisor-based led to a small issue where we spawned the topology before the actual workers in the supervisor were spawned, leading to metrics that weren't enriched properly. We fixed that by manually spawning the internal supervisor first, which worked... but that's not a permanent fix, and still has timing-related risks.

This PR explicitly fixes the issue by actually waiting until all registered components (and most background-y tasks in the environment provider do register themselves in the health registry) are ready after spawning the internal supervisor, and only after that occurs do we proceed with spawning the primary topology.

As part of this, we update HealthRegistry::all_ready to be async so that we can properly notify as soon as we detect all registered components are ready without having to do naive time-based polling.

Change Type

Bug fix
New feature
Non-functional (chore, refactoring, docs)
Performanc

How did you test this PR?

Updated the existing readiness unit test to handle the changes to all_ready.
Existing unit, integration, and correctness tests all passing.

References

DADP-2

tobz · 2026-05-15T14:29:51Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

pr-commenter · 2026-05-15T14:37:20Z

Binary Size Analysis (Agent Data Plane)

Target: 5ff8571 (baseline) vs 4ebaf64 (comparison) diff
Analysis Type: Stripped binaries (debug symbols excluded)
Baseline Size: 37.39 MiB
Comparison Size: 37.49 MiB
Size Change: +95.18 KiB (+0.25%)
Pass/Fail Threshold: +5%
Result: PASSED ✅

Changes by Module

Module	File Size	Symbols
`figment`	+35.37 KiB	246
`anyhow`	+30.34 KiB	538
`alloc`	-29.04 KiB	827
`serde_with`	-29.00 KiB	30
`saluki_components::sources::otlp`	-21.99 KiB	41
`datadog_agent_commons::ipc::client`	+20.71 KiB	10
`prost`	+18.89 KiB	300
`&mut serde_json`	+17.83 KiB	40
`std`	+17.18 KiB	165
`saluki_components::transforms::dogstatsd_mapper`	+14.55 KiB	10
`agent_data_plane::cli::run`	+14.25 KiB	14
`hashbrown`	-14.03 KiB	150
`backon`	-12.63 KiB	1
`core`	+12.60 KiB	2656
`serde_json`	-10.80 KiB	91
`axum`	+8.10 KiB	171
`[sections]`	-8.09 KiB	7
`saluki_context::resolver::ContextResolver`	+6.93 KiB	5
`agent_data_plane::state::metrics`	+6.68 KiB	9
`saluki_core::state::reflector`	-6.59 KiB	3

Detailed Symbol Changes

    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +0.3% +48.8Ki  +0.3% +32.6Ki    [14164 Others]
   +99% +30.0Ki   +99% +30.0Ki    agent_data_plane::internal::env::workload::RemoteAgentWorkloadProvider::from_configuration::_{{closure}}::h338fcc3bf4f70534
  [NEW] +21.4Ki  [NEW] +21.2Ki    datadog_agent_commons::ipc::client::RemoteAgentClient::from_configuration::_{{closure}}::_{{closure}}::_{{closure}}::ha7761343dccf35ad
  [NEW] +18.6Ki  [NEW] +18.5Ki    agent_data_plane::internal::env::workload::build_collector::_{{closure}}::h1787f21bc6b7d8ce
 +13e3% +17.5Ki +36e3% +17.5Ki    core::ops::function::FnOnce::call_once::ha2970794d93b769a
  [NEW] +17.1Ki  [NEW] +17.0Ki    h2::server::Connection<T,B>::poll_closed::h2ad023c27f33b38f
  [NEW] +15.5Ki  [NEW] +15.4Ki    h2::server::Connection<T,B>::poll_closed::h233c464c83a997bc
  [NEW] +15.4Ki  [NEW] +15.3Ki    h2::server::Connection<T,B>::poll_closed::h5aa707cfd5fd0774
  +241% +13.5Ki  +245% +13.5Ki    h2::proto::connection::DynConnection<B>::recv_frame::h191b5f6b057766b4
  +9.3% +13.2Ki  +9.3% +13.2Ki    agent_data_plane::cli::run::handle_run_command::_{{closure}}::hd83fd04a3d04258b
  [NEW] +11.7Ki  [NEW] +11.6Ki    _<figment::value::magic::RelativePathBuf as figment::value::magic::Magic>::deserialize_from::h726ad0e065537109
  [NEW] +10.9Ki  [NEW] +10.8Ki    _<figment::value::magic::Tagged<T> as figment::value::magic::Magic>::deserialize_from::hd8381001f11e5792
  +480% +10.3Ki  +524% +10.3Ki    _<saluki_components::sources::otlp::logs::translator::OtlpLogsTranslator as core::iter::traits::iterator::Iterator>::next::h985fc6de27821f97
  [DEL] -10.7Ki  [DEL] -10.5Ki    saluki_components::sources::otlp::logs::transform::transform_log_record::h74a5ccad8f7e6e7c
 -89.1% -12.6Ki -90.2% -12.6Ki    _<backon::retry::Retry<B,T,E,Fut,FutureFn,SF$C&C$NF,AF> as core::future::future::Future>::poll::h6162df3e59c8a3da
  [DEL] -14.1Ki  [DEL] -14.0Ki    _<figment::value::de::ConfiguredValueDe<I> as serde_core::de::Deserializer>::deserialize_struct::ha36a9ea645ba40a7
  [DEL] -15.4Ki  [DEL] -15.1Ki    saluki_components::common::datadog::apm::_::_<impl serde_core::de::Deserialize for saluki_components::common::datadog::apm::ApmConfiguration>::deserialize::h1c89555ab26977cd
  [DEL] -17.3Ki  [DEL] -17.2Ki    saluki_components::sources::otlp::metrics::runtime_metrics::RUNTIME_METRICS_MAPPINGS::_{{closure}}::h49d858bca1863d53
 -50.7% -19.3Ki -50.8% -19.3Ki    agent_data_plane::internal::env::ADPEnvironmentProvider::from_configuration::_{{closure}}::h1b11cb271cd02f33
  [DEL] -21.1Ki  [DEL] -20.9Ki    _<figment::value::de::ConfiguredValueDe<I> as serde_core::de::Deserializer>::deserialize_struct::he29ece4feb8c4fe7
  [DEL] -38.1Ki  [DEL] -38.0Ki    agent_data_plane::internal::env::workload::build_collector::_{{closure}}::h2305b141e08273fd
  +0.2% +95.2Ki  +0.3% +79.0Ki    TOTAL

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6064ab45d0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-15T14:40:40Z

-    emit_startup_metrics();
-
+    info!("Waiting for topology to become healthy...");
+    health_registry.all_ready().await;


Race readiness waits against startup failures

When a topology component fails during startup before calling mark_ready (for example, the OTLP source can return an error from server_builder.build(...).await? before marking itself ready), this await never completes: the component task has exited and dropped its health handle, while HealthRegistry::all_ready only wakes on readiness transitions. The running_topology.wait_for_unexpected_finish() branch that would report the failure is not polled until after this await, so ADP can hang indefinitely during startup instead of shutting down with the component error; this readiness wait should be raced with topology/internal-supervisor failure signals.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This PR fixes a timing-related issue in ADP startup by making the runtime explicitly wait for all components registered with the health registry to become ready after spawning the internal supervisor (and again after spawning the topology), instead of relying on scheduling order. To support this, HealthRegistry::all_ready is converted from a synchronous boolean-returning poll into an async method that uses a Notify to be woken whenever any component transitions to ready.

Changes:

Convert HealthRegistry::all_ready to an async, notification-driven wait, with each Health handle waking waiters when it marks itself ready.
Update ADP run to spawn the internal supervisor, await all_ready on the registry, then build/spawn the topology and await all_ready again.
Update the existing readiness unit test to use tokio_test::spawn and assert pending/ready transitions through the new async API.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
lib/saluki-core/src/health/mod.rs	Adds a shared `Notify` to `RegistryState`/`Health`, makes `all_ready` async with a notify-then-check loop, and updates the readiness test.
bin/agent-data-plane/src/cli/run.rs	Restructures startup so the internal supervisor is spawned first, awaited via `health_registry.all_ready()`, then the topology is built/spawned and awaited again before emitting startup metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tobz · 2026-05-15T16:04:28Z

+        loop {
+            // Register as a waiter _before_ checking to avoid missing notifications during the check.
+            let notified = readiness_notify.notified();

-        for component in &inner.component_state {
-            if !component.is_ready() {
-                return false;
+            if self.check_all_ready() {
+                return;
            }
+
+            notified.await;
        }


Incorrect.

The documentation for Notify::notified explicitly states:

The Notified future is guaranteed to receive wakeups from notify_waiters() as soon as it has been created, even if it has not yet been polled.

pr-commenter · 2026-05-15T14:53:19Z

Regression Detector (Agent Data Plane)

Run ID: 5ad0f8fc-0239-422a-a63c-41e1500ef96a
Baseline: 5ff8571c · Comparison: 4ebaf64f · Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment (35)

Experiments configured erratic: true are tagged (ignored) and skipped when determining which experiments regressed or improved. Experiments which are detected as erratic at runtime are tagged (erratic) to flag that the run's sample dispersion was high, but their regression / improvement signal still counts.

experiment	goal	Δ mean %	links
dsd_uds_1mb_3k_contexts_cpu (erratic)	cpu	⚪ +18.23	metrics profiles logs
dsd_uds_10mb_3k_contexts_cpu (erratic)	cpu	⚪ +6.72	metrics profiles logs
dsd_uds_512kb_3k_contexts_cpu (erratic)	cpu	⚪ +3.39	metrics profiles logs
dsd_uds_500mb_3k_contexts_throughput	throughput	⚪ -2.03	metrics profiles logs
otlp_ingest_metrics_5mb_cpu (erratic)	cpu	⚪ +1.76	metrics profiles logs
otlp_ingest_traces_5mb_cpu (erratic)	cpu	⚪ +1.16	metrics profiles logs
quality_gates_rss_dsd_medium	memory	⚪ +0.59	metrics profiles logs
quality_gates_rss_dsd_low	memory	⚪ +0.43	metrics profiles logs
otlp_ingest_metrics_5mb_memory	memory	⚪ +0.41	metrics profiles logs
dsd_uds_100mb_3k_contexts_cpu (erratic)	cpu	⚪ +0.37	metrics profiles logs
otlp_ingest_traces_5mb_memory	memory	⚪ +0.31	metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_memory	memory	⚪ +0.29	metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_cpu (erratic)	cpu	⚪ +0.25	metrics profiles logs
dsd_uds_100mb_3k_contexts_memory	memory	⚪ +0.13	metrics profiles logs
quality_gates_rss_dsd_heavy	memory	⚪ +0.12	metrics profiles logs
dsd_uds_10mb_3k_contexts_memory	memory	⚪ +0.12	metrics profiles logs
quality_gates_rss_dsd_ultraheavy	memory	⚪ +0.09	metrics profiles logs
otlp_ingest_traces_5mb_throughput	throughput	⚪ -0.08	metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_throughput	throughput	⚪ -0.03	metrics profiles logs
otlp_ingest_logs_5mb_throughput (ignored)	throughput	⚪ -0.01	metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_throughput	throughput	⚪ -0.01	metrics profiles logs
dsd_uds_100mb_3k_contexts_throughput	throughput	⚪ -0.00	metrics profiles logs
dsd_uds_512kb_3k_contexts_throughput	throughput	⚪ -0.00	metrics profiles logs
dsd_uds_10mb_3k_contexts_throughput	throughput	⚪ +0.00	metrics profiles logs
dsd_uds_1mb_3k_contexts_throughput	throughput	⚪ +0.00	metrics profiles logs
otlp_ingest_metrics_5mb_throughput	throughput	⚪ +0.05	metrics profiles logs
quality_gates_rss_idle	memory	⚪ -0.07	metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_memory	memory	⚪ -0.15	metrics profiles logs
dsd_uds_1mb_3k_contexts_memory	memory	⚪ -0.16	metrics profiles logs
otlp_ingest_logs_5mb_cpu (ignored)	cpu	⚪ -0.22	metrics profiles logs
dsd_uds_512kb_3k_contexts_memory	memory	⚪ -0.24	metrics profiles logs
dsd_uds_500mb_3k_contexts_memory	memory	⚪ -0.51	metrics profiles logs
dsd_uds_500mb_3k_contexts_cpu (erratic)	cpu	⚪ -0.71	metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_cpu (erratic)	cpu	⚪ -1.64	metrics profiles logs
otlp_ingest_logs_5mb_memory (ignored)	memory	⚪ -5.79	metrics profiles logs

Bounds Checks: ✅ Passed (5)

experiment	check	replicates	observed	links
quality_gates_rss_dsd_heavy	memory_usage	10/10	✅ 122 MiB ≤ 140 MiB	metrics profiles logs
quality_gates_rss_dsd_low	memory_usage	10/10	✅ 39.6 MiB ≤ 50 MiB	metrics profiles logs
quality_gates_rss_dsd_medium	memory_usage	10/10	✅ 59.9 MiB ≤ 75 MiB	metrics profiles logs
quality_gates_rss_dsd_ultraheavy	memory_usage	10/10	✅ 178 MiB ≤ 200 MiB	metrics profiles logs
quality_gates_rss_idle	memory_usage	10/10	✅ 27 MiB ≤ 40 MiB	metrics profiles logs

Explanation

A change is flagged as a regression when |Δ mean %| > 5.00% in the regressing direction for its optimization goal AND SMP marks the experiment as a regression (is_regression: true). Improvements use the matching criteria for the improving direction. Experiments configured erratic: true (tagged (ignored)) are skipped outright; experiments detected as erratic at runtime (tagged (erratic)) still count, since that flag describes sample dispersion rather than directional certainty. The Δ mean % cell is colored accordingly: 🟢 = improvement, 🔴 = regression, ⚪ = neutral. Reduction in CPU or memory is an improvement; reduction in ingress throughput is a regression.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

tobz · 2026-05-15T16:42:10Z

@@ -291,13 +278,15 @@ pub async fn handle_run_command(
        }
    }


This is valid.

I swapped back to spawning the little helper task that waits for everything to be ready before logging that the topology is healthy. Kind of a bummer but oh well.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

…e spawning topology

…locks when initialization fails

…e spawning topology (#1661) ## Summary This PR updates ADP to wait for all registered components in the health registry to become ready after spawning the internal supervisor _before_ spawning the primary topology. In #1636, the switch over for asynchronous tasks spawned by the environment provider to be supervisor-based led to a small issue where we spawned the topology before the actual workers in the supervisor were spawned, leading to metrics that weren't enriched properly. We fixed that by manually spawning the internal supervisor first, which worked... but that's not a permanent fix, and still has timing-related risks. This PR explicitly fixes the issue by actually waiting until all registered components (and most background-y tasks in the environment provider do register themselves in the health registry) are ready after spawning the internal supervisor, and only after that occurs do we proceed with spawning the primary topology. As part of this, we update `HealthRegistry::all_ready` to be async so that we can properly notify as soon as we detect all registered components are ready without having to do naive time-based polling. ## Change Type - [ ] Bug fix - [ ] New feature - [x] Non-functional (chore, refactoring, docs) - [ ] Performanc ## How did you test this PR? - [x] Updated the existing `readiness` unit test to handle the changes to `all_ready`. - [x] Existing unit, integration, and correctness tests all passing. ## References DADP-2 Co-authored-by: toby.lawrence <toby.lawrence@datadoghq.com> 55119e3

dd-octo-sts Bot added the area/core Core functionality, event model, etc. label May 15, 2026

tobz added the type/chore Updates to dependencies or general "administrative" tasks necessary to maintain the codebase/repo. label May 15, 2026

tobz marked this pull request as ready for review May 15, 2026 14:38

Copilot AI review requested due to automatic review settings May 15, 2026 14:38

tobz requested a review from a team as a code owner May 15, 2026 14:38

Copilot started reviewing on behalf of tobz May 15, 2026 14:39 View session

chatgpt-codex-connector Bot reviewed May 15, 2026

View reviewed changes

Copilot AI reviewed May 15, 2026

View reviewed changes

tobz requested a review from Copilot May 15, 2026 16:08

Copilot started reviewing on behalf of tobz May 15, 2026 16:09 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 15, 2026 17:33

tobz force-pushed the tobz/explicit-ready-check-internal-sup-topology branch from 19863c8 to 1114211 Compare May 15, 2026 17:33

Copilot started reviewing on behalf of tobz May 15, 2026 17:34 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

tobz force-pushed the tobz/explicit-ready-check-internal-sup-topology branch from 1114211 to e4cba5e Compare May 15, 2026 17:47

aqian01 approved these changes May 15, 2026

View reviewed changes

Base automatically changed from tobz/env-provider-supervisor-makeover to main May 15, 2026 18:43

tobz force-pushed the tobz/explicit-ready-check-internal-sup-topology branch from e4cba5e to f051d8f Compare May 15, 2026 18:44

tobz added 3 commits May 15, 2026 15:13

chore(agent-data-plane): wait for internal supervisor readiness befor…

5c72e6e

…e spawning topology

race all_ready checks with polling the underlying tasks to avoid dead…

00f1318

…locks when initialization fails

actually do need the separate spawn

4ebaf64

tobz force-pushed the tobz/explicit-ready-check-internal-sup-topology branch from f051d8f to 4ebaf64 Compare May 15, 2026 19:14

gh-worker-dd-devflow-36fce6 Bot added mergequeue-status: waiting mergequeue-status: queued mergequeue-status: in_progress and removed mergequeue-status: waiting mergequeue-status: queued labels May 15, 2026

gh-worker-dd-mergequeue-cf854d Bot merged commit 55119e3 into main May 15, 2026
78 checks passed

gh-worker-dd-devflow-36fce6 Bot added mergequeue-status: done and removed mergequeue-status: in_progress labels May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(agent-data-plane): wait for internal supervisor readiness before spawning topology#1661

chore(agent-data-plane): wait for internal supervisor readiness before spawning topology#1661
gh-worker-dd-mergequeue-cf854d[bot] merged 3 commits into
mainfrom
tobz/explicit-ready-check-internal-sup-topology

tobz commented May 15, 2026 •

edited

Loading

Uh oh!

tobz commented May 15, 2026 •

edited

Loading

Uh oh!

pr-commenter Bot commented May 15, 2026 •

edited

Loading

Changes by Module

Detailed Symbol Changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

tobz May 15, 2026

Uh oh!

pr-commenter Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

tobz May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tobz commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type

How did you test this PR?

References

Uh oh!

tobz commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pr-commenter Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Binary Size Analysis (Agent Data Plane)

Changes by Module

Detailed Symbol Changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

tobz May 15, 2026

Choose a reason for hiding this comment

Uh oh!

pr-commenter Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Detector (Agent Data Plane)

Optimization Goals: ✅ No significant changes detected

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

tobz May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tobz commented May 15, 2026 •

edited

Loading

tobz commented May 15, 2026 •

edited

Loading

pr-commenter Bot commented May 15, 2026 •

edited

Loading

pr-commenter Bot commented May 15, 2026 •

edited

Loading