chore(agent-data-plane): wait for internal supervisor readiness before spawning topology#1661
Conversation
Binary Size Analysis (Agent Data Plane)Target: 5ff8571 (baseline) vs 4ebaf64 (comparison) diff
|
| Module | File Size | Symbols |
|---|---|---|
figment |
+35.37 KiB | 246 |
anyhow |
+30.34 KiB | 538 |
alloc |
-29.04 KiB | 827 |
serde_with |
-29.00 KiB | 30 |
saluki_components::sources::otlp |
-21.99 KiB | 41 |
datadog_agent_commons::ipc::client |
+20.71 KiB | 10 |
prost |
+18.89 KiB | 300 |
&mut serde_json |
+17.83 KiB | 40 |
std |
+17.18 KiB | 165 |
saluki_components::transforms::dogstatsd_mapper |
+14.55 KiB | 10 |
agent_data_plane::cli::run |
+14.25 KiB | 14 |
hashbrown |
-14.03 KiB | 150 |
backon |
-12.63 KiB | 1 |
core |
+12.60 KiB | 2656 |
serde_json |
-10.80 KiB | 91 |
axum |
+8.10 KiB | 171 |
[sections] |
-8.09 KiB | 7 |
saluki_context::resolver::ContextResolver |
+6.93 KiB | 5 |
agent_data_plane::state::metrics |
+6.68 KiB | 9 |
saluki_core::state::reflector |
-6.59 KiB | 3 |
Detailed Symbol Changes
FILE SIZE VM SIZE
-------------- --------------
+0.3% +48.8Ki +0.3% +32.6Ki [14164 Others]
+99% +30.0Ki +99% +30.0Ki agent_data_plane::internal::env::workload::RemoteAgentWorkloadProvider::from_configuration::_{{closure}}::h338fcc3bf4f70534
[NEW] +21.4Ki [NEW] +21.2Ki datadog_agent_commons::ipc::client::RemoteAgentClient::from_configuration::_{{closure}}::_{{closure}}::_{{closure}}::ha7761343dccf35ad
[NEW] +18.6Ki [NEW] +18.5Ki agent_data_plane::internal::env::workload::build_collector::_{{closure}}::h1787f21bc6b7d8ce
+13e3% +17.5Ki +36e3% +17.5Ki core::ops::function::FnOnce::call_once::ha2970794d93b769a
[NEW] +17.1Ki [NEW] +17.0Ki h2::server::Connection<T,B>::poll_closed::h2ad023c27f33b38f
[NEW] +15.5Ki [NEW] +15.4Ki h2::server::Connection<T,B>::poll_closed::h233c464c83a997bc
[NEW] +15.4Ki [NEW] +15.3Ki h2::server::Connection<T,B>::poll_closed::h5aa707cfd5fd0774
+241% +13.5Ki +245% +13.5Ki h2::proto::connection::DynConnection<B>::recv_frame::h191b5f6b057766b4
+9.3% +13.2Ki +9.3% +13.2Ki agent_data_plane::cli::run::handle_run_command::_{{closure}}::hd83fd04a3d04258b
[NEW] +11.7Ki [NEW] +11.6Ki _<figment::value::magic::RelativePathBuf as figment::value::magic::Magic>::deserialize_from::h726ad0e065537109
[NEW] +10.9Ki [NEW] +10.8Ki _<figment::value::magic::Tagged<T> as figment::value::magic::Magic>::deserialize_from::hd8381001f11e5792
+480% +10.3Ki +524% +10.3Ki _<saluki_components::sources::otlp::logs::translator::OtlpLogsTranslator as core::iter::traits::iterator::Iterator>::next::h985fc6de27821f97
[DEL] -10.7Ki [DEL] -10.5Ki saluki_components::sources::otlp::logs::transform::transform_log_record::h74a5ccad8f7e6e7c
-89.1% -12.6Ki -90.2% -12.6Ki _<backon::retry::Retry<B,T,E,Fut,FutureFn,SF$C&C$NF,AF> as core::future::future::Future>::poll::h6162df3e59c8a3da
[DEL] -14.1Ki [DEL] -14.0Ki _<figment::value::de::ConfiguredValueDe<I> as serde_core::de::Deserializer>::deserialize_struct::ha36a9ea645ba40a7
[DEL] -15.4Ki [DEL] -15.1Ki saluki_components::common::datadog::apm::_::_<impl serde_core::de::Deserialize for saluki_components::common::datadog::apm::ApmConfiguration>::deserialize::h1c89555ab26977cd
[DEL] -17.3Ki [DEL] -17.2Ki saluki_components::sources::otlp::metrics::runtime_metrics::RUNTIME_METRICS_MAPPINGS::_{{closure}}::h49d858bca1863d53
-50.7% -19.3Ki -50.8% -19.3Ki agent_data_plane::internal::env::ADPEnvironmentProvider::from_configuration::_{{closure}}::h1b11cb271cd02f33
[DEL] -21.1Ki [DEL] -20.9Ki _<figment::value::de::ConfiguredValueDe<I> as serde_core::de::Deserializer>::deserialize_struct::he29ece4feb8c4fe7
[DEL] -38.1Ki [DEL] -38.0Ki agent_data_plane::internal::env::workload::build_collector::_{{closure}}::h2305b141e08273fd
+0.2% +95.2Ki +0.3% +79.0Ki TOTAL
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6064ab45d0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| emit_startup_metrics(); | ||
|
|
||
| info!("Waiting for topology to become healthy..."); | ||
| health_registry.all_ready().await; |
There was a problem hiding this comment.
Race readiness waits against startup failures
When a topology component fails during startup before calling mark_ready (for example, the OTLP source can return an error from server_builder.build(...).await? before marking itself ready), this await never completes: the component task has exited and dropped its health handle, while HealthRegistry::all_ready only wakes on readiness transitions. The running_topology.wait_for_unexpected_finish() branch that would report the failure is not polled until after this await, so ADP can hang indefinitely during startup instead of shutting down with the component error; this readiness wait should be raced with topology/internal-supervisor failure signals.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR fixes a timing-related issue in ADP startup by making the runtime explicitly wait for all components registered with the health registry to become ready after spawning the internal supervisor (and again after spawning the topology), instead of relying on scheduling order. To support this, HealthRegistry::all_ready is converted from a synchronous boolean-returning poll into an async method that uses a Notify to be woken whenever any component transitions to ready.
Changes:
- Convert
HealthRegistry::all_readyto an async, notification-driven wait, with eachHealthhandle waking waiters when it marks itself ready. - Update ADP
runto spawn the internal supervisor, awaitall_readyon the registry, then build/spawn the topology and awaitall_readyagain. - Update the existing readiness unit test to use
tokio_test::spawnand assert pending/ready transitions through the new async API.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| lib/saluki-core/src/health/mod.rs | Adds a shared Notify to RegistryState/Health, makes all_ready async with a notify-then-check loop, and updates the readiness test. |
| bin/agent-data-plane/src/cli/run.rs | Restructures startup so the internal supervisor is spawned first, awaited via health_registry.all_ready(), then the topology is built/spawned and awaited again before emitting startup metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| loop { | ||
| // Register as a waiter _before_ checking to avoid missing notifications during the check. | ||
| let notified = readiness_notify.notified(); | ||
|
|
||
| for component in &inner.component_state { | ||
| if !component.is_ready() { | ||
| return false; | ||
| if self.check_all_ready() { | ||
| return; | ||
| } | ||
|
|
||
| notified.await; | ||
| } |
There was a problem hiding this comment.
Incorrect.
The documentation for Notify::notified explicitly states:
The
Notifiedfuture is guaranteed to receive wakeups fromnotify_waiters()as soon as it has been created, even if it has not yet been polled.
Regression Detector (Agent Data Plane)Run ID: Optimization Goals: ✅ No significant changes detectedFine details of change detection per experiment (35)Experiments configured
Bounds Checks: ✅ Passed (5)
ExplanationA change is flagged as a regression when |Δ mean %| > 5.00% in the regressing direction for its optimization goal AND SMP marks the experiment as a regression ( |
| @@ -291,13 +278,15 @@ pub async fn handle_run_command( | |||
| } | |||
| } | |||
There was a problem hiding this comment.
This is valid.
I swapped back to spawning the little helper task that waits for everything to be ready before logging that the topology is healthy. Kind of a bummer but oh well.
19863c8 to
1114211
Compare
1114211 to
e4cba5e
Compare
e4cba5e to
f051d8f
Compare
…e spawning topology
…locks when initialization fails
f051d8f to
4ebaf64
Compare
…e spawning topology (#1661) ## Summary This PR updates ADP to wait for all registered components in the health registry to become ready after spawning the internal supervisor _before_ spawning the primary topology. In #1636, the switch over for asynchronous tasks spawned by the environment provider to be supervisor-based led to a small issue where we spawned the topology before the actual workers in the supervisor were spawned, leading to metrics that weren't enriched properly. We fixed that by manually spawning the internal supervisor first, which worked... but that's not a permanent fix, and still has timing-related risks. This PR explicitly fixes the issue by actually waiting until all registered components (and most background-y tasks in the environment provider do register themselves in the health registry) are ready after spawning the internal supervisor, and only after that occurs do we proceed with spawning the primary topology. As part of this, we update `HealthRegistry::all_ready` to be async so that we can properly notify as soon as we detect all registered components are ready without having to do naive time-based polling. ## Change Type - [ ] Bug fix - [ ] New feature - [x] Non-functional (chore, refactoring, docs) - [ ] Performanc ## How did you test this PR? - [x] Updated the existing `readiness` unit test to handle the changes to `all_ready`. - [x] Existing unit, integration, and correctness tests all passing. ## References DADP-2 Co-authored-by: toby.lawrence <toby.lawrence@datadoghq.com> 55119e3

Summary
This PR updates ADP to wait for all registered components in the health registry to become ready after spawning the internal supervisor before spawning the primary topology.
In #1636, the switch over for asynchronous tasks spawned by the environment provider to be supervisor-based led to a small issue where we spawned the topology before the actual workers in the supervisor were spawned, leading to metrics that weren't enriched properly. We fixed that by manually spawning the internal supervisor first, which worked... but that's not a permanent fix, and still has timing-related risks.
This PR explicitly fixes the issue by actually waiting until all registered components (and most background-y tasks in the environment provider do register themselves in the health registry) are ready after spawning the internal supervisor, and only after that occurs do we proceed with spawning the primary topology.
As part of this, we update
HealthRegistry::all_readyto be async so that we can properly notify as soon as we detect all registered components are ready without having to do naive time-based polling.Change Type
How did you test this PR?
readinessunit test to handle the changes toall_ready.References
DADP-2