chore(env): increase gRPC max message size for Agent IPC#1584
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2053f19940
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| /// Maximum message size for gRPC messages. | ||
| /// | ||
| /// Defaults to `128 * 1024 * 1024` (4MB). |
There was a problem hiding this comment.
Correct documented default gRPC size
The rustdoc for grpc_max_message_size says the default is 128 * 1024 * 1024 (4MB), but that expression is actually 128MB. This mismatch will mislead operators during tuning/debugging (especially when investigating message-size ResourceExhausted failures) because the documented default does not match runtime behavior.
Useful? React with 👍 / 👎.
Regression Detector (Agent Data Plane)Regression Detector ResultsRun ID: 863b0a20-ab7b-4cdd-abe9-815978e0f267 Baseline: b28ad98 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ❌ | otlp_ingest_logs_5mb_memory | memory utilization | +7.96 | [+7.54, +8.39] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_throughput | ingress throughput | +0.02 | [-0.10, +0.15] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_cpu | % cpu utilization | -1.10 | [-6.08, +3.88] | 1 | (metrics) (profiles) (logs) |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | dsd_uds_1mb_3k_contexts_cpu | % cpu utilization | +8.01 | [-44.23, +60.24] | 1 | (metrics) (profiles) (logs) |
| ❌ | otlp_ingest_logs_5mb_memory | memory utilization | +7.96 | [+7.54, +8.39] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_cpu | % cpu utilization | +5.46 | [-50.98, +61.90] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_cpu | % cpu utilization | +3.09 | [-2.84, +9.02] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_transform_5mb_cpu | % cpu utilization | +1.90 | [-0.06, +3.86] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_filtering_5mb_cpu | % cpu utilization | +1.23 | [-1.06, +3.52] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_memory | memory utilization | +0.91 | [+0.72, +1.10] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_memory | memory utilization | +0.42 | [+0.27, +0.57] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_memory | memory utilization | +0.39 | [+0.23, +0.55] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_idle | memory utilization | +0.35 | [+0.31, +0.40] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_memory | memory utilization | +0.32 | [+0.16, +0.48] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_medium | memory utilization | +0.31 | [+0.14, +0.48] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_cpu | % cpu utilization | +0.25 | [-1.13, +1.63] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_cpu | % cpu utilization | +0.19 | [-1.81, +2.19] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_memory | memory utilization | +0.17 | [+0.01, +0.33] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_throughput | ingress throughput | +0.16 | [+0.08, +0.25] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_filtering_5mb_throughput | ingress throughput | +0.16 | [+0.08, +0.25] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_memory | memory utilization | +0.15 | [+0.00, +0.31] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_ultraheavy | memory utilization | +0.04 | [-0.09, +0.17] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_throughput | ingress throughput | +0.02 | [-0.10, +0.15] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_heavy | memory utilization | +0.02 | [-0.10, +0.15] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_throughput | ingress throughput | +0.00 | [-0.05, +0.06] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_throughput | ingress throughput | +0.00 | [-0.04, +0.04] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_throughput | ingress throughput | -0.00 | [-0.06, +0.05] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_throughput | ingress throughput | -0.01 | [-0.16, +0.15] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_throughput | ingress throughput | -0.02 | [-0.19, +0.16] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_transform_5mb_memory | memory utilization | -0.03 | [-0.20, +0.13] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_cpu | % cpu utilization | -0.05 | [-29.67, +29.57] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_filtering_5mb_memory | memory utilization | -0.08 | [-0.31, +0.15] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_transform_5mb_throughput | ingress throughput | -0.09 | [-0.17, -0.02] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_memory | memory utilization | -0.25 | [-0.40, -0.11] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_low | memory utilization | -0.26 | [-0.42, -0.09] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_cpu | % cpu utilization | -0.76 | [-6.17, +4.65] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_throughput | ingress throughput | -0.87 | [-1.00, -0.74] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_cpu | % cpu utilization | -1.10 | [-6.08, +3.88] | 1 | (metrics) (profiles) (logs) |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | quality_gates_rss_dsd_heavy | memory_usage | 10/10 | 123.73MiB ≤ 140MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_low | memory_usage | 10/10 | 40.36MiB ≤ 50MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_medium | memory_usage | 10/10 | 63.04MiB ≤ 75MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_ultraheavy | memory_usage | 10/10 | 174.30MiB ≤ 200MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_idle | memory_usage | 10/10 | 27.62MiB ≤ 40MiB | (metrics) (profiles) (logs) |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Binary Size Analysis (Agent Data Plane)Target: b28ad98 (baseline) vs d2717b9 (comparison) diff
|
| Module | File Size | Symbols |
|---|---|---|
agent_data_plane::cli::run |
+11.97 KiB | 6 |
saluki_env::workload::providers |
-9.71 KiB | 2 |
tokio |
-5.55 KiB | 139 |
anon.7e5448950ada2a8ae92489f25c0da915.4.llvm.11965434555244657092 |
+3.74 KiB | 1 |
anon.5e5ca91f3e19cf084fe1c982a158e710.665.llvm.15530999012673535213 |
-3.74 KiB | 1 |
alloc |
+3.04 KiB | 37 |
prost |
-2.71 KiB | 23 |
core |
+2.71 KiB | 882 |
tower_layer |
+2.06 KiB | 4 |
tonic_prost |
-1.92 KiB | 7 |
hashbrown |
+1.79 KiB | 21 |
saluki_core::topology::built |
+1.68 KiB | 2 |
anon.e9981c590893643d4d7aef53edaf929e.322.llvm.6077253491057944362 |
-1.57 KiB | 1 |
anon.edc88c1f77f7fc54f3491f6dbe3b9816.26.llvm.8772311085534667432 |
+1.57 KiB | 1 |
anon.2bef91b57997dcd336602b8dc2fe8990.2.llvm.18408503694074454322 |
+1.49 KiB | 1 |
anon.b99883931ef8526ea25dab4689b7c8e4.196.llvm.3315547351039564088 |
-1.49 KiB | 1 |
agent_data_plane::internal::control_plane |
-1.44 KiB | 3 |
hyper_timeout |
-1.28 KiB | 4 |
agent_data_plane::cli::debug |
-1.26 KiB | 8 |
anon.5e5ca91f3e19cf084fe1c982a158e710.755.llvm.15530999012673535213 |
-1.22 KiB | 1 |
Detailed Symbol Changes
FILE SIZE VM SIZE
-------------- --------------
+102% +19.5Ki +103% +19.5Ki saluki_env::workload::providers::remote_agent::build_collector::_{{closure}}::hb2620575c393f4d8
+16e2% +13.5Ki +18e2% +13.5Ki agent_data_plane::state::metrics::rules::get_datadog_agent_remappings::hc50d7070764a62fc
+7.5% +11.9Ki +7.5% +11.9Ki agent_data_plane::cli::run::handle_run_command::_{{closure}}::hc4b614f155e62c55
[NEW] +4.96Ki [NEW] +458 core::ptr::drop_in_place<core::iter::adapters::map::Map<std::collections::hash::map::IntoIter<axum::routing::RouteId,axum::routing::Endpoint<saluki_env::workload::providers::remote_agent::api::RemoteAgentWorkloadState>>,axum::routing::path_router::PathRouter<saluki_env::workload::providers::remote_agent::api::RemoteAgentWorkloadState,_>::with_state<$LP$$RP$>::{{closure}}>>::hc7f5dc116f7e759c
[NEW] +4.20Ki [NEW] +48 core::ptr::drop_in_place<tower::util::map_future::MapFuture<axum::util::MapIntoResponse<tower::util::map_request::MapRequest<datadog_protos::agent_include::datadog::remoteagent::telemetry::v1::telemetry_provider_server::TelemetryProviderServer<agent_data_plane::internal::remote_agent::RemoteAgentImpl>,tonic::service::router::Routes::add_service<datadog_protos::agent_include::datadog::remoteagent::telemetry::v1::telemetry_provider_server::TelemetryProviderServer<agent_data_plane::internal::remote_agent::RemoteAgentImpl>>::{{closure}}>>,tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>,http::response::Response<axum_core::body::Body>,core::convert::Infallible>::new<axum::util::MapIntoResponse<tower::util::map_request::MapRequest<datadog_protos::agent_include::datadog::remoteagent::telemetry::v1::telemetry_provider_server::TelemetryProviderServer<agent_data_plane::internal::rem
[NEW] +4.15Ki [NEW] +3.92Ki _<futures_util::future::try_future::try_flatten::TryFlatten<Fut,<Fut as futures_core::future::TryFuture>::Ok> as core::future::future::Future>::poll::h7b344b33cc965797
[NEW] +4.11Ki [NEW] +3.89Ki serde_json::value::de::_<impl serde_core::de::Deserializer for serde_json::map::Map<alloc::string::String,serde_json::value::Value>>::deserialize_any::h2c18528b921a63c5
[NEW] +3.74Ki [NEW] +16 anon.7e5448950ada2a8ae92489f25c0da915.4.llvm.11965434555244657092
+285% +3.25Ki +312% +3.25Ki h2::proto::streams::streams::StreamRef<B>::send_data::h82ffb912b4d59cda
+105% +3.04Ki +110% +3.04Ki _<&str as tonic::metadata::map::into_metadata_key::Sealed<VE>>::append::hd82dd7f9ab5bc5f7
[DEL] -2.78Ki [DEL] -2.68Ki h2::proto::streams::prioritize::Prioritize::send_data::hc96de8ca38fb53d2
[DEL] -3.16Ki [DEL] -3.04Ki agent_data_plane::state::metrics::rules::aggregation::get_aggregation_remappings::h6285a2d5b15bd140
[DEL] -3.31Ki [DEL] -3.21Ki tonic::metadata::key::MetadataKey<VE>::from_static::hfb9aae86df16c7c8
[DEL] -3.74Ki [DEL] -16 anon.5e5ca91f3e19cf084fe1c982a158e710.665.llvm.15530999012673535213
[DEL] -4.08Ki [DEL] -3.92Ki _<futures_util::future::try_future::AndThen<Fut1,Fut2,F> as core::future::future::Future>::poll::h2805e5709a3c3a0c
-75.2% -4.19Ki -50.0% -48 core::ptr::drop_in_place<tower::util::map_future::MapFuture<axum::util::MapIntoResponse<tower::util::map_request::MapRequest<datadog_protos::agent_include::datadog::remoteagent::flare::v1::flare_provider_server::FlareProviderServer<agent_data_plane::internal::remote_agent::RemoteAgentImpl>,tonic::service::router::Routes::add_service<datadog_protos::agent_include::datadog::remoteagent::flare::v1::flare_provider_server::FlareProviderServer<agent_data_plane::internal::remote_agent::RemoteAgentImpl>>::{{closure}}>>,tower::util::boxed_clone_sync::BoxCloneSyncService<http::request::Request<axum_core::body::Body>,http::response::Response<axum_core::body::Body>,core::convert::Infallible>::new<axum::util::MapIntoResponse<tower::util::map_request::MapRequest<datadog_protos::agent_include::datadog::remoteagent::flare::v1::flare_provider_server::FlareProviderServer<agent_data_plane::internal::remote_agent::RemoteAgentImpl>,ton
[DEL] -4.95Ki [DEL] -458 core::ptr::drop_in_place<core::iter::adapters::map::Map<std::collections::hash::map::IntoIter<axum::routing::RouteId,axum::routing::Endpoint<saluki_app::logging::api::LoggingHandlerState>>,axum::routing::path_router::PathRouter<saluki_app::logging::api::LoggingHandlerState,_>::with_state<$LP$$RP$>::{{closure}}>>::h8000805f88b7f418
[DEL] -5.13Ki [DEL] -4.96Ki serde_json::value::de::_<impl serde_core::de::Deserializer for serde_json::value::Value>::deserialize_struct::h3bdc9c7ff85065c4
[DEL] -7.93Ki [DEL] -7.81Ki agent_data_plane::state::metrics::rules::dogstatsd::get_dogstatsd_remappings::h70b33f0eaff6923b
-0.2% -10.0Ki -0.2% -7.39Ki [3881 Others]
-50.9% -29.2Ki -51.0% -29.2Ki saluki_env::workload::providers::remote_agent::RemoteAgentWorkloadProvider::from_configuration::_{{closure}}::h9b1aa6330ab3f6f1
-0.0% -6.12Ki -0.0% -3.17Ki TOTAL
## Summary As stated in the PR title. This matches the recent change in the Agent itself ([#50202](DataDog/datadog-agent#50202)) to increase the gRPC max message size to ensure we can send large configuration snapshots without issue. We've added this setting to `RemoteAgentClientConfiguration` with a default that matches the Agent, and a `rename` that lines up with the Agent configuration setting that controls this on the Agent side. A note: we've added a key alias for this to map the nested path to the flattened path, which means we have a change in `saluki_components`... admittedly, this is weird. I have a thought about trying to reorganize some of these very Agent-specific bits of code that live in different `saluki-*` crates into a single `datadog-agent-common` crate or something, but I'm not quite there yet. ## Change Type - [x] Bug fix - [ ] New feature - [ ] Non-functional (chore, refactoring, docs) - [ ] Performance ## How did you test this PR? Existing tests. ## References DADP-15 Co-authored-by: toby.lawrence <toby.lawrence@datadoghq.com> 18cfa96
Summary
As stated in the PR title.
This matches the recent change in the Agent itself (#50202) to increase the gRPC max message size to ensure we can send large configuration snapshots without issue. We've added this setting to
RemoteAgentClientConfigurationwith a default that matches the Agent, and arenamethat lines up with the Agent configuration setting that controls this on the Agent side.A note: we've added a key alias for this to map the nested path to the flattened path, which means we have a change in
saluki_components... admittedly, this is weird. I have a thought about trying to reorganize some of these very Agent-specific bits of code that live in differentsaluki-*crates into a singledatadog-agent-commoncrate or something, but I'm not quite there yet.Change Type
How did you test this PR?
Existing tests.
References
DADP-15