enhancement(agent-data-plane): only ship Agent-relevant internal telemetry over RAR#1237
Conversation
5d1588f to
bd85827
Compare
Binary Size Analysis (Agent Data Plane)Target: 0a6de86 (baseline) vs 29560e0 (comparison) diff
|
| Module | File Size | Symbols |
|---|---|---|
agent_data_plane::components::remapper |
-43.96 KiB | 21 |
agent_data_plane::state::metrics |
+36.85 KiB | 11 |
core |
+10.17 KiB | 6115 |
http_body_util |
-10.05 KiB | 89 |
futures_util |
+8.53 KiB | 4 |
agent_data_plane::internal::remote_agent |
-4.16 KiB | 64 |
quick_cache |
-3.74 KiB | 18 |
agent_data_plane::internal::initialize_and_launch_runtime |
-3.70 KiB | 3 |
saluki_core::topology::interconnect |
-3.07 KiB | 35 |
saluki_io::net::client |
-2.83 KiB | 1 |
agent_data_plane::cli::run |
+2.48 KiB | 70 |
saluki_components::destinations::prometheus |
-1.98 KiB | 62 |
saluki_context::resolver::TagsResolver |
-1.93 KiB | 11 |
[Unmapped] |
+1.23 KiB | 1 |
tokio |
-1.18 KiB | 1796 |
saluki_common::cache::expiry |
-1.02 KiB | 14 |
saluki_context::resolver::ContextResolver |
-975 B | 7 |
hyper |
+888 B | 265 |
prometheus_exposition::PrometheusRenderer::format_labels |
+808 B | 1 |
[sections] |
-800 B | 8 |
Detailed Symbol Changes
FILE SIZE VM SIZE
-------------- --------------
[NEW] +1.79Mi [NEW] +1.79Mi std::thread::local::LocalKey<T>::with::h31315ac4c244155e
[NEW] +114Ki [NEW] +114Ki agent_data_plane::cli::run::create_topology::_{{closure}}::h98d0ae31cb4f4046
[NEW] +84.6Ki [NEW] +84.5Ki agent_data_plane::internal::control_plane::spawn_control_plane::_{{closure}}::h46b094814d6a697c
[NEW] +64.2Ki [NEW] +64.0Ki saluki_components::common::datadog::io::run_endpoint_io_loop::_{{closure}}::hc95a739242c3695b
[NEW] +59.9Ki [NEW] +59.8Ki agent_data_plane::cli::run::handle_run_command::_{{closure}}::h182a78b26161c4a1
[NEW] +49.5Ki [NEW] +49.4Ki saluki_app::bootstrap::AppBootstrapper::bootstrap::_{{closure}}::h24978ee6b466cfc2
[NEW] +46.1Ki [NEW] +45.9Ki _<saluki_components::forwarders::otlp::OtlpForwarder as saluki_core::components::forwarders::Forwarder>::run::_{{closure}}::h3a69188b1f9c6c42
[NEW] +45.7Ki [NEW] +45.5Ki _<saluki_components::destinations::prometheus::Prometheus as saluki_core::components::destinations::Destination>::run::_{{closure}}::h14660224772a0b86
[NEW] +44.3Ki [NEW] +44.1Ki _<saluki_components::transforms::aggregate::Aggregate as saluki_core::components::transforms::Transform>::run::_{{closure}}::hfe3b70ef3fd62917
[NEW] +44.1Ki [NEW] +43.9Ki saluki_env::workload::providers::remote_agent::RemoteAgentWorkloadProvider::from_configuration::_{{closure}}::hb4d68b3a7be09944
-0.2% -21.6Ki -0.2% -16.1Ki [15536 Others]
[DEL] -44.1Ki [DEL] -43.9Ki saluki_env::workload::providers::remote_agent::RemoteAgentWorkloadProvider::from_configuration::_{{closure}}::hbdb816f02e21b1cf
[DEL] -44.4Ki [DEL] -44.2Ki _<saluki_components::transforms::aggregate::Aggregate as saluki_core::components::transforms::Transform>::run::_{{closure}}::hb4d1ad0684099c4c
[DEL] -46.0Ki [DEL] -45.8Ki _<saluki_components::destinations::prometheus::Prometheus as saluki_core::components::destinations::Destination>::run::_{{closure}}::hea5618e9afd08b63
[DEL] -46.1Ki [DEL] -45.9Ki _<saluki_components::forwarders::otlp::OtlpForwarder as saluki_core::components::forwarders::Forwarder>::run::_{{closure}}::hb0c627893c271e2c
[DEL] -49.5Ki [DEL] -49.4Ki saluki_app::bootstrap::AppBootstrapper::bootstrap::_{{closure}}::h15f926ef5659e580
[DEL] -57.8Ki [DEL] -57.7Ki agent_data_plane::cli::run::handle_run_command::_{{closure}}::hebabd4c0d26708e2
[DEL] -64.2Ki [DEL] -64.0Ki saluki_components::common::datadog::io::run_endpoint_io_loop::_{{closure}}::h1c7964dcd8ecbead
[DEL] -84.6Ki [DEL] -84.5Ki agent_data_plane::internal::control_plane::spawn_control_plane::_{{closure}}::h78c38079d35f4f77
[DEL] -114Ki [DEL] -114Ki agent_data_plane::cli::run::create_topology::_{{closure}}::h4a1b0b37b7e2d03e
[DEL] -1.79Mi [DEL] -1.79Mi std::thread::local::LocalKey<T>::with::he517c09e5477efe1
-0.1% -18.9Ki -0.1% -13.4Ki TOTAL
Regression Detector (Agent Data Plane)Regression Detector ResultsRun ID: 2976c95e-c611-4939-b0a7-9a67aa58e16c Baseline: 1d50af4 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ❌ | otlp_ingest_logs_5mb_memory | memory utilization | +19.81 | [+19.14, +20.48] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_throughput | ingress throughput | +0.01 | [-0.13, +0.14] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_cpu | % cpu utilization | -1.20 | [-6.13, +3.74] | 1 | (metrics) (profiles) (logs) |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ❌ | otlp_ingest_logs_5mb_memory | memory utilization | +19.81 | [+19.14, +20.48] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_cpu | % cpu utilization | +1.85 | [-4.32, +8.02] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_cpu | % cpu utilization | +1.41 | [-29.29, +32.11] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_cpu | % cpu utilization | +1.09 | [-5.62, +7.79] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_low | memory utilization | +0.70 | [+0.51, +0.90] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_filtering_5mb_cpu | % cpu utilization | +0.46 | [-2.09, +3.01] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_memory | memory utilization | +0.45 | [+0.27, +0.63] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_memory | memory utilization | +0.40 | [+0.23, +0.58] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_medium | memory utilization | +0.33 | [+0.14, +0.52] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_memory | memory utilization | +0.31 | [+0.06, +0.55] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_memory | memory utilization | +0.27 | [+0.09, +0.44] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_transform_5mb_cpu | % cpu utilization | +0.23 | [-2.01, +2.47] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_memory | memory utilization | +0.16 | [-0.02, +0.35] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_heavy | memory utilization | +0.02 | [-0.12, +0.15] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_10mb_3k_contexts_throughput | ingress throughput | +0.01 | [-0.13, +0.15] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_throughput | ingress throughput | +0.01 | [-0.13, +0.14] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_throughput | ingress throughput | +0.00 | [-0.06, +0.06] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_filtering_5mb_throughput | ingress throughput | +0.00 | [-0.02, +0.02] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_transform_5mb_throughput | ingress throughput | +0.00 | [-0.02, +0.02] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_100mb_3k_contexts_throughput | ingress throughput | -0.00 | [-0.01, +0.01] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_throughput | ingress throughput | -0.00 | [-0.02, +0.02] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_throughput | ingress throughput | -0.00 | [-0.06, +0.05] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_throughput | ingress throughput | -0.01 | [-0.14, +0.12] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_filtering_5mb_memory | memory utilization | -0.01 | [-0.34, +0.32] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_ottl_transform_5mb_memory | memory utilization | -0.17 | [-0.42, +0.07] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_idle | memory utilization | -0.22 | [-0.26, -0.17] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_memory | memory utilization | -0.22 | [-0.39, -0.05] | 1 | (metrics) (profiles) (logs) |
| ➖ | quality_gates_rss_dsd_ultraheavy | memory utilization | -0.29 | [-0.42, -0.15] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_cpu | % cpu utilization | -0.57 | [-2.02, +0.89] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_logs_5mb_cpu | % cpu utilization | -1.20 | [-6.13, +3.74] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_1mb_3k_contexts_cpu | % cpu utilization | -1.39 | [-53.46, +50.68] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_500mb_3k_contexts_throughput | ingress throughput | -2.85 | [-2.98, -2.72] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_metrics_5mb_memory | memory utilization | -3.42 | [-3.61, -3.22] | 1 | (metrics) (profiles) (logs) |
| ➖ | dsd_uds_512kb_3k_contexts_cpu | % cpu utilization | -3.73 | [-59.38, +51.92] | 1 | (metrics) (profiles) (logs) |
| ➖ | otlp_ingest_traces_5mb_cpu | % cpu utilization | -3.87 | [-6.04, -1.69] | 1 | (metrics) (profiles) (logs) |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | quality_gates_rss_dsd_heavy | memory_usage | 10/10 | 115.12MiB ≤ 140MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_low | memory_usage | 10/10 | 33.71MiB ≤ 50MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_medium | memory_usage | 10/10 | 54.02MiB ≤ 75MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_dsd_ultraheavy | memory_usage | 10/10 | 166.80MiB ≤ 200MiB | (metrics) (profiles) (logs) |
| ✅ | quality_gates_rss_idle | memory_usage | 10/10 | 21.11MiB ≤ 40MiB | (metrics) (profiles) (logs) |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
4dbc756 to
29560e0
Compare
This stack of pull requests is managed by Graphite. Learn more about stacking. |

Summary
This PR refactors how we collect and filter the internal telemetry sent to the Core Agent via RAR in order to isolate Agent-specific internal telemetry (things that get sent via COAT, basically) from the general basket of internal metrics emitted by Saluki.
As we need to publish specific metrics to the Core Agent in order to drive COAT, we have to take our internal telemetry and arrange it in a way that lines up with what the Core Agent expects, and then send all of that over RAR. This currently takes the shape of scraping everything from a Prometheus destination that exposes all internal telemetry. This is suboptimal because it means we're always pushing internal telemetry through RAR, which means we can't get the same origin enrichment (container tags, etc) compared to if we had a distinct autodiscovery-based check to scrape the endpoint.
This PR reworks how we filter, remap, and collect internal telemetry to support both the COAT telemetry we send over RAR and the general internal telemetry we expose over the Prometheus scrape endpoint. In no particular order:
prometheus-exposition, and updated the Prometheus destination to depend on itagent-data-planeto use the remapper rules, as they align with the COAT-only metrics we care about sending via RAROverall, this allows us to both send COAT-relevant metrics to RAR while still scraping internal telemetry via Prometheus so that both can co-exist without clobbering each other/overlapping, leading to confusing metrics queries on dashboards... and without sending irrelevant metrics to RAR.
Change Type
How did you test this PR?
Built and ran ADP locally, and double-checked that only COAT-specific metrics are sent via RAR and exposed in the Core Agent's telemetry endpoint. Ensured that non-remapped Saluki internal telemetry was exposed on the internal telemetry endpoint.
References
AGTMETRICS-400