Skip to content

fix(core): use compensated summation for histograms#1666

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit into
mainfrom
mark.kirichenko/use-compensated-summation
May 18, 2026
Merged

fix(core): use compensated summation for histograms#1666
gh-worker-dd-mergequeue-cf854d[bot] merged 1 commit into
mainfrom
mark.kirichenko/use-compensated-summation

Conversation

@atanzu
Copy link
Copy Markdown
Contributor

@atanzu atanzu commented May 15, 2026

Summary

Use modified Neumaier algorithm to calculate sums, counts, and quantiles for histogram samples.

The naive sum += value * weight loop suffers catastrophic cancellation when the sample stream contains values of wildly different magnitudes. The classic Kahan/Peters counter-example {1, +1e100, 1, -1e100} evaluates to 0 with naive summation but to the correct 2.0 with the new algorithm.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

How did you test this PR?

Added unit tests to check correctness.

References

Similar PR in Datadog-Agent.

Use modified Neumaier algorithm to calculate sums, counts, and quantiles
for histogram samples.

The naive `sum += value * weight` loop suffers catastrophic cancellation
when the sample stream contains values of wildly different magnitudes.
The classic Kahan/Peters counter-example `{1, +1e100, 1, -1e100}`
evaluates to 0 with naive summation but to the correct 2.0 with the new
algorithm.

Signed-off-by: Mark Kirichenko <mark.kirichenko@datadoghq.com>
@dd-octo-sts dd-octo-sts Bot added area/core Core functionality, event model, etc. area/components Sources, transforms, and destinations. transform/aggregate Aggregate transform. destination/prometheus Prometheus Scrape destination. encoder/datadog-metrics Datadog Metrics encoder. labels May 15, 2026
@tobz tobz changed the title Use compensated summation for histograms fix(chore): use compensated summation for histograms May 15, 2026
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 15, 2026

Binary Size Analysis (Agent Data Plane)

Target: 55bca14 (baseline) vs df02eb3 (comparison) diff
Analysis Type: Stripped binaries (debug symbols excluded)
Baseline Size: 37.36 MiB
Comparison Size: 37.32 MiB
Size Change: -39.87 KiB (-0.10%)
Pass/Fail Threshold: +5%
Result: PASSED ✅

Changes by Module

Module File Size Symbols
core -33.99 KiB 1451
smallvec +22.57 KiB 62
figment -10.78 KiB 12
saluki_core::data_model::event -8.63 KiB 22
[Unmapped] -4.36 KiB 1
anon.0cac25fc52ba6f4fc475348a8c66d8e3.39.llvm.4081512695721216103 +3.90 KiB 1
anon.8bb1cfcdf181d421e8889bc3626b8144.17.llvm.125851317713716366 -3.81 KiB 1
[sections] -3.30 KiB 7
anon.bcfbe2edddf7aafb2d5d5e0cc5ffa1e5.16.llvm.12024661772337668300 -3.15 KiB 1
hashbrown +3.15 KiB 24
anon.0f2fa1d1fad1031510176699744ee20b.644.llvm.3708903403341574001 +3.06 KiB 1
papaya +2.56 KiB 11
anon.eb51c975e2567ebfca80d7da0abd4cd1.7.llvm.11556325092004374934 -2.55 KiB 1
anon.40340ab29c26454e228b882d4ecb70d4.0.llvm.14605026808982526804 +2.54 KiB 1
saluki_api::DynamicRoute::http +1.99 KiB 1
anon.bdadb00871588c874a55b8b73ce579a9.0.llvm.8997752761958168434 -1.93 KiB 1
anon.41b0b8befc3118c3a2b0f17ec06872eb.9.llvm.17862187855904175761 +1.93 KiB 1
alloc -1.91 KiB 68
serde_core +1.88 KiB 40
tokio_util -1.84 KiB 11

Detailed Symbol Changes

    FILE SIZE        VM SIZE    
 --------------  -------------- 
  [NEW] +6.41Ki  [NEW] +6.33Ki    matchit::router::Router<T>::insert::h261aa988f7ce6a25
  [NEW] +6.14Ki  [NEW] +6.05Ki    matchit::router::Router<T>::insert::hfe75556626a71a8d
  [NEW] +6.07Ki  [NEW] +5.99Ki    matchit::tree::Node<T>::insert::h889792c8a42776b1
  +904% +5.75Ki +10e2% +5.75Ki    axum::routing::Router<S>::route::h0531592b945792f9
  [NEW] +5.42Ki  [NEW] +5.33Ki    serde_core::de::MapAccess::next_value::ha8691cfa6c85eadd
  +795% +5.40Ki  +908% +5.40Ki    tokio::runtime::runtime::Runtime::block_on::h59580466f3413de1
  +992% +5.27Ki +12e2% +5.27Ki    tokio::runtime::runtime::Runtime::block_on::h26cc5fddc7a2e095
  [NEW] +3.90Ki  [NEW]     +16    anon.0cac25fc52ba6f4fc475348a8c66d8e3.39.llvm.4081512695721216103
  [NEW] +3.06Ki  [NEW]     +74    anon.0f2fa1d1fad1031510176699744ee20b.644.llvm.3708903403341574001
  [DEL] -2.55Ki  [DEL]     -80    anon.eb51c975e2567ebfca80d7da0abd4cd1.7.llvm.11556325092004374934
  [DEL] -2.89Ki  [DEL] -2.77Ki    quick_cache::shard::CacheShard<Key,Val,We,B,L,Plh>::insert::haf55a3f4a81ec55a
  [DEL] -3.15Ki  [DEL]     -74    anon.bcfbe2edddf7aafb2d5d5e0cc5ffa1e5.16.llvm.12024661772337668300
  [DEL] -3.48Ki  [DEL] -2.33Ki    _<serde_core::de::impls::<impl serde_core::de::Deserialize for core::time::Duration>::deserialize::DurationVisitor as serde_core::de::Visitor>::visit_map::hed68c5e9e330be6d
  [DEL] -3.81Ki  [DEL]     -16    anon.8bb1cfcdf181d421e8889bc3626b8144.17.llvm.125851317713716366
 -51.3% -4.36Ki  [ = ]       0    [Unmapped]
  [DEL] -5.26Ki  [DEL] -5.10Ki    _<figment::value::de::ConfiguredValueDe<I> as serde_core::de::Deserializer>::deserialize_struct::hedaa6bc6b4faff3a
  [DEL] -6.01Ki  [DEL] -5.93Ki    matchit::tree::Node<T>::insert::ha63513d725e691c6
  [DEL] -6.11Ki  [DEL] -6.02Ki    matchit::router::Router<T>::insert::h02e8e157485e9a5b
  [DEL] -6.17Ki  [DEL] -6.06Ki    axum::routing::path_router::PathRouter<S,_>::route::h75fd76341dec4a4b
  [DEL] -6.39Ki  [DEL] -6.31Ki    matchit::tree::Node<T>::insert::h2e5068899f601750
  -0.6% -37.1Ki  -0.6% -25.2Ki    [7148 Others]
  -0.1% -39.9Ki  -0.1% -19.6Ki    TOTAL

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 15, 2026

Regression Detector (Agent Data Plane)

Run ID: 540d69aa-d5f5-4b28-a395-65c42b03be5d
Baseline: 55bca143 · Comparison: df02eb32 · Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment (35)

Experiments configured erratic: true are tagged (ignored) and skipped when determining which experiments regressed or improved. Experiments which are detected as erratic at runtime are tagged (erratic) to flag that the run's sample dispersion was high, but their regression / improvement signal still counts.

experiment goal Δ mean % links
dsd_uds_1mb_3k_contexts_cpu (erratic) cpu ⚪ +7.96 metrics profiles logs
otlp_ingest_metrics_5mb_memory memory ⚪ +3.73 metrics profiles logs
dsd_uds_500mb_3k_contexts_throughput throughput ⚪ -3.00 metrics profiles logs
dsd_uds_10mb_3k_contexts_cpu (erratic) cpu ⚪ +2.36 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_cpu (erratic) cpu ⚪ +0.75 metrics profiles logs
dsd_uds_100mb_3k_contexts_cpu (erratic) cpu ⚪ +0.36 metrics profiles logs
dsd_uds_10mb_3k_contexts_memory memory ⚪ +0.28 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_memory memory ⚪ +0.24 metrics profiles logs
otlp_ingest_traces_5mb_cpu (erratic) cpu ⚪ +0.22 metrics profiles logs
otlp_ingest_traces_5mb_memory memory ⚪ +0.16 metrics profiles logs
quality_gates_rss_dsd_heavy memory ⚪ +0.16 metrics profiles logs
quality_gates_rss_dsd_low memory ⚪ +0.14 metrics profiles logs
dsd_uds_512kb_3k_contexts_cpu (erratic) cpu ⚪ +0.10 metrics profiles logs
quality_gates_rss_idle memory ⚪ +0.09 metrics profiles logs
otlp_ingest_metrics_5mb_throughput throughput ⚪ -0.02 metrics profiles logs
otlp_ingest_logs_5mb_throughput (ignored) throughput ⚪ -0.02 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_memory memory ⚪ +0.01 metrics profiles logs
dsd_uds_512kb_3k_contexts_throughput throughput ⚪ -0.00 metrics profiles logs
dsd_uds_1mb_3k_contexts_throughput throughput ⚪ -0.00 metrics profiles logs
dsd_uds_100mb_3k_contexts_throughput throughput ⚪ +0.00 metrics profiles logs
dsd_uds_10mb_3k_contexts_throughput throughput ⚪ +0.02 metrics profiles logs
otlp_ingest_traces_5mb_throughput throughput ⚪ +0.03 metrics profiles logs
dsd_uds_500mb_3k_contexts_memory memory ⚪ -0.03 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_throughput throughput ⚪ +0.08 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_throughput throughput ⚪ +0.12 metrics profiles logs
dsd_uds_512kb_3k_contexts_memory memory ⚪ -0.16 metrics profiles logs
quality_gates_rss_dsd_medium memory ⚪ -0.16 metrics profiles logs
quality_gates_rss_dsd_ultraheavy memory ⚪ -0.17 metrics profiles logs
dsd_uds_100mb_3k_contexts_memory memory ⚪ -0.23 metrics profiles logs
dsd_uds_1mb_3k_contexts_memory memory ⚪ -0.25 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_cpu (erratic) cpu ⚪ -0.55 metrics profiles logs
dsd_uds_500mb_3k_contexts_cpu (erratic) cpu ⚪ -0.94 metrics profiles logs
otlp_ingest_metrics_5mb_cpu (erratic) cpu ⚪ -1.27 metrics profiles logs
otlp_ingest_logs_5mb_memory (ignored) memory ⚪ -1.29 metrics profiles logs
otlp_ingest_logs_5mb_cpu (ignored) cpu ⚪ -1.80 metrics profiles logs
Bounds Checks: ✅ Passed (5)
experiment check replicates observed links
quality_gates_rss_dsd_heavy memory_usage 10/10 ✅ 123 MiB ≤ 140 MiB metrics profiles logs
quality_gates_rss_dsd_low memory_usage 10/10 ✅ 39.8 MiB ≤ 50 MiB metrics profiles logs
quality_gates_rss_dsd_medium memory_usage 10/10 ✅ 60.9 MiB ≤ 75 MiB metrics profiles logs
quality_gates_rss_dsd_ultraheavy memory_usage 10/10 ✅ 178 MiB ≤ 200 MiB metrics profiles logs
quality_gates_rss_idle memory_usage 10/10 ✅ 27.1 MiB ≤ 40 MiB metrics profiles logs
Explanation

A change is flagged as a regression when |Δ mean %| > 5.00% in the regressing direction for its optimization goal AND SMP marks the experiment as a regression (is_regression: true). Improvements use the matching criteria for the improving direction. Experiments configured erratic: true (tagged (ignored)) are skipped outright; experiments detected as erratic at runtime (tagged (erratic)) still count, since that flag describes sample dispersion rather than directional certainty. The Δ mean % cell is colored accordingly: 🟢 = improvement, 🔴 = regression, ⚪ = neutral. Reduction in CPU or memory is an improvement; reduction in ingress throughput is a regression.

@atanzu atanzu changed the title fix(chore): use compensated summation for histograms fix(core): use compensated summation for histograms May 18, 2026
@atanzu atanzu marked this pull request as ready for review May 18, 2026 05:25
@atanzu atanzu requested a review from a team as a code owner May 18, 2026 05:25
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df02eb32c4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

let mut ddsketch = DDSketch::default();
for sample in histogram.samples() {
ddsketch.insert_n(sample.value.into_inner(), sample.weight);
ddsketch.insert_n(sample.value.into_inner(), sample.weight.0 as u64);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve fractional histogram weights when encoding sketches

When this encoder handles a histogram built from a non-integer sample rate such as @0.21, Histogram::insert now stores sample.weight as the raw weight (~4.76), while summary.count() rounds that to the nearest sample count. This cast truncates the same sample to 4 before inserting it into the DDSketch, so encoded histogram payloads undercount fractional-weight samples and can disagree with the aggregate count/sum produced from the same histogram. Convert the raw weight using the same rounding/accounting policy before passing it to insert_n.

Useful? React with 👍 / 👎.

@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot merged commit 19791e3 into main May 18, 2026
80 of 82 checks passed
dd-octo-sts Bot pushed a commit that referenced this pull request May 18, 2026
## Summary
Use modified Neumaier algorithm to calculate sums, counts, and quantiles for histogram samples.

The naive `sum += value * weight` loop suffers catastrophic cancellation when the sample stream contains values of wildly different magnitudes. The classic Kahan/Peters counter-example `{1, +1e100, 1, -1e100}` evaluates to 0 with naive summation but to the correct 2.0 with the new algorithm.

## Change Type
- [x] Bug fix
- [ ] New feature
- [ ] Non-functional (chore, refactoring, docs)
- [ ] Performance

## How did you test this PR?

Added unit tests to check correctness.

## References

[Similar PR in Datadog-Agent](DataDog/datadog-agent#49913).

Co-authored-by: mark.kirichenko <mark.kirichenko@datadoghq.com> 19791e3
@atanzu atanzu deleted the mark.kirichenko/use-compensated-summation branch May 18, 2026 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/components Sources, transforms, and destinations. area/core Core functionality, event model, etc. destination/prometheus Prometheus Scrape destination. encoder/datadog-metrics Datadog Metrics encoder. mergequeue-status: done transform/aggregate Aggregate transform.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants