Skip to content

Lazy-allocate error latency histogram on AggregateEntry#11478

Open
dougqh wants to merge 1 commit into
dougqh/optimize-metric-keyfrom
dougqh/lazy-error-latencies
Open

Lazy-allocate error latency histogram on AggregateEntry#11478
dougqh wants to merge 1 commit into
dougqh/optimize-metric-keyfrom
dougqh/lazy-error-latencies

Conversation

@dougqh
Copy link
Copy Markdown
Contributor

@dougqh dougqh commented May 27, 2026

Summary

  • Defer errorLatencies histogram allocation until the first error is recorded on an entry. Most entries never see an error in their lifetime; previously each one carried a ~60-80 byte empty DDSketchHistogram for life.
  • Across a full 2048-entry table, saves ~150 KB if 95% of entries never error (the typical case).
  • SerializingMetricWriter caches the serialized form of an empty histogram (~17 bytes) and emits those cached bytes when an entry's errorLatencies is null, so the wire format is byte-identical to before.

Background

Extracted from #11389, where the same change was bundled with cardinality- and peer-tag-related work. This PR is just the lazy-errorLatencies piece; it sits between #11382 and #11387 so it can ship without depending on the cardinality machinery in #11387.

Trade-off

Entries that do see an error retain the histogram across clear() (cleared, not nulled). An always-erroring entry allocates exactly once. Same total allocation as before for that path.

Throughput benchmarks

This is a heap-footprint change, not a CPU one — the consumer's hot path is unchanged. The bench suite was re-run anyway as a sanity check to confirm no throughput regression vs the #11382 base. Same machine state and JMH config as the rest of the stack's runs (8 producer threads, 2×15s warmup + 5×15s, 1 fork, throughput mode).

Bench (ops/s) v1.62.0 master #11382 this PR (#11478)
Adversarial 444,290 ± 1,616,937 14,276,351 ± 1,091,138 32,556,300 ± 4,321,490 30,609,314 ± 6,944,664
HighCardinalityResource 4,854,335 ± 1,214,233 8,168,005 ± 3,493,716 35,739,452 ± 2,556,684 34,552,088 ± 4,687,212
HighCardinalityPeer 6,902,209 ± 368,641 10,110,142 ± 3,380,594 37,638,634 ± 6,673,337 35,491,425 ± 4,970,576

#11478 vs #11382 is within the per-run error bar on every bench (0.94×–0.97×) — statistically indistinguishable. The CPU-side hot path didn't change: recordOneDuration now calls errorLatenciesForWrite() instead of reading a final field, but that's a single-field-load-and-branch on every entry's first error and a direct field load thereafter, which the JIT inlines flat. aggregateDropped counts are also in line with #11382, confirming the lazy field doesn't perturb the table-cap behavior.

The actual win — the ~150 KB heap reclamation at full table cap when 95% of entries never error — isn't observable in a throughput bench. It would show up in jol-based per-entry footprint inspection (one fewer histogram per entry) or in a long-running profile of allocated-bytes-per-cycle (errorLatencies allocation amortizes from "one per unique key" to "one per unique error-emitting key").

Test plan

  • :dd-trace-core:test — metrics tests pass
  • No behavior change to the client-stats wire payload

🤖 Generated with Claude Code

Each AggregateEntry allocated two DDSketchHistograms in its constructor
(ok + error latencies). DDSketchHistogram wraps a DDSketch + lazy store,
roughly 60-80 bytes per histogram even when empty. Most spans aren't
errors, so most entries' errorLatencies sit empty for life.

Now the field starts null. recordOneDuration lazy-allocates on the first
error; if no error ever lands on the entry, it stays null and ~80 bytes
of empty-histogram overhead are reclaimed. Across a full 2048-entry
table that's ~150 KB if 95% of entries never error -- the typical case.

For the wire format, SerializingMetricWriter caches the serialized form
of an empty histogram (~17 bytes) on first use and writes those cached
bytes when an entry's errorLatencies is null. The cache is per-writer
(not a global static) so each writer instance picks up the Histograms
factory state at the time of its first report, avoiding races with test
setup that registers the DDSketch factory at varying points.

Trade-off: entries that DO see an error retain the histogram across
clear() (just cleared, not nulled), so always-erroring entries allocate
exactly once. Same total allocation as before for that case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@datadog-datadog-prod-us1-2
Copy link
Copy Markdown
Contributor

datadog-datadog-prod-us1-2 Bot commented May 27, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

DataDog/apm-reliability/dd-trace-java | agent_integration_tests   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). 4 failed tests due to IllegalAccessError at MetricsIntegrationTest.groovy:44.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: f2ee559 | Docs | Datadog PR Page | Give us feedback!

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 27, 2026

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results

Startup Time

Scenario This PR master Change
insecure-bank / iast 13,994 ms 13,967 ms +0.2%
insecure-bank / tracing 12,866 ms 13,083 ms -1.7%
petclinic / appsec 16,513 ms 16,176 ms +2.1%
petclinic / iast 16,525 ms 15,798 ms +4.6%
petclinic / profiling 15,574 ms 16,489 ms -5.5%
petclinic / tracing 14,872 ms 15,684 ms -5.2%

Commit: f2ee559c · CI Pipeline · Benchmarking Platform UI


Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

@dougqh dougqh marked this pull request as ready for review May 27, 2026 19:25
@dougqh dougqh requested a review from a team as a code owner May 27, 2026 19:25
@dougqh dougqh requested a review from amarziali May 27, 2026 19:25
@dd-octo-sts dd-octo-sts Bot added the tag: ai generated Largely based on code generated by an AI or LLM label May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tag: ai generated Largely based on code generated by an AI or LLM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant