test(export): add benchmarks for histogram sample processing#324
test(export): add benchmarks for histogram sample processing#324bwplotka wants to merge 7 commits into
Conversation
…theft Add cases to TestSampleBuilder for: - ungrouped (interleaved) histogram samples - ungrouped (interleaved) histogram samples with first group incomplete - ungrouped (interleaved) histogram samples with first group skipped due to new bucket Enforce strict _bucket, _count, _sum ordering for test samples. TAG=agy CONV=8f508481-de1c-4e6b-ad3a-718d089a2fbe
When histogram samples in a scrape batch are ungrouped (interleaved across different series label sets for the same metric name), existing buildDistribution would return as soon as any histogram in the cache completed. This caused histogram distribution theft (attaching completed distributions to the wrong series) and sample loss for other interleaved series. Replace buildDistribution with buildHistograms to consume the entire contiguous block of samples for a histogram metric name. Cache series metadata (hash, proto, lset) on distribution entries and emit all completed distributions in deterministic order when the block ends. Incomplete or skipped histogram distributions are cleanly discarded without being emitted under incorrect series label sets. TAG=agy CONV=ae038258-882b-4fbd-8b3f-15976bc590f2
Signed-off-by: bwplotka <bwplotka@gmail.com>
Signed-off-by: bwplotka <bwplotka@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request adds a new benchmark file transform_bench_test.go to measure the performance of SampleBuilder when processing grouped and ungrouped histograms. The feedback recommends instantiating and closing the sampleBuilder outside of the benchmark loop to avoid unnecessary allocation and CPU overhead, which will yield more accurate benchmark results.
| b.ResetTimer() | ||
| b.ReportAllocs() | ||
|
|
||
| for i := 0; i < b.N; i++ { | ||
| sb := newSampleBuilder(cache) | ||
| batch := batches[1] | ||
| for len(batch) > 0 { | ||
| _, tail, err := sb.next(metadata, externalLabels, batch, nil) | ||
| if err != nil { | ||
| b.Fatal(err) | ||
| } | ||
| batch = tail | ||
| } | ||
| sb.close() | ||
| } |
There was a problem hiding this comment.
In the benchmark loop, a new sampleBuilder is instantiated and closed on every iteration. Since newSampleBuilder allocates a map of size 128 and a slice, doing this inside the b.N loop introduces significant allocation and CPU overhead that is unrelated to the next() method being benchmarked.
Because sampleBuilder's internal state (b.dists and b.histResultsBuf) is fully cleared or reset during each call to next() (via buildDistributions), it is completely safe to reuse a single sampleBuilder instance across all iterations of the benchmark. Moving the instantiation outside the loop and calling b.ResetTimer() after it will make the benchmark results much more accurate and focused on the actual sample processing logic.
sb := newSampleBuilder(cache)
defer sb.close()
b.ResetTimer()
b.ReportAllocs()
for i := 0; i < b.N; i++ {
batch := batches[1]
for len(batch) > 0 {
_, tail, err := sb.next(metadata, externalLabels, batch, nil)
if err != nil {
b.Fatal(err)
}
batch = tail
}
}There was a problem hiding this comment.
Do this, also use b.Loop()
bd3aa88 to
7ce3854
Compare
Efficiency Assessment: GoogleCloudPlatform/prometheus PR #323
Pull Request: GoogleCloudPlatform/prometheus#323 (
fix-histogram-theft-bug)Assessment Date: 2026-07-01
Evaluated Branch:
benchmark-histogram-export(based onorigin/fix-histogram-theft-bug)Executive Summary
PR #323 replaces
buildDistribution(singular) withbuildDistributions(plural) in transform.go. Previously, when a histogram sample was encountered, the builder consumed samples until one distribution completed and assumed it corresponded to the series of the first sample seen. On interleaved or incomplete histogram streams (such as Kong issue #14925), this caused histogram distribution theft, where buckets from one series were erroneously attached to the metadata and labels of another.With PR #323,
buildDistributionsconsumes all contiguous samples for a histogram metric family in one pass, groups them by series hash inb.dists, emits all completed distributions with their true identity, and immediately returns the cached distribution objects tosync.Pool.Key Efficiency Findings:
b.distsat the end of each metric family to immediately recycle distributions back tosync.Poolviadefer, which prevents memory retention across large scrape batches.Benchmark Results (Before vs. After)
A comprehensive benchmark suite was implemented in transform_bench_test.go testing
sampleBuilder.next()across 10, 100, and 500 histogram series per batch (each series consisting of 10 buckets +_sum+_count= 12 samples/series).Two stream topologies were tested:
_sumsamples appear first, followed by all_countsamples, followed by buckets across series.1330987e6) Timepr-323) TimeHistogramsGroupedHistogramsGroupedHistogramsGroupedHistogramsUngroupedHistogramsUngroupedHistogramsUngroupedArchitectural & Efficiency Analysis
1. Memory Efficiency & Pool Management
In the previous implementation, when
buildDistributioncompleted a series, the underlying*distributionobject remained stored inside thesampleBuilder.distsmap untilbuilder.close()was invoked at the end of the entire scrape batch. If a batch contained 1,000 unique histogram series, the old code held 1,000 pooled objects simultaneously.In PR #323,
buildDistributionsintroduces immediate cleanup viadefer:As soon as a contiguous metric family is processed, all of its
*distributionobjects are returned tosync.Poolandb.distsis cleared. When the next metric family in the scrape is evaluated, it reuses the exact same pooled objects fromsync.Pool.buildDistributionsreturns[]hashedSeries(backed by the reusedb.histResultsBufslice). Whennext()appends this slice toresult(result = append(result, histSeries...)), Go performs a single slice header/capacity adjustment worth exactly 64 bytes per batch.2. CPU Performance & Map Iteration
The minor (~3%–7%) CPU overhead in microbenchmarks is directly attributable to the two map iterations over
b.distsper metric family:b.diststo checkdist.complete()and buildmonitoring_pb.TimeSeries.b.distsin thedeferblock to callputDistribution(dist)before callingclear(b.dists).In a production scrape (e.g., 500 series taking ~3.44 ms), this delta amounts to less than 100 microseconds per scrape—a negligible trade-off for eliminating distribution theft and out-of-order errors on interleaved streams.
3. Impact of Commit
fa054e200(chore: remove order)In the initial PR commit (
ec24bbba6),buildDistributionstracked series insertion order using ab.touched []uint64slice. Commitfa054e200removedb.touchedin favor of iterating directly overb.dists(and sorting slices in test assertions instead). This optimization was effective: removingb.touchedeliminated per-sample slice growth overhead and kept the heap allocations at an absolute minimum (5 allocs/op).Running the Benchmark Suite
To execute the benchmarks locally from the repository root:
go test -run=^$ -bench=BenchmarkSampleBuilder_Histograms -benchmem ./google/exportTAG=agy
CONV=20159a34-fd32-4d67-879d-325703fa43ea