Skip to content

fix(export): Fix histogram theft bug on interleaved and incomplete histograms#323

Open
bwplotka wants to merge 4 commits into
release-2.53.5-gmpfrom
fix-histogram-theft-bug
Open

fix(export): Fix histogram theft bug on interleaved and incomplete histograms#323
bwplotka wants to merge 4 commits into
release-2.53.5-gmpfrom
fix-histogram-theft-bug

Conversation

@bwplotka

@bwplotka bwplotka commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

When histogram samples in a scrape batch are ungrouped or interleaved across different series label sets for the same metric family (such as from non-compliant sources like Kong/kong#14925), the existing buildDistribution logic in sampleBuilder returned prematurely as soon as any single histogram in the cache completed (dist.complete()).

This premature return caused two critical bugs:

  1. Histogram Distribution Theft: Because the completed distribution was returned immediately without strictly binding emission to each specific series' metadata, completed distributions could be attached to the wrong series label sets.
  2. Sample Loss & State Confusion: Returning early left the remaining interleaved samples in the contiguous block unprocessed or lost. Furthermore, incomplete or skipped histogram series (e.g., when new bucket bounds are introduced across scrapes) could disrupt valid distributions.

Key Changes

1. Contiguous Block Processing (google/export/transform.go)

  • Replaced Single-Distribution Builder: Replaced buildDistribution with buildDistributions, called from next. Instead of aborting on the first completed distribution, buildDistributions now consumes the entire contiguous block of samples for a histogram metric family name before emitting results.
  • Per-Series Metadata Caching: Extended the cached distribution struct to store series metadata (hash, proto, and lset). When the block finishes, all completed distributions are built and emitted using their own cached metadata, guaranteeing deterministic ordering and accurate label attribution.
  • Incomplete & Skipped Histogram Handling: Incomplete histograms (such as those missing a +Inf bucket) or skipped series are cleanly discarded at the end of the block without being emitted under incorrect label sets or causing sample loss for other series.
  • Memory & Allocation Optimizations: Added results []hashedSeries to sampleBuilder to pool and reuse result slices across iterations without extra heap allocations. Added a defer cleanup block in buildDistributions to return cached distribution objects to the pool (putDistribution) and clear the map after processing each metric family.

2. Regression Testing (google/export/transform_test.go)

  • Added comprehensive regression test cases in TestSampleBuilder (referencing internal regression tracking b/516519320):
    • Ungrouped (interleaved) histogram samples across multiple series.
    • Ungrouped (interleaved) histogram samples where the first group is incomplete.
    • Ungrouped (interleaved) histogram samples where the first group is skipped due to schema/bucket changes across scrapes (e.g., adding a new bucket in a subsequent scrape).
  • Added deterministic slice sorting (cmpopts.SortSlices) to test assertions when comparing multiple emitted histogram series.

Related Issues

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the histogram processing logic in google/export/transform.go by replacing buildDistribution with buildDistributions. This change allows the exporter to handle interleaved (ungrouped) histogram samples from sources like Kong by caching and returning multiple completed distributions in a single batch. Comprehensive unit tests have also been added to cover these scenarios. The review feedback suggests clarifying the necessity of the touched slice for deterministic test ordering, pre-allocating the result slice to avoid unnecessary allocations, and cleaning up obsolete TODO comments in the new test cases.

Comment thread google/export/transform.go Outdated
Comment thread google/export/transform.go Outdated
Comment thread google/export/transform_test.go Outdated
Comment thread google/export/transform_test.go Outdated
@bwplotka bwplotka force-pushed the fix-histogram-theft-bug branch 3 times, most recently from 40f5d56 to bd3aa88 Compare July 1, 2026 16:43
@bwplotka bwplotka marked this pull request as ready for review July 1, 2026 16:43
// Whether to not emit a sample.
skip bool

hash uint64

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is allocating bigger structs, but if we trust AI benchmark it does not yield big difference.

Assessment: #324

@bwplotka bwplotka requested review from bernot-dev and dashpole July 1, 2026 17:24
@bwplotka

bwplotka commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

Given some risks, I wonder if we shouldn't release an unofficial image first for cx to try.

@bwplotka bwplotka requested a review from pintohutch July 1, 2026 17:31
@bwplotka

bwplotka commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

bwplotka added 4 commits July 3, 2026 14:06
…theft

Add cases to TestSampleBuilder for:
- ungrouped (interleaved) histogram samples
- ungrouped (interleaved) histogram samples with first group incomplete
- ungrouped (interleaved) histogram samples with first group skipped due to new bucket

Enforce strict _bucket, _count, _sum ordering for test samples.

TAG=agy
CONV=8f508481-de1c-4e6b-ad3a-718d089a2fbe
When histogram samples in a scrape batch are ungrouped (interleaved across
different series label sets for the same metric name), existing
buildDistribution would return as soon as any histogram in the cache
completed. This caused histogram distribution theft (attaching completed
distributions to the wrong series) and sample loss for other interleaved series.

Replace buildDistribution with buildHistograms to consume the entire
contiguous block of samples for a histogram metric name. Cache series
metadata (hash, proto, lset) on distribution entries and emit all completed
distributions in deterministic order when the block ends. Incomplete or
skipped histogram distributions are cleanly discarded without being emitted
under incorrect series label sets.

TAG=agy
CONV=ae038258-882b-4fbd-8b3f-15976bc590f2
Signed-off-by: bwplotka <bwplotka@gmail.com>
Signed-off-by: bwplotka <bwplotka@gmail.com>
@bwplotka bwplotka force-pushed the fix-histogram-theft-bug branch from bd3aa88 to 7ce3854 Compare July 3, 2026 13:06
@bwplotka

bwplotka commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/prombench

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant