feat(property-vals): emit-once seen cache in merger#60733
Merged
Conversation
Add an optional bounded "already emitted" cache to the merger so a tuple that is still resident is suppressed instead of re-produced every flush window. Re-emission of recurring tuples is the bulk of produced rows, so this drops produced volume toward the new-distinct-value arrival rate. The cache is a quick_cache LRU keyed on the full tuple, sized to the active working set rather than total cardinality, so memory is bounded by MERGER_SEEN_CACHE_CAPACITY (default 0 = disabled), not by the number of distinct values. It is lossless on values: an evicted or forgotten tuple is re-emitted and the AggregatingMergeTree absorbs the duplicate. On a produce failure the emitted set is forgotten so the retry re-emits it. Applied in the merger only; events and groups run with reduction disabled. Hits, misses, and evictions are exported per worker for sizing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
|
Reviews (1): Last reviewed commit: "feat(property-vals): emit-once seen cach..." | Re-trigger Greptile |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…cceeds Replace the optimistic insert + forget-on-failure with inserting into the seen cache only after the produce succeeds. The cache now only ever holds tuples that were actually emitted, so a produce failure just restores the batch to the aggregator with nothing to undo, and a tuple can never be suppressed without having been produced first. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
aspicer
approved these changes
Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
property-vals-rs re-produces a tuple once per flush window for as long as that tuple keeps appearing. Re-emission of recurring tuples across flushes contributes to increase write volume to clickhouse.
Changes
Adds a bounded "already emitted" cache to the merger. At flush, a tuple still resident in the cache is suppressed; a new one is emitted and then inserted into the cache only after the produce succeeds.
SeenCachewraps aquick_cacheLRU keyed on the full(team, type, key, value)tuple, with a custom eviction lifecycle. Same structure property-defs-rs uses for its dedup cache.MERGER_SEEN_CACHE_CAPACITY(default0= disabled). Capacity bounds memory directly; it is independent of total cardinality.Trade-offs:
property_countto a presence-ish undercount, because a suppressed occurrence does not increment the count. I will likely remove this col as a follow up.How did you test this code?
I'm an agent. Added and ran automated unit tests for the cache path:
Automatic notifications
🤖 Agent context
Authored with Claude Code (Opus 4.8). The cache lives in the merger because the intermediate topic is hashed by the full tuple, so each tuple has a single owning merger pod and the per-pod caches do not overlap, which makes the dedup correct without coordination. Chose an exact LRU (
quick_cache) over a bloom filter on purpose: the bloom is cheaper per element but its false positives drop genuinely-new values, whereas the LRU's only failure mode is re-emitting an evicted tuple, which is lossless. Sizing is against the active working set, which was measured to be small per pod, rather than the all-time cardinality. Ships disabled (capacity 0); enabling and capacity tuning are operational, guided by the hit/miss/eviction metrics.