Skip to content

feat(property-vals): emit-once seen cache in merger#60733

Merged
andyzzhao merged 3 commits into
masterfrom
andy/property-vals-merger-seen-cache
Jun 1, 2026
Merged

feat(property-vals): emit-once seen cache in merger#60733
andyzzhao merged 3 commits into
masterfrom
andy/property-vals-merger-seen-cache

Conversation

@andyzzhao
Copy link
Copy Markdown
Contributor

@andyzzhao andyzzhao commented May 29, 2026

Problem

property-vals-rs re-produces a tuple once per flush window for as long as that tuple keeps appearing. Re-emission of recurring tuples across flushes contributes to increase write volume to clickhouse.

Changes

Adds a bounded "already emitted" cache to the merger. At flush, a tuple still resident in the cache is suppressed; a new one is emitted and then inserted into the cache only after the produce succeeds.

  • New SeenCache wraps a quick_cache LRU keyed on the full (team, type, key, value) tuple, with a custom eviction lifecycle. Same structure property-defs-rs uses for its dedup cache.
  • Controlled by MERGER_SEEN_CACHE_CAPACITY (default 0 = disabled). Capacity bounds memory directly; it is independent of total cardinality.

Trade-offs:

  • This degrades property_count to a presence-ish undercount, because a suppressed occurrence does not increment the count. I will likely remove this col as a follow up.
  • On a merger redeploy the cache starts cold and produced volume returns to the full rate until it re-warms.

How did you test this code?

I'm an agent. Added and ran automated unit tests for the cache path:

  • an already-cached tuple is suppressed (no second produce, aggregator still drained)
  • a tuple forgotten on produce failure is re-emitted on retry, not lost to the cache
  • distinct new values all emit while a repeat is suppressed

Automatic notifications

  • Publish to changelog?
  • Alert Sales and Marketing teams?

🤖 Agent context

Authored with Claude Code (Opus 4.8). The cache lives in the merger because the intermediate topic is hashed by the full tuple, so each tuple has a single owning merger pod and the per-pod caches do not overlap, which makes the dedup correct without coordination. Chose an exact LRU (quick_cache) over a bloom filter on purpose: the bloom is cheaper per element but its false positives drop genuinely-new values, whereas the LRU's only failure mode is re-emitting an evicted tuple, which is lossless. Sizing is against the active working set, which was measured to be small per pod, rather than the all-time cardinality. Ships disabled (capacity 0); enabling and capacity tuning are operational, guided by the hit/miss/eviction metrics.

Add an optional bounded "already emitted" cache to the merger so a tuple
that is still resident is suppressed instead of re-produced every flush
window. Re-emission of recurring tuples is the bulk of produced rows, so
this drops produced volume toward the new-distinct-value arrival rate.

The cache is a quick_cache LRU keyed on the full tuple, sized to the
active working set rather than total cardinality, so memory is bounded by
MERGER_SEEN_CACHE_CAPACITY (default 0 = disabled), not by the number of
distinct values. It is lossless on values: an evicted or forgotten tuple
is re-emitted and the AggregatingMergeTree absorbs the duplicate. On a
produce failure the emitted set is forgotten so the retry re-emits it.

Applied in the merger only; events and groups run with reduction disabled.
Hits, misses, and evictions are exported per worker for sizing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 29, 2026

Reviews (1): Last reviewed commit: "feat(property-vals): emit-once seen cach..." | Re-trigger Greptile

@andyzzhao andyzzhao added the stamphog Request AI review from stamphog label Jun 1, 2026
Copy link
Copy Markdown

@stamphog stamphog Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gates denied this PR because it adds a new Rust dependency (quick_cache) to Cargo.toml and Cargo.lock, which triggers the dependency/toolchain deny-list rule. This requires human review before auto-approval.

@stamphog stamphog Bot removed the stamphog Request AI review from stamphog label Jun 1, 2026
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@andyzzhao andyzzhao requested a review from a team June 1, 2026 17:06
…cceeds

Replace the optimistic insert + forget-on-failure with inserting into the
seen cache only after the produce succeeds. The cache now only ever holds
tuples that were actually emitted, so a produce failure just restores the
batch to the aggregator with nothing to undo, and a tuple can never be
suppressed without having been produced first.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@andyzzhao andyzzhao merged commit 10b3138 into master Jun 1, 2026
189 checks passed
@andyzzhao andyzzhao deleted the andy/property-vals-merger-seen-cache branch June 1, 2026 20:46
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented Jun 1, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-06-01 21:28 UTC Run
prod-us ✅ Deployed 2026-06-02 10:18 UTC Run
prod-eu ✅ Deployed 2026-06-01 21:53 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants