Skip to content

feat(property-vals): aggregate property values per event behind flags#61322

Merged
andyzzhao merged 1 commit into
andy/property-vals-value-len-configfrom
andy/property-vals-event-name-svc
Jun 3, 2026
Merged

feat(property-vals): aggregate property values per event behind flags#61322
andyzzhao merged 1 commit into
andy/property-vals-value-len-configfrom
andy/property-vals-event-name-svc

Conversation

@andyzzhao
Copy link
Copy Markdown
Contributor

@andyzzhao andyzzhao commented Jun 3, 2026

Problem

We want property-value autocomplete to be scopable to a specific event (filtering "events where event = X and property = Y"). For that, the value catalog needs to know which event each value appeared on. This is the service half of that change: it threads the source event name through the aggregation pipeline so a later ClickHouse column can support per-event value lookups.

It is gated and ships dark, so it is safe to land and deploy ahead of the ClickHouse side.

Changes

  • The events worker stamps the source event name onto event-type values. Person and group values are never event-scoped (group properties are not even on the event stream), so they stay empty.
  • event_name becomes part of the in-memory aggregation key, the merger's emit-once seen cache, and the top-K cap, so event values aggregate and cap per (event, key) rather than per key.
  • Two independent flags, both default off:
    • AGGREGATE_BY_EVENT_NAME controls whether the event name enters the aggregation key. On, event values fragment per event (this is what drives the write amplification). Off, event values carry an empty name like person/group.
    • EMIT_EVENT_NAME controls whether the merger writes the event_name field on the output topic. The field is omitted when empty (serde(skip_serializing_if)), so with this off the messages to clickhouse_property_values are byte-identical to the pre-event_name format.

Splitting the two flags lets us exercise the write-amplification behavior before the ClickHouse column exists. Turn AGGREGATE_BY_EVENT_NAME on with EMIT_EVENT_NAME off to produce the same fragmented record count (and the same write throughput) the finished feature will, while the wire stays in the old format so the current ClickHouse kafka table keeps consuming it unchanged. Once the ClickHouse column is live, turn EMIT_EVENT_NAME on too.

Rollout order: deploy the ClickHouse PR, deploy this service with both flags off, optionally flip AGGREGATE_BY_EVENT_NAME alone to load-test throughput, then flip EMIT_EVENT_NAME on.

One thing to weigh: aggregating per event lowers the collapse ratio, so write throughput to ClickHouse rises (roughly 2-6x at the flush cadence depending on a team's event diversity). Lengthening FLUSH_INTERVAL_SECS is the lever if that is too much.

How did you test this code?

I am an agent (Claude Code). No manual testing. On the property-vals-rs crate: cargo fmt, cargo clippy, and cargo test (54 passed). New tests cover:

  • event values get the source event name when aggregation is on; person values stay empty in the same event
  • all values are empty when aggregation is off
  • oversized event names (over 200 chars) are not stamped, so the values still land unscoped
  • the wire field is omitted when empty (so the message matches the pre-event_name format) and round-trips back to empty
  • the merger carries event_name through from the intermediate message

Automatic notifications

  • Publish to changelog?
  • Alert Sales and Marketing teams?

Docs update

No docs change.

🤖 Agent context

Authored with Claude Code (Read/Edit/Bash) while working through property-value search performance.

Decisions: the two flags (AGGREGATE_BY_EVENT_NAME, EMIT_EVENT_NAME) are split so the per-event aggregation grain, and its write amplification, can be exercised on production before the consuming ClickHouse column exists, which decouples the two deploys. The output field is omitted when empty so flag-off messages stay byte-identical to the old format. Person and group are intentionally left unscoped because their values are not event-specific (groups are not on the event stream at all). The aggregation key, merger seen cache, and top-K cap all key on event_name so the per-event grain is consistent end to end. The ~2-6x write amplification was measured and is called out as a rollout consideration, not a blocker, since it only bites once aggregation is on.

Stacked on the value-length base PR: #61325

@andyzzhao andyzzhao added the stamphog Request AI review from stamphog label Jun 3, 2026
Copy link
Copy Markdown

@stamphog stamphog Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is clean, flag-gated, and wire-backwards-compatible, but this is a medium-complexity change to a production data pipeline service with zero reviews — needs at least one human sign-off before auto-approval.

@stamphog stamphog Bot removed the stamphog Request AI review from stamphog label Jun 3, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 3, 2026

Comments Outside Diff (1)

  1. rust/property-vals-rs/src/producer.rs, line 97-100 (link)

    P2 The partition key comment enumerates the tuple fields as (team, type, key, value), but event_name is now also part of TupleKey and drives distinct aggregation buckets at stage 2. The comment reads as if the key is still the full tuple, which could mislead a future reader into thinking a per-event_name deduplication guarantee exists at the partition level (it doesn't — same-(team,type,key,value) but different event_name messages still land on the same partition, which is correct and intentional, but the comment implies otherwise).

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: rust/property-vals-rs/src/producer.rs
    Line: 97-100
    
    Comment:
    The partition key comment enumerates the tuple fields as `(team, type, key, value)`, but `event_name` is now also part of `TupleKey` and drives distinct aggregation buckets at stage 2. The comment reads as if the key is still the full tuple, which could mislead a future reader into thinking a per-`event_name` deduplication guarantee exists at the partition level (it doesn't — same-`(team,type,key,value)` but different `event_name` messages still land on the same partition, which is correct and intentional, but the comment implies otherwise).
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
rust/property-vals-rs/src/producer.rs:97-100
The partition key comment enumerates the tuple fields as `(team, type, key, value)`, but `event_name` is now also part of `TupleKey` and drives distinct aggregation buckets at stage 2. The comment reads as if the key is still the full tuple, which could mislead a future reader into thinking a per-`event_name` deduplication guarantee exists at the partition level (it doesn't — same-`(team,type,key,value)` but different `event_name` messages still land on the same partition, which is correct and intentional, but the comment implies otherwise).

```suggestion
        // Partition key on (team, type, key, value). The same tuple from any
        // pod always lands on the same partition, so the merger can collapse
        // per-pod duplicates. event_name is intentionally excluded: two tuples
        // that differ only in event_name still share a partition, which is
        // correct — the merger keys on the full TupleKey (including event_name)
        // and aggregates them independently.
```

### Issue 2 of 2
rust/property-vals-rs/src/fan_out.rs:264-275
**Missing coverage for `stamp_event_name=true` + `event.event=None`**

`stamped_event_tuples_carry_source_event_name` always constructs events with `event: Some(name)`. The silent fall-through when `event.event` is `None` (`.unwrap_or("")`) is correct, but no proptest covers `fan_out(&e, &none(), true)` with an `arb_event()` that can produce `event: None`, which would leave `event_name=""` in the key even with the flag on. Adding a parameterised case that uses `arb_event()` with `stamp_event_name=true` and asserts that event-type tuples carry `event_name` matching `event.event.as_deref().unwrap_or("")` would make this contract explicit.

Reviews (1): Last reviewed commit: "feat(property-vals): stamp source event ..." | Re-trigger Greptile

Comment on lines +264 to +275
#[test]
fn stamped_event_tuples_carry_source_event_name(name in "[a-z]{1,20}", blob in arb_blob()) {
let e = Event {
team_id: 2,
event: Some(name.clone()),
properties: Some(blob),
person_properties: None,
};
for (t, _) in fan_out(&e, &none(), true) {
prop_assert_eq!(&t.event_name, &name);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing coverage for stamp_event_name=true + event.event=None

stamped_event_tuples_carry_source_event_name always constructs events with event: Some(name). The silent fall-through when event.event is None (.unwrap_or("")) is correct, but no proptest covers fan_out(&e, &none(), true) with an arb_event() that can produce event: None, which would leave event_name="" in the key even with the flag on. Adding a parameterised case that uses arb_event() with stamp_event_name=true and asserts that event-type tuples carry event_name matching event.event.as_deref().unwrap_or("") would make this contract explicit.

Prompt To Fix With AI
This is a comment left during a code review.
Path: rust/property-vals-rs/src/fan_out.rs
Line: 264-275

Comment:
**Missing coverage for `stamp_event_name=true` + `event.event=None`**

`stamped_event_tuples_carry_source_event_name` always constructs events with `event: Some(name)`. The silent fall-through when `event.event` is `None` (`.unwrap_or("")`) is correct, but no proptest covers `fan_out(&e, &none(), true)` with an `arb_event()` that can produce `event: None`, which would leave `event_name=""` in the key even with the flag on. Adding a parameterised case that uses `arb_event()` with `stamp_event_name=true` and asserts that event-type tuples carry `event_name` matching `event.event.as_deref().unwrap_or("")` would make this contract explicit.

How can I resolve this? If you propose a fix, please make it concise.

Comment thread rust/property-vals-rs/src/fan_out.rs Outdated
@veria-ai
Copy link
Copy Markdown

veria-ai Bot commented Jun 3, 2026

PR overview

All previously flagged issues have been addressed. No open security concerns remain on this pull request.

Security review

No open security issues remain on this pull request.

Fixed/addressed: 1 · PR risk: 0/10

Adds an event_name to each aggregated tuple so event property values can be
looked up scoped to the event they appear on. The events worker keys event-
type values on the source event name; person and group values are never
event-scoped and stay empty. Oversized event names are dropped, not stamped
across many rows.

Two flags decouple this from the ClickHouse schema, both default off:
- AGGREGATE_BY_EVENT_NAME keys values on the event, producing the full
  amplified record count (each value splits into one row per event).
- EMIT_EVENT_NAME controls whether the merger writes event_name to the output
  ClickHouse reads. With AGGREGATE on and EMIT off, the pipeline produces the
  amplified count in the old wire format, so the existing kafka table ingests
  it and ClickHouse collapses the duplicates, letting the real write
  amplification be measured on prod before the ClickHouse column exists.

The aggregation, merger seen-cache, and top-K cap all key on event_name.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@andyzzhao andyzzhao force-pushed the andy/property-vals-event-name-svc branch from 81334a8 to 115b38d Compare June 3, 2026 05:31
@andyzzhao andyzzhao changed the title feat(property-vals): stamp source event name behind STAMP_EVENT_NAME feat(property-vals): aggregate property values per event behind flags Jun 3, 2026
@andyzzhao andyzzhao changed the base branch from master to andy/property-vals-value-len-config June 3, 2026 05:32
@andyzzhao andyzzhao merged commit 8f9f980 into andy/property-vals-value-len-config Jun 3, 2026
182 checks passed
@andyzzhao andyzzhao deleted the andy/property-vals-event-name-svc branch June 3, 2026 05:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant