feat(property-vals): add EXCLUDED_PROPERTY_KEYS env var to skip keys at fan_out#60160
Conversation
|
| #[test] | ||
| fn excluded_event_property_keys_are_skipped() { | ||
| let blob = r#"{"$insert_id":"abc-123","$browser":"Chrome","distinct_id":"u1","$session_id":"s1"}"#; | ||
| let exclusions = excluded(&["$insert_id", "distinct_id", "$session_id"]); | ||
| let tuples = fan_out(&event(blob), &exclusions); | ||
| assert_eq!(tuples.len(), 1, "only $browser should survive exclusion"); | ||
| assert_eq!(tuples[0].property_key, "$browser"); | ||
| assert_eq!(tuples[0].property_value, "Chrome"); | ||
| } | ||
|
|
||
| #[test] | ||
| fn excluded_person_property_keys_are_skipped() { | ||
| let ev = Event { | ||
| team_id: 2, | ||
| properties: None, | ||
| person_properties: Some( | ||
| r#"{"email":"a@b.com","$session_id":"s1","plan":"enterprise"}"#.to_string(), | ||
| ), | ||
| }; | ||
| let tuples = fan_out(&ev, &excluded(&["$session_id"])); | ||
| assert_eq!(tuples.len(), 2); | ||
| let keys: Vec<&str> = tuples.iter().map(|t| t.property_key.as_str()).collect(); | ||
| assert!(keys.contains(&"email")); | ||
| assert!(keys.contains(&"plan")); | ||
| assert!(!keys.contains(&"$session_id")); | ||
| } | ||
|
|
||
| #[test] | ||
| fn excluded_group_property_keys_are_skipped() { | ||
| let g = group_identify(0, r#"{"plan":"enterprise","internal_id":"g-42"}"#); | ||
| let tuples = fan_out_group(&g, &excluded(&["internal_id"])); | ||
| assert_eq!(tuples.len(), 1); | ||
| assert_eq!(tuples[0].property_key, "plan"); | ||
| } |
There was a problem hiding this comment.
Repeated exclusion test structure; prefer parameterised tests
excluded_event_property_keys_are_skipped, excluded_person_property_keys_are_skipped, and excluded_group_property_keys_are_skipped all exercise the same logical invariant — "keys in the exclusion list are not emitted" — across the three property types. Per the team's preference for parameterised tests, a single macro or a helper that iterates over (input_fn, exclusion_key, expected_surviving_keys) tuples would express this OnceAndOnlyOnce and make it easier to add new property types in the future.
Prompt To Fix With AI
This is a comment left during a code review.
Path: rust/property-vals-rs/src/fan_out.rs
Line: 300-333
Comment:
**Repeated exclusion test structure; prefer parameterised tests**
`excluded_event_property_keys_are_skipped`, `excluded_person_property_keys_are_skipped`, and `excluded_group_property_keys_are_skipped` all exercise the same logical invariant — "keys in the exclusion list are not emitted" — across the three property types. Per the team's preference for parameterised tests, a single macro or a helper that iterates over `(input_fn, exclusion_key, expected_surviving_keys)` tuples would express this OnceAndOnlyOnce and make it easier to add new property types in the future.
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
…at fan_out Property keys to filter out before they hit the aggregator are now config-driven via a comma-separated env var. Useful for identifier-shaped keys ($insert_id, $session_id, $window_id, distinct_id, etc.) that nobody type-aheads in autocomplete but dominate property_values row count. Data point from a recent property_values dump on team 2: the top 5 keys by row count ($insert_id, $sent_at, $feature_flag_request_id, $time, $feature_flag_evaluated_at) accounted for 64.6% of all rows but 0.56% of total appearances. Pure write amp, no read benefit. The new PropertyKeyList type parses the env var into an Arc<HashSet<String>>; fan_out and fan_out_group take it as a parameter and skip emission for keys in the set. Empty list is a no-op (current production behavior preserved). The set can be tuned without redeploying the binary — change the chart values, Argo syncs, pods pick it up on restart. Tests: - excluded_event_property_keys_are_skipped: drops $insert_id / distinct_id / $session_id, only $browser survives - excluded_person_property_keys_are_skipped: filters $session_id from person properties, leaves email and plan - excluded_group_property_keys_are_skipped: same for group properties - empty_exclusion_list_is_a_noop: PropertyKeyList::default() and "".parse() both produce empty sets and don't affect output Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reads more honestly — the type only ever expresses an exclusion. Unlike TeamList (reused for both allowed_teams and blocked_teams), this type has one job and the name should say so. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
abb9a8d to
b5994e5
Compare
There was a problem hiding this comment.
Clean, backward-compatible feature addition in a Rust service. The Arc<HashSet<String>> approach is safe for sharing across threads, the default (empty string → empty set) preserves existing behavior, and all call sites are updated consistently. The bot's comment about parameterized tests is a style preference, not a substantive concern.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New commits pushed (delta classified non_trivial_delta) — stamphog approval dismissed; re-review running automatically.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New commits pushed (delta classified non_trivial_delta) — stamphog approval dismissed; re-review running automatically.
There was a problem hiding this comment.
Clean, backward-compatible Rust feature addition. The Arc<HashSet<String>> approach is correct for shared-read concurrency, the empty-string default preserves existing behavior, and all call sites are consistently updated. The only inline comment (parameterized tests) is a style preference already flagged as non-substantive by the automated review.
…at fan_out (#60160) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Problem
Identifier-shaped properties (
$insert_id,$session_id,$window_id,distinct_id, etc.) amplify write throughput resulting in consumer lag because the table stores each row as a unique(team_id, property_type, property_key, property_value)tuple — so a property whose value is essentially unique per event (UUIDs, timestamps, request IDs) generates a new row for every event it appears on, with no aggregation collapse to absorb the volume.Confirmed empirically by joining
property_valuesrow counts against autocomplete search frequency fromsystem.query_logfor team 2 over 30 days:$insert_id$sent_at$feature_flag_request_id$time$feature_flag_evaluated_at24 of the top 25 keys (by row count) have zero searches in 30 days. The top 5 alone account for ~64% of all stored rows yet receive no autocomplete traffic — they're pure write amplification.
The exceptions to leave alone:
$session_id(258 searches, real autocomplete usage despite high cardinality),$current_url(high cardinality but also high real search demand),$set_once(modest searches but worth preserving).Avg value countisavg(property_count)per row — close to 1 means each value appears ~once (every event is a new row), which is the identifier signature. Compare to$current_url=50.4or$set_once=65.7where each unique value gets emitted dozens of times before a new one appears (real aggregation happening).Changes
config.rs: addEXCLUDED_PROPERTY_KEYSenv var, parsed via newExcludedPropertyKeystype into anArc<HashSet<String>>. Empty by default.fan_out.rs:fan_outandfan_out_grouptake the exclusion set as a parameter;emit_from_blobskips emission for keys in the set.main.rs: wrapsfan_outandfan_out_groupin closures that capture the config's exclusion set, passes them toworker_loop.Tunable without code deploys — change the chart values, Argo syncs, pods pick it up on restart.
Companion charts PR adds the env var to
apps/property-vals-rs/values.yamlso per-environment overrides can set the list.How did you test this code?
I am an agent. Local automated checks only:
cargo test -p property-vals-rs --lib— all 38 tests pass (34 existing + 4 new)excluded_event_property_keys_are_skipped— drops$insert_id/distinct_id/$session_id, only$browsersurvivesexcluded_person_property_keys_are_skipped— filters$session_idfrom person properties, leavesemailandplanexcluded_group_property_keys_are_skipped— same for group propertiesempty_exclusion_list_is_a_noop— verifiesExcludedPropertyKeys::default()and"".parse()are both no-ops (production behavior preserved when env var unset)Publish to changelog?
no