feat(experiments): add pre-UDF dedup and bitfield step checks to funnel query#54402
feat(experiments): add pre-UDF dedup and bitfield step checks to funnel query#54402
Conversation
The experiment funnel query previously scanned the events table twice (once for exposures, once for metric events) and performed an intermediate CTE-to-CTE JOIN. This collapses the 3-CTE pipeline into a 2-CTE single-scan pattern, eliminating the duplicate scan and JOIN. The precomputed exposure path is left unchanged since it already reads from a cheap preaggregated table.
…el query Two optimizations from the product-insights funnel query applied to the experiment funnel query: 1. Pre-UDF event array deduplication: before passing the per-user events array to aggregate_funnel_array, remove events "sandwiched" between neighbors with identical step sets and breakdown properties. This is especially effective for high-frequency $feature_flag_called events that fire on every page load, significantly reducing UDF input size and computation cost. 2. Bitfield step completion checks: use bitAnd on the UDF's steps_bitfield output (result.5) instead of integer comparison on step_reached for total_sum and step_counts. This is the idiomatic way to consume the UDF output and scales better for multi-step funnels. The dedup filter requires the sorted events array to be referenced three times (filter target + two arrayRotate calls). To avoid tripling the groupArray cost, the entity_metrics CTE is split into two: - entity_events: materializes the sorted array and other aggregates - entity_metrics: applies dedup + funnel UDF on the pre-computed array
Prompt To Fix All With AIThis is a comment left during a code review.
Path: posthog/hogql_queries/experiments/experiment_query_builder.py
Line: 1752-1847
Comment:
**Duplicate methods after refactor**
`_build_funnel_aggregation_expr` and `_build_funnel_aggregation_expr_optimized` are now byte-for-byte identical — the `events_alias` difference that previously distinguished them is gone because both paths now use `sorted_events_ref="entity_events.sorted_events"`. One of them can be removed, and all callers can share a single method.
```suggestion
def _build_funnel_aggregation_expr(self) -> ast.Expr:
"""
Returns the funnel evaluation expression using aggregate_funnel_array.
References entity_events.sorted_events (pre-computed in the entity_events CTE)
and applies dedup before the UDF. Used by both legacy and optimized paths.
"""
assert isinstance(self.metric, ExperimentFunnelMetric)
return funnel_evaluation_expr(
self.team, self.metric, include_exposure=True, sorted_events_ref="entity_events.sorted_events"
)
```
Then replace the call at line 621 with `self._build_funnel_aggregation_expr()` and delete `_build_funnel_aggregation_expr_optimized`.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: posthog/hogql_queries/experiments/base_query_utils.py
Line: 413-441
Comment:
**Misleading docstring on breakdown condition**
The docstring says "Its breakdown props are identical to both neighbors" but `x.3` is the UDF-internal groupBy slot, always set to `array('')` in the experiment context (see `_build_sorted_events_sql`). The `x.3` equality checks are therefore trivially true for every event — they are a no-op and the comment misleads readers into thinking actual experiment breakdown values are being compared.
Consider updating the comment to clarify that `x.3` is the UDF groupBy payload (always `array('')` here), so the effective removal criterion is only step-set identity and timestamp bounds.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "feat(experiments): add pre-UDF dedup and..." | Re-trigger Greptile |
| @@ -1771,12 +1828,23 @@ def _build_variant_expr_for_funnel_optimized(self) -> ast.Expr: | |||
| }, | |||
| ) | |||
|
|
|||
| def _build_sorted_events_expr_optimized(self) -> ast.Expr: | |||
| """ | |||
| Sorted events array for the optimized path. Materialized in entity_events CTE | |||
| so it can be referenced three times in the dedup filter without tripling groupArray cost. | |||
| """ | |||
| assert isinstance(self.metric, ExperimentFunnelMetric) | |||
| return funnel_sorted_events_expr(self.metric, events_alias="base_events", include_exposure=True) | |||
|
|
|||
| def _build_funnel_aggregation_expr_optimized(self) -> ast.Expr: | |||
| """ | |||
| Funnel aggregation for the optimized path. References base_events instead of metric_events. | |||
| Funnel aggregation for the optimized path. References entity_events.sorted_events | |||
| (pre-computed in the entity_events CTE) and applies dedup before the UDF. | |||
| """ | |||
| assert isinstance(self.metric, ExperimentFunnelMetric) | |||
| return funnel_evaluation_expr(self.team, self.metric, events_alias="base_events", include_exposure=True) | |||
| return funnel_evaluation_expr( | |||
| self.team, self.metric, include_exposure=True, sorted_events_ref="entity_events.sorted_events" | |||
| ) | |||
There was a problem hiding this comment.
Duplicate methods after refactor
_build_funnel_aggregation_expr and _build_funnel_aggregation_expr_optimized are now byte-for-byte identical — the events_alias difference that previously distinguished them is gone because both paths now use sorted_events_ref="entity_events.sorted_events". One of them can be removed, and all callers can share a single method.
| def _build_funnel_aggregation_expr(self) -> ast.Expr: | |
| """ | |
| Returns the funnel evaluation expression using aggregate_funnel_array. | |
| References entity_events.sorted_events (pre-computed in the entity_events CTE) | |
| and applies dedup before the UDF. Used by both legacy and optimized paths. | |
| """ | |
| assert isinstance(self.metric, ExperimentFunnelMetric) | |
| return funnel_evaluation_expr( | |
| self.team, self.metric, include_exposure=True, sorted_events_ref="entity_events.sorted_events" | |
| ) |
Then replace the call at line 621 with self._build_funnel_aggregation_expr() and delete _build_funnel_aggregation_expr_optimized.
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/hogql_queries/experiments/experiment_query_builder.py
Line: 1752-1847
Comment:
**Duplicate methods after refactor**
`_build_funnel_aggregation_expr` and `_build_funnel_aggregation_expr_optimized` are now byte-for-byte identical — the `events_alias` difference that previously distinguished them is gone because both paths now use `sorted_events_ref="entity_events.sorted_events"`. One of them can be removed, and all callers can share a single method.
```suggestion
def _build_funnel_aggregation_expr(self) -> ast.Expr:
"""
Returns the funnel evaluation expression using aggregate_funnel_array.
References entity_events.sorted_events (pre-computed in the entity_events CTE)
and applies dedup before the UDF. Used by both legacy and optimized paths.
"""
assert isinstance(self.metric, ExperimentFunnelMetric)
return funnel_evaluation_expr(
self.team, self.metric, include_exposure=True, sorted_events_ref="entity_events.sorted_events"
)
```
Then replace the call at line 621 with `self._build_funnel_aggregation_expr()` and delete `_build_funnel_aggregation_expr_optimized`.
How can I resolve this? If you propose a fix, please make it concise.| def _build_dedup_filter_sql(sorted_events_sql: str) -> str: | ||
| """ | ||
| Wraps a sorted events array expression with the deduplication filter. | ||
|
|
||
| Removes events that are "sandwiched" between neighbors with identical step sets | ||
| and breakdown properties. This shrinks the array the UDF has to process, | ||
| which is especially effective for high-frequency exposure events like | ||
| $feature_flag_called that fire on every page load. | ||
|
|
||
| An event is removed only if ALL of the following are true: | ||
| - It matches only a single step (length(x.4) <= 1) | ||
| - Its step set is identical to both its left and right neighbors | ||
| - Its breakdown props are identical to both neighbors | ||
| - It is not the first or last element (timestamp bounds check) | ||
| """ | ||
| return f"""arrayFilter( | ||
| (x, x_before, x_after) -> not(and( | ||
| ifNull(lessOrEquals(length(x.4), 1), 0), | ||
| ifNull(equals(x.4, x_before.4), isNull(x.4) and isNull(x_before.4)), | ||
| ifNull(equals(x.4, x_after.4), isNull(x.4) and isNull(x_after.4)), | ||
| ifNull(equals(x.3, x_before.3), isNull(x.3) and isNull(x_before.3)), | ||
| ifNull(equals(x.3, x_after.3), isNull(x.3) and isNull(x_after.3)), | ||
| ifNull(greater(x.1, x_before.1), 0), | ||
| ifNull(less(x.1, x_after.1), 0) | ||
| )), | ||
| {sorted_events_sql}, | ||
| arrayRotateRight({sorted_events_sql}, 1), | ||
| arrayRotateLeft({sorted_events_sql}, 1) | ||
| )""" |
There was a problem hiding this comment.
Misleading docstring on breakdown condition
The docstring says "Its breakdown props are identical to both neighbors" but x.3 is the UDF-internal groupBy slot, always set to array('') in the experiment context (see _build_sorted_events_sql). The x.3 equality checks are therefore trivially true for every event — they are a no-op and the comment misleads readers into thinking actual experiment breakdown values are being compared.
Consider updating the comment to clarify that x.3 is the UDF groupBy payload (always array('') here), so the effective removal criterion is only step-set identity and timestamp bounds.
Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/hogql_queries/experiments/base_query_utils.py
Line: 413-441
Comment:
**Misleading docstring on breakdown condition**
The docstring says "Its breakdown props are identical to both neighbors" but `x.3` is the UDF-internal groupBy slot, always set to `array('')` in the experiment context (see `_build_sorted_events_sql`). The `x.3` equality checks are therefore trivially true for every event — they are a no-op and the comment misleads readers into thinking actual experiment breakdown values are being compared.
Consider updating the comment to clarify that `x.3` is the UDF groupBy payload (always `array('')` here), so the effective removal criterion is only step-set identity and timestamp bounds.
How can I resolve this? If you propose a fix, please make it concise.|
This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, please remove the |
Problem
The experiment funnel query passes the raw sorted events array directly to the
aggregate_funnel_arrayUDF without deduplication. For experiments with high-frequency exposure events (e.g.,$feature_flag_calledfiring on every page load), a user with 200 flag evaluations and 1 conversion produces a 201-element array for the UDF. Most of those 200 flag events are redundant — they all map tostep_0with the same properties.Additionally, step completion checks use integer comparison on
step_reachedrather than the UDF's nativesteps_bitfieldoutput.Changes
1. Pre-UDF event array deduplication
Before passing the per-user events array to
aggregate_funnel_array, removes events "sandwiched" between neighbors with identical step sets and breakdown properties. After deduplication, consecutive duplicates collapse and the array may shrink to just a few elements.The dedup filter requires the sorted events array to be referenced three times (filter target + two
arrayRotatecalls). To avoid tripling thegroupArraycost, theentity_metricsCTE is split into:entity_events: materializes the sorted array and other aggregates viaGROUP BYentity_metrics: applies dedup + funnel UDF on the pre-computed array (noGROUP BY)Applied to both the optimized (single-scan) and legacy (precomputed exposures) query paths.
2. Bitfield step completion checks
Uses
bitAndon the UDF'ssteps_bitfieldoutput (result.5) instead of integer comparison onstep_reachedfortotal_sumandstep_counts. This is the idiomatic way to consume the UDF output and scales better for multi-step funnels.steps_event_datastill usesvalue.1(step_reached) for exact equality checks since it needs to identify the user's highest reached step, not just whether a step was completed.How did you test this code?
All 446 experiment tests pass (4 skipped). Snapshot updates reflect the new query structure.
I'm an agent and have not tested this manually — only through the existing unit/integration test suite.
🤖 LLM context
Stacked on top of #54348 (single-scan optimization). These optimizations were derived from patterns observed in the product-insights funnel query.