feat(experiments): add pre-UDF dedup and bitfield step checks to funnel query by andehen · Pull Request #54402 · PostHog/posthog

andehen · 2026-04-14T08:30:25Z

Problem

The experiment funnel query passes the raw sorted events array directly to the aggregate_funnel_array UDF without deduplication. For experiments with high-frequency exposure events (e.g., $feature_flag_called firing on every page load), a user with 200 flag evaluations and 1 conversion produces a 201-element array for the UDF. Most of those 200 flag events are redundant — they all map to step_0 with the same properties.

Additionally, step completion checks use integer comparison on step_reached rather than the UDF's native steps_bitfield output.

Changes

1. Pre-UDF event array deduplication
Before passing the per-user events array to aggregate_funnel_array, removes events "sandwiched" between neighbors with identical step sets and breakdown properties. After deduplication, consecutive duplicates collapse and the array may shrink to just a few elements.

The dedup filter requires the sorted events array to be referenced three times (filter target + two arrayRotate calls). To avoid tripling the groupArray cost, the entity_metrics CTE is split into:

entity_events: materializes the sorted array and other aggregates via GROUP BY
entity_metrics: applies dedup + funnel UDF on the pre-computed array (no GROUP BY)

Applied to both the optimized (single-scan) and legacy (precomputed exposures) query paths.

2. Bitfield step completion checks
Uses bitAnd on the UDF's steps_bitfield output (result.5) instead of integer comparison on step_reached for total_sum and step_counts. This is the idiomatic way to consume the UDF output and scales better for multi-step funnels.

steps_event_data still uses value.1 (step_reached) for exact equality checks since it needs to identify the user's highest reached step, not just whether a step was completed.

How did you test this code?

All 446 experiment tests pass (4 skipped). Snapshot updates reflect the new query structure.

I'm an agent and have not tested this manually — only through the existing unit/integration test suite.

🤖 LLM context

Stacked on top of #54348 (single-scan optimization). These optimizations were derived from patterns observed in the product-insights funnel query.

The experiment funnel query previously scanned the events table twice (once for exposures, once for metric events) and performed an intermediate CTE-to-CTE JOIN. This collapses the 3-CTE pipeline into a 2-CTE single-scan pattern, eliminating the duplicate scan and JOIN. The precomputed exposure path is left unchanged since it already reads from a cheap preaggregated table.

…el query Two optimizations from the product-insights funnel query applied to the experiment funnel query: 1. Pre-UDF event array deduplication: before passing the per-user events array to aggregate_funnel_array, remove events "sandwiched" between neighbors with identical step sets and breakdown properties. This is especially effective for high-frequency $feature_flag_called events that fire on every page load, significantly reducing UDF input size and computation cost. 2. Bitfield step completion checks: use bitAnd on the UDF's steps_bitfield output (result.5) instead of integer comparison on step_reached for total_sum and step_counts. This is the idiomatic way to consume the UDF output and scales better for multi-step funnels. The dedup filter requires the sorted events array to be referenced three times (filter target + two arrayRotate calls). To avoid tripling the groupArray cost, the entity_metrics CTE is split into two: - entity_events: materializes the sorted array and other aggregates - entity_metrics: applies dedup + funnel UDF on the pre-computed array

greptile-apps · 2026-04-14T08:44:11Z

Prompt To Fix All With AI

This is a comment left during a code review.
Path: posthog/hogql_queries/experiments/experiment_query_builder.py
Line: 1752-1847

Comment:
**Duplicate methods after refactor**

`_build_funnel_aggregation_expr` and `_build_funnel_aggregation_expr_optimized` are now byte-for-byte identical — the `events_alias` difference that previously distinguished them is gone because both paths now use `sorted_events_ref="entity_events.sorted_events"`. One of them can be removed, and all callers can share a single method.

```suggestion
    def _build_funnel_aggregation_expr(self) -> ast.Expr:
        """
        Returns the funnel evaluation expression using aggregate_funnel_array.
        References entity_events.sorted_events (pre-computed in the entity_events CTE)
        and applies dedup before the UDF. Used by both legacy and optimized paths.
        """
        assert isinstance(self.metric, ExperimentFunnelMetric)
        return funnel_evaluation_expr(
            self.team, self.metric, include_exposure=True, sorted_events_ref="entity_events.sorted_events"
        )
```

Then replace the call at line 621 with `self._build_funnel_aggregation_expr()` and delete `_build_funnel_aggregation_expr_optimized`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: posthog/hogql_queries/experiments/base_query_utils.py
Line: 413-441

Comment:
**Misleading docstring on breakdown condition**

The docstring says "Its breakdown props are identical to both neighbors" but `x.3` is the UDF-internal groupBy slot, always set to `array('')` in the experiment context (see `_build_sorted_events_sql`). The `x.3` equality checks are therefore trivially true for every event — they are a no-op and the comment misleads readers into thinking actual experiment breakdown values are being compared.

Consider updating the comment to clarify that `x.3` is the UDF groupBy payload (always `array('')` here), so the effective removal criterion is only step-set identity and timestamp bounds.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "feat(experiments): add pre-UDF dedup and..." | Re-trigger Greptile}

greptile-apps · 2026-04-14T08:44:15Z

@@ -1771,12 +1828,23 @@ def _build_variant_expr_for_funnel_optimized(self) -> ast.Expr:
                },
            )

+    def _build_sorted_events_expr_optimized(self) -> ast.Expr:
+        """
+        Sorted events array for the optimized path. Materialized in entity_events CTE
+        so it can be referenced three times in the dedup filter without tripling groupArray cost.
+        """
+        assert isinstance(self.metric, ExperimentFunnelMetric)
+        return funnel_sorted_events_expr(self.metric, events_alias="base_events", include_exposure=True)
+
    def _build_funnel_aggregation_expr_optimized(self) -> ast.Expr:
        """
-        Funnel aggregation for the optimized path. References base_events instead of metric_events.
+        Funnel aggregation for the optimized path. References entity_events.sorted_events
+        (pre-computed in the entity_events CTE) and applies dedup before the UDF.
        """
        assert isinstance(self.metric, ExperimentFunnelMetric)
-        return funnel_evaluation_expr(self.team, self.metric, events_alias="base_events", include_exposure=True)
+        return funnel_evaluation_expr(
+            self.team, self.metric, include_exposure=True, sorted_events_ref="entity_events.sorted_events"
+        )


Duplicate methods after refactor

_build_funnel_aggregation_expr and _build_funnel_aggregation_expr_optimized are now byte-for-byte identical — the events_alias difference that previously distinguished them is gone because both paths now use sorted_events_ref="entity_events.sorted_events". One of them can be removed, and all callers can share a single method.

Suggested change

def _build_funnel_aggregation_expr(self) -> ast.Expr:

"""

Returns the funnel evaluation expression using aggregate_funnel_array.

References entity_events.sorted_events (pre-computed in the entity_events CTE)

and applies dedup before the UDF. Used by both legacy and optimized paths.

"""

assert isinstance(self.metric, ExperimentFunnelMetric)

return funnel_evaluation_expr(

self.team, self.metric, include_exposure=True, sorted_events_ref="entity_events.sorted_events"

)

Then replace the call at line 621 with self._build_funnel_aggregation_expr() and delete _build_funnel_aggregation_expr_optimized.

Prompt To Fix With AI

This is a comment left during a code review. Path: posthog/hogql_queries/experiments/experiment_query_builder.py Line: 1752-1847 Comment: **Duplicate methods after refactor** `_build_funnel_aggregation_expr` and `_build_funnel_aggregation_expr_optimized` are now byte-for-byte identical — the `events_alias` difference that previously distinguished them is gone because both paths now use `sorted_events_ref="entity_events.sorted_events"`. One of them can be removed, and all callers can share a single method. ```suggestion def _build_funnel_aggregation_expr(self) -> ast.Expr: """ Returns the funnel evaluation expression using aggregate_funnel_array. References entity_events.sorted_events (pre-computed in the entity_events CTE) and applies dedup before the UDF. Used by both legacy and optimized paths. """ assert isinstance(self.metric, ExperimentFunnelMetric) return funnel_evaluation_expr( self.team, self.metric, include_exposure=True, sorted_events_ref="entity_events.sorted_events" ) ``` Then replace the call at line 621 with `self._build_funnel_aggregation_expr()` and delete `_build_funnel_aggregation_expr_optimized`. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-14T08:44:16Z

+def _build_dedup_filter_sql(sorted_events_sql: str) -> str:
+    """
+    Wraps a sorted events array expression with the deduplication filter.
+
+    Removes events that are "sandwiched" between neighbors with identical step sets
+    and breakdown properties. This shrinks the array the UDF has to process,
+    which is especially effective for high-frequency exposure events like
+    $feature_flag_called that fire on every page load.
+
+    An event is removed only if ALL of the following are true:
+    - It matches only a single step (length(x.4) <= 1)
+    - Its step set is identical to both its left and right neighbors
+    - Its breakdown props are identical to both neighbors
+    - It is not the first or last element (timestamp bounds check)
+    """
+    return f"""arrayFilter(
+        (x, x_before, x_after) -> not(and(
+            ifNull(lessOrEquals(length(x.4), 1), 0),
+            ifNull(equals(x.4, x_before.4), isNull(x.4) and isNull(x_before.4)),
+            ifNull(equals(x.4, x_after.4), isNull(x.4) and isNull(x_after.4)),
+            ifNull(equals(x.3, x_before.3), isNull(x.3) and isNull(x_before.3)),
+            ifNull(equals(x.3, x_after.3), isNull(x.3) and isNull(x_after.3)),
+            ifNull(greater(x.1, x_before.1), 0),
+            ifNull(less(x.1, x_after.1), 0)
+        )),
+        {sorted_events_sql},
+        arrayRotateRight({sorted_events_sql}, 1),
+        arrayRotateLeft({sorted_events_sql}, 1)
+    )"""


Misleading docstring on breakdown condition

The docstring says "Its breakdown props are identical to both neighbors" but x.3 is the UDF-internal groupBy slot, always set to array('') in the experiment context (see _build_sorted_events_sql). The x.3 equality checks are therefore trivially true for every event — they are a no-op and the comment misleads readers into thinking actual experiment breakdown values are being compared.

Consider updating the comment to clarify that x.3 is the UDF groupBy payload (always array('') here), so the effective removal criterion is only step-set identity and timestamp bounds.

Prompt To Fix With AI

This is a comment left during a code review. Path: posthog/hogql_queries/experiments/base_query_utils.py Line: 413-441 Comment: **Misleading docstring on breakdown condition** The docstring says "Its breakdown props are identical to both neighbors" but `x.3` is the UDF-internal groupBy slot, always set to `array('')` in the experiment context (see `_build_sorted_events_sql`). The `x.3` equality checks are therefore trivially true for every event — they are a no-op and the comment misleads readers into thinking actual experiment breakdown values are being compared. Consider updating the comment to clarify that `x.3` is the UDF groupBy payload (always `array('')` here), so the effective removal criterion is only step-set identity and timestamp bounds. How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-04-22T08:06:03Z

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, please remove the stale label – otherwise this will be closed in another week. If you want to permanently keep it open, use the waiting label.

andehen added 2 commits April 14, 2026 09:09

greptile-apps Bot reviewed Apr 14, 2026

View reviewed changes

Base automatically changed from experiments/optimize-funnel-query to master April 14, 2026 12:50

github-actions Bot added the stale label Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(experiments): add pre-UDF dedup and bitfield step checks to funnel query#54402

feat(experiments): add pre-UDF dedup and bitfield step checks to funnel query#54402
andehen wants to merge 2 commits intomasterfrom
experiments/optimize-funnel-query-dedup-bitfield

andehen commented Apr 14, 2026

Uh oh!

greptile-apps Bot commented Apr 14, 2026

Uh oh!

greptile-apps Bot Apr 14, 2026

Uh oh!

greptile-apps Bot Apr 14, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andehen commented Apr 14, 2026

Problem

Changes

How did you test this code?

🤖 LLM context

Uh oh!

greptile-apps Bot commented Apr 14, 2026

Uh oh!

greptile-apps Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant