Skip to content

feat(experiments): add pre-UDF dedup and bitfield step checks to funnel query#54402

Open
andehen wants to merge 2 commits intomasterfrom
experiments/optimize-funnel-query-dedup-bitfield
Open

feat(experiments): add pre-UDF dedup and bitfield step checks to funnel query#54402
andehen wants to merge 2 commits intomasterfrom
experiments/optimize-funnel-query-dedup-bitfield

Conversation

@andehen
Copy link
Copy Markdown
Contributor

@andehen andehen commented Apr 14, 2026

Problem

The experiment funnel query passes the raw sorted events array directly to the aggregate_funnel_array UDF without deduplication. For experiments with high-frequency exposure events (e.g., $feature_flag_called firing on every page load), a user with 200 flag evaluations and 1 conversion produces a 201-element array for the UDF. Most of those 200 flag events are redundant — they all map to step_0 with the same properties.

Additionally, step completion checks use integer comparison on step_reached rather than the UDF's native steps_bitfield output.

Changes

1. Pre-UDF event array deduplication
Before passing the per-user events array to aggregate_funnel_array, removes events "sandwiched" between neighbors with identical step sets and breakdown properties. After deduplication, consecutive duplicates collapse and the array may shrink to just a few elements.

The dedup filter requires the sorted events array to be referenced three times (filter target + two arrayRotate calls). To avoid tripling the groupArray cost, the entity_metrics CTE is split into:

  • entity_events: materializes the sorted array and other aggregates via GROUP BY
  • entity_metrics: applies dedup + funnel UDF on the pre-computed array (no GROUP BY)

Applied to both the optimized (single-scan) and legacy (precomputed exposures) query paths.

2. Bitfield step completion checks
Uses bitAnd on the UDF's steps_bitfield output (result.5) instead of integer comparison on step_reached for total_sum and step_counts. This is the idiomatic way to consume the UDF output and scales better for multi-step funnels.

steps_event_data still uses value.1 (step_reached) for exact equality checks since it needs to identify the user's highest reached step, not just whether a step was completed.

How did you test this code?

All 446 experiment tests pass (4 skipped). Snapshot updates reflect the new query structure.

I'm an agent and have not tested this manually — only through the existing unit/integration test suite.

🤖 LLM context

Stacked on top of #54348 (single-scan optimization). These optimizations were derived from patterns observed in the product-insights funnel query.

andehen added 2 commits April 14, 2026 09:09
The experiment funnel query previously scanned the events table twice
(once for exposures, once for metric events) and performed an
intermediate CTE-to-CTE JOIN. This collapses the 3-CTE pipeline into
a 2-CTE single-scan pattern, eliminating the duplicate scan and JOIN.

The precomputed exposure path is left unchanged since it already reads
from a cheap preaggregated table.
…el query

Two optimizations from the product-insights funnel query applied to
the experiment funnel query:

1. Pre-UDF event array deduplication: before passing the per-user
   events array to aggregate_funnel_array, remove events "sandwiched"
   between neighbors with identical step sets and breakdown properties.
   This is especially effective for high-frequency $feature_flag_called
   events that fire on every page load, significantly reducing UDF
   input size and computation cost.

2. Bitfield step completion checks: use bitAnd on the UDF's
   steps_bitfield output (result.5) instead of integer comparison on
   step_reached for total_sum and step_counts. This is the idiomatic
   way to consume the UDF output and scales better for multi-step
   funnels.

The dedup filter requires the sorted events array to be referenced
three times (filter target + two arrayRotate calls). To avoid tripling
the groupArray cost, the entity_metrics CTE is split into two:
- entity_events: materializes the sorted array and other aggregates
- entity_metrics: applies dedup + funnel UDF on the pre-computed array
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 14, 2026

Prompt To Fix All With AI
This is a comment left during a code review.
Path: posthog/hogql_queries/experiments/experiment_query_builder.py
Line: 1752-1847

Comment:
**Duplicate methods after refactor**

`_build_funnel_aggregation_expr` and `_build_funnel_aggregation_expr_optimized` are now byte-for-byte identical — the `events_alias` difference that previously distinguished them is gone because both paths now use `sorted_events_ref="entity_events.sorted_events"`. One of them can be removed, and all callers can share a single method.

```suggestion
    def _build_funnel_aggregation_expr(self) -> ast.Expr:
        """
        Returns the funnel evaluation expression using aggregate_funnel_array.
        References entity_events.sorted_events (pre-computed in the entity_events CTE)
        and applies dedup before the UDF. Used by both legacy and optimized paths.
        """
        assert isinstance(self.metric, ExperimentFunnelMetric)
        return funnel_evaluation_expr(
            self.team, self.metric, include_exposure=True, sorted_events_ref="entity_events.sorted_events"
        )
```

Then replace the call at line 621 with `self._build_funnel_aggregation_expr()` and delete `_build_funnel_aggregation_expr_optimized`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: posthog/hogql_queries/experiments/base_query_utils.py
Line: 413-441

Comment:
**Misleading docstring on breakdown condition**

The docstring says "Its breakdown props are identical to both neighbors" but `x.3` is the UDF-internal groupBy slot, always set to `array('')` in the experiment context (see `_build_sorted_events_sql`). The `x.3` equality checks are therefore trivially true for every event — they are a no-op and the comment misleads readers into thinking actual experiment breakdown values are being compared.

Consider updating the comment to clarify that `x.3` is the UDF groupBy payload (always `array('')` here), so the effective removal criterion is only step-set identity and timestamp bounds.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "feat(experiments): add pre-UDF dedup and..." | Re-trigger Greptile

Comment on lines 1752 to +1847
@@ -1771,12 +1828,23 @@ def _build_variant_expr_for_funnel_optimized(self) -> ast.Expr:
},
)

def _build_sorted_events_expr_optimized(self) -> ast.Expr:
"""
Sorted events array for the optimized path. Materialized in entity_events CTE
so it can be referenced three times in the dedup filter without tripling groupArray cost.
"""
assert isinstance(self.metric, ExperimentFunnelMetric)
return funnel_sorted_events_expr(self.metric, events_alias="base_events", include_exposure=True)

def _build_funnel_aggregation_expr_optimized(self) -> ast.Expr:
"""
Funnel aggregation for the optimized path. References base_events instead of metric_events.
Funnel aggregation for the optimized path. References entity_events.sorted_events
(pre-computed in the entity_events CTE) and applies dedup before the UDF.
"""
assert isinstance(self.metric, ExperimentFunnelMetric)
return funnel_evaluation_expr(self.team, self.metric, events_alias="base_events", include_exposure=True)
return funnel_evaluation_expr(
self.team, self.metric, include_exposure=True, sorted_events_ref="entity_events.sorted_events"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicate methods after refactor

_build_funnel_aggregation_expr and _build_funnel_aggregation_expr_optimized are now byte-for-byte identical — the events_alias difference that previously distinguished them is gone because both paths now use sorted_events_ref="entity_events.sorted_events". One of them can be removed, and all callers can share a single method.

Suggested change
def _build_funnel_aggregation_expr(self) -> ast.Expr:
"""
Returns the funnel evaluation expression using aggregate_funnel_array.
References entity_events.sorted_events (pre-computed in the entity_events CTE)
and applies dedup before the UDF. Used by both legacy and optimized paths.
"""
assert isinstance(self.metric, ExperimentFunnelMetric)
return funnel_evaluation_expr(
self.team, self.metric, include_exposure=True, sorted_events_ref="entity_events.sorted_events"
)

Then replace the call at line 621 with self._build_funnel_aggregation_expr() and delete _build_funnel_aggregation_expr_optimized.

Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/hogql_queries/experiments/experiment_query_builder.py
Line: 1752-1847

Comment:
**Duplicate methods after refactor**

`_build_funnel_aggregation_expr` and `_build_funnel_aggregation_expr_optimized` are now byte-for-byte identical — the `events_alias` difference that previously distinguished them is gone because both paths now use `sorted_events_ref="entity_events.sorted_events"`. One of them can be removed, and all callers can share a single method.

```suggestion
    def _build_funnel_aggregation_expr(self) -> ast.Expr:
        """
        Returns the funnel evaluation expression using aggregate_funnel_array.
        References entity_events.sorted_events (pre-computed in the entity_events CTE)
        and applies dedup before the UDF. Used by both legacy and optimized paths.
        """
        assert isinstance(self.metric, ExperimentFunnelMetric)
        return funnel_evaluation_expr(
            self.team, self.metric, include_exposure=True, sorted_events_ref="entity_events.sorted_events"
        )
```

Then replace the call at line 621 with `self._build_funnel_aggregation_expr()` and delete `_build_funnel_aggregation_expr_optimized`.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +413 to +441
def _build_dedup_filter_sql(sorted_events_sql: str) -> str:
"""
Wraps a sorted events array expression with the deduplication filter.

Removes events that are "sandwiched" between neighbors with identical step sets
and breakdown properties. This shrinks the array the UDF has to process,
which is especially effective for high-frequency exposure events like
$feature_flag_called that fire on every page load.

An event is removed only if ALL of the following are true:
- It matches only a single step (length(x.4) <= 1)
- Its step set is identical to both its left and right neighbors
- Its breakdown props are identical to both neighbors
- It is not the first or last element (timestamp bounds check)
"""
return f"""arrayFilter(
(x, x_before, x_after) -> not(and(
ifNull(lessOrEquals(length(x.4), 1), 0),
ifNull(equals(x.4, x_before.4), isNull(x.4) and isNull(x_before.4)),
ifNull(equals(x.4, x_after.4), isNull(x.4) and isNull(x_after.4)),
ifNull(equals(x.3, x_before.3), isNull(x.3) and isNull(x_before.3)),
ifNull(equals(x.3, x_after.3), isNull(x.3) and isNull(x_after.3)),
ifNull(greater(x.1, x_before.1), 0),
ifNull(less(x.1, x_after.1), 0)
)),
{sorted_events_sql},
arrayRotateRight({sorted_events_sql}, 1),
arrayRotateLeft({sorted_events_sql}, 1)
)"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Misleading docstring on breakdown condition

The docstring says "Its breakdown props are identical to both neighbors" but x.3 is the UDF-internal groupBy slot, always set to array('') in the experiment context (see _build_sorted_events_sql). The x.3 equality checks are therefore trivially true for every event — they are a no-op and the comment misleads readers into thinking actual experiment breakdown values are being compared.

Consider updating the comment to clarify that x.3 is the UDF groupBy payload (always array('') here), so the effective removal criterion is only step-set identity and timestamp bounds.

Prompt To Fix With AI
This is a comment left during a code review.
Path: posthog/hogql_queries/experiments/base_query_utils.py
Line: 413-441

Comment:
**Misleading docstring on breakdown condition**

The docstring says "Its breakdown props are identical to both neighbors" but `x.3` is the UDF-internal groupBy slot, always set to `array('')` in the experiment context (see `_build_sorted_events_sql`). The `x.3` equality checks are therefore trivially true for every event — they are a no-op and the comment misleads readers into thinking actual experiment breakdown values are being compared.

Consider updating the comment to clarify that `x.3` is the UDF groupBy payload (always `array('')` here), so the effective removal criterion is only step-set identity and timestamp bounds.

How can I resolve this? If you propose a fix, please make it concise.

Base automatically changed from experiments/optimize-funnel-query to master April 14, 2026 12:50
@github-actions
Copy link
Copy Markdown
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, please remove the stale label – otherwise this will be closed in another week. If you want to permanently keep it open, use the waiting label.

@github-actions github-actions Bot added the stale label Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant