feat(experiments): report precomputation failures to error tracking by andehen · Pull Request #60512 · PostHog/posthog

andehen · 2026-05-28T18:46:57Z

Problem

When experiment exposure or metric-events precomputation fails, the query runner catches the exception, logs it, and falls back to a direct scan — so the query still returns correct results. That fallback is good for resilience, but it also masks the failure: a completely broken precomputation path looks like normal background retry noise while results keep flowing via the direct scan.

This is exactly how the toUInt8 funnel-preaggregation bug (#60380) went unnoticed in production despite failing every matching INSERT. This is Phase 1 of the follow-up plan in #60511.

Part of #60511

Changes

At the two except blocks in ExperimentQueryRunner._get_experiment_query() (exposure and metric-events precomputation), report the caught exception to error tracking via capture_exception, alongside the existing logger.exception. Each capture carries precomputation_path (exposure / metric_events), experiment_id, and metric_type for triage.

The fallback behaviour and query results are unchanged — this only makes a failing precomputation path observable.

A follow-up (Phase 2/3 in #60511) will add an experiment-specific Prometheus counter and alerting on the fallback rate.

How did you test this code?

I'm an agent (Claude Code). Automated test only — no manual testing:

Extended test_falls_back_to_events_scan_on_lazy_computation_failure to assert that when precomputation raises, the query still falls back successfully and capture_exception is invoked with precomputation_path="exposure".
Ran it locally against the dev Postgres/ClickHouse stack: 1 passed.
ruff lint/format and ty typecheck pass (via lint-staged on commit).

🤖 Agent context

Authored by Claude Code (Opus 4.8) at the request of @andehen.

This is the first of five phases scoped in #60511 (the issue itself was also drafted from a PR review comment). The plan deliberately splits "propagate" (this PR) from "add metric + alerting" (later phases) so each is independently reviewable. The fallback was kept intact by design — the goal is observability, not changing failure behaviour. capture_exception is the same helper already used by experiment_error_handler, so grouping/fingerprinting is consistent with other experiment error reporting.

When experiment exposure or metric-events precomputation fails, the query runner falls back to a direct scan and still returns results. That fallback also masked the failure: it was only logged, so a completely broken precomputation path looked like background retry noise. Capture these exceptions to error tracking (alongside the existing log) so a broken precomputation path surfaces even though the fallback keeps queries working. The query result is unchanged.

greptile-apps · 2026-05-28T18:49:42Z

Comments Outside Diff (1)

posthog/hogql_queries/experiments/test/test_experiment_exposure_preaggregation.py, line 740-741 (link)

The new capture_exception assertion only covers the exposure precomputation path; the parallel metric_events path has no corresponding test. Given the project preference for parameterised tests, both paths would naturally be covered by parameterising test_falls_back_to_events_scan_on_lazy_computation_failure over ("_ensure_exposures_precomputed", "exposure") and ("_ensure_metric_events_precomputed", "metric_events").

Prompt To Fix With AI

This is a comment left during a code review.
Path: posthog/hogql_queries/experiments/test/test_experiment_exposure_preaggregation.py
Line: 740-741

Comment:
The new `capture_exception` assertion only covers the `exposure` precomputation path; the parallel `metric_events` path has no corresponding test. Given the project preference for parameterised tests, both paths would naturally be covered by parameterising `test_falls_back_to_events_scan_on_lazy_computation_failure` over `("_ensure_exposures_precomputed", "exposure")` and `("_ensure_metric_events_precomputed", "metric_events")`.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
posthog/hogql_queries/experiments/experiment_query_runner.py:344-357
Double log entry per failure in production: `logger.exception("exposure_lazy_computation_failed", ...)` fires first, then `capture_exception` internally calls `logger.exception(error, event_id=uuid)` when `api_key` is set. This yields two structlog entries for the same event — one with the named key and `experiment_id`, one with the raw exception and `event_id`. The same pattern repeats at the `metric_events` site. If the intent is to keep both, the existing call's context (`experiment_id`) could be moved into `additional_properties` and the outer `logger.exception` call removed, leaving all logging to `capture_exception`.

### Issue 2 of 3
posthog/hogql_queries/experiments/test/test_experiment_exposure_preaggregation.py:740-741
The new `capture_exception` assertion only covers the `exposure` precomputation path; the parallel `metric_events` path has no corresponding test. Given the project preference for parameterised tests, both paths would naturally be covered by parameterising `test_falls_back_to_events_scan_on_lazy_computation_failure` over `("_ensure_exposures_precomputed", "exposure")` and `("_ensure_metric_events_precomputed", "metric_events")`.

### Issue 3 of 3
posthog/hogql_queries/experiments/experiment_query_runner.py:353-356
The two inline comments (`# The query still succeeds via the direct-scan fallback…` and `# See note above: the direct-scan fallback masks this failure…`) explain a behavioral change in code, which runs counter to the project's rule against this pattern. The same rationale is already documented thoroughly in the PR description.

_{Reviews (1): Last reviewed commit: "feat(experiments): report precomputation..." | Re-trigger Greptile}

- Drop the redundant logger.exception; capture_exception already logs once, and the named event is preserved via a "tag" property - Parameterize the fallback test to cover both the exposure and metric_events precomputation paths - Trim inline comments to a single concise rationale

rodrigoi

github-actions · 2026-05-28T19:12:40Z

🎭 Playwright report · View test results →

⚠️ 2 flaky tests:

Redirect to appropriate place after login (chromium)
Split a person with multiple distinct IDs (chromium)

These issues are not necessarily caused by your changes.
Annoyed by this comment? Help fix flakies and failures and it'll disappear!

deployment-status-posthog · 2026-05-29T07:22:05Z

Deploy status

Environment	Status	Deployed At	Workflow
dev	✅ Deployed	2026-05-29 07:22 UTC	Run
prod-us	✅ Deployed	2026-05-29 07:35 UTC	Run
prod-eu	✅ Deployed	2026-05-29 07:40 UTC	Run

…60512)

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Comment thread posthog/hogql_queries/experiments/experiment_query_runner.py

Comment thread posthog/hogql_queries/experiments/experiment_query_runner.py

andehen requested a review from a team May 28, 2026 18:50

rodrigoi approved these changes May 28, 2026

View reviewed changes

andehen enabled auto-merge (squash) May 28, 2026 19:05

lricoy approved these changes May 28, 2026

View reviewed changes

Merge branch 'master' into experiments/propagate-precomputation-failures

4e01618

andehen merged commit 4539f65 into master May 29, 2026
225 of 228 checks passed

andehen deleted the experiments/propagate-precomputation-failures branch May 29, 2026 07:00

rorylshanks pushed a commit that referenced this pull request May 29, 2026

feat(experiments): report precomputation failures to error tracking (#…

1c93b6b

…60512)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(experiments): report precomputation failures to error tracking#60512

feat(experiments): report precomputation failures to error tracking#60512
andehen merged 3 commits into
masterfrom
experiments/propagate-precomputation-failures

andehen commented May 28, 2026

Uh oh!

greptile-apps Bot commented May 28, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

rodrigoi left a comment

Uh oh!

github-actions Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

deployment-status-posthog Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andehen commented May 28, 2026

Problem

Changes

How did you test this code?

🤖 Agent context

Uh oh!

greptile-apps Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

rodrigoi left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

deployment-status-posthog Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploy status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

deployment-status-posthog Bot commented May 29, 2026 •

edited

Loading