Skip to content

feat(experiments): report precomputation failures to error tracking#60512

Merged
andehen merged 3 commits into
masterfrom
experiments/propagate-precomputation-failures
May 29, 2026
Merged

feat(experiments): report precomputation failures to error tracking#60512
andehen merged 3 commits into
masterfrom
experiments/propagate-precomputation-failures

Conversation

@andehen
Copy link
Copy Markdown
Contributor

@andehen andehen commented May 28, 2026

Problem

When experiment exposure or metric-events precomputation fails, the query runner catches the exception, logs it, and falls back to a direct scan — so the query still returns correct results. That fallback is good for resilience, but it also masks the failure: a completely broken precomputation path looks like normal background retry noise while results keep flowing via the direct scan.

This is exactly how the toUInt8 funnel-preaggregation bug (#60380) went unnoticed in production despite failing every matching INSERT. This is Phase 1 of the follow-up plan in #60511.

Part of #60511

Changes

At the two except blocks in ExperimentQueryRunner._get_experiment_query() (exposure and metric-events precomputation), report the caught exception to error tracking via capture_exception, alongside the existing logger.exception. Each capture carries precomputation_path (exposure / metric_events), experiment_id, and metric_type for triage.

The fallback behaviour and query results are unchanged — this only makes a failing precomputation path observable.

A follow-up (Phase 2/3 in #60511) will add an experiment-specific Prometheus counter and alerting on the fallback rate.

How did you test this code?

I'm an agent (Claude Code). Automated test only — no manual testing:

  • Extended test_falls_back_to_events_scan_on_lazy_computation_failure to assert that when precomputation raises, the query still falls back successfully and capture_exception is invoked with precomputation_path="exposure".
  • Ran it locally against the dev Postgres/ClickHouse stack: 1 passed.
  • ruff lint/format and ty typecheck pass (via lint-staged on commit).

🤖 Agent context

Authored by Claude Code (Opus 4.8) at the request of @andehen.

This is the first of five phases scoped in #60511 (the issue itself was also drafted from a PR review comment). The plan deliberately splits "propagate" (this PR) from "add metric + alerting" (later phases) so each is independently reviewable. The fallback was kept intact by design — the goal is observability, not changing failure behaviour. capture_exception is the same helper already used by experiment_error_handler, so grouping/fingerprinting is consistent with other experiment error reporting.

When experiment exposure or metric-events precomputation fails, the query
runner falls back to a direct scan and still returns results. That fallback
also masked the failure: it was only logged, so a completely broken
precomputation path looked like background retry noise.

Capture these exceptions to error tracking (alongside the existing log) so a
broken precomputation path surfaces even though the fallback keeps queries
working. The query result is unchanged.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Comments Outside Diff (1)

  1. posthog/hogql_queries/experiments/test/test_experiment_exposure_preaggregation.py, line 740-741 (link)

    P2 The new capture_exception assertion only covers the exposure precomputation path; the parallel metric_events path has no corresponding test. Given the project preference for parameterised tests, both paths would naturally be covered by parameterising test_falls_back_to_events_scan_on_lazy_computation_failure over ("_ensure_exposures_precomputed", "exposure") and ("_ensure_metric_events_precomputed", "metric_events").

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: posthog/hogql_queries/experiments/test/test_experiment_exposure_preaggregation.py
    Line: 740-741
    
    Comment:
    The new `capture_exception` assertion only covers the `exposure` precomputation path; the parallel `metric_events` path has no corresponding test. Given the project preference for parameterised tests, both paths would naturally be covered by parameterising `test_falls_back_to_events_scan_on_lazy_computation_failure` over `("_ensure_exposures_precomputed", "exposure")` and `("_ensure_metric_events_precomputed", "metric_events")`.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
posthog/hogql_queries/experiments/experiment_query_runner.py:344-357
Double log entry per failure in production: `logger.exception("exposure_lazy_computation_failed", ...)` fires first, then `capture_exception` internally calls `logger.exception(error, event_id=uuid)` when `api_key` is set. This yields two structlog entries for the same event — one with the named key and `experiment_id`, one with the raw exception and `event_id`. The same pattern repeats at the `metric_events` site. If the intent is to keep both, the existing call's context (`experiment_id`) could be moved into `additional_properties` and the outer `logger.exception` call removed, leaving all logging to `capture_exception`.

### Issue 2 of 3
posthog/hogql_queries/experiments/test/test_experiment_exposure_preaggregation.py:740-741
The new `capture_exception` assertion only covers the `exposure` precomputation path; the parallel `metric_events` path has no corresponding test. Given the project preference for parameterised tests, both paths would naturally be covered by parameterising `test_falls_back_to_events_scan_on_lazy_computation_failure` over `("_ensure_exposures_precomputed", "exposure")` and `("_ensure_metric_events_precomputed", "metric_events")`.

### Issue 3 of 3
posthog/hogql_queries/experiments/experiment_query_runner.py:353-356
The two inline comments (`# The query still succeeds via the direct-scan fallback…` and `# See note above: the direct-scan fallback masks this failure…`) explain a behavioral change in code, which runs counter to the project's rule against this pattern. The same rationale is already documented thoroughly in the PR description.

Reviews (1): Last reviewed commit: "feat(experiments): report precomputation..." | Re-trigger Greptile

Comment thread posthog/hogql_queries/experiments/experiment_query_runner.py
Comment thread posthog/hogql_queries/experiments/experiment_query_runner.py
@andehen andehen requested a review from a team May 28, 2026 18:50
- Drop the redundant logger.exception; capture_exception already logs once,
  and the named event is preserved via a "tag" property
- Parameterize the fallback test to cover both the exposure and metric_events
  precomputation paths
- Trim inline comments to a single concise rationale
Copy link
Copy Markdown
Member

@rodrigoi rodrigoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Palpatine Done Well

@andehen andehen enabled auto-merge (squash) May 28, 2026 19:05
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

🎭 Playwright report · View test results →

⚠️ 2 flaky tests:

  • Redirect to appropriate place after login (chromium)
  • Split a person with multiple distinct IDs (chromium)

These issues are not necessarily caused by your changes.
Annoyed by this comment? Help fix flakies and failures and it'll disappear!

@andehen andehen merged commit 4539f65 into master May 29, 2026
225 of 228 checks passed
@andehen andehen deleted the experiments/propagate-precomputation-failures branch May 29, 2026 07:00
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented May 29, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-05-29 07:22 UTC Run
prod-us ✅ Deployed 2026-05-29 07:35 UTC Run
prod-eu ✅ Deployed 2026-05-29 07:40 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants