feat(experiments): report precomputation failures to error tracking#60512
Conversation
When experiment exposure or metric-events precomputation fails, the query runner falls back to a direct scan and still returns results. That fallback also masked the failure: it was only logged, so a completely broken precomputation path looked like background retry noise. Capture these exceptions to error tracking (alongside the existing log) so a broken precomputation path surfaces even though the fallback keeps queries working. The query result is unchanged.
|
- Drop the redundant logger.exception; capture_exception already logs once, and the named event is preserved via a "tag" property - Parameterize the fallback test to cover both the exposure and metric_events precomputation paths - Trim inline comments to a single concise rationale
|
🎭 Playwright report · View test results →
These issues are not necessarily caused by your changes. |

Problem
When experiment exposure or metric-events precomputation fails, the query runner catches the exception, logs it, and falls back to a direct scan — so the query still returns correct results. That fallback is good for resilience, but it also masks the failure: a completely broken precomputation path looks like normal background retry noise while results keep flowing via the direct scan.
This is exactly how the
toUInt8funnel-preaggregation bug (#60380) went unnoticed in production despite failing every matching INSERT. This is Phase 1 of the follow-up plan in #60511.Part of #60511
Changes
At the two
exceptblocks inExperimentQueryRunner._get_experiment_query()(exposure and metric-events precomputation), report the caught exception to error tracking viacapture_exception, alongside the existinglogger.exception. Each capture carriesprecomputation_path(exposure/metric_events),experiment_id, andmetric_typefor triage.The fallback behaviour and query results are unchanged — this only makes a failing precomputation path observable.
A follow-up (Phase 2/3 in #60511) will add an experiment-specific Prometheus counter and alerting on the fallback rate.
How did you test this code?
I'm an agent (Claude Code). Automated test only — no manual testing:
test_falls_back_to_events_scan_on_lazy_computation_failureto assert that when precomputation raises, the query still falls back successfully andcapture_exceptionis invoked withprecomputation_path="exposure".1 passed.rufflint/format andtytypecheck pass (via lint-staged on commit).🤖 Agent context
Authored by Claude Code (Opus 4.8) at the request of @andehen.
This is the first of five phases scoped in #60511 (the issue itself was also drafted from a PR review comment). The plan deliberately splits "propagate" (this PR) from "add metric + alerting" (later phases) so each is independently reviewable. The fallback was kept intact by design — the goal is observability, not changing failure behaviour.
capture_exceptionis the same helper already used byexperiment_error_handler, so grouping/fingerprinting is consistent with other experiment error reporting.