Skip to content

feat(replay-vision): prometheus metrics + structured success logs#60196

Merged
TueHaulund merged 2 commits into
masterfrom
tue/replay-vision-metrics
May 28, 2026
Merged

feat(replay-vision): prometheus metrics + structured success logs#60196
TueHaulund merged 2 commits into
masterfrom
tue/replay-vision-metrics

Conversation

@TueHaulund
Copy link
Copy Markdown
Contributor

Problem

Phase 1 of Replay Vision is functionally complete (model + workflow + manual /observe/ triggering) but has no telemetry surface beyond logger.exception calls on the failure paths. Operators can't answer "how many observations are running per scanner type", "what's the failure rate", "what's p95 latency on the Gemini call" — none of it exists in Prometheus or as structured success logs in Loki. The master plan flags this as a Phase 1 wrap-up requirement.

Changes

New metrics module (temporal/metrics.py) — module-scope Counter/Histogram from the default registry, auto-scraped by the worker's combined metrics server. Cardinality budget worst-case ~150 series.

Metric Labels Where emitted
replay_vision_observations_total status, scanner_type terminal-state activities
replay_vision_failure_kinds_total kind, scanner_type mark_observation_failed_activity
replay_vision_ineligible_kinds_total kind mark_observation_ineligible_activity
replay_vision_activity_duration_seconds activity, status @track_activity decorator on every activity
replay_vision_provider_call_seconds provider, model, scanner_type, outcome inside _call_with_retry

Structured success logs at the same hook points so Loki can answer the same questions without Prom — replay_vision.observation.{succeeded,failed,ineligible} with observation_id, scanner_type, and kind (where applicable).

Plumbing: CreateObservationOutput now carries scanner_type so the three MarkObservation*Inputs can take it as a labeled field — terminal activities label metrics without re-querying.

How did you test this code?

Agent-authored. hogli test products/replay_vision/backend/tests/test_temporal.py — 66 passed (61 existing + 5 new). New TestObservationStateMetricsAndLogs class asserts both the Prometheus increment and the structured log payload for each terminal-state transition, plus the histogram count for @track_activity.

Publish to changelog?

no

🤖 Agent context

Tool: Claude Code. Considered emitting metrics directly from workflow code (rejected — workflows must be deterministic, Prom increments are global side effects that would double-fire on replay). Considered an activity interceptor (rejected — heavier than a decorator and harder to scope to one product). Considered activity.info().activity_type for the metric label (rejected — breaks direct test invocations outside an activity context; using fn.__name__ at decoration time gives the same value with no runtime context dependency).

@TueHaulund TueHaulund force-pushed the tue/replay-vision-metrics branch from 2c50720 to e6b36bb Compare May 27, 2026 08:25
Phase 1 wrap-up. Adds the observability surface called for in the master plan.

Metrics (module-scope, default registry, auto-scraped by the worker's combined
metrics server):

- `replay_vision_observations_total{status, scanner_type}` — terminal-state counter
- `replay_vision_failure_kinds_total{kind, scanner_type}` — failures broken down by FailureKind
- `replay_vision_ineligible_kinds_total{kind}` — ineligibles broken down by IneligibleSessionKind
- `replay_vision_activity_duration_seconds{activity, status}` — wall time per activity, via @track_activity
- `replay_vision_provider_call_seconds{provider, model, scanner_type, outcome}` — Gemini call latency

Structured success logs emitted alongside the counters:
- `replay_vision.observation.succeeded`
- `replay_vision.observation.failed` (with kind)
- `replay_vision.observation.ineligible` (with kind)

Plumbing: `CreateObservationOutput` and the three `MarkObservation*Inputs`
gain a `scanner_type` field so the terminal-state activities can label
their metrics without re-querying. The `@track_activity` decorator captures
the function name at decoration time so direct test invocations work without
an activity context.
@TueHaulund TueHaulund force-pushed the tue/replay-vision-metrics branch from e6b36bb to 4fc08bb Compare May 27, 2026 08:41
@TueHaulund TueHaulund requested review from a team, arnohillen, fasyy612, ksvat and nicowaltz and removed request for a team May 27, 2026 09:05
@TueHaulund TueHaulund marked this pull request as ready for review May 27, 2026 09:05
@assign-reviewers-posthog assign-reviewers-posthog Bot requested a review from a team May 27, 2026 09:05
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 27, 2026

Comments Outside Diff (1)

  1. products/replay_vision/backend/tests/test_temporal.py, line 857-880 (link)

    P2 Idempotency only tested for mark_observation_failed_activity

    test_no_counter_or_log_when_update_affects_zero_rows is parameterised on all three terminal statuses but calls only mark_observation_failed_activity in every case. The idempotency guard (if not updated: return) in mark_observation_ineligible_activity and mark_observation_succeeded_activity is never exercised by this test. Given the stated preference for parameterised tests, extending this case to also drive the matching activity for each status (e.g. parameterising on (terminal_status, activity, inputs) tuples) would close that gap.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: products/replay_vision/backend/tests/test_temporal.py
    Line: 857-880
    
    Comment:
    **Idempotency only tested for `mark_observation_failed_activity`**
    
    `test_no_counter_or_log_when_update_affects_zero_rows` is parameterised on all three terminal statuses but calls only `mark_observation_failed_activity` in every case. The idempotency guard (`if not updated: return`) in `mark_observation_ineligible_activity` and `mark_observation_succeeded_activity` is never exercised by this test. Given the stated preference for parameterised tests, extending this case to also drive the matching activity for each status (e.g. parameterising on `(terminal_status, activity, inputs)` tuples) would close that gap.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
products/replay_vision/backend/tests/test_temporal.py:857-880
**Idempotency only tested for `mark_observation_failed_activity`**

`test_no_counter_or_log_when_update_affects_zero_rows` is parameterised on all three terminal statuses but calls only `mark_observation_failed_activity` in every case. The idempotency guard (`if not updated: return`) in `mark_observation_ineligible_activity` and `mark_observation_succeeded_activity` is never exercised by this test. Given the stated preference for parameterised tests, extending this case to also drive the matching activity for each status (e.g. parameterising on `(terminal_status, activity, inputs)` tuples) would close that gap.

Reviews (1): Last reviewed commit: "feat(replay-vision): prometheus metrics ..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@fasyy612 fasyy612 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweeet 👌

@TueHaulund TueHaulund merged commit ec2a7f5 into master May 28, 2026
219 checks passed
@TueHaulund TueHaulund deleted the tue/replay-vision-metrics branch May 28, 2026 10:22
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented May 28, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-05-28 10:57 UTC Run
prod-us ✅ Deployed 2026-05-28 11:21 UTC Run
prod-eu ✅ Deployed 2026-05-28 11:24 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants