Skip to content

feat(replay-vision): add SweepScannerWorkflow for Phase 2 schedule fires#60772

Merged
TueHaulund merged 6 commits into
masterfrom
tue/replay-vision-scanner-sweep
Jun 1, 2026
Merged

feat(replay-vision): add SweepScannerWorkflow for Phase 2 schedule fires#60772
TueHaulund merged 6 commits into
masterfrom
tue/replay-vision-scanner-sweep

Conversation

@TueHaulund
Copy link
Copy Markdown
Contributor

@TueHaulund TueHaulund commented May 30, 2026

Stacked on #60617.

Problem

Phase 2 needs a Temporal workflow that fires every 5 minutes per scanner, runs ScannerCandidateQuery, dispatches one ApplyScannerWorkflow per candidate, and advances the watermark. The per-scanner schedule lands in the next stacked PR — this is just the workflow.

Changes

  • SweepScannerWorkflow: find candidates → ABANDONed children → advance watermark. On full-batch failure raises AllChildStartsFailed and skips the watermark advance so the next fire retries the window.
  • find_scanner_candidates_activity: returns candidates + saturated flag (len == DEFAULT_CANDIDATE_LIMIT). Non-retryable on malformed saved query.
  • advance_scanner_watermark_activity: bumps last_swept_at and last_seen_session_id via .update() — no scanner_version bump, idempotent.
  • Migration 0008: adds last_seen_session_id to ReplayScanner.

How did you test this code?

I'm an agent. 15 new tests pass; 78 existing tests still pass. No manual testing.

Automatic notifications

  • Publish to changelog?
  • Alert Sales and Marketing teams?

🤖 Agent context

Agent: Claude (Claude Code). Used ALLOW_DUPLICATE for the child reuse policy to match the existing rasterize-recording dispatch — UNIQUE(scanner_id, session_id) is the durable dedup.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 30, 2026

Migration SQL Changes

Hey 👋, we've detected some migrations on this PR. Here's the SQL output for each migration, make sure they make sense:

products/replay_vision/backend/migrations/0010_replayscanner_last_seen_session_id.py

BEGIN;
--
-- Add field last_seen_session_id to replayscanner
--
ALTER TABLE "replay_vision_replayscanner" ADD COLUMN "last_seen_session_id" varchar(200) DEFAULT '' NOT NULL;
COMMIT;

Last updated: 2026-06-01 20:37 UTC (bc5dcea)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 30, 2026

🔍 Migration Risk Analysis

We've analyzed your migrations for potential risks.

Summary: 1 Safe | 0 Needs Review | 0 Blocked

✅ Safe

Brief or no lock, backwards compatible

replay_vision.0010_replayscanner_last_seen_session_id
  └─ #1 ✅ AddField
     Adding NOT NULL field with constant default (safe in PG11+)
     model: replayscanner, field: last_seen_session_id

Last updated: 2026-06-01 20:38 UTC (bc5dcea)

@TueHaulund TueHaulund requested review from a team, arnohillen, fasyy612, ksvat and nicowaltz and removed request for a team May 30, 2026 20:43
@TueHaulund TueHaulund marked this pull request as ready for review May 30, 2026 20:43
@assign-reviewers-posthog assign-reviewers-posthog Bot requested a review from a team May 30, 2026 20:44
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 30, 2026

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
products/replay_vision/backend/tests/test_sweep.py:339-342
Redundant duplicate assertion — the same list-comprehension check appears twice in a row. The inline assert on line 339 and the `advance_calls` assertion on lines 341-342 are identical; one should be removed.

```suggestion
    advance_calls = [call for fn, call in mocks.activity_calls if fn == advance_scanner_watermark_activity]
    assert advance_calls == []
```

### Issue 2 of 2
products/replay_vision/backend/temporal/sweep_workflow.py:56-59
`asyncio.gather` with `return_exceptions=False` (the default) propagates on the **first** non-`WorkflowAlreadyStartedError` failure, not only when every dispatch fails. The comment "if every dispatch fails" misrepresents the actual semantics — even a single unexpected failure in a 100-candidate batch will skip the watermark advance. The behaviour itself is intentional and safe (the next sweep retries, already-started children get `WorkflowAlreadyStartedError`), but the comment makes it sound like partial failures are tolerated.

```suggestion
        # `return_exceptions=False`: if *any* dispatch fails with an error other
        # than WorkflowAlreadyStartedError, the first such exception propagates
        # and the watermark advance is skipped — next sweep retries the same
        # window. Already-started children are deduplicated by Temporal's
        # workflow_id and by `UNIQUE(scanner_id, session_id)` on the row.
```

Reviews (1): Last reviewed commit: "refactor(replay-vision): simplify SweepS..." | Re-trigger Greptile

Comment thread products/replay_vision/backend/tests/test_sweep.py Outdated
Comment thread products/replay_vision/backend/temporal/sweep_workflow.py Outdated
Lays the foundation for per-scanner Temporal schedules. Wraps SessionRecordingListFromQuery with a session_end-based filter: a session is eligible when it's had no activity in the last 35 minutes and its end time is past the scanner's watermark. The wrap delegates all RecordingsQuery filter compilation to the recordings list, so a scanner's saved filters resolve identically to the UI.
- Bump _PARTITION_LOOKBACK from 6h to 26h, anchored to posthog-js's 24h
  session_id rotation cap + 2h skew/lag headroom. Adds regression test for
  long-running sessions whose start is older than 6h.
- Add keyset pagination via last_seen_session_id kwarg + lexicographic tuple
  comparison. Lets the schedule resume past a saturated batch without skipping
  sessions tied at the boundary microsecond.
- Drop the now kwarg; use datetime.now(dt.UTC) directly so inner and outer
  clocks always agree under @freeze_time.
- Push sampling into the inner HAVING via extra_having_predicates so
  un-sampled sessions are dropped before being aggregated by the outer.
- Validate max_execution_time_seconds > 0.
- Comment on inner.order_by mutation noting get_query() re-parses each call.
Drop multiline blocks, keep at most one sentence per comment, remove plan/phase prose.
@TueHaulund TueHaulund force-pushed the tue/replay-vision-scanner-candidate-query branch from 93698c0 to f7c0345 Compare May 30, 2026 21:26
@TueHaulund TueHaulund force-pushed the tue/replay-vision-scanner-sweep branch from c898cda to a30b8c4 Compare May 30, 2026 21:33
Comment thread products/replay_vision/backend/temporal/sweep_workflow.py
@veria-ai
Copy link
Copy Markdown

veria-ai Bot commented May 30, 2026

PR overview

This pull request adds the SweepScannerWorkflow for replay-vision Phase 2 scheduled scanner runs, dispatching candidate sweep work through Temporal child workflows. The touched workflow code coordinates scanner-enabled batch execution for observation creation.

There is one open security concern around quota enforcement during sweep fan-out: a user with scanner configuration access could trigger concurrent child workflows that each see available quota and collectively exceed the intended monthly observation limit. Two prior issues have already been addressed, so the remaining risk is focused on bounding or atomically reserving quota before dispatch. The impact appears limited to quota/resource abuse rather than direct data exposure or authorization bypass.

Open issues (1)

Fixed/addressed: 2 · PR risk: 5/10

@TueHaulund TueHaulund force-pushed the tue/replay-vision-scanner-sweep branch from 4d2e011 to 80f2b38 Compare June 1, 2026 07:15
Closes a DoS vector flagged in review: a client sending events with
session_ids longer than 128 chars (the MAX_SESSION_ID_LENGTH used by the
ApplyScannerWorkflow wire payload) would wedge the sweep on Pydantic
validation. Filtering at the query layer keeps over-length sessions
invisible to the scanner so the watermark always advances.
Adds the Temporal workflow that fires every 5 minutes per scanner, runs
ScannerCandidateQuery, dispatches ABANDONed ApplyScannerWorkflow children,
and advances the watermark. Per-scanner schedules and the reconciler land in
a later PR.

- Migration 0010: add last_seen_session_id to ReplayScanner (keyset
  tiebreaker for resuming saturated batches without re-emitting).
- find_scanner_candidates_activity: reads scanner row, runs the candidate
  query, returns candidates + a saturated flag. Filters enabled=True to
  short-circuit disabled scanners. Verifies the creator still has
  session_recording read on the team as a defence-in-depth check.
- advance_scanner_watermark_activity: bumps last_swept_at +
  last_seen_session_id via .update(), no scanner_version bump.
- SweepScannerWorkflow: find -> asyncio.gather over _start_child ->
  advance. First non-WorkflowAlreadyStartedError failure aborts the gather
  and skips the watermark advance; UNIQUE(scanner_id, session_id) on
  ReplayObservation dedups retries.
- ReplayScannerViewSet: dangerously_get_required_scopes adds
  session_recording:read to create/update/partial_update and initial()
  enforces the matching user_access_control check, matching the /observe/
  authorization boundary.
@TueHaulund TueHaulund force-pushed the tue/replay-vision-scanner-sweep branch from 80f2b38 to abd878f Compare June 1, 2026 07:48
Copy link
Copy Markdown
Contributor

@fasyy612 fasyy612 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🙆‍♀️

Base automatically changed from tue/replay-vision-scanner-candidate-query to master June 1, 2026 17:41
retry_policy=common.RetryPolicy(maximum_attempts=1),
)
if not find_result.candidates:
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Observation quota bypass

An authenticated user who can configure a broad enabled scanner can cause a sweep to start up to DEFAULT_CANDIDATE_LIMIT child workflows at once. Each child checks compute_quota_snapshot() independently in create_observation_activity, so concurrent children can all observe quota headroom before any pending rows are visible and create more observations than the monthly quota allows. Reserve quota atomically before dispatching, or cap the dispatch batch to the current remaining quota using a DB-side lock/claim so the sweep cannot fan out past the organization’s remaining allowance.

@TueHaulund TueHaulund merged commit 8313cac into master Jun 1, 2026
336 of 342 checks passed
@TueHaulund TueHaulund deleted the tue/replay-vision-scanner-sweep branch June 1, 2026 20:58
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented Jun 1, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-06-01 21:28 UTC Run
prod-us ✅ Deployed 2026-06-02 10:18 UTC Run
prod-eu ✅ Deployed 2026-06-01 21:53 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants