Skip to content

feat: add session recording, replay, and scenario filtering to eval harness#374

Merged
BYK merged 1 commit into
mainfrom
feat-eval-recording
May 18, 2026
Merged

feat: add session recording, replay, and scenario filtering to eval harness#374
BYK merged 1 commit into
mainfrom
feat-eval-recording

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 18, 2026

Summary

Adds session recording/replay to the eval harness for ~90% cost reduction on subsequent runs, plus scenario filtering for targeted testing.

New CLI flags

# First run: record session replay data
bun packages/core/eval/run.ts --mode live --record fixtures/recorded --dimensions preferences

# Subsequent runs: replay from fixtures (skip expensive upstream API calls)
bun packages/core/eval/run.ts --mode live --replay fixtures/recorded --dimensions preferences

# Run only specific scenarios
bun packages/core/eval/run.ts --mode live --scenarios pr-3-evolution --baselines lore

How it works

Session replay is the dominant eval cost (~$2-3 per run with 24+ upstream Anthropic API calls with growing context windows). The recording system:

  1. Record mode (--record <dir>): Enables the gateway's existing startRecording() interceptor during replaySession() calls. Upstream request/response pairs are saved as NDJSON files (one per session). The interceptor is scoped to session replay only — /lore:curate and QA calls are unaffected.

  2. Replay mode (--replay <dir>): Loads recorded NDJSON fixtures and wires the gateway's getReplayInterceptor() during session replay. The gateway still fully processes every turn (temporal storage, session tracking, gradient state) — only the upstream HTTP call is replaced. /lore:curate and QA questions still use real API calls.

Cost comparison (PR-3 evolution scenario)

Mode Session replay Distill + Curate QA + Judge Total
Record (first run) ~$2-3 (24 calls) ~$0.30 ~$0.05 ~$2.35
Replay (subsequent) $0 (fixtures) ~$0.30 ~$0.03 ~$0.33

Other changes

  • --scenarios filter: Run only specific scenarios by ID (comma-separated)
  • Removed 5s QA delay: No longer needed with /lore:curate handling curation synchronously

Files Changed

  • packages/core/eval/types.tsrecordDir, replayDir, scenarios fields on EvalConfig
  • packages/core/eval/run.ts — CLI flags + validation
  • packages/core/eval/harness.tsreplaySessionWithFixtures(), scenario filtering, delay removal

…arness

- --record <dir>: save upstream API responses during session replay
- --replay <dir>: replay from saved fixtures (skip expensive upstream calls)
- --scenarios <id,...>: run only specific scenarios
- Remove 5s QA delay (unnecessary with /lore:curate)

Session replay is the dominant eval cost (~-3 per run, 24+ upstream
calls with growing context). Recording saves these on the first run;
subsequent runs replay from fixtures and only pay for distillation,
curation, and QA (~/usr/bin/bash.33) — a ~90% cost reduction.
@BYK BYK self-assigned this May 18, 2026
@BYK BYK merged commit 5a01821 into main May 18, 2026
10 checks passed
@BYK BYK deleted the feat-eval-recording branch May 18, 2026 21:38
This was referenced May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant