feat: add session recording, replay, and scenario filtering to eval harness#374
Merged
Conversation
…arness - --record <dir>: save upstream API responses during session replay - --replay <dir>: replay from saved fixtures (skip expensive upstream calls) - --scenarios <id,...>: run only specific scenarios - Remove 5s QA delay (unnecessary with /lore:curate) Session replay is the dominant eval cost (~-3 per run, 24+ upstream calls with growing context). Recording saves these on the first run; subsequent runs replay from fixtures and only pay for distillation, curation, and QA (~/usr/bin/bash.33) — a ~90% cost reduction.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds session recording/replay to the eval harness for ~90% cost reduction on subsequent runs, plus scenario filtering for targeted testing.
New CLI flags
How it works
Session replay is the dominant eval cost (~$2-3 per run with 24+ upstream Anthropic API calls with growing context windows). The recording system:
Record mode (
--record <dir>): Enables the gateway's existingstartRecording()interceptor duringreplaySession()calls. Upstream request/response pairs are saved as NDJSON files (one per session). The interceptor is scoped to session replay only —/lore:curateand QA calls are unaffected.Replay mode (
--replay <dir>): Loads recorded NDJSON fixtures and wires the gateway'sgetReplayInterceptor()during session replay. The gateway still fully processes every turn (temporal storage, session tracking, gradient state) — only the upstream HTTP call is replaced./lore:curateand QA questions still use real API calls.Cost comparison (PR-3 evolution scenario)
Other changes
--scenariosfilter: Run only specific scenarios by ID (comma-separated)/lore:curatehandling curation synchronouslyFiles Changed
packages/core/eval/types.ts—recordDir,replayDir,scenariosfields on EvalConfigpackages/core/eval/run.ts— CLI flags + validationpackages/core/eval/harness.ts—replaySessionWithFixtures(), scenario filtering, delay removal