feat: add session recording, replay, and scenario filtering to eval harness by BYK · Pull Request #374 · BYK/loreai

BYK · 2026-05-18T21:36:37Z

Summary

Adds session recording/replay to the eval harness for ~90% cost reduction on subsequent runs, plus scenario filtering for targeted testing.

New CLI flags

# First run: record session replay data
bun packages/core/eval/run.ts --mode live --record fixtures/recorded --dimensions preferences

# Subsequent runs: replay from fixtures (skip expensive upstream API calls)
bun packages/core/eval/run.ts --mode live --replay fixtures/recorded --dimensions preferences

# Run only specific scenarios
bun packages/core/eval/run.ts --mode live --scenarios pr-3-evolution --baselines lore

How it works

Session replay is the dominant eval cost (~$2-3 per run with 24+ upstream Anthropic API calls with growing context windows). The recording system:

Record mode (--record <dir>): Enables the gateway's existing startRecording() interceptor during replaySession() calls. Upstream request/response pairs are saved as NDJSON files (one per session). The interceptor is scoped to session replay only — /lore:curate and QA calls are unaffected.
Replay mode (--replay <dir>): Loads recorded NDJSON fixtures and wires the gateway's getReplayInterceptor() during session replay. The gateway still fully processes every turn (temporal storage, session tracking, gradient state) — only the upstream HTTP call is replaced. /lore:curate and QA questions still use real API calls.

Cost comparison (PR-3 evolution scenario)

Mode	Session replay	Distill + Curate	QA + Judge	Total
Record (first run)	~$2-3 (24 calls)	~$0.30	~$0.05	~$2.35
Replay (subsequent)	$0 (fixtures)	~$0.30	~$0.03	~$0.33

Other changes

--scenarios filter: Run only specific scenarios by ID (comma-separated)
Removed 5s QA delay: No longer needed with /lore:curate handling curation synchronously

Files Changed

packages/core/eval/types.ts — recordDir, replayDir, scenarios fields on EvalConfig
packages/core/eval/run.ts — CLI flags + validation
packages/core/eval/harness.ts — replaySessionWithFixtures(), scenario filtering, delay removal

…arness - --record <dir>: save upstream API responses during session replay - --replay <dir>: replay from saved fixtures (skip expensive upstream calls) - --scenarios <id,...>: run only specific scenarios - Remove 5s QA delay (unnecessary with /lore:curate) Session replay is the dominant eval cost (~-3 per run, 24+ upstream calls with growing context). Recording saves these on the first run; subsequent runs replay from fixtures and only pay for distillation, curation, and QA (~/usr/bin/bash.33) — a ~90% cost reduction.

BYK self-assigned this May 18, 2026

BYK merged commit 5a01821 into main May 18, 2026
10 checks passed

BYK deleted the feat-eval-recording branch May 18, 2026 21:38

This was referenced May 21, 2026

publish: BYK/loreai@0.23.0 #439

Closed

publish: BYK/loreai@0.23.0 #448

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add session recording, replay, and scenario filtering to eval harness#374

feat: add session recording, replay, and scenario filtering to eval harness#374
BYK merged 1 commit into
mainfrom
feat-eval-recording

BYK commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BYK commented May 18, 2026

Summary

New CLI flags

How it works

Cost comparison (PR-3 evolution scenario)

Other changes

Files Changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant