Skip to content

feat: migrate eval system to vitest-evals#441

Merged
BYK merged 1 commit into
mainfrom
feat-vitest-evals
May 21, 2026
Merged

feat: migrate eval system to vitest-evals#441
BYK merged 1 commit into
mainfrom
feat-vitest-evals

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 21, 2026

Summary

Adds vitest-evals integration for structured, standardized eval runs.

Architecture

beforeAll (per scenario):
  1. Start isolated gateway (temp DB)
  2. Replay session turns through gateway (scripted interceptor)
  3. Run /lore:curate (distillation + curation)
  4. Backfill embeddings
  → Gateway is warmed up with Lore state

Each it() test:
  1. Send QA question through warmed-up gateway (LTM + recall tool)
  2. Get response
  3. FactualityJudge scores against reference answer

afterAll:
  Tear down gateway

New Files

  • vitest.evals.config.ts — Separate Vitest config for evals (600s test timeout, 30min hook timeout, single-fork, vitest-evals reporter)
  • packages/core/eval/lore-harness.tscreateHarness() wrapper around the Lore gateway with replayAndWarmup() for session replay
  • packages/core/eval/cm1.eval.ts — CM-1 scenario (400K inflated) as vitest-evals tests
  • packages/core/eval/mega-session.eval.ts — 2.3M mega-session as vitest-evals tests
  • package.jsonbun run evals script

What This Gets Us

  • Vitest test runner — standard vitest run output, filtering, parallelism
  • FactualityJudge — built-in LLM-as-judge (replaces custom judge.ts)
  • GitHub Actions reporter — summary + annotations via vitest-evals/reporter
  • Standard interface — each question is an it() test case

What Stays the Same

  • Scenario definitions (scenarios/*.ts) — unchanged
  • Gateway lifecycle — reused from harness.ts
  • Scripted interceptor — reused
  • Existing run.ts / harness.ts — coexist (can be removed later)

Tests

  • 1752 pass, 0 fail (existing tests unaffected — .eval.ts only in eval config)
  • Typecheck clean

@BYK BYK self-assigned this May 21, 2026
Adds vitest-evals integration alongside the existing eval harness:

- vitest.evals.config.ts: separate Vitest config for evals (long timeouts,
  single-fork, vitest-evals reporter)
- lore-harness.ts: createHarness() wrapper around the Lore gateway with
  replayAndWarmup() for session replay setup
- cm1.eval.ts: CM-1 400K scenario as vitest-evals tests
- mega-session.eval.ts: 2.3M mega-session as vitest-evals tests
- FactualityJudge for LLM-as-judge scoring
- 'bun run evals' script

Architecture: beforeAll replays session through gateway (builds Lore state),
each it() sends a QA question through the warmed-up gateway, FactualityJudge
scores against the reference answer.
@BYK BYK force-pushed the feat-vitest-evals branch from df8cc98 to 10de926 Compare May 21, 2026 09:20
@BYK BYK merged commit 24b3d3e into main May 21, 2026
9 of 10 checks passed
@BYK BYK deleted the feat-vitest-evals branch May 21, 2026 09:22
@craft-deployer craft-deployer Bot mentioned this pull request May 21, 2026
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant