feat: migrate eval system to vitest-evals by BYK · Pull Request #441 · BYK/loreai

BYK · 2026-05-21T09:13:55Z

Summary

Adds vitest-evals integration for structured, standardized eval runs.

Architecture

beforeAll (per scenario):
  1. Start isolated gateway (temp DB)
  2. Replay session turns through gateway (scripted interceptor)
  3. Run /lore:curate (distillation + curation)
  4. Backfill embeddings
  → Gateway is warmed up with Lore state

Each it() test:
  1. Send QA question through warmed-up gateway (LTM + recall tool)
  2. Get response
  3. FactualityJudge scores against reference answer

afterAll:
  Tear down gateway

New Files

vitest.evals.config.ts — Separate Vitest config for evals (600s test timeout, 30min hook timeout, single-fork, vitest-evals reporter)
packages/core/eval/lore-harness.ts — createHarness() wrapper around the Lore gateway with replayAndWarmup() for session replay
packages/core/eval/cm1.eval.ts — CM-1 scenario (400K inflated) as vitest-evals tests
packages/core/eval/mega-session.eval.ts — 2.3M mega-session as vitest-evals tests
package.json — bun run evals script

What This Gets Us

Vitest test runner — standard vitest run output, filtering, parallelism
FactualityJudge — built-in LLM-as-judge (replaces custom judge.ts)
GitHub Actions reporter — summary + annotations via vitest-evals/reporter
Standard interface — each question is an it() test case

What Stays the Same

Scenario definitions (scenarios/*.ts) — unchanged
Gateway lifecycle — reused from harness.ts
Scripted interceptor — reused
Existing run.ts / harness.ts — coexist (can be removed later)

Tests

1752 pass, 0 fail (existing tests unaffected — .eval.ts only in eval config)
Typecheck clean

Adds vitest-evals integration alongside the existing eval harness: - vitest.evals.config.ts: separate Vitest config for evals (long timeouts, single-fork, vitest-evals reporter) - lore-harness.ts: createHarness() wrapper around the Lore gateway with replayAndWarmup() for session replay setup - cm1.eval.ts: CM-1 400K scenario as vitest-evals tests - mega-session.eval.ts: 2.3M mega-session as vitest-evals tests - FactualityJudge for LLM-as-judge scoring - 'bun run evals' script Architecture: beforeAll replays session through gateway (builds Lore state), each it() sends a QA question through the warmed-up gateway, FactualityJudge scores against the reference answer.

BYK self-assigned this May 21, 2026

BYK force-pushed the feat-vitest-evals branch from df8cc98 to 10de926 Compare May 21, 2026 09:20

BYK merged commit 24b3d3e into main May 21, 2026
9 of 10 checks passed

BYK deleted the feat-vitest-evals branch May 21, 2026 09:22

craft-deployer Bot mentioned this pull request May 21, 2026

publish: BYK/loreai@0.23.0 #448

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: migrate eval system to vitest-evals#441

feat: migrate eval system to vitest-evals#441
BYK merged 1 commit into
mainfrom
feat-vitest-evals

BYK commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BYK commented May 21, 2026

Summary

Architecture

New Files

What This Gets Us

What Stays the Same

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant