feat: add comprehensive eval suite for Lore's five key dimensions by BYK · Pull Request #369 · BYK/loreai

BYK · 2026-05-17T18:55:57Z

Summary

End-to-end eval framework measuring Lore's five key dimensions: context management, multi-session recall, user preference recall, cross-project learning, and cost. Includes 16 scenarios with 130 questions, 6 baselines, and full CI integration.

Infrastructure (`packages/core/eval/`)

Two execution modes: fixture (deterministic, free, for CI regression) and live (real LLM calls, for quality benchmarks)
LLM backend abstraction (llm-backend.ts): Anthropic (preferred), GitHub Models API (free fallback), OpenAI — auto-detected from environment with proper retry/backoff
Multi-metric rubric judge (judge.ts): 15 scoring criteria, 9 pre-built rubrics, 1-5 scale per criterion with weighted composites
6 baselines (baselines.ts): tail-window, compaction+tail, raw, lore context-only ablation, lore memory-only ablation, auto-memory (external Python sidecar)
Gateway integration: live mode auto-starts an isolated Lore gateway, replays session transcripts to build Lore state (distillation, LTM, recall), then asks questions through the gateway
Independent cost verification (cost-verifier.ts)
CLI entry point (run.ts): dimension/baseline/model selection, JSONL output, summary reports

Scenarios (16 total, 130 questions)

Dimension	Scenarios	Questions	Key Tests
Context Management	CM-1/2/3	35	Early detail retention, tool output dedup, layer escalation
Multi-Session Recall	MSR-1/2/3	33	Sequential dev, deep history, cross-model recall
Preference Recall	PR-1/2/3	22	Explicit prefs, implicit patterns, preference evolution
Cross-Project	CP-1/2/3	14	Gotcha transfer, architecture patterns, pref consistency
Cost	COST-1-5	26	Tracking accuracy, short/long/multi-session cost

CI (`.github/workflows/eval.yml`)

Fixture mode on PRs touching core/gateway/eval code (deterministic, free)
Live mode via workflow_dispatch or weekly schedule (requires ANTHROPIC_API_KEY secret)
auto-memory baseline job (optional, continue-on-error)
Results uploaded as artifacts

First Live Results (Preference Recall, Sonnet 4.6)

Scenario	Lore	Tail-Window	Notes
PR-1 (explicit prefs)	3.6	4.9	Captures ~60% of explicit preferences
PR-2 (implicit prefs)	4.9	4.8	Lore beats tail-window — LTM adds value
PR-3 (pref evolution)	1.8	5.0	Recalls old pref, not the update

Key findings:

Lore's LTM/recall adds real value for implicit behavioral patterns (PR-2: 4.9 vs 4.8)
Curator captures ~60% of explicit preferences from short sessions — room for improvement
Preference evolution is the biggest gap: curator doesn't supersede old entries when user changes preference
Cost per preferences-only run: ~$0.42; full suite estimate: ~$5

End-to-end eval framework measuring context management, multi-session recall, user preference recall, cross-project learning, and cost. Infrastructure: - Two execution modes: fixture (deterministic, CI) and live (real LLM calls) - LLM backend abstraction: Anthropic, GitHub Models API, OpenAI - Multi-metric rubric judge (1-5 scale per criterion, weighted composites) - 6 baselines: tail-window, compaction, raw, context-only ablation, memory-only ablation, auto-mem0 (external Python sidecar) - Independent cost verification against Lore's internal tracker - CLI with dimension/baseline selection, JSONL output, summary reports Scenarios (16 total, 130 questions): - CM-1/2/3: long session retention, tool output dedup, layer escalation - MSR-1/2/3: sequential development, deep history, cross-model recall - PR-1/2/3: explicit prefs, implicit patterns, preference evolution - CP-1/2/3: gotcha transfer, architecture patterns, pref consistency - COST-1-5: tracking accuracy, short/long/multi-session cost, batch savings CI: - GitHub Actions workflow: fixture mode on PRs, live mode weekly - GitHub Models API for free live evals in CI - auto-mem0 baseline job with Python sidecar

BYK self-assigned this May 17, 2026

BYK force-pushed the feat-eval-suite branch 13 times, most recently from 2b6f943 to e44b1e6 Compare May 17, 2026 21:44

BYK force-pushed the feat-eval-suite branch from e44b1e6 to 512adbd Compare May 17, 2026 22:04

BYK merged commit b4ec5a6 into main May 17, 2026
10 checks passed

BYK deleted the feat-eval-suite branch May 17, 2026 22:06

This was referenced May 18, 2026

fix: improve curator preference detection and evolution handling #372

Merged

feat: add temperature parameter to LLMClient interface #381

Closed

This was referenced May 21, 2026

publish: BYK/loreai@0.23.0 #439

Closed

publish: BYK/loreai@0.23.0 #448

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add comprehensive eval suite for Lore's five key dimensions#369

feat: add comprehensive eval suite for Lore's five key dimensions#369
BYK merged 1 commit into
mainfrom
feat-eval-suite

BYK commented May 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BYK commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Infrastructure (packages/core/eval/)

Scenarios (16 total, 130 questions)

CI (.github/workflows/eval.yml)

First Live Results (Preference Recall, Sonnet 4.6)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

BYK commented May 17, 2026 •

edited

Loading

Infrastructure (`packages/core/eval/`)

CI (`.github/workflows/eval.yml`)