Skip to content

feat: add comprehensive eval suite for Lore's five key dimensions#369

Merged
BYK merged 1 commit into
mainfrom
feat-eval-suite
May 17, 2026
Merged

feat: add comprehensive eval suite for Lore's five key dimensions#369
BYK merged 1 commit into
mainfrom
feat-eval-suite

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 17, 2026

Summary

End-to-end eval framework measuring Lore's five key dimensions: context management, multi-session recall, user preference recall, cross-project learning, and cost. Includes 16 scenarios with 130 questions, 6 baselines, and full CI integration.

Infrastructure (packages/core/eval/)

  • Two execution modes: fixture (deterministic, free, for CI regression) and live (real LLM calls, for quality benchmarks)
  • LLM backend abstraction (llm-backend.ts): Anthropic (preferred), GitHub Models API (free fallback), OpenAI — auto-detected from environment with proper retry/backoff
  • Multi-metric rubric judge (judge.ts): 15 scoring criteria, 9 pre-built rubrics, 1-5 scale per criterion with weighted composites
  • 6 baselines (baselines.ts): tail-window, compaction+tail, raw, lore context-only ablation, lore memory-only ablation, auto-memory (external Python sidecar)
  • Gateway integration: live mode auto-starts an isolated Lore gateway, replays session transcripts to build Lore state (distillation, LTM, recall), then asks questions through the gateway
  • Independent cost verification (cost-verifier.ts)
  • CLI entry point (run.ts): dimension/baseline/model selection, JSONL output, summary reports

Scenarios (16 total, 130 questions)

Dimension Scenarios Questions Key Tests
Context Management CM-1/2/3 35 Early detail retention, tool output dedup, layer escalation
Multi-Session Recall MSR-1/2/3 33 Sequential dev, deep history, cross-model recall
Preference Recall PR-1/2/3 22 Explicit prefs, implicit patterns, preference evolution
Cross-Project CP-1/2/3 14 Gotcha transfer, architecture patterns, pref consistency
Cost COST-1-5 26 Tracking accuracy, short/long/multi-session cost

CI (.github/workflows/eval.yml)

  • Fixture mode on PRs touching core/gateway/eval code (deterministic, free)
  • Live mode via workflow_dispatch or weekly schedule (requires ANTHROPIC_API_KEY secret)
  • auto-memory baseline job (optional, continue-on-error)
  • Results uploaded as artifacts

First Live Results (Preference Recall, Sonnet 4.6)

Scenario Lore Tail-Window Notes
PR-1 (explicit prefs) 3.6 4.9 Captures ~60% of explicit preferences
PR-2 (implicit prefs) 4.9 4.8 Lore beats tail-window — LTM adds value
PR-3 (pref evolution) 1.8 5.0 Recalls old pref, not the update

Key findings:

  • Lore's LTM/recall adds real value for implicit behavioral patterns (PR-2: 4.9 vs 4.8)
  • Curator captures ~60% of explicit preferences from short sessions — room for improvement
  • Preference evolution is the biggest gap: curator doesn't supersede old entries when user changes preference
  • Cost per preferences-only run: ~$0.42; full suite estimate: ~$5

@BYK BYK self-assigned this May 17, 2026
@BYK BYK force-pushed the feat-eval-suite branch 13 times, most recently from 2b6f943 to e44b1e6 Compare May 17, 2026 21:44
End-to-end eval framework measuring context management, multi-session
recall, user preference recall, cross-project learning, and cost.

Infrastructure:
- Two execution modes: fixture (deterministic, CI) and live (real LLM calls)
- LLM backend abstraction: Anthropic, GitHub Models API, OpenAI
- Multi-metric rubric judge (1-5 scale per criterion, weighted composites)
- 6 baselines: tail-window, compaction, raw, context-only ablation,
  memory-only ablation, auto-mem0 (external Python sidecar)
- Independent cost verification against Lore's internal tracker
- CLI with dimension/baseline selection, JSONL output, summary reports

Scenarios (16 total, 130 questions):
- CM-1/2/3: long session retention, tool output dedup, layer escalation
- MSR-1/2/3: sequential development, deep history, cross-model recall
- PR-1/2/3: explicit prefs, implicit patterns, preference evolution
- CP-1/2/3: gotcha transfer, architecture patterns, pref consistency
- COST-1-5: tracking accuracy, short/long/multi-session cost, batch savings

CI:
- GitHub Actions workflow: fixture mode on PRs, live mode weekly
- GitHub Models API for free live evals in CI
- auto-mem0 baseline job with Python sidecar
@BYK BYK force-pushed the feat-eval-suite branch from e44b1e6 to 512adbd Compare May 17, 2026 22:04
@BYK BYK merged commit b4ec5a6 into main May 17, 2026
10 checks passed
@BYK BYK deleted the feat-eval-suite branch May 17, 2026 22:06
This was referenced May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant