Skip to content

eval: add 2.3M-token mega-session scenario — Lore 4.0 vs Compaction 2.4 (+70%)#440

Merged
BYK merged 1 commit into
mainfrom
eval-mega-session
May 21, 2026
Merged

eval: add 2.3M-token mega-session scenario — Lore 4.0 vs Compaction 2.4 (+70%)#440
BYK merged 1 commit into
mainfrom
eval-mega-session

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 21, 2026

Summary

Adds a real 2.3M-token eval scenario extracted from a 5-day getsentry/cli refactoring session. At this extreme scale, Lore demonstrates a +70% advantage over classical compaction.

Results

Metric Lore Compaction Delta
Overall 4.0/5 2.4/5 +70%
Easy (late-session) 4.0 2.4 +67%
Medium (mid-session) 3.9 3.0 +29%
Hard (early-session) 4.1 1.8 +136%
Perfect scores (5.0) 13/20 5/20 2.6x
Passing (≥4.0) 14/20 5/20 2.8x

At 2.3M tokens, compaction reduces the entire conversation to ~11K tokens of summary (200x compression). This destroys early-session details. Lore preserves them through distillation (17-21 distillations, ~10K tokens) + 64K raw tail + searchable temporal archive via recall.

Scenario

  • Source: Real getsentry/cli session (ses_33198e726ffeDyEZ4ZoowIUDJO)
  • Duration: 5 days (Mar 8-12, 2026), 95 user turns, 3959 assistant turns
  • Content: Issue triage, 7+ PRs (feat: tool-call-aware cache warming + /lore:warm:* commands + UI controls #370-394), architectural decisions (buildCommand migration), multi-phase plan, code reviews, design debates
  • 20 questions across easy (5), medium (7), hard (8) — targeting issue selection, PR details, architectural decisions, test counts, code cleanup reasoning
  • Fixture: 1.7MB gzipped JSON

Code Changes

  • packages/core/eval/scenarios/mega-session.ts — scenario module with 20 questions
  • packages/core/eval/scenarios/cli-refactor-session.json.gz — compressed session fixture
  • packages/core/eval/harness.ts — register mega scenario in context dimension
  • packages/core/eval/baselines.ts — fix compaction chunking for >1M token prefixes, fix threshold to ~140K
  • packages/core/eval/run.ts — remove tail-window from default baselines

Tests

  • Typecheck clean across all 4 packages

@BYK BYK self-assigned this May 21, 2026
Extracts a real 5-day coding session (95 user turns, 3959 assistant turns,
2.37M tokens) from the Lore DB and uses it as an eval scenario with 20
questions targeting various depths: early (issue selection, first PR),
mid (architectural decisions, design debates), and late (phase execution).

No inflation needed — the session is already at mega-scale.
Fixture stored as gzipped JSON (1.7MB).

Also removes tail-window from default baselines and fixes compaction
threshold to ~140K (was triggering at 80K).
@BYK BYK force-pushed the eval-mega-session branch from 133dc0a to 4f2b864 Compare May 21, 2026 08:42
@BYK BYK merged commit 6d650e5 into main May 21, 2026
9 of 10 checks passed
@BYK BYK deleted the eval-mega-session branch May 21, 2026 09:06
BYK added a commit that referenced this pull request May 21, 2026
…ry (#442)

## Summary

Updates the landing page copy to reflect the mega-session benchmark
results and stronger value proposition.

## Changes

### Hero section
- Description: emphasizes "crystal-clear memory across sessions lasting
days, hundreds of turns, any LLM provider"
- Chip: "400K+ Token Sessions" → "Sessions Lasting Days"

### Stats strip
Already updated in #440: +70% vs compaction, 13/20 perfect scores, 2.3M+
tokens tested.

### "The Problem" section
- Compaction step: now cites the real 2.3M-token benchmark — "compaction
reduces 2.3 million tokens to an 11K summary, scoring 2.4/5. Lore scores
4.0/5."

### "The Solution" section
- Recall step: "13 out of 20 perfect recall scores where compaction
managed 5"

### Feature cards
- Cost card: "Infinite sessions, lower cost" → "Sessions as long as you
want" — "Work for days, hundreds of turns, millions of tokens — memory
stays sharp."
- Compatibility card: "Works with any provider" → "Portable memory, any
provider" — "Switch providers, switch tools, switch machines — your
memory travels with you."

### Ticker
- Lead with "2.3M tokens, 5 days, crystal-clear recall" + scores
- Added "Any provider, any tool — Portable memory that travels with you"
- Replaced "Compaction destroys details" (negative) with real benchmark
numbers (positive proof)
@craft-deployer craft-deployer Bot mentioned this pull request May 21, 2026
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant