eval: add 2.3M-token mega-session scenario — Lore 4.0 vs Compaction 2.4 (+70%)#440
Merged
Conversation
Extracts a real 5-day coding session (95 user turns, 3959 assistant turns, 2.37M tokens) from the Lore DB and uses it as an eval scenario with 20 questions targeting various depths: early (issue selection, first PR), mid (architectural decisions, design debates), and late (phase execution). No inflation needed — the session is already at mega-scale. Fixture stored as gzipped JSON (1.7MB). Also removes tail-window from default baselines and fixes compaction threshold to ~140K (was triggering at 80K).
BYK
added a commit
that referenced
this pull request
May 21, 2026
…ry (#442) ## Summary Updates the landing page copy to reflect the mega-session benchmark results and stronger value proposition. ## Changes ### Hero section - Description: emphasizes "crystal-clear memory across sessions lasting days, hundreds of turns, any LLM provider" - Chip: "400K+ Token Sessions" → "Sessions Lasting Days" ### Stats strip Already updated in #440: +70% vs compaction, 13/20 perfect scores, 2.3M+ tokens tested. ### "The Problem" section - Compaction step: now cites the real 2.3M-token benchmark — "compaction reduces 2.3 million tokens to an 11K summary, scoring 2.4/5. Lore scores 4.0/5." ### "The Solution" section - Recall step: "13 out of 20 perfect recall scores where compaction managed 5" ### Feature cards - Cost card: "Infinite sessions, lower cost" → "Sessions as long as you want" — "Work for days, hundreds of turns, millions of tokens — memory stays sharp." - Compatibility card: "Works with any provider" → "Portable memory, any provider" — "Switch providers, switch tools, switch machines — your memory travels with you." ### Ticker - Lead with "2.3M tokens, 5 days, crystal-clear recall" + scores - Added "Any provider, any tool — Portable memory that travels with you" - Replaced "Compaction destroys details" (negative) with real benchmark numbers (positive proof)
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a real 2.3M-token eval scenario extracted from a 5-day getsentry/cli refactoring session. At this extreme scale, Lore demonstrates a +70% advantage over classical compaction.
Results
At 2.3M tokens, compaction reduces the entire conversation to ~11K tokens of summary (200x compression). This destroys early-session details. Lore preserves them through distillation (17-21 distillations, ~10K tokens) + 64K raw tail + searchable temporal archive via recall.
Scenario
buildCommandmigration), multi-phase plan, code reviews, design debatesCode Changes
packages/core/eval/scenarios/mega-session.ts— scenario module with 20 questionspackages/core/eval/scenarios/cli-refactor-session.json.gz— compressed session fixturepackages/core/eval/harness.ts— register mega scenario in context dimensionpackages/core/eval/baselines.ts— fix compaction chunking for >1M token prefixes, fix threshold to ~140Kpackages/core/eval/run.ts— remove tail-window from default baselinesTests