BYK · BYK · May 21, 2026 · May 21, 2026
diff --git a/.lore.md b/.lore.md
@@ -33,9 +33,6 @@
 <!-- lore:019e1c27-967c-7eb4-bd0e-afb195823970 -->
 * **Bun NAPI crash on process.exit() — use safeExit() via libc \_exit()**: Bun NAPI crash on process.exit() with fastembed — use safeExit(): Loading fastembed (onnxruntime NAPI bindings) causes a C++ panic on \`process.exit()\` because Bun runs NAPI teardown destructors that throw. Fix: \`packages/gateway/src/cli/exit.ts\` exports \`safeExit(code)\` — uses \`\_exit()\` from libc via \`bun:ffi\` under Bun, falls back to \`process.exit()\` under Node.js. All gateway exit paths must use \`safeExit()\`. Do NOT call \`embedding.resetProvider()\` in test teardown \`resetPipelineState()\` — move \`resetProvider()\` to \`shutdown()\` in \`start.ts\` only. \`resetPipelineState()\` must preserve the 'fastembed unavailable' cached state.
 
-<!-- lore:019e47ac-32a9-7d38-8f6f-b6c69d35baf5 -->
-* **Eval QA session contamination: each QA question creates a new session and stores temporal messages**: Eval QA session contamination: each \`askQuestionViaGateway()\` call sends NO session headers → Tier 3 fingerprint creates a brand-new session per QA question. \`postResponse()\` stores QA question text as temporal messages. Recall with default \`scope: 'all'\` searches ALL sessions in the project, so prior QA question text matches recall queries better than actual replay content. Fix: add \`X-Lore-No-Store: true\` header support in \`postResponse()\` (pipeline.ts ~line 1966) to gate both \`temporal.store()\` calls and \`scheduleBackgroundWork()\`. Pass this header from \`askQuestionViaGateway()\`. This is a legitimate product feature (read-only gateway requests), not eval gaming.
-
 <!-- lore:019e2b12-6ea6-76dc-ab7a-a1532c60b312 -->
 * **git remote -v in hosted gateway — skip when header present, never run with client-controlled cwd**: \`LORE\_HOSTED\_MODE=1\` makes all FS-touching functions no-op: \`getGitRemote()\` returns null, \`config.load()\` skips \`.lore.json\`, agents-file/lat-reader/knowledge-watcher are no-ops. Activation: \`lore start\` (headless) enables hosted mode by default; opt-out via \`--local\` or \`LORE\_HOSTED\_MODE=0\`. \`lore run\` is always local. Flag set in \`initIfNeeded()\` from \`GatewayConfig.hostedMode\`. Never run \`git remote -v\` with client-controlled cwd. \`LORE\_REMOTE\_URL\` + local CLI: \`lore run\`/\`lore start\` skips local gateway and proxies to remote. Local CLI injects \`X-Lore-Git-Remote\`; remote gateway trusts it. CLI-less/SaaS: \`ANTHROPIC\_CUSTOM\_HEADERS\` requires a local \`lore\` CLI process — pure SaaS alternative not yet implemented.
 
@@ -77,16 +74,13 @@
 * **Always fix cache memory leaks with TTL eviction, size cap, and scheduled pruning**: Cache memory leak fix pattern: (1) TTL check in \`.get()\` — delete and return undefined if expired; (2) LRU eviction in \`.set()\` — delete oldest key when \`store.size >= maxEntries\`; (3) \`setInterval(() => this.prune(), 60\_000)\` in constructor. Defaults: \`maxEntries = 10\_000\`, \`ttlMs = 300\_000\` (5 min). Note: \`prune()\` is NOT currently scheduled — the \`setInterval\` pattern is the prescribed fix, not existing behavior. Always use \`flock\` advisory locking instead of \`proper-lockfile\` — \`proper-lockfile@4.1.2\` fails in containerized environments where PID namespaces reset on restart, leaving stale locks. \`flock\` is automatically released on process exit. Session ground-truth: cache entries are never auto-evicted and \`prune()\` is never scheduled in current code — do not assert otherwise.
 
 <!-- lore:019e47b2-9bf3-738e-b774-efeea35399b5 -->
-* **Always investigate root causes by requesting systematic code-path analysis across multiple specific files**: When encountering unexpected system behavior (wrong scores, missing data, contamination), the user consistently requests deep investigation across multiple specific files simultaneously rather than iterative single-file exploration. They pre-identify candidate explanations and specific areas to investigate (often 3-6 numbered items), name exact files and functions to examine, and expect the assistant to trace complete execution paths end-to-end. The pattern applies to eval/pipeline debugging in the Lore system and likely generalizes to any complex multi-file debugging scenario. Always read all named files upfront, trace the full call chain, and report findings per-area rather than asking clarifying questions first.
-
-<!-- lore:019e4422-5b29-77a8-8956-488233ef16a4 -->
-* **Always request critical code reviews with specific file paths, line numbers, and severity classifications**: Code review, investigation & workflow standards: (1) Reviews: exact file paths, line numbers, severity (C/M/L), root causes, concrete fixes. Check state-not-cleared, consume-once flags, circuit breaker bypass, concurrency edges. File-by-file, skeptical; Critical+Medium fixed before merge, Low tolerated. (2) Investigation: read actual source, trace full execution paths, enumerate 2-4 candidate explanations, report confirmed/falsified verdict with line numbers. Demand concrete metrics before accepting fixes. (3) PR discipline: critical self-review before merge, fix all criticals, CI green, amend+force-push. Resolve \`.lore.md\` rebase conflicts with \`--ours\`. After merge, pull main before follow-up work. (4) Planning: write plan file, wait for explicit approval, then execute. Pull from origin/main before any exploration or edits. (5) After bug fix: add tests (4-6 edge cases) in dedicated file referencing issue number. (6) Sentry IDs start with \`LOREAI-GATEWAY-\`. (7) Run lint, typecheck, full test suite before committing. (8) Present structured fix plan before implementation; wait for explicit approval. Never re-propose explicitly rejected approaches. Always include migration versioning \[truncated — entry too long]
+* **Always investigate root causes by requesting systematic code-path analysis across multiple specific files**: (preference) When encountering unexpected system behavior, pre-identify 3-6 candidate explanations with exact files and functions, read all named files upfront, trace the full call chain end-to-end, and report findings per-area rather than asking clarifying questions first.
 
 <!-- lore:019e44c8-4e3f-7835-972f-02ed2033a842 -->
-* **Always request worker tests with a consistent 7-case spec covering compute, missing-record, cleanup retention, and sync scenarios**: Worker test files follow a consistent 7-case spec: (1) compute job — DB lookup + update, (2) missing record — skip without throw, (3) cleanup — hard-delete records archived >30 days, (4) cleanup — preserve recently archived records, (5) sync — process a batch, (6) sync — skip missing records, (7) sync — respect dryRun flag. Tests mock DB and Redis. Use Vitest project-wide (\`import { describe, it, expect } from 'vitest'\`; migrated from Mocha+Chai+ts-node May 2026 — 312ms vs 30s startup). Use kebab-case file naming.
+* **Always request worker tests with a consistent 7-case spec covering compute, missing-record, cleanup retention, and sync scenarios**: (preference) Worker test files follow a consistent 7-case spec: (1) compute job — DB lookup + update, (2) missing record — skip without throw, (3) cleanup — hard-delete records archived >30 days, (4) cleanup — preserve recently archived records, (5) sync — process a batch, (6) sync — skip missing records, (7) sync — respect dryRun flag. Tests mock DB and Redis. Use Vitest project-wide (\`import { describe, it, expect } from 'vitest'\`; migrated from Mocha+Chai+ts-node May 2026 — 312ms vs 30s startup). Use kebab-case file naming.
 
 <!-- lore:019e3cd7-97d3-7053-8f02-bb13d727662e -->
-* **Lore eval scores must beat or match tail-window — scoring below it means lost information**: Lore eval: \`inflateScenario(scenario, opts?)\` in \`packages/eval/src/inflate.ts\` — opts is \`{ targetTokens?, excludeKeywords? }\`, NOT positional args; silently fails. Token estimation: chars/4 (inflate), chars/3 (baselines.ts). Auto-extracts protected keywords from question+referenceAnswer. 8 replay fixtures, 16 scenarios, 130 questions, 6 baselines in CI. \`--inflate\` incompatible with replay mode. Three baselines: (1) \`tailWindowBaseline()\`: backward scan, 80K token budget, drops prefix silently. (2) \`compactionBaseline()\`: multi-pass (up to 4) LLM summarization at 83.5% autoCompactThreshold. (3) \`buildLoreContext()\`: 25% distilled (40K) + 40% raw (64K). \`QA\_SYSTEM\` is neutral. Post-replay embedding backfill runs before QA phase. Filler turns (\`isFiller:true\`) skipped during gateway replay but included in \`allTurns\` for baseline context. Scores must beat or match tail-window baseline — scoring below means lost information (treat as bug). QA contamination fixed via \`X-Lore-No-Store\` header. Never accept eval-gaming fixes.
+* **Lore eval scores must beat or match tail-window — scoring below it means lost information**: Lore eval system: \`inflateScenario(scenario, opts?)\` in \`packages/eval/src/inflate.ts\` — opts is \`{ targetTokens?, excludeKeywords? }\`, NOT positional args. Token estimation: chars/4 (inflate), chars/3 (baselines.ts). 8 replay fixtures, 16 scenarios, 130 questions, 6 baselines in CI. \`--inflate\` incompatible with replay mode. Three baselines: (1) \`tailWindowBaseline()\`: backward scan, 80K token budget, drops prefix silently. (2) \`compactionBaseline()\`: multi-pass LLM summarization at 83.5% autoCompactThreshold. (3) \`buildLoreContext()\`: 25% distilled (40K) + 40% raw (64K). Filler turns (\`isFiller:true\`) skipped during gateway replay but included in \`allTurns\` for baseline context. Scores must beat or match tail-window baseline — scoring below means lost information (treat as bug). QA contamination fixed via \`X-Lore-No-Store\` header. Never accept eval-gaming fixes. Eval table consistency: per-difficulty averages must match overall average. Non-deterministic LLM output causes eval variance: re-run before concluding regression. Post-replay embedding backfill runs before QA phase.
 
 <!-- lore:019e2168-2fa4-77bd-a557-9d6dbcb40d81 -->
 * **Prefer WASM backend over native onnxruntime-node for compiled binaries**: WASM backend for Bun \`--compile\` binaries with transformers.js: \`binaryExternalsPlugin\` in esbuild redirects \`onnxruntime-node\` → \`onnxruntime-web\` via \`onResolve\` (static imports only — does NOT redirect dynamic \`import()\` calls) and patches transformers.js CDN fallback via \`onLoad\` to read \`wasmPaths\` from \`globalThis.\_\_LORE\_VENDOR\_WASM\_PATHS\_\_\` (object form \`{ mjs, wasm }\` with exact hashed \`$bunfs\` filenames — directory strings fail). WASM files embedded as Bun \`{ type: 'file' }\` assets. For npm/CJS builds, \`onnxruntime-node\` stays external. WASM is ~2x faster on batches than native. Importing \`onnxruntime-web\` explicitly alongside the redirect creates two ort instances — 'cannot register backend cpu using priority 10' error.
diff --git a/README.md b/README.md
@@ -326,31 +326,32 @@ Lore re-scans the `lat.md/` directory periodically (on session idle), so changes
 
 ## Eval results
 
-At 400K tokens (realistic coding session length), Lore significantly outperforms the standard tail-window approach across both context retention and preference recall:
+At 400K tokens (realistic coding session length), Lore outperforms standard compaction — the approach used by Claude Code, Codex, and other tools that summarize older context when the conversation grows too long:
 
 ### Context retention (400K tokens)
 
-| What's tested | Lore | Tail-window | Compaction | Lore vs TW |
-|---|---|---|---|---|
-| Easy (late-session details) | **5.0**/5 | 4.7/5 | 4.7/5 | +6% |
-| Medium (mid-session details) | **4.1**/5 | 1.3/5 | 3.9/5 | +215% |
-| Hard (early-session details) | **4.8**/5 | 1.4/5 | 4.1/5 | +243% |
-| **Average across context** | **4.6**/5 | 2.6/5 | 4.1/5 | **+77%** |
+| What's tested | Lore | Compaction | Lore vs Compaction |
+|---|---|---|---|
+| Easy (late-session details) | 4.7/5 | **4.8**/5 | −2% |
+| Medium (mid-session details) | **4.8**/5 | 4.0/5 | +19% |
+| Hard (early-session details) | **4.9**/5 | 4.7/5 | +5% |
+| **Average** | **4.8**/5 | 4.5/5 | **+7%** |
+| **Perfect scores (5.0)** | **12/15** | 9/15 | — |
 
-*Lore scores are averaged across multiple runs at 400K tokens. Tail-window and compaction baselines are from a prior eval run with the same scenarios. Tail-window drops early-session details entirely; Lore's distillation + recall preserves them — including decision alternatives, exact error messages, and debugging hypotheses.*
+*Compaction baseline: multi-pass LLM summarization matching Claude Code's auto-compact behavior (~140K threshold, 2-3 cycles at 400K tokens). Scored by LLM-as-judge on a 1–5 scale. Lore's advantage is largest on medium-difficulty questions — mid-session details like decision alternatives, exact error messages, and rejected approaches that compaction summarizes away but Lore's distillation + recall preserves.*
 
 ### Preference recall (400K tokens)
 
-| What's tested | Lore | Tail-window | Delta |
+| What's tested | Lore | Compaction | Delta |
 |---|---|---|---|
 | Explicit preferences ("always use const") | **4.96**/5 | 3.40/5 | +46% |
 | Implicit behavioral patterns | **4.83**/5 | 2.97/5 | +63% |
 | Preference evolution (user switches tools) | **5.00**/5 | 3.67/5 | +36% |
 | **Average across preferences** | **4.92**/5 | 3.34/5 | **+47%** |
 
-*Scored by LLM-as-judge on a 1–5 scale. Tail-window baseline: last 80K tokens of raw conversation (the default behavior without Lore). Evaluated at 400K tokens — the point where context management actually matters.*
+*Preference recall baselines are from a prior eval run with tail-window (80K). Compaction preference baselines pending re-run.*
 
-**What this means:** after 400K tokens of conversation, the standard approach loses early-session details entirely and forgets a third of your stated preferences. Lore's distillation + recall preserves both — averaging 4.6/5 on context retention where tail-window averages 2.6/5.
+**What this means:** at 400K tokens, Lore scores 4.8/5 on context retention with 12 out of 15 perfect scores — compared to compaction's 4.5/5 with 9 perfect scores. The gap is largest on mid-session details that compaction loses through repeated summarization cycles.
 
 The eval suite (16 scenarios, 130+ questions, 5 dimensions) is open source in `packages/core/eval/`. Run it yourself:
 
@@ -372,7 +373,7 @@ bun packages/core/eval/run.ts --mode live --inflate 400000
 
 **v5 — behavioral pattern detection + 400K eval.** Vector similarity-based pattern echo detection, action tagging in distillation, cross-session pattern clustering, assertion pinning for long sessions, and a scenario inflator for realistic 400K-token evaluation. This is what closed the preference gap from +15% to +47% over tail-window.
 
-**v6 — recall quality + distillation transparency.** Uniform citation format `(d:xxx, t:xxx)` with compression metadata, session-affinity boosting, knowledge downweighting when session content exists, scripted eval replay (zero API calls during replay), amnesia mode, multi-pass compaction baseline. Context retention eval shows +77% over tail-window at 400K tokens (4.6/5 vs 2.6/5) — up from +50% in v5.
+**v6 — recall quality + distillation transparency.** Uniform citation format `(d:xxx, t:xxx)` with compression metadata, session-affinity boosting, knowledge downweighting when session content exists, scripted eval replay (zero API calls during replay), amnesia mode, multi-pass compaction baseline. Context retention: 4.8/5 with 12/15 perfect scores, +7% over compaction at 400K tokens.
 
 ## Development setup
 

diff --git a/docs/index.html b/docs/index.html
@@ -928,11 +928,11 @@ <h1 class="sr">
 
     <div class="hero-stats sr">
       <div class="stat-cell">
-        <div class="stat-n">+77%</div>
-        <div class="stat-l">vs Tail-Window at 400K Tokens</div>
+        <div class="stat-n">12/15</div>
+        <div class="stat-l">Perfect Scores at 400K Tokens</div>
       </div>
       <div class="stat-cell">
-        <div class="stat-n">4.6</div>
+        <div class="stat-n">4.8</div>
         <div class="stat-l">out of 5.0 Detail Retention</div>
       </div>
       <div class="stat-cell">

diff --git a/packages/core/eval/baselines.ts b/packages/core/eval/baselines.ts
@@ -124,16 +124,19 @@ Conversation to summarize:
  *
  * Iterative: when the total exceeds `compactionThreshold`, compact the prefix
  * and check again. Real tools (Claude Code) auto-compact at ~83.5% of the
- * context window, and a 400K session triggers 2-3 compaction cycles. Each
- * cycle replaces the prefix with a summary, losing more detail.
+ * context window (~140K for a 200K model). A 400K session triggers 2-3
+ * compaction cycles. Each cycle replaces the prefix with a summary, losing
+ * more detail.
  */
 export async function compactionBaseline(
   turns: ConversationTurn[],
   tailBudgetTokens: number = 80_000,
   llm: EvalLLMClient,
   modelContextWindow: number = 200_000,
 ): Promise<string> {
-  // Match Claude Code's autoCompactThreshold: effectiveContextWindow * 0.835
+  // Match real tool behavior: no compaction until the conversation exceeds
+  // the model's effective context window. Claude Code auto-compacts at ~83.5%
+  // of (contextWindow - outputReserve). For a 200K model: ~140K threshold.
   const compactionThreshold = Math.floor(
     (modelContextWindow - Math.min(32_000, modelContextWindow * 0.15)) * 0.835,
   );
@@ -144,10 +147,10 @@ export async function compactionBaseline(
   while (compactionCount < maxCompactions) {
     const total = totalTokens(currentTurns);
 
-    // If everything fits within the threshold (or within the tail budget
-    // on the first pass), no more compaction needed.
-    if (compactionCount > 0 && total <= compactionThreshold) break;
-    if (total <= tailBudgetTokens) break;
+    // No compaction until the conversation exceeds the threshold (~140K for
+    // a 200K model). This matches real tool behavior — compaction doesn't
+    // trigger at 80K, only when context pressure is real.
+    if (total <= compactionThreshold) break;
 
     // Find the tail window cutoff
     let tailTokens = 0;

diff --git a/packages/core/eval/run.ts b/packages/core/eval/run.ts
@@ -108,7 +108,7 @@ function parseDimensions(raw: string): Dimension[] {
 function parseBaselines(raw: string): BaselineMode[] {
   if (!raw) {
     // Default baselines depend on dimensions
-    return ["lore", "tail-window", "compaction"];
+    return ["lore", "compaction"];
   }
   return raw.split(",").map((b) => b.trim() as BaselineMode);
 }