BYK · BYK · May 19, 2026 · May 19, 2026
diff --git a/.lore.md b/.lore.md
@@ -34,7 +34,7 @@
 * **git remote -v in hosted gateway — skip when header present, never run with client-controlled cwd**: \`LORE\_HOSTED\_MODE=1\` makes all FS-touching functions no-op: \`getGitRemote()\` returns null, \`config.load()\` skips \`.lore.json\`, agents-file/lat-reader/knowledge-watcher are no-ops. Activation: \`lore start\` (headless) enables hosted mode by default; opt-out via \`--local\` or \`LORE\_HOSTED\_MODE=0\`. \`lore run\` is always local. Flag set in \`initIfNeeded()\` from \`GatewayConfig.hostedMode\`. Never run \`git remote -v\` with client-controlled cwd. \`LORE\_REMOTE\_URL\` + local CLI: \`lore run\`/\`lore start\` skips local gateway and proxies to remote. Local CLI injects \`X-Lore-Git-Remote\`; remote gateway trusts it. CLI-less/SaaS: \`ANTHROPIC\_CUSTOM\_HEADERS\` requires a local \`lore\` CLI process — pure SaaS alternative not yet implemented.
 
 <!-- lore:019e1de2-75fe-72f5-8f20-36b4923c1ea9 -->
-* **LTM cache delete must be inside the 'changes made' guard in curator.ts**: Curator/recall path bugs: (1) \`ltmSessionCache.delete(sessionId)\` must be inside \`if (changesApplied)\` guard in curator.ts — unconditional placement forces expensive LTM rebuilds on every no-op run. (2) Recall follow-up requests must set \`cacheConversation: false\` — otherwise modified message array triggers full cache write at 5m TTL pricing. (3) Non-streaming recall follow-up path must NOT re-issue the upstream request — capture response body once to prevent double token cost and double cache prime. Strip \`recall\` from tools list to prevent re-invocation; convert \`tool\_use\`/\`tool\_result\` pair to plain text blocks. Thinking blocks must be preserved in assistant messages when extended thinking is enabled.
+* **LTM cache delete must be inside the 'changes made' guard in curator.ts**: Curator/recall path bugs: (1) \`ltmSessionCache.delete(sessionId)\` must be inside \`if (changesApplied)\` guard in curator.ts — unconditional placement forces expensive LTM rebuilds on every no-op run. (2) Recall follow-up requests must set \`cacheConversation: false\` — otherwise modified message array triggers full cache write at 5m TTL pricing. (3) Non-streaming recall follow-up path must NOT re-issue the upstream request — capture response body once to prevent double token cost. Strip \`recall\` from tools list to prevent re-invocation; convert \`tool\_use\`/\`tool\_result\` pair to plain text blocks. Thinking blocks must be preserved in assistant messages when extended thinking is enabled.
 
 <!-- lore:019e2760-86be-7f0a-978e-8aafc873b9c8 -->
 * **OpenAI/Responses API upstreams don't receive LTM — req.system passed through unchanged**: OpenAI/Responses API upstreams don't receive LTM injection — \`req.system\` is passed through unchanged. Only the Anthropic path in \`packages/gateway/src/pipeline.ts\` injects LTM into the system prompt. Sessions using OpenAI-protocol upstreams get no knowledge context. Fix: apply the same LTM injection logic to all upstream paths before forwarding. The LTM 3-block system prompt (stable preferences at 1h TTL, context-bound at 5m TTL) is Anthropic-only and must be adapted for other protocols.
@@ -52,7 +52,7 @@
 * **splitSegments() infinite recursion on oversized single messages**: splitSegments() infinite recursion on oversized single messages: In \`packages/core/src/distillation.ts\`, \`splitSegments()\` recurses infinitely when a single message exceeds \`maxSegmentTokens\` (16384). \`findSplitIndex()\` returns \`messages.length\` (=1), so \`left = messages.slice(0, 1)\` produces an identical recursive call. Triggered on large tool outputs (~49KB+). Fix: add base case after the \`totalTokens <= maxTokens\` guard — \`if (messages.length <= 1) return \[messages]\`. The oversized message becomes an indivisible segment.
 
 <!-- lore:019e1de2-7639-7b32-b4c1-e64486934c27 -->
-* **TTL downgrade hysteresis: downgradeStreak field prevents compounding cache busts**: Auto-TTL downgrade hysteresis in \`packages/gateway/src/pipeline.ts\`: downgrade from 1h→5m TTL requires 3 consecutive short-gap turns (\`ttlDowngradeStreak\` in \`SessionState\`). Block downgrade if >50% of session tokens are cached. Reset streak on any long-gap turn. Subagent turns and tool-use continuations excluded from gap recording — capture \`prevStopReason\` before line 1667 overwrites it, skip when \`prevStopReason === 'tool\_use'\` or \`isSubagentTurn\`. State persistence tiers: (1) Immediate — session identity fields on mutation. (2) Per-turn — cost snapshot piggybacked on \`saveSessionTracking\` in \`postResponse\`. (3) 30s periodic — gradient EMAs and cache warming state via dirty flag + idle scheduler. Max data loss on crash: ~30s of gradient/warmup state.
+* **TTL downgrade hysteresis: downgradeStreak field prevents compounding cache busts**: Auto-TTL downgrade hysteresis in \`packages/gateway/src/pipeline.ts\`: downgrade from 1h→5m TTL requires 3 consecutive short-gap turns (\`ttlDowngradeStreak\` in \`SessionState\`). Block downgrade if >50% of session tokens are cached. Reset streak on any long-gap turn. Subagent turns and tool-use continuations excluded from gap recording — capture \`prevStopReason\` before line 1667 overwrites it, skip when \`prevStopReason === 'tool\_use'\` or \`isSubagentTurn\`. State persistence: immediate (session identity), per-turn (cost snapshot), 30s periodic (gradient EMAs + cache warming via dirty flag). Max data loss on crash: ~30s.
 
 <!-- lore:019e1e9f-3131-733f-978e-dde6f41e29fd -->
 * **Upgrade lock double-acquisition bug: same process re-locks same file**: In \`packages/gateway/src/cli/lib/binary.ts\`, \`downloadBinaryToTemp()\` acquires a lock on \`\<execPath>.lock\` and holds it. Then \`installBinary()\` computes the same install path and tries to \`acquireLock()\` again. \`handleExistingLock()\` only allows re-entry if \`existingPid === process.ppid\` (parent), but the lock was written by the same process (\`existingPid === process.pid\`), so it throws 'Another upgrade is already in progress'. Fix: in \`handleExistingLock\`, also allow re-entry when \`existingPid === process.pid\`. Double \`releaseLock()\` is safe — \`releaseLock\` swallows errors so the second call is a no-op after the file is deleted.
@@ -70,17 +70,23 @@
 
 ### Preference
 
+<!-- lore:019e4126-cfbf-78dc-bec2-3a7ebf6b9e7d -->
+* **Always analyze root causes before proposing solutions, with explicit enumerated failure points**: When the user identifies a problem, they enumerate specific failure points explicitly and numbered before designing solutions. Mirror this structure: acknowledge the enumerated failure analysis, then address each failure point directly. Don't jump straight to a fix — validate or extend the root cause breakdown first. Also applies when helping design improvements to evals, tool descriptions, or system behavior.
+
 <!-- lore:019e40e7-96ed-746e-bccb-48f78110ad64 -->
-* **Always request critical self-review via subagent before merging PRs**: Before merging any PR, the user consistently asks the assistant to critically review its own code and PR description using a subagent for objectivity. The subagent review should identify real bugs, misleading logs, wrong parameters, dead code, and other issues categorized by severity (critical/medium/low). Only actionable issues should be fixed; cosmetic or deferred items are noted but skipped. After fixes are applied, all tests must pass before the commit is amended/pushed and the PR is merged. This pattern applies to every PR regardless of size or apparent simplicity.
+* **Always request critical self-review via subagent before merging PRs**: Before merging any PR, critically review code and PR description using a subagent for objectivity. Subagent should identify real bugs, misleading logs, wrong parameters, dead code — categorized by severity (critical/medium/low). Only fix actionable issues; note but skip cosmetic/deferred items. All tests must pass before committing and merging.
+
+<!-- lore:019e412d-38a2-7de8-a7a5-19ed025a2335 -->
+* **Always request thorough architectural understanding before implementing eval features**: When starting work on the Lore eval suite, the user consistently asks for a comprehensive exploration of the existing system before making changes or additions. This includes requesting analysis of specific files, directory structures, type definitions, scenario formats, harness execution, and baseline implementations. The user wants to understand key functions, signatures, and measurable aspects before designing or building anything new. Follow this pattern by proactively reading and summarizing all relevant eval files (types.ts, harness.ts, judge.ts, baselines.ts, scenario files) when the user begins a new eval-related task, without waiting to be asked.
 
 <!-- lore:019e2820-3ed0-7cc0-97a7-2c654df763ec -->
-* **IDs starting with LOREAI-GATEWAY- are Sentry issue IDs**: Any identifier starting with \`LOREAI-GATEWAY-\` (e.g. \`LOREAI-GATEWAY-F\`) is a Sentry issue ID for the gateway project. Always treat these as Sentry issue references when encountered in conversation — fetch the issue via Sentry CLI/API to get stack traces, user counts, and release info before investigating the codebase.
+* **IDs starting with LOREAI-GATEWAY- are Sentry issue IDs**: Any identifier starting with \`LOREAI-GATEWAY-\` (e.g. \`LOREAI-GATEWAY-F\`) is a Sentry issue ID for the gateway project. Always treat these as Sentry issue references — fetch via Sentry CLI/API to get stack traces, user counts, and release info before investigating the codebase.
 
 <!-- lore:019e3cd7-97d3-7053-8f02-bb13d727662e -->
-* **Lore eval scores must beat or match tail-window — scoring below it means lost information**: Lore eval scores must beat or match tail-window baseline — scoring below it means lost information (treat as bug). \`inflateScenario(scenario, opts?)\` in \`packages/eval/src/inflate.ts\` — opts is \`{ targetTokens?, excludeKeywords? }\`, NOT positional args; silently fails. Token estimation: chars/4 (scenario convention; chars/3 in baselines.ts for budget safety). Auto-extracts protected keywords from question+referenceAnswer. Adjusts \`question.metadata.turnIndex\` after inflation. 8 replay fixtures, 16 scenarios, 130 questions, 6 baselines in CI. \`--inflate\` is incompatible with replay mode — run inflated scenarios in live mode only. Inflator buries preference-change turns (known issue).
+* **Lore eval scores must beat or match tail-window — scoring below it means lost information**: Lore eval scores must beat or match tail-window baseline — scoring below means lost information (treat as bug). \`inflateScenario(scenario, opts?)\` in \`packages/eval/src/inflate.ts\` — opts is \`{ targetTokens?, excludeKeywords? }\`, NOT positional args; silently fails. Token estimation: chars/4 (scenario convention; chars/3 in baselines.ts for budget safety). Auto-extracts protected keywords from question+referenceAnswer. Adjusts \`question.metadata.turnIndex\` after inflation. 8 replay fixtures, 16 scenarios, 130 questions, 6 baselines in CI. \`--inflate\` incompatible with replay mode — run inflated scenarios in live mode only. Inflator buries preference-change turns (known issue).
 
 <!-- lore:019e2168-2fa4-77bd-a557-9d6dbcb40d81 -->
 * **Prefer WASM backend over native onnxruntime-node for compiled binaries**: WASM backend for Bun \`--compile\` binaries with transformers.js: \`binaryExternalsPlugin\` in esbuild redirects \`onnxruntime-node\` → \`onnxruntime-web\` via \`onResolve\` (static imports only — does NOT redirect dynamic \`import()\` calls) and patches transformers.js CDN fallback via \`onLoad\` to read \`wasmPaths\` from \`globalThis.\_\_LORE\_VENDOR\_WASM\_PATHS\_\_\` (object form \`{ mjs, wasm }\` with exact hashed \`$bunfs\` filenames — directory strings fail because Bun hashes bundled WASM filenames). WASM files embedded as Bun \`{ type: 'file' }\` assets in the wrapper; wrapper sets \`globalThis.\_\_LORE\_VENDOR\_WASM\_PATHS\_\_\` before importing the worker. No onnxruntime import in wrapper or worker. For npm/CJS builds, \`onnxruntime-node\` stays external. WASM is ~2x faster on batches than native. Importing \`onnxruntime-web\` explicitly alongside the redirect creates two ort instances — 'cannot register backend cpu using priority 10' error.
 
 <!-- lore:019e4110-4cdb-706d-b47e-514e5a349b1e -->
-* **Use Vitest as the project-wide testing framework, not Mocha + Chai + ts-node**: Use Vitest as the project-wide testing framework (migrated from Mocha + Chai + ts-node on May 19, 2026 — 312ms vs 30s startup). Always write new tests with \`import { describe, it, expect } from 'vitest'\`. Use kebab-case file naming (e.g., \`auth-integration.test.ts\`). Never revert to Mocha + Chai. Treat the most recent explicit framework directive as authoritative.
+* **Use Vitest as the project-wide testing framework, not Mocha + Chai + ts-node**: Use Vitest as the project-wide testing framework (migrated from Mocha + Chai + ts-node, May 2026 — 312ms vs 30s startup). Always write new tests with \`import { describe, it, expect } from 'vitest'\`. Use kebab-case file naming (e.g., \`auth-integration.test.ts\`). Never revert to Mocha + Chai. Treat the most recent explicit framework directive as authoritative.
diff --git a/packages/core/eval/harness.ts b/packages/core/eval/harness.ts
@@ -482,7 +482,7 @@ async function askQuestionViaGateway(
   question: string,
   gateway: GatewayHandle,
   model: string,
-): Promise<{ hypothesis: string; tokens: TokenUsage }> {
+): Promise<{ hypothesis: string; tokens: TokenUsage; recallInvoked: boolean }> {
   const requestBody = {
     model,
     system: QA_SYSTEM,
@@ -509,6 +509,8 @@ async function askQuestionViaGateway(
     }
 
     const resp = await gateway.chat(requestBody);
+    const recallInvoked =
+      resp.headers.get("x-lore-recall-invoked") === "true";
     const data = (await resp.json()) as {
       content?: Array<{ type: string; text?: string }>;
       usage?: {
@@ -538,6 +540,7 @@ async function askQuestionViaGateway(
 
     return {
       hypothesis: text || data.error?.message || "[No response from gateway]",
+      recallInvoked,
       tokens: {
         input: data.usage?.input_tokens ?? 0,
         output: data.usage?.output_tokens ?? 0,
@@ -550,6 +553,7 @@ async function askQuestionViaGateway(
 
   return {
     hypothesis: "[Gateway rate limit exceeded after retries]",
+    recallInvoked: false,
     tokens: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0, totalCost: 0 },
   };
 }
@@ -763,6 +767,7 @@ export async function runScenario(
       for (const q of scenario.questions) {
         let hypothesis: string;
         let tokens: TokenUsage;
+        let recallInvoked = false;
 
         if (config.mode === "fixture" || !llm) {
           // Fixture mode: produce a placeholder hypothesis
@@ -789,6 +794,7 @@ export async function runScenario(
           );
           hypothesis = answer.hypothesis;
           tokens = answer.tokens;
+          recallInvoked = answer.recallInvoked;
         } else {
           // Non-gateway baselines: ask via direct LLM with rendered context
           const answer = await askQuestion(q.question, context, mode, llm);
@@ -797,7 +803,7 @@ export async function runScenario(
         }
 
         // Score with the judge
-        const judgeResult = await judge(q, hypothesis, llm);
+        const judgeResult = await judge(q, hypothesis, llm, { recallInvoked });
 
         const result: EvalResult = {
           timestamp: new Date().toISOString(),
@@ -817,6 +823,7 @@ export async function runScenario(
             tags: q.metadata.tags,
             turnIndex: q.metadata.turnIndex,
             cumulativeTokens: q.metadata.cumulativeTokens,
+            recallInvoked,
           },
         };
 

diff --git a/packages/core/eval/judge.ts b/packages/core/eval/judge.ts
@@ -177,6 +177,17 @@ export const CROSS_PROJECT_AVAILABILITY: ScoringCriterion = {
   },
 };
 
+export const RECALL_TRIGGER: ScoringCriterion = {
+  name: "recall_trigger",
+  description:
+    "Did the answer appropriately use recall for cross-session references?",
+  scale: {
+    1: "Did not attempt recall despite clear cross-session reference cues",
+    3: "Used recall but with poor query formulation or incomplete usage",
+    5: "Proactively used recall with appropriate queries to retrieve cross-session information",
+  },
+};
+
 // ---------------------------------------------------------------------------
 // Pre-built rubrics
 // ---------------------------------------------------------------------------
@@ -281,6 +292,17 @@ export const RUBRICS = {
       cross_project_availability: 0.3,
     },
   } satisfies ScoringRubric,
+
+  /** MSR-1 cross-session cue questions */
+  crossSessionCueRecall: {
+    criteria: [FACTUAL_ACCURACY, COMPLETENESS, RECALL_TRIGGER, TEMPORAL_ATTRIBUTION],
+    weights: {
+      factual_accuracy: 0.25,
+      completeness: 0.25,
+      recall_trigger: 0.3,
+      temporal_attribution: 0.2,
+    },
+  } satisfies ScoringRubric,
 } as const;
 
 // ---------------------------------------------------------------------------
@@ -319,13 +341,25 @@ function buildJudgeUser(
   referenceAnswer: string,
   hypothesis: string,
   rubric: ScoringRubric,
+  metadata?: { recallInvoked?: boolean },
 ): string {
   const criteria = buildCriteriaDescription(rubric);
+
+  // Only include recall metadata when the rubric has a recall_trigger criterion
+  const hasRecallCriterion = rubric.criteria.some(
+    (c) => c.name === "recall_trigger",
+  );
+  const recallSection =
+    hasRecallCriterion && metadata?.recallInvoked !== undefined
+      ? `\n\n## Recall Tool Usage\nThe recall tool (cross-session memory search) was **${metadata.recallInvoked ? "invoked" : "not invoked"}** when answering this question. Factor this into the recall_trigger score.\n\n`
+      : "\n\n";
+
   return (
     `## Scoring Criteria\n\n${criteria}\n\n` +
     `## Question\n${question}\n\n` +
     `## Reference Answer\n${referenceAnswer}\n\n` +
-    `## Hypothesis (answer to evaluate)\n${hypothesis}\n\n` +
+    `## Hypothesis (answer to evaluate)\n${hypothesis}` +
+    recallSection +
     `Score each criterion on a 1-5 scale. Return JSON only.`
   );
 }
@@ -356,6 +390,7 @@ export async function judge(
   question: EvalQuestion,
   hypothesis: string,
   llm?: EvalLLMClient,
+  metadata?: { recallInvoked?: boolean },
 ): Promise<JudgeResult> {
   const { rubric } = question;
 
@@ -378,6 +413,7 @@ export async function judge(
     question.referenceAnswer,
     hypothesis,
     rubric,
+    metadata,
   );
 
   const result = await llm.prompt(JUDGE_SYSTEM, userPrompt, {