Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 3 additions & 9 deletions .lore.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,6 @@
<!-- lore:019e1c27-967c-7eb4-bd0e-afb195823970 -->
* **Bun NAPI crash on process.exit() — use safeExit() via libc \_exit()**: Bun NAPI crash on process.exit() with fastembed — use safeExit(): Loading fastembed (onnxruntime NAPI bindings) causes a C++ panic on \`process.exit()\` because Bun runs NAPI teardown destructors that throw. Fix: \`packages/gateway/src/cli/exit.ts\` exports \`safeExit(code)\` — uses \`\_exit()\` from libc via \`bun:ffi\` under Bun, falls back to \`process.exit()\` under Node.js. All gateway exit paths must use \`safeExit()\`. Do NOT call \`embedding.resetProvider()\` in test teardown \`resetPipelineState()\` — move \`resetProvider()\` to \`shutdown()\` in \`start.ts\` only. \`resetPipelineState()\` must preserve the 'fastembed unavailable' cached state.

<!-- lore:019e47ac-32a9-7d38-8f6f-b6c69d35baf5 -->
* **Eval QA session contamination: each QA question creates a new session and stores temporal messages**: Eval QA session contamination: each \`askQuestionViaGateway()\` call sends NO session headers → Tier 3 fingerprint creates a brand-new session per QA question. \`postResponse()\` stores QA question text as temporal messages. Recall with default \`scope: 'all'\` searches ALL sessions in the project, so prior QA question text matches recall queries better than actual replay content. Fix: add \`X-Lore-No-Store: true\` header support in \`postResponse()\` (pipeline.ts ~line 1966) to gate both \`temporal.store()\` calls and \`scheduleBackgroundWork()\`. Pass this header from \`askQuestionViaGateway()\`. This is a legitimate product feature (read-only gateway requests), not eval gaming.

<!-- lore:019e2b12-6ea6-76dc-ab7a-a1532c60b312 -->
* **git remote -v in hosted gateway — skip when header present, never run with client-controlled cwd**: \`LORE\_HOSTED\_MODE=1\` makes all FS-touching functions no-op: \`getGitRemote()\` returns null, \`config.load()\` skips \`.lore.json\`, agents-file/lat-reader/knowledge-watcher are no-ops. Activation: \`lore start\` (headless) enables hosted mode by default; opt-out via \`--local\` or \`LORE\_HOSTED\_MODE=0\`. \`lore run\` is always local. Flag set in \`initIfNeeded()\` from \`GatewayConfig.hostedMode\`. Never run \`git remote -v\` with client-controlled cwd. \`LORE\_REMOTE\_URL\` + local CLI: \`lore run\`/\`lore start\` skips local gateway and proxies to remote. Local CLI injects \`X-Lore-Git-Remote\`; remote gateway trusts it. CLI-less/SaaS: \`ANTHROPIC\_CUSTOM\_HEADERS\` requires a local \`lore\` CLI process — pure SaaS alternative not yet implemented.

Expand Down Expand Up @@ -77,16 +74,13 @@
* **Always fix cache memory leaks with TTL eviction, size cap, and scheduled pruning**: Cache memory leak fix pattern: (1) TTL check in \`.get()\` — delete and return undefined if expired; (2) LRU eviction in \`.set()\` — delete oldest key when \`store.size >= maxEntries\`; (3) \`setInterval(() => this.prune(), 60\_000)\` in constructor. Defaults: \`maxEntries = 10\_000\`, \`ttlMs = 300\_000\` (5 min). Note: \`prune()\` is NOT currently scheduled — the \`setInterval\` pattern is the prescribed fix, not existing behavior. Always use \`flock\` advisory locking instead of \`proper-lockfile\` — \`proper-lockfile@4.1.2\` fails in containerized environments where PID namespaces reset on restart, leaving stale locks. \`flock\` is automatically released on process exit. Session ground-truth: cache entries are never auto-evicted and \`prune()\` is never scheduled in current code — do not assert otherwise.

<!-- lore:019e47b2-9bf3-738e-b774-efeea35399b5 -->
* **Always investigate root causes by requesting systematic code-path analysis across multiple specific files**: When encountering unexpected system behavior (wrong scores, missing data, contamination), the user consistently requests deep investigation across multiple specific files simultaneously rather than iterative single-file exploration. They pre-identify candidate explanations and specific areas to investigate (often 3-6 numbered items), name exact files and functions to examine, and expect the assistant to trace complete execution paths end-to-end. The pattern applies to eval/pipeline debugging in the Lore system and likely generalizes to any complex multi-file debugging scenario. Always read all named files upfront, trace the full call chain, and report findings per-area rather than asking clarifying questions first.

<!-- lore:019e4422-5b29-77a8-8956-488233ef16a4 -->
* **Always request critical code reviews with specific file paths, line numbers, and severity classifications**: Code review, investigation & workflow standards: (1) Reviews: exact file paths, line numbers, severity (C/M/L), root causes, concrete fixes. Check state-not-cleared, consume-once flags, circuit breaker bypass, concurrency edges. File-by-file, skeptical; Critical+Medium fixed before merge, Low tolerated. (2) Investigation: read actual source, trace full execution paths, enumerate 2-4 candidate explanations, report confirmed/falsified verdict with line numbers. Demand concrete metrics before accepting fixes. (3) PR discipline: critical self-review before merge, fix all criticals, CI green, amend+force-push. Resolve \`.lore.md\` rebase conflicts with \`--ours\`. After merge, pull main before follow-up work. (4) Planning: write plan file, wait for explicit approval, then execute. Pull from origin/main before any exploration or edits. (5) After bug fix: add tests (4-6 edge cases) in dedicated file referencing issue number. (6) Sentry IDs start with \`LOREAI-GATEWAY-\`. (7) Run lint, typecheck, full test suite before committing. (8) Present structured fix plan before implementation; wait for explicit approval. Never re-propose explicitly rejected approaches. Always include migration versioning \[truncated — entry too long]
* **Always investigate root causes by requesting systematic code-path analysis across multiple specific files**: (preference) When encountering unexpected system behavior, pre-identify 3-6 candidate explanations with exact files and functions, read all named files upfront, trace the full call chain end-to-end, and report findings per-area rather than asking clarifying questions first.

<!-- lore:019e44c8-4e3f-7835-972f-02ed2033a842 -->
* **Always request worker tests with a consistent 7-case spec covering compute, missing-record, cleanup retention, and sync scenarios**: Worker test files follow a consistent 7-case spec: (1) compute job — DB lookup + update, (2) missing record — skip without throw, (3) cleanup — hard-delete records archived >30 days, (4) cleanup — preserve recently archived records, (5) sync — process a batch, (6) sync — skip missing records, (7) sync — respect dryRun flag. Tests mock DB and Redis. Use Vitest project-wide (\`import { describe, it, expect } from 'vitest'\`; migrated from Mocha+Chai+ts-node May 2026 — 312ms vs 30s startup). Use kebab-case file naming.
* **Always request worker tests with a consistent 7-case spec covering compute, missing-record, cleanup retention, and sync scenarios**: (preference) Worker test files follow a consistent 7-case spec: (1) compute job — DB lookup + update, (2) missing record — skip without throw, (3) cleanup — hard-delete records archived >30 days, (4) cleanup — preserve recently archived records, (5) sync — process a batch, (6) sync — skip missing records, (7) sync — respect dryRun flag. Tests mock DB and Redis. Use Vitest project-wide (\`import { describe, it, expect } from 'vitest'\`; migrated from Mocha+Chai+ts-node May 2026 — 312ms vs 30s startup). Use kebab-case file naming.

<!-- lore:019e3cd7-97d3-7053-8f02-bb13d727662e -->
* **Lore eval scores must beat or match tail-window — scoring below it means lost information**: Lore eval: \`inflateScenario(scenario, opts?)\` in \`packages/eval/src/inflate.ts\` — opts is \`{ targetTokens?, excludeKeywords? }\`, NOT positional args; silently fails. Token estimation: chars/4 (inflate), chars/3 (baselines.ts). Auto-extracts protected keywords from question+referenceAnswer. 8 replay fixtures, 16 scenarios, 130 questions, 6 baselines in CI. \`--inflate\` incompatible with replay mode. Three baselines: (1) \`tailWindowBaseline()\`: backward scan, 80K token budget, drops prefix silently. (2) \`compactionBaseline()\`: multi-pass (up to 4) LLM summarization at 83.5% autoCompactThreshold. (3) \`buildLoreContext()\`: 25% distilled (40K) + 40% raw (64K). \`QA\_SYSTEM\` is neutral. Post-replay embedding backfill runs before QA phase. Filler turns (\`isFiller:true\`) skipped during gateway replay but included in \`allTurns\` for baseline context. Scores must beat or match tail-window baseline — scoring below means lost information (treat as bug). QA contamination fixed via \`X-Lore-No-Store\` header. Never accept eval-gaming fixes.
* **Lore eval scores must beat or match tail-window — scoring below it means lost information**: Lore eval system: \`inflateScenario(scenario, opts?)\` in \`packages/eval/src/inflate.ts\` — opts is \`{ targetTokens?, excludeKeywords? }\`, NOT positional args. Token estimation: chars/4 (inflate), chars/3 (baselines.ts). 8 replay fixtures, 16 scenarios, 130 questions, 6 baselines in CI. \`--inflate\` incompatible with replay mode. Three baselines: (1) \`tailWindowBaseline()\`: backward scan, 80K token budget, drops prefix silently. (2) \`compactionBaseline()\`: multi-pass LLM summarization at 83.5% autoCompactThreshold. (3) \`buildLoreContext()\`: 25% distilled (40K) + 40% raw (64K). Filler turns (\`isFiller:true\`) skipped during gateway replay but included in \`allTurns\` for baseline context. Scores must beat or match tail-window baseline — scoring below means lost information (treat as bug). QA contamination fixed via \`X-Lore-No-Store\` header. Never accept eval-gaming fixes. Eval table consistency: per-difficulty averages must match overall average. Non-deterministic LLM output causes eval variance: re-run before concluding regression. Post-replay embedding backfill runs before QA phase.

<!-- lore:019e2168-2fa4-77bd-a557-9d6dbcb40d81 -->
* **Prefer WASM backend over native onnxruntime-node for compiled binaries**: WASM backend for Bun \`--compile\` binaries with transformers.js: \`binaryExternalsPlugin\` in esbuild redirects \`onnxruntime-node\` → \`onnxruntime-web\` via \`onResolve\` (static imports only — does NOT redirect dynamic \`import()\` calls) and patches transformers.js CDN fallback via \`onLoad\` to read \`wasmPaths\` from \`globalThis.\_\_LORE\_VENDOR\_WASM\_PATHS\_\_\` (object form \`{ mjs, wasm }\` with exact hashed \`$bunfs\` filenames — directory strings fail). WASM files embedded as Bun \`{ type: 'file' }\` assets. For npm/CJS builds, \`onnxruntime-node\` stays external. WASM is ~2x faster on batches than native. Importing \`onnxruntime-web\` explicitly alongside the redirect creates two ort instances — 'cannot register backend cpu using priority 10' error.
25 changes: 13 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -326,31 +326,32 @@ Lore re-scans the `lat.md/` directory periodically (on session idle), so changes

## Eval results

At 400K tokens (realistic coding session length), Lore significantly outperforms the standard tail-window approach across both context retention and preference recall:
At 400K tokens (realistic coding session length), Lore outperforms standard compaction — the approach used by Claude Code, Codex, and other tools that summarize older context when the conversation grows too long:

### Context retention (400K tokens)

| What's tested | Lore | Tail-window | Compaction | Lore vs TW |
|---|---|---|---|---|
| Easy (late-session details) | **5.0**/5 | 4.7/5 | 4.7/5 | +6% |
| Medium (mid-session details) | **4.1**/5 | 1.3/5 | 3.9/5 | +215% |
| Hard (early-session details) | **4.8**/5 | 1.4/5 | 4.1/5 | +243% |
| **Average across context** | **4.6**/5 | 2.6/5 | 4.1/5 | **+77%** |
| What's tested | Lore | Compaction | Lore vs Compaction |
|---|---|---|---|
| Easy (late-session details) | 4.7/5 | **4.8**/5 | −2% |
| Medium (mid-session details) | **4.8**/5 | 4.0/5 | +19% |
| Hard (early-session details) | **4.9**/5 | 4.7/5 | +5% |
| **Average** | **4.8**/5 | 4.5/5 | **+7%** |
| **Perfect scores (5.0)** | **12/15** | 9/15 | — |

*Lore scores are averaged across multiple runs at 400K tokens. Tail-window and compaction baselines are from a prior eval run with the same scenarios. Tail-window drops early-session details entirely; Lore's distillation + recall preserves them — including decision alternatives, exact error messages, and debugging hypotheses.*
*Compaction baseline: multi-pass LLM summarization matching Claude Code's auto-compact behavior (~140K threshold, 2-3 cycles at 400K tokens). Scored by LLM-as-judge on a 1–5 scale. Lore's advantage is largest on medium-difficulty questions — mid-session details like decision alternatives, exact error messages, and rejected approaches that compaction summarizes away but Lore's distillation + recall preserves.*

### Preference recall (400K tokens)

| What's tested | Lore | Tail-window | Delta |
| What's tested | Lore | Compaction | Delta |
|---|---|---|---|
| Explicit preferences ("always use const") | **4.96**/5 | 3.40/5 | +46% |
| Implicit behavioral patterns | **4.83**/5 | 2.97/5 | +63% |
| Preference evolution (user switches tools) | **5.00**/5 | 3.67/5 | +36% |
| **Average across preferences** | **4.92**/5 | 3.34/5 | **+47%** |

*Scored by LLM-as-judge on a 1–5 scale. Tail-window baseline: last 80K tokens of raw conversation (the default behavior without Lore). Evaluated at 400K tokens — the point where context management actually matters.*
*Preference recall baselines are from a prior eval run with tail-window (80K). Compaction preference baselines pending re-run.*

**What this means:** after 400K tokens of conversation, the standard approach loses early-session details entirely and forgets a third of your stated preferences. Lore's distillation + recall preserves both — averaging 4.6/5 on context retention where tail-window averages 2.6/5.
**What this means:** at 400K tokens, Lore scores 4.8/5 on context retention with 12 out of 15 perfect scores — compared to compaction's 4.5/5 with 9 perfect scores. The gap is largest on mid-session details that compaction loses through repeated summarization cycles.

The eval suite (16 scenarios, 130+ questions, 5 dimensions) is open source in `packages/core/eval/`. Run it yourself:

Expand All @@ -372,7 +373,7 @@ bun packages/core/eval/run.ts --mode live --inflate 400000

**v5 — behavioral pattern detection + 400K eval.** Vector similarity-based pattern echo detection, action tagging in distillation, cross-session pattern clustering, assertion pinning for long sessions, and a scenario inflator for realistic 400K-token evaluation. This is what closed the preference gap from +15% to +47% over tail-window.

**v6 — recall quality + distillation transparency.** Uniform citation format `(d:xxx, t:xxx)` with compression metadata, session-affinity boosting, knowledge downweighting when session content exists, scripted eval replay (zero API calls during replay), amnesia mode, multi-pass compaction baseline. Context retention eval shows +77% over tail-window at 400K tokens (4.6/5 vs 2.6/5) — up from +50% in v5.
**v6 — recall quality + distillation transparency.** Uniform citation format `(d:xxx, t:xxx)` with compression metadata, session-affinity boosting, knowledge downweighting when session content exists, scripted eval replay (zero API calls during replay), amnesia mode, multi-pass compaction baseline. Context retention: 4.8/5 with 12/15 perfect scores, +7% over compaction at 400K tokens.

## Development setup

Expand Down
6 changes: 3 additions & 3 deletions docs/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -928,11 +928,11 @@ <h1 class="sr">

<div class="hero-stats sr">
<div class="stat-cell">
<div class="stat-n">+77%</div>
<div class="stat-l">vs Tail-Window at 400K Tokens</div>
<div class="stat-n">12/15</div>
<div class="stat-l">Perfect Scores at 400K Tokens</div>
</div>
<div class="stat-cell">
<div class="stat-n">4.6</div>
<div class="stat-n">4.8</div>
<div class="stat-l">out of 5.0 Detail Retention</div>
</div>
<div class="stat-cell">
Expand Down
17 changes: 10 additions & 7 deletions packages/core/eval/baselines.ts
Original file line number Diff line number Diff line change
Expand Up @@ -124,16 +124,19 @@ Conversation to summarize:
*
* Iterative: when the total exceeds `compactionThreshold`, compact the prefix
* and check again. Real tools (Claude Code) auto-compact at ~83.5% of the
* context window, and a 400K session triggers 2-3 compaction cycles. Each
* cycle replaces the prefix with a summary, losing more detail.
* context window (~140K for a 200K model). A 400K session triggers 2-3
* compaction cycles. Each cycle replaces the prefix with a summary, losing
* more detail.
*/
export async function compactionBaseline(
turns: ConversationTurn[],
tailBudgetTokens: number = 80_000,
llm: EvalLLMClient,
modelContextWindow: number = 200_000,
): Promise<string> {
// Match Claude Code's autoCompactThreshold: effectiveContextWindow * 0.835
// Match real tool behavior: no compaction until the conversation exceeds
// the model's effective context window. Claude Code auto-compacts at ~83.5%
// of (contextWindow - outputReserve). For a 200K model: ~140K threshold.
const compactionThreshold = Math.floor(
(modelContextWindow - Math.min(32_000, modelContextWindow * 0.15)) * 0.835,
);
Expand All @@ -144,10 +147,10 @@ export async function compactionBaseline(
while (compactionCount < maxCompactions) {
const total = totalTokens(currentTurns);

// If everything fits within the threshold (or within the tail budget
// on the first pass), no more compaction needed.
if (compactionCount > 0 && total <= compactionThreshold) break;
if (total <= tailBudgetTokens) break;
// No compaction until the conversation exceeds the threshold (~140K for
// a 200K model). This matches real tool behavior — compaction doesn't
// trigger at 80K, only when context pressure is real.
if (total <= compactionThreshold) break;

// Find the tail window cutoff
let tailTokens = 0;
Expand Down
2 changes: 1 addition & 1 deletion packages/core/eval/run.ts
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ function parseDimensions(raw: string): Dimension[] {
function parseBaselines(raw: string): BaselineMode[] {
if (!raw) {
// Default baselines depend on dimensions
return ["lore", "tail-window", "compaction"];
return ["lore", "compaction"];
}
return raw.split(",").map((b) => b.trim() as BaselineMode);
}
Expand Down
Loading