Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 4 additions & 7 deletions .lore.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,16 +32,13 @@
* **Pi plugin: which providers can be proxied through the gateway**: Pi plugin gateway proxy compatibility by wire protocol. \*\*Proxiable\*\*: \`anthropic\` → \`/v1/messages\`: \`anthropic\`, \`fireworks\`, \`github-copilot\`; \`openai-completions\` → \`/v1/chat/completions\`: \`deepseek\`, \`xai\`, \`groq\`, \`cerebras\`, \`openrouter\`, \`huggingface\`, \`opencode\`, \`opencode-go\`; \`openai-responses\` → \`/v1/responses\`: \`openai\`, \`azure-openai-responses\`, \`openai-codex\`, \`azure-openai\`, \`lm-studio\`, \`ollama\`. \*\*Cannot proxy\*\*: \`google\`, \`google-vertex\`, \`amazon-bedrock\`, \`mistral\`. \`registerProvider(name, { baseUrl })\` overrides base URL. Gateway routes by URL path only. OpenAI streaming clients receive true incremental SSE (\`stream/openai.ts\`).

<!-- lore:019e49be-a3bc-7f5b-90f7-d859411f3a4d -->
* **SaaS deployment: local-only vs hosted-only modes, no simultaneous hybrid**: SaaS deployment: two mutually exclusive modes — fully local (local gateway + Turso embedded replica, LLM calls direct) and fully hosted (SaaS gateway, env var only, horizontal scaling + sticky sessions or Redis). No hybrid mode: rejected for complexity/surface area. Corporate environments prefer hosted; power/privacy users prefer local. 95%+ of code shared (\`packages/core/\`, \`pipeline.ts\`, \`server.ts\`). Hosted requires horizontal scaling from day one, API key security (HSM/Vault-level), ~20-100ms added latency. Local requires local CLI. DB-per-org on Turso; control plane maps users→orgs→teams→databases. General principle: always prefer clean mutually exclusive modes over simultaneous hybrid architectures — smaller surface area, easier reasoning.
* **SaaS deployment: local-only vs hosted-only modes, no simultaneous hybrid**: SaaS deployment: two mutually exclusive modes — fully local (local gateway + Turso embedded replica, LLM calls direct) and fully hosted (SaaS gateway, env var only, horizontal scaling + sticky sessions or Redis). No hybrid mode: rejected for complexity/surface area. Corporate environments prefer hosted; power/privacy users prefer local. 95%+ of code shared (\`packages/core/\`, \`pipeline.ts\`, \`server.ts\`). Hosted requires horizontal scaling from day one, API key security (HSM/Vault-level), ~20-100ms added latency. DB-per-org on Turso; control plane maps users→orgs→teams→databases. General principle: always prefer clean mutually exclusive modes over hybrid architectures.

### Gotcha

<!-- lore:019e1c27-967c-7eb4-bd0e-afb195823970 -->
* **Bun NAPI crash on process.exit() — use safeExit() via libc \_exit()**: Bun NAPI crash on process.exit() with fastembed — use safeExit(): Loading fastembed (onnxruntime NAPI bindings) causes a C++ panic on \`process.exit()\` because Bun runs NAPI teardown destructors that throw. Fix: \`packages/gateway/src/cli/exit.ts\` exports \`safeExit(code)\` — uses \`\_exit()\` from libc via \`bun:ffi\` under Bun, falls back to \`process.exit()\` under Node.js. All gateway exit paths must use \`safeExit()\`. Do NOT call \`embedding.resetProvider()\` in test teardown \`resetPipelineState()\` — move \`resetProvider()\` to \`shutdown()\` in \`start.ts\` only. \`resetPipelineState()\` must preserve the 'fastembed unavailable' cached state.

<!-- lore:019e49d2-d096-7dca-81df-04837eb3f247 -->
* **exportLoreFile() missing existsSync guard — fires ENOENT every 30s idle tick**: Trap: \`exportLoreFile()\` in \`packages/core/src/agents-file.ts\` line 543 calls \`writeFileSync(fp, content)\` with no \`existsSync(projectPath)\` check. The idle worker (idle.ts line 330) calls it every 30s — the catch logs the error and retries next tick, causing high-volume Sentry events. Contrast: \`exportToFile()\` uses \`mkdirSync(dirname(...), {recursive:true})\` at line 367; \`data.ts\` callers guard with \`existsSync\` at lines 324/475. Fix: add \`if (!existsSync(projectPath)) return;\` before line 543 — \`existsSync\` already imported at line 11. Don't create missing dirs (user may have intentionally deleted them).

<!-- lore:019e2b12-6ea6-76dc-ab7a-a1532c60b312 -->
* **git remote -v in hosted gateway — skip when header present, never run with client-controlled cwd**: \`LORE\_HOSTED\_MODE=1\` makes all FS-touching functions no-op: \`getGitRemote()\` returns null, \`config.load()\` skips \`.lore.json\`, agents-file/lat-reader/knowledge-watcher are no-ops. Activation: \`lore start\` (headless) enables hosted mode by default; opt-out via \`--local\` or \`LORE\_HOSTED\_MODE=0\`. \`lore run\` is always local. Flag set in \`initIfNeeded()\` from \`GatewayConfig.hostedMode\`. Never run \`git remote -v\` with client-controlled cwd. \`LORE\_REMOTE\_URL\` + local CLI: \`lore run\`/\`lore start\` skips local gateway and proxies to remote. Local CLI injects \`X-Lore-Git-Remote\`; remote gateway trusts it. CLI-less/SaaS: \`ANTHROPIC\_CUSTOM\_HEADERS\` requires a local \`lore\` CLI process — pure SaaS alternative not yet implemented.

Expand All @@ -64,7 +61,7 @@
* **TTL downgrade hysteresis: downgradeStreak field prevents compounding cache busts**: Auto-TTL downgrade hysteresis in \`packages/gateway/src/pipeline.ts\`: downgrade from 1h→5m TTL requires 3 consecutive short-gap turns (\`ttlDowngradeStreak\` in \`SessionState\`). Block downgrade if >50% of session tokens are cached. Reset streak on any long-gap turn. Subagent turns and tool-use continuations excluded from gap recording — capture \`prevStopReason\` before line 1667 overwrites it, skip when \`prevStopReason === 'tool\_use'\` or \`isSubagentTurn\`. State persistence: immediate (session identity), per-turn (cost snapshot), 30s periodic (gradient EMAs + cache warming via dirty flag). Max data loss on crash: ~30s. Recall follow-up requests must set \`cacheConversation: false\` — otherwise modified message array triggers full cache write at 5m TTL pricing.

<!-- lore:019e49d3-470f-7814-9971-fa485a78d9b5 -->
* **Unprotected JSON.parse sites in gateway: config.ts, remote.ts, api.ts, cache-warmer.ts**: Unprotected \`JSON.parse\` sites in gateway (no try-catch): (1) \`packages/core/src/config.ts:260\` — \`.lore.json\` file read; (2) \`packages/gateway/src/cli/remote.ts:56\` — \`res.json()\` when Content-Type claims JSON but body isn't; (3) \`packages/gateway/src/api.ts:87\` — zstd-decompressed request body; (4) \`packages/gateway/src/cache-warmer.ts:609\` — stored request body from DB. Pipeline SSE parses (lines 866, 1326, 1359, 1812) are already protected. Also: \`load(projectPath)\` in \`cli/import.ts:87\` is called without \`await\` — if \`JSON.parse\` throws on malformed \`.lore.json\`, it becomes an unhandled rejection (LOREAI-GATEWAY-Y). Fix: (1) \`await load(projectPath)\` in import.ts:87; (2) wrap config.ts:260 in try-catch, fall through to default config; (3) wrap all four unprotected sites in try-catch with error logging.
* **Unprotected JSON.parse sites in gateway: config.ts, remote.ts, api.ts, cache-warmer.ts**: Unprotected \`JSON.parse\` sites in gateway (no try-catch): (1) \`packages/core/src/config.ts:260\` — \`.lore.json\` file read — FIXED: added JSONC support via \`stripJsonComments()\` + try-catch, falls through to defaults on failure (LOREAI-GATEWAY-Y); (2) \`packages/gateway/src/cli/remote.ts:56\` — \`res.json()\` when Content-Type claims JSON but body isn't; (3) \`packages/gateway/src/api.ts:87\` — zstd-decompressed request body; (4) \`packages/gateway/src/cache-warmer.ts:609\` — stored request body from DB. Pipeline SSE parses (lines 866, 1326, 1359, 1812) are already protected. Sites 2-4 remain unprotected (no Sentry events yet). Related: \`exportLoreFile()\` in \`agents-file.ts:543\` called \`writeFileSync\` without handling missing project dirs — FIXED: try-catch swallows ENOENT silently (project dir deleted mid-session), re-throws all other FS errors (LOREAI-GATEWAY-K).

<!-- lore:019e1cd6-05d2-74c5-aea8-fd827a4a45e7 -->
* **vectorSearch() is unscoped — test cleanup must delete all embedding rows**: \`vectorSearch()\` in \`packages/core/src/ltm.ts\` queries \`knowledge WHERE embedding IS NOT NULL AND confidence > 0.2\` with no \`project\_id\` filter (intentional for cross-project search). Two gotchas: (1) Test suites scoped to one project leak embedding rows into other vectorSearch tests — \`beforeEach\` must \`DELETE FROM knowledge WHERE embedding IS NOT NULL\`. (2) \`vectorSearch()\` has no \`excludeCategories\` param — category exclusions from \`forSession()\` callers have no effect; add optional \`excludeCategories\` param and propagate from callers. Also: global entries (pid=null) force \`crossProject=true\`; confidence is clamped to \[0.0, 1.0] in \`update()\`.
Expand All @@ -83,10 +80,10 @@
* **Always fix cache memory leaks with TTL eviction, size cap, and scheduled pruning**: Cache memory leak fix pattern: (1) TTL check in \`.get()\` — delete and return undefined if expired; (2) LRU eviction in \`.set()\` — delete oldest key when \`store.size >= maxEntries\`; (3) \`setInterval(() => this.prune(), 60\_000)\` in constructor. Defaults: \`maxEntries=10\_000\`, \`ttlMs=300\_000\`. Note: \`prune()\` is NOT currently scheduled in existing code. Locking: use \`flock\` advisory locking instead of \`proper-lockfile\` — \`proper-lockfile@4.1.2\` fails in containerized environments where PID namespaces reset on restart, leaving stale locks. \`flock\` is automatically released on process exit. Upgrade lock double-acquisition bug (\`binary.ts\`): \`downloadBinaryToTemp()\` acquires lock on \`\<execPath>.lock\`, then \`installBinary()\` tries to re-acquire same lock. Fix: in \`handleExistingLock\`, allow re-entry when \`existingPid === process.pid\`. Double \`releaseLock()\` is safe.

<!-- lore:019e4422-5b29-77a8-8956-488233ef16a4 -->
* **Always request critical code reviews with specific file paths, line numbers, and severity classifications**: Code review & investigation standards: (1) Reviews: exact file paths, line numbers, severity (C/M/L), root causes, concrete fixes. Check state-not-cleared, consume-once flags, circuit breaker bypass, concurrency edges. Critical+Medium fixed before merge. (2) Investigation: read actual source, trace full call chain (file+line), enumerate 2-4 candidates, report confirmed/falsified verdict. Distinguish co-required bugs. (3) PR discipline: critical self-review before merge, CI green, amend+force-push. (4) After bug fix: add tests (4-6 edge cases) referencing issue number. Worker test files follow a consistent 7-case spec. (5) Sentry IDs start with \`LOREAI-GATEWAY-\`. (6) Run lint, typecheck, full test suite before committing. Use Vitest (\`import { describe, it, expect } from 'vitest'\`; migrated from Mocha+Chai May 2026). Use kebab-case file naming. (7) Document process/workflow decisions in AGENTS.md before proceeding. (8) For new infrastructure/tooling: analyze tradeoffs, sketch architecture, wait for confirmation before implementing.
* **Always request critical code reviews with specific file paths, line numbers, and severity classifications**: Code review & investigation standards: (1) Reviews: exact file paths, line numbers, severity (C/M/L), root causes, concrete fixes. Check state-not-cleared, consume-once flags, circuit breaker bypass, concurrency edges. Critical+Medium fixed before merge. (2) Investigation: read actual source, trace full call chain, enumerate 2-4 candidates, report confirmed/falsified verdict. (3) PR discipline: critical self-review before merge, CI green, amend+force-push. (4) After bug fix: add tests (4-6 edge cases) referencing issue number. (5) Sentry IDs start with \`LOREAI-GATEWAY-\`. (6) Run lint, typecheck, full test suite before committing. Use Vitest (\`import { describe, it, expect } from 'vitest'\`; migrated from Mocha+Chai May 2026). Use kebab-case file naming. (7) Document process/workflow decisions in AGENTS.md before proceeding. (8) For new infrastructure/tooling: analyze tradeoffs, sketch architecture, wait for confirmation before implementing.

<!-- lore:019e498a-c0e4-70c5-ad40-d4d6d9d26ff5 -->
* **CI/PR cycle: check failing jobs, wait for bots, resolve all comments before merging**: CI/PR cycle: After every push: (1) check failing jobs via \`gh run view --log-failed --job $(gh pr checks $PR\_NO --json state,link -q '.\[] | select(.state == "FAILURE").link | split("/")\[-1]')\`; (2) wait for 'Sentry Seer' and 'Cursor BugBot'; (3) fix all failures; (4) use \`gh api graphql\` with \`reviewThreads\` filtering \`isResolved==false, isMinimized==false\` for unresolved comments; (5) address all bot/human comments, respond or mark resolved; repeat until clean. PR creation: check if already on a relevant branch; follow repo branch/commit conventions; base PR description on implementation plan; add plan as \`git notes\`; create as draft initially. Always call \`plan\_exit\` when done planning. If BugBot finds nothing, merge and move on.
* **CI/PR cycle: check failing jobs, wait for bots, resolve all comments before merging**: CI/PR cycle: After every push: (1) check failing jobs via \`gh run view --log-failed\`; (2) wait for 'Sentry Seer' and 'Cursor BugBot'; (3) fix all failures; (4) use \`gh api graphql\` with \`reviewThreads\` filtering \`isResolved==false, isMinimized==false\` for unresolved comments; (5) address all bot/human comments, respond or mark resolved; repeat until clean. PR creation: check if already on a relevant branch; follow repo branch/commit conventions; base PR description on implementation plan; add plan as \`git notes\`; create as draft initially. Always call \`plan\_exit\` when done planning. If BugBot finds nothing, merge and move on.

<!-- lore:019e3cd7-97d3-7053-8f02-bb13d727662e -->
* **Lore eval scores must beat or match tail-window — scoring below it means lost information**: Lore eval system: \`inflateScenario(scenario, opts?)\` in \`packages/eval/src/inflate.ts\` — opts is \`{ targetTokens?, excludeKeywords? }\`, NOT positional args. Token estimation: chars/4 (inflate), chars/3 (baselines.ts). 8 replay fixtures, 16 scenarios, 130 questions, 6 baselines in CI. \`--inflate\` incompatible with replay mode. Three baselines: (1) \`tailWindowBaseline()\`: backward scan, 80K token budget, drops prefix silently. (2) \`compactionBaseline()\`: multi-pass LLM summarization at 83.5% autoCompactThreshold. (3) \`buildLoreContext()\`: 25% distilled (40K) + 40% raw (64K). Filler turns (\`isFiller:true\`) skipped during gateway replay but included in \`allTurns\` for baseline context. Scores must beat or match tail-window — scoring below means lost information (treat as bug). QA contamination fixed via \`X-Lore-No-Store\`. Non-deterministic LLM output causes variance: re-run before concluding regression.
Expand Down
13 changes: 10 additions & 3 deletions packages/core/src/agents-file.ts
Original file line number Diff line number Diff line change
Expand Up @@ -540,9 +540,16 @@ export function exportLoreFile(projectPath: string): void {
}

// Content changed — write and update cache.
writeFileSync(fp, content, "utf8");
const { mtimeMs } = statSync(fp);
setCache(fp, { mtimeMs, hash: contentHash });
// Wrap in try-catch to silently handle ENOENT (project dir deleted/renamed
// mid-session). Other FS errors (EACCES, EIO) still propagate.
try {
writeFileSync(fp, content, "utf8");
const { mtimeMs } = statSync(fp);
setCache(fp, { mtimeMs, hash: contentHash });
} catch (e: unknown) {
if ((e as NodeJS.ErrnoException).code === "ENOENT") return;
throw e;
}
}

/**
Expand Down
24 changes: 21 additions & 3 deletions packages/core/src/config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,18 @@ import { z } from "zod";
import { existsSync, readFileSync } from "node:fs";
import { join } from "node:path";
import { isHostedMode } from "./hosted";
import { warn } from "./log";

/**
* Strip JS-style comments from a JSON string, enabling JSONC support for
* `.lore.json`. Preserves `//` and `/* ... *​/` inside quoted strings.
* Also removes trailing commas before `}` or `]`.
*/
function stripJsonComments(str: string): string {
return str
.replace(/("(?:[^"\\]|\\.)*")|\/\/[^\n]*|\/\*[\s\S]*?\*\//g, (m, s) => s ?? "")
.replace(/,\s*([}\]])/g, "$1");
}

export const LoreConfig = z.object({
model: z
Expand Down Expand Up @@ -257,9 +269,15 @@ export async function load(directory: string): Promise<LoreConfig> {
if (!isHostedMode()) {
const path = join(directory, ".lore.json");
if (existsSync(path)) {
const raw = JSON.parse(readFileSync(path, "utf8"));
current = LoreConfig.parse(raw);
return current;
try {
const raw = JSON.parse(stripJsonComments(readFileSync(path, "utf8")));
current = LoreConfig.parse(raw);
return current;
} catch (e) {
warn(
`Failed to parse ${path}: ${e instanceof Error ? e.message : e}. Using defaults.`,
);
}
}
}
current = LoreConfig.parse({});
Expand Down
27 changes: 27 additions & 0 deletions packages/gateway/instrument.ts
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,19 @@ const sentryEnabled =
sentryEnvVar === "1" ? true : sentryEnvVar === "0" ? false : !isDev;

if (sentryEnabled && !Sentry.isInitialized()) {
// Transient network errors that are expected in a long-running LLM proxy.
// These are not actionable bugs — they occur when clients disconnect,
// upstreams are temporarily unavailable, or network conditions degrade.
const TRANSIENT_ERROR_PATTERNS = [
/\bEPIPE\b/,
/socket connection was closed unexpectedly/i,
/ZlibError/,
/The operation timed out/i,
/Worker upstream exhausted \d+ retries/,
/ECONNRESET\b/,
/ECONNREFUSED\b/,
];

Sentry.init({
dsn: "https://0282201d6a3df3bc46423e61012ae62b@o275100.ingest.us.sentry.io/4511355222622208",

Expand All @@ -85,6 +98,20 @@ if (sentryEnabled && !Sentry.isInitialized()) {
// Capture 100% of transactions and logs
tracesSampleRate: 1.0,
enableLogs: true,

// Drop transient network errors that are not actionable bugs.
// Each exception in the chain is tested independently so a real bug
// wrapping a transient cause isn't accidentally silenced.
beforeSend(event) {
const values = event.exception?.values;
if (values?.some((v) => {
const msg = `${v.type}: ${v.value}`;
return TRANSIENT_ERROR_PATTERNS.some((re) => re.test(msg));
})) {
return null;
}
return event;
},
});

// Bridge core's log.* calls → Sentry structured logs + error capture
Expand Down
2 changes: 2 additions & 0 deletions packages/gateway/script/bundle.ts
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,8 @@ export declare function startServer(config: GatewayConfig): {
stop: () => void;
port: number;
hosts: string[];
/** Resolves when the server is listening. Present under Node.js; absent under Bun. */
ready?: Promise<void>;
};

/**
Expand Down
12 changes: 12 additions & 0 deletions packages/gateway/script/node-polyfills.ts
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,17 @@ if (typeof globalThis.Bun === "undefined") {
}
});

// Node's server.listen() is async — EADDRINUSE is emitted as an 'error'
// event, not thrown synchronously. Expose a `ready` promise so callers
// (startGateway) can await successful bind and catch port conflicts.
const ready = new Promise<void>((resolve, reject) => {
server.once("listening", resolve);
server.once("error", reject);
});
// Prevent UnhandledPromiseRejection if the caller never awaits `ready`.
// The real error surfaces when startGateway() awaits it.
ready.catch(() => {});

server.listen(opts.port, opts.hostname);

return {
Expand All @@ -94,6 +105,7 @@ if (typeof globalThis.Bun === "undefined") {
if (typeof addr === "object" && addr !== null) return addr.port;
return opts.port;
},
ready,
};
}

Expand Down
Loading
Loading