feat: local embeddings, exact keyword boost, and pattern extraction by BYK · Pull Request #136 · BYK/loreai

BYK · 2026-05-07T11:14:36Z

Summary

Local embedding provider (P1): Adds fastembed (bge-small-en-v1.5, 384 dims) as default embedding provider. Runs fully on-device via ONNX Runtime — no API key, zero cost, ~150ms per query. Voyage and OpenAI remain available via .lore.json config. Existing embeddings auto-clear and re-embed on provider change via checkConfigChange().
Exact keyword match boost (P2): Adds exactTermMatchRank() as an additional RRF signal in recall. Ranks candidates by verbatim query term overlap, boosting proper nouns, file names, and technical terms that BM25's Porter stemming may dilute. Zero-cost, additive-only.
Pattern extraction at distillation time (P3): 8 conservative regex patterns detect decision/preference language ("decided to use X", "chose X over Y", "prefers X over Y") in distilled observations and seed knowledge entries via ltm.create(). Runs in both distillSegment() and metaDistill(). Zero LLM cost, <1ms runtime.

Migration

Existing Voyage users with no explicit .lore.json will silently switch to local embeddings. checkConfigChange() detects the fingerprint mismatch and auto-rebuilds all embeddings with the new provider. To stay on Voyage, add explicit config:

{ "search": { "embeddings": { "provider": "voyage", "model": "voyage-code-3", "dimensions": 1024 } } }

Test results

796 pass, 0 fail (including 42 new tests across 3 test files).

- P1: Add local embedding provider via fastembed (bge-small-en-v1.5, 384 dims). Runs on-device via ONNX Runtime with no API key needed. Model downloads on first use (~33MB), cached locally. Made 'local' the new default provider; Voyage/OpenAI remain available via .lore.json config. Existing embeddings auto-clear and re-embed on provider change via checkConfigChange(). - P2: Add exact keyword match boost as additional RRF signal in recall. Ranks candidates by verbatim query term overlap to boost proper nouns, file names, and technical terms that BM25 stemming may dilute. - P3: Add regex-based pattern extraction at distillation time. 8 conservative patterns detect decision/preference language in distilled observations and seed knowledge entries via ltm.create(). Runs in both distillSegment() and metaDistill() paths. Zero LLM cost.

…l scoring (#137) ## Summary - **P0: Unified embedding backfill** — Moved embedding backfill from OpenCode-only to a shared `runStartupBackfill()` in core. Both OpenCode and Pi adapters now call this at init, fixing near-zero embedding coverage (~0.5%) for Pi-served projects. Logs coverage stats to stderr at startup for immediate visibility. - **P3: Retroactive metric backfill** — Added `backfillMetrics()` to compute `r_compression` and `c_norm` for ~2,863 pre-v12 distillations by loading source temporal messages. Idempotent, handles pruned sources gracefully. Runs synchronously at startup. - **P2: Quality-weighted recall scoring** — Wired `c_norm` into recall by adding a distillation quality RRF list. Uniform segments (low c_norm) rank higher; old bursty segments rank lower via `c_norm + (ageDays/90) * 0.1`. Mild secondary signal that blends naturally with existing BM25 and vector relevance. ## Files Changed | File | Change | |---|---| | `packages/core/src/embedding.ts` | Added `runStartupBackfill()` with coverage logging | | `packages/core/src/distillation.ts` | Added `backfillMetrics()` for retroactive r_compression/c_norm | | `packages/core/src/recall.ts` | Added c_norm to Distillation type, all SELECTs, and quality RRF list | | `packages/opencode/src/index.ts` | Replaced inline backfill with core functions | | `packages/pi/src/index.ts` | Added embedding + metric backfill calls | | `.lore.md` | Updated project knowledge | ## Testing - 796 tests pass, 0 failures - Build TS errors are pre-existing (fastembed type issue on parent branch)

TypeScript's overload resolution can't infer the correct variant when the model identifier is resolved dynamically from config. Cast the options object to the standard init signature (BGESmallENV15 is the default) to satisfy the type checker. The enum lookup at runtime guarantees a valid model value.

BYK enabled auto-merge (squash) May 7, 2026 11:14

BYK added 2 commits May 7, 2026 12:48

BYK merged commit 8695f3e into main May 7, 2026
1 check passed

BYK deleted the feat/local-embeddings-keyword-boost-pattern-extract branch May 7, 2026 18:41

This was referenced May 7, 2026

publish: BYK/loreai@0.13.0 #145

Closed

publish: BYK/loreai@0.13.1 #146

Closed

publish: BYK/loreai@0.13.2 #147

Closed

publish: BYK/loreai@0.13.3 #148

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: local embeddings, exact keyword boost, and pattern extraction#136

feat: local embeddings, exact keyword boost, and pattern extraction#136
BYK merged 3 commits intomainfrom
feat/local-embeddings-keyword-boost-pattern-extract

BYK commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BYK commented May 7, 2026

Summary

Migration

Test results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant