Skip to content

feat: local embeddings, exact keyword boost, and pattern extraction#136

Merged
BYK merged 3 commits intomainfrom
feat/local-embeddings-keyword-boost-pattern-extract
May 7, 2026
Merged

feat: local embeddings, exact keyword boost, and pattern extraction#136
BYK merged 3 commits intomainfrom
feat/local-embeddings-keyword-boost-pattern-extract

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 7, 2026

Summary

  • Local embedding provider (P1): Adds fastembed (bge-small-en-v1.5, 384 dims) as default embedding provider. Runs fully on-device via ONNX Runtime — no API key, zero cost, ~150ms per query. Voyage and OpenAI remain available via .lore.json config. Existing embeddings auto-clear and re-embed on provider change via checkConfigChange().

  • Exact keyword match boost (P2): Adds exactTermMatchRank() as an additional RRF signal in recall. Ranks candidates by verbatim query term overlap, boosting proper nouns, file names, and technical terms that BM25's Porter stemming may dilute. Zero-cost, additive-only.

  • Pattern extraction at distillation time (P3): 8 conservative regex patterns detect decision/preference language ("decided to use X", "chose X over Y", "prefers X over Y") in distilled observations and seed knowledge entries via ltm.create(). Runs in both distillSegment() and metaDistill(). Zero LLM cost, <1ms runtime.

Migration

Existing Voyage users with no explicit .lore.json will silently switch to local embeddings. checkConfigChange() detects the fingerprint mismatch and auto-rebuilds all embeddings with the new provider. To stay on Voyage, add explicit config:

{ "search": { "embeddings": { "provider": "voyage", "model": "voyage-code-3", "dimensions": 1024 } } }

Test results

796 pass, 0 fail (including 42 new tests across 3 test files).

- P1: Add local embedding provider via fastembed (bge-small-en-v1.5, 384 dims).
  Runs on-device via ONNX Runtime with no API key needed. Model downloads
  on first use (~33MB), cached locally. Made 'local' the new default provider;
  Voyage/OpenAI remain available via .lore.json config. Existing embeddings
  auto-clear and re-embed on provider change via checkConfigChange().

- P2: Add exact keyword match boost as additional RRF signal in recall.
  Ranks candidates by verbatim query term overlap to boost proper nouns,
  file names, and technical terms that BM25 stemming may dilute.

- P3: Add regex-based pattern extraction at distillation time.
  8 conservative patterns detect decision/preference language in distilled
  observations and seed knowledge entries via ltm.create(). Runs in both
  distillSegment() and metaDistill() paths. Zero LLM cost.
@BYK BYK enabled auto-merge (squash) May 7, 2026 11:14
BYK added 2 commits May 7, 2026 12:48
…l scoring (#137)

## Summary

- **P0: Unified embedding backfill** — Moved embedding backfill from
OpenCode-only to a shared `runStartupBackfill()` in core. Both OpenCode
and Pi adapters now call this at init, fixing near-zero embedding
coverage (~0.5%) for Pi-served projects. Logs coverage stats to stderr
at startup for immediate visibility.
- **P3: Retroactive metric backfill** — Added `backfillMetrics()` to
compute `r_compression` and `c_norm` for ~2,863 pre-v12 distillations by
loading source temporal messages. Idempotent, handles pruned sources
gracefully. Runs synchronously at startup.
- **P2: Quality-weighted recall scoring** — Wired `c_norm` into recall
by adding a distillation quality RRF list. Uniform segments (low c_norm)
rank higher; old bursty segments rank lower via `c_norm + (ageDays/90) *
0.1`. Mild secondary signal that blends naturally with existing BM25 and
vector relevance.

## Files Changed

| File | Change |
|---|---|
| `packages/core/src/embedding.ts` | Added `runStartupBackfill()` with
coverage logging |
| `packages/core/src/distillation.ts` | Added `backfillMetrics()` for
retroactive r_compression/c_norm |
| `packages/core/src/recall.ts` | Added c_norm to Distillation type, all
SELECTs, and quality RRF list |
| `packages/opencode/src/index.ts` | Replaced inline backfill with core
functions |
| `packages/pi/src/index.ts` | Added embedding + metric backfill calls |
| `.lore.md` | Updated project knowledge |

## Testing

- 796 tests pass, 0 failures
- Build TS errors are pre-existing (fastembed type issue on parent
branch)
TypeScript's overload resolution can't infer the correct variant when the
model identifier is resolved dynamically from config. Cast the options
object to the standard init signature (BGESmallENV15 is the default) to
satisfy the type checker. The enum lookup at runtime guarantees a valid
model value.
@BYK BYK merged commit 8695f3e into main May 7, 2026
1 check passed
@BYK BYK deleted the feat/local-embeddings-keyword-boost-pattern-extract branch May 7, 2026 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant