feat: migrate embeddings from fastembed/BGE Small to transformers.js/Nomic v1.5#287
Merged
Conversation
…Nomic v1.5 Replace fastembed (optional dep) + BGE Small v1.5 (384d) with @huggingface/transformers (regular dep) + Nomic Embed Text v1.5 (768d). Key improvements: - Same-domain entry similarity drops from 0.93-0.97 to 0.46-0.70, making vector-based dedup viable (was unusable with BGE Small) - Task instruction prefixes (search_document:/search_query:) per Nomic spec - Matryoshka-capable: supports 64-768 dimensions - Simplified build pipeline: no per-target staging trees, side-load libs, or dlopen of onnxruntime — transformers.js bundles its own ONNX runtime - Net code reduction: -1035 lines, +502 lines Tradeoff: binary size increases ~146 MB (model 17→132 MB, runtime 30→21 MB WASM, JS lib 5→45 MB). INT8 quantized model quality justifies the cost.
@huggingface/transformers depends on onnxruntime-node which contains .node native binary files that esbuild cannot handle. Keep it external in all esbuild steps (main bundle, worker, core build, gateway bundle). Bun's --compile step handles .node files natively.
sharp is pulled in by @huggingface/transformers for vision model image processing — not needed for text embeddings. Externalizing it drops the worker bundle from 607 KB to 489 KB and removes native sharp deps from the binary.
… Bun compile The onnxResolvePlugin: - Resolves onnxruntime-node/common from Bun's .bun/ store to their actual paths so esbuild can bundle the JS parts inline - Patches binding.js to wrap the dynamic .node require in an indirect call, preventing esbuild from statically resolving .node binaries (Bun --compile handles them at runtime) - Stubs sharp as an empty module (only used for vision models) This fixes the CI binary build failure where esbuild couldn't handle .node files and Bun compile couldn't resolve bare onnxruntime-node imports from the esbuild output.
Replace fastembed-era JQ checks (.modelAbsoluteDirPath, .modelName) with new transformers.js vendor fields (.localModelPath, .version). Update comments to reference nomic-embed-text-v1.5 instead of fastembed/BGE Small/onnxruntime.
The binary build plugin redirects onnxruntime-node → onnxruntime-web
so transformers.js uses the WASM+SIMD ONNX runtime instead of native
NAPI bindings (which can't load from Bun's $bunfs).
Key changes:
- binaryExternalsPlugin: redirects onnxruntime-node → onnxruntime-web,
stubs sharp, resolves onnxruntime-common/web from .bun/ store
- Wrapper embeds WASM runtime files as Bun assets, passes exact $bunfs
paths via __LORE_VENDOR_WASM_PATHS__ (object form { mjs, wasm })
- Worker sets wasmPaths on onnxruntime env before importing transformers.js
- MODEL_DIR_NAME uses '/' separator (matching transformers.js localModelPath
resolution) instead of '--' (HF cache convention)
WASM backend is ~2x faster on batch embeddings than native onnxruntime-node
(95ms vs 155ms for batch-10, 413ms vs 779ms for batch-50).
TypeScript can't find onnxruntime-node types (it's a transitive dep in .bun/). Use string concatenation to prevent tsc from resolving the module at typecheck time — in the binary, esbuild redirects it to onnxruntime-web anyway.
Instead of importing onnxruntime at runtime in the worker (which
esbuild can't redirect for dynamic imports and Bun compile can't
resolve from .bun/), patch the transformers.js source at esbuild
bundle time to read wasmPaths from globalThis.__LORE_VENDOR_WASM_PATHS__
instead of the CDN URL.
The wrapper sets __LORE_VENDOR_WASM_PATHS__ = { mjs, wasm } with
$bunfs asset paths before importing the worker. The esbuild onLoad
hook replaces the CDN assignment in transformers.node.mjs with a
globalThis read.
- Remove stale --all and --target flags from vendor-embeddings.ts CI invocations (script no longer accepts arguments) - Update CI comment about vendor cache (shared model cache, not per-target native bindings) - Fix .lore.md entries to reflect final WASM approach (globalThis __LORE_VENDOR_WASM_PATHS__ + esbuild onLoad patch, not wrapper onnxruntime import)
BYK
added a commit
that referenced
this pull request
May 13, 2026
## Summary - Enhance `deduplicate()` to use embedding cosine similarity (≥0.85 threshold) alongside title word-overlap, catching semantically identical entries with different titles - Add `lore data reindex` CLI command for on-demand re-embedding without gateway restart - Auto-reindex in `lore data dedup` when stale/missing embeddings detected ## Motivation With the Nomic v1.5 migration (PR #287), same-domain distinct entries score 0.46–0.70 cosine similarity — making embedding-based dedup viable at threshold 0.85+. Previously, BGE Small produced 0.93–0.97 for all same-domain entries, so dedup was limited to title word-overlap only. ## What changed ### `packages/core/src/ltm.ts` - `deduplicate()` now builds neighbor maps using **two signals**: title word-overlap (existing, ≥0.7 Jaccard + ≥4 shared words) OR embedding cosine similarity (new, ≥0.85). Pairs matching either signal are clustered together. - Loads embeddings for project entries and computes pairwise similarity, with a dimension guard (`entryVec.length === otherVec.length`) to skip stale vectors. ### `packages/gateway/src/cli/data.ts` - New `lore data reindex` command: calls `checkConfigChange()` + `backfillEmbeddings()` + `backfillDistillationEmbeddings()` directly. - `lore data dedup` now auto-calls `checkConfigChange()` + `backfillEmbeddings()` before scanning, so stale embeddings from a model migration are refreshed automatically. ## Test results - 1348 tests pass, typecheck clean - Tested against real DB: found **102 duplicates across 39 clusters in 7 projects** (vs 0 with title-overlap only on the same data)
This was referenced May 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
fastembed(optional dep) + BGE Small v1.5 (384d) with@huggingface/transformers(regular dep) + Nomic Embed Text v1.5 (768d, Matryoshka-capable)Embedding quality improvement
Same-domain entry similarity (the gotcha that made vector dedup impossible with BGE Small):
Tested with 10 real
.lore.mdgotcha entries — zero pairs above 0.85 (BGE Small: all pairs above 0.93).What changed
packages/core/src/embedding-worker.ts: Rewritten to usepipeline('feature-extraction')with Nomic post-processing: mean pooling → layer norm → Matryoshka truncation → L2 normalizationpackages/core/src/embedding.ts: Removed fastembed probe logic (~150 lines), added task prefix support (search_document:/search_query:per Nomic spec)packages/core/src/embedding-vendor.ts: Simplified from fastembed model registration to transformers.jslocalModelPathregistrationpackages/core/src/config.ts: Default model →nomic-ai/nomic-embed-text-v1.5, dimensions → 768packages/gateway/script/build.ts: Removed per-target staging, side-load dlopen, fastembed externals. AddedbinaryExternalsPluginthat redirectsonnxruntime-node→onnxruntime-weband patches transformers.js CDN fallback to read WASM paths fromglobalThis. Wrapper embeds WASM files as Bun assets.packages/gateway/script/vendor-paths.ts: Updated for Nomic model (HF repo layout withonnx/subdir,/-separated model dir)packages/gateway/script/vendor-embeddings.ts: Simplified from per-target staging + native binding management to a single shared model downloadBinary build: WASM backend
The compiled binary uses onnxruntime-web (WASM+SIMD) instead of onnxruntime-node (native NAPI).
binaryExternalsPluginhandles this:onResolve: redirectsonnxruntime-node→onnxruntime-web, stubssharp, resolvesonnxruntime-common/onnxruntime-webfrom.bun/storeonLoad: patches transformers.js to readwasmPathsfromglobalThis.__LORE_VENDOR_WASM_PATHS__instead of CDN URL.mjs+.wasm) as Bun{ type: "file" }assets, sets the globalThis key with exact$bunfspathsIn npm/dev mode, native onnxruntime-node is used normally.
Benchmark (onnxruntime-node vs onnxruntime-web):
Binary size
Net increase: ~124 MB, primarily model weights. Quality improvement justifies the cost.
Test results
1348 tests pass, 0 failures across 56 files. Typecheck clean. CI green on all 3 platforms (linux-x64, darwin-arm64, windows-x64).