Skip to content

feat: migrate embeddings from fastembed/BGE Small to transformers.js/Nomic v1.5#287

Merged
BYK merged 9 commits into
mainfrom
feat/nomic-embed-v1.5
May 13, 2026
Merged

feat: migrate embeddings from fastembed/BGE Small to transformers.js/Nomic v1.5#287
BYK merged 9 commits into
mainfrom
feat/nomic-embed-v1.5

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 13, 2026

Summary

  • Replace fastembed (optional dep) + BGE Small v1.5 (384d) with @huggingface/transformers (regular dep) + Nomic Embed Text v1.5 (768d, Matryoshka-capable)
  • Simplify binary build pipeline: remove per-target staging trees, side-load libs, and onnxruntime dlopen — transformers.js bundles its own ONNX runtime
  • Binary uses WASM backend (onnxruntime-web, ~2x faster on batches than native) via esbuild redirect + source patching
  • Net code reduction: -1035 lines, +502 lines across 17 files

Embedding quality improvement

Same-domain entry similarity (the gotcha that made vector dedup impossible with BGE Small):

Metric BGE Small v1.5 Nomic v1.5
Same-domain entry similarity 0.93–0.97 0.46–0.70
Relevant vs unrelated spread ~0.1 0.47
Vector dedup viable? No Yes

Tested with 10 real .lore.md gotcha entries — zero pairs above 0.85 (BGE Small: all pairs above 0.93).

What changed

  • packages/core/src/embedding-worker.ts: Rewritten to use pipeline('feature-extraction') with Nomic post-processing: mean pooling → layer norm → Matryoshka truncation → L2 normalization
  • packages/core/src/embedding.ts: Removed fastembed probe logic (~150 lines), added task prefix support (search_document: / search_query: per Nomic spec)
  • packages/core/src/embedding-vendor.ts: Simplified from fastembed model registration to transformers.js localModelPath registration
  • packages/core/src/config.ts: Default model → nomic-ai/nomic-embed-text-v1.5, dimensions → 768
  • packages/gateway/script/build.ts: Removed per-target staging, side-load dlopen, fastembed externals. Added binaryExternalsPlugin that redirects onnxruntime-nodeonnxruntime-web and patches transformers.js CDN fallback to read WASM paths from globalThis. Wrapper embeds WASM files as Bun assets.
  • packages/gateway/script/vendor-paths.ts: Updated for Nomic model (HF repo layout with onnx/ subdir, /-separated model dir)
  • packages/gateway/script/vendor-embeddings.ts: Simplified from per-target staging + native binding management to a single shared model download

Binary build: WASM backend

The compiled binary uses onnxruntime-web (WASM+SIMD) instead of onnxruntime-node (native NAPI). binaryExternalsPlugin handles this:

  1. onResolve: redirects onnxruntime-nodeonnxruntime-web, stubs sharp, resolves onnxruntime-common/onnxruntime-web from .bun/ store
  2. onLoad: patches transformers.js to read wasmPaths from globalThis.__LORE_VENDOR_WASM_PATHS__ instead of CDN URL
  3. Wrapper embeds WASM files (.mjs + .wasm) as Bun { type: "file" } assets, sets the globalThis key with exact $bunfs paths

In npm/dev mode, native onnxruntime-node is used normally.

Benchmark (onnxruntime-node vs onnxruntime-web):

Workload Native WASM
Model load 630 ms 614 ms
Single text 24 ms 24 ms
Batch 10 155 ms 95 ms
Batch 50 779 ms 413 ms

Binary size

Component Old (fastembed + BGE Small) New (transformers.js + Nomic v1.5)
Model weights ~17 MB (BGE Small INT8) ~132 MB (Nomic v1.5 INT8)
ONNX runtime ~30 MB (.so/.dylib side-load) ~11 MB (WASM runtime files)
JS library ~5 MB (fastembed) ~500 KB (tree-shaken transformers.js)
Binary total ~120 MB ~244 MB

Net increase: ~124 MB, primarily model weights. Quality improvement justifies the cost.

Test results

1348 tests pass, 0 failures across 56 files. Typecheck clean. CI green on all 3 platforms (linux-x64, darwin-arm64, windows-x64).

BYK added 9 commits May 13, 2026 12:22
…Nomic v1.5

Replace fastembed (optional dep) + BGE Small v1.5 (384d) with
@huggingface/transformers (regular dep) + Nomic Embed Text v1.5 (768d).

Key improvements:
- Same-domain entry similarity drops from 0.93-0.97 to 0.46-0.70,
  making vector-based dedup viable (was unusable with BGE Small)
- Task instruction prefixes (search_document:/search_query:) per Nomic spec
- Matryoshka-capable: supports 64-768 dimensions
- Simplified build pipeline: no per-target staging trees, side-load libs,
  or dlopen of onnxruntime — transformers.js bundles its own ONNX runtime
- Net code reduction: -1035 lines, +502 lines

Tradeoff: binary size increases ~146 MB (model 17→132 MB, runtime 30→21 MB
WASM, JS lib 5→45 MB). INT8 quantized model quality justifies the cost.
@huggingface/transformers depends on onnxruntime-node which contains
.node native binary files that esbuild cannot handle. Keep it external
in all esbuild steps (main bundle, worker, core build, gateway bundle).
Bun's --compile step handles .node files natively.
sharp is pulled in by @huggingface/transformers for vision model image
processing — not needed for text embeddings. Externalizing it drops the
worker bundle from 607 KB to 489 KB and removes native sharp deps from
the binary.
… Bun compile

The onnxResolvePlugin:
- Resolves onnxruntime-node/common from Bun's .bun/ store to their
  actual paths so esbuild can bundle the JS parts inline
- Patches binding.js to wrap the dynamic .node require in an indirect
  call, preventing esbuild from statically resolving .node binaries
  (Bun --compile handles them at runtime)
- Stubs sharp as an empty module (only used for vision models)

This fixes the CI binary build failure where esbuild couldn't handle
.node files and Bun compile couldn't resolve bare onnxruntime-node
imports from the esbuild output.
Replace fastembed-era JQ checks (.modelAbsoluteDirPath, .modelName)
with new transformers.js vendor fields (.localModelPath, .version).
Update comments to reference nomic-embed-text-v1.5 instead of
fastembed/BGE Small/onnxruntime.
The binary build plugin redirects onnxruntime-node → onnxruntime-web
so transformers.js uses the WASM+SIMD ONNX runtime instead of native
NAPI bindings (which can't load from Bun's $bunfs).

Key changes:
- binaryExternalsPlugin: redirects onnxruntime-node → onnxruntime-web,
  stubs sharp, resolves onnxruntime-common/web from .bun/ store
- Wrapper embeds WASM runtime files as Bun assets, passes exact $bunfs
  paths via __LORE_VENDOR_WASM_PATHS__ (object form { mjs, wasm })
- Worker sets wasmPaths on onnxruntime env before importing transformers.js
- MODEL_DIR_NAME uses '/' separator (matching transformers.js localModelPath
  resolution) instead of '--' (HF cache convention)

WASM backend is ~2x faster on batch embeddings than native onnxruntime-node
(95ms vs 155ms for batch-10, 413ms vs 779ms for batch-50).
TypeScript can't find onnxruntime-node types (it's a transitive dep in
.bun/). Use string concatenation to prevent tsc from resolving the
module at typecheck time — in the binary, esbuild redirects it to
onnxruntime-web anyway.
Instead of importing onnxruntime at runtime in the worker (which
esbuild can't redirect for dynamic imports and Bun compile can't
resolve from .bun/), patch the transformers.js source at esbuild
bundle time to read wasmPaths from globalThis.__LORE_VENDOR_WASM_PATHS__
instead of the CDN URL.

The wrapper sets __LORE_VENDOR_WASM_PATHS__ = { mjs, wasm } with
$bunfs asset paths before importing the worker. The esbuild onLoad
hook replaces the CDN assignment in transformers.node.mjs with a
globalThis read.
- Remove stale --all and --target flags from vendor-embeddings.ts CI
  invocations (script no longer accepts arguments)
- Update CI comment about vendor cache (shared model cache, not
  per-target native bindings)
- Fix .lore.md entries to reflect final WASM approach (globalThis
  __LORE_VENDOR_WASM_PATHS__ + esbuild onLoad patch, not wrapper
  onnxruntime import)
@BYK BYK merged commit 0a3cd48 into main May 13, 2026
7 checks passed
@BYK BYK deleted the feat/nomic-embed-v1.5 branch May 13, 2026 13:35
BYK added a commit that referenced this pull request May 13, 2026
## Summary

- Enhance `deduplicate()` to use embedding cosine similarity (≥0.85
threshold) alongside title word-overlap, catching semantically identical
entries with different titles
- Add `lore data reindex` CLI command for on-demand re-embedding without
gateway restart
- Auto-reindex in `lore data dedup` when stale/missing embeddings
detected

## Motivation

With the Nomic v1.5 migration (PR #287), same-domain distinct entries
score 0.46–0.70 cosine similarity — making embedding-based dedup viable
at threshold 0.85+. Previously, BGE Small produced 0.93–0.97 for all
same-domain entries, so dedup was limited to title word-overlap only.

## What changed

### `packages/core/src/ltm.ts`
- `deduplicate()` now builds neighbor maps using **two signals**: title
word-overlap (existing, ≥0.7 Jaccard + ≥4 shared words) OR embedding
cosine similarity (new, ≥0.85). Pairs matching either signal are
clustered together.
- Loads embeddings for project entries and computes pairwise similarity,
with a dimension guard (`entryVec.length === otherVec.length`) to skip
stale vectors.

### `packages/gateway/src/cli/data.ts`
- New `lore data reindex` command: calls `checkConfigChange()` +
`backfillEmbeddings()` + `backfillDistillationEmbeddings()` directly.
- `lore data dedup` now auto-calls `checkConfigChange()` +
`backfillEmbeddings()` before scanning, so stale embeddings from a model
migration are refreshed automatically.

## Test results

- 1348 tests pass, typecheck clean
- Tested against real DB: found **102 duplicates across 39 clusters in 7
projects** (vs 0 with title-overlap only on the same data)
This was referenced May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant