Skip to content

feat: add embedding-based dedup and 'lore data reindex' command#288

Merged
BYK merged 2 commits into
mainfrom
feat/embedding-dedup
May 13, 2026
Merged

feat: add embedding-based dedup and 'lore data reindex' command#288
BYK merged 2 commits into
mainfrom
feat/embedding-dedup

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 13, 2026

Summary

  • Enhance deduplicate() to use embedding cosine similarity (≥0.85 threshold) alongside title word-overlap, catching semantically identical entries with different titles
  • Add lore data reindex CLI command for on-demand re-embedding without gateway restart
  • Auto-reindex in lore data dedup when stale/missing embeddings detected

Motivation

With the Nomic v1.5 migration (PR #287), same-domain distinct entries score 0.46–0.70 cosine similarity — making embedding-based dedup viable at threshold 0.85+. Previously, BGE Small produced 0.93–0.97 for all same-domain entries, so dedup was limited to title word-overlap only.

What changed

packages/core/src/ltm.ts

  • deduplicate() now builds neighbor maps using two signals: title word-overlap (existing, ≥0.7 Jaccard + ≥4 shared words) OR embedding cosine similarity (new, ≥0.85). Pairs matching either signal are clustered together.
  • Loads embeddings for project entries and computes pairwise similarity, with a dimension guard (entryVec.length === otherVec.length) to skip stale vectors.

packages/gateway/src/cli/data.ts

  • New lore data reindex command: calls checkConfigChange() + backfillEmbeddings() + backfillDistillationEmbeddings() directly.
  • lore data dedup now auto-calls checkConfigChange() + backfillEmbeddings() before scanning, so stale embeddings from a model migration are refreshed automatically.

Test results

  • 1348 tests pass, typecheck clean
  • Tested against real DB: found 102 duplicates across 39 clusters in 7 projects (vs 0 with title-overlap only on the same data)

BYK added 2 commits May 13, 2026 15:10
Enhance deduplicate() to use vector cosine similarity (threshold 0.85)
alongside title word-overlap. With Nomic v1.5, same-domain distinct
entries score 0.46-0.70, making embedding-based dedup viable — entries
with different titles but semantically identical content are now caught.

Add 'lore data reindex' CLI command to trigger checkConfigChange() +
backfillEmbeddings() + backfillDistillationEmbeddings() on demand,
without requiring a gateway restart.

The dedup command now auto-reindexes if the embedding config changed
(e.g. after a model migration) or if entries are missing embeddings,
ensuring vectors are fresh before running similarity comparisons.
- Restore corrupted .lore.md entry 019e20a4 (curator overwrote cache
  warming gotcha with Nomic migration text)
- Scope embedding DB query to project entry IDs instead of global fetch
- Add try/catch around fromBlob() for corrupted embedding BLOBs
- Remove double checkConfigChange() calls (backfillEmbeddings handles it)
- Dedup auto-reindex now also backfills distillation embeddings
- Add error handling around backfill calls in both cmdReindex and cmdDedup
- Rename OverlapHit → DedupHit with 'score' field (was 'coefficient')
@BYK BYK merged commit 7b9665f into main May 13, 2026
7 checks passed
@BYK BYK deleted the feat/embedding-dedup branch May 13, 2026 15:24
This was referenced May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant