feat: add embedding-based dedup and 'lore data reindex' command by BYK · Pull Request #288 · BYK/loreai

BYK · 2026-05-13T15:11:18Z

Summary

Enhance deduplicate() to use embedding cosine similarity (≥0.85 threshold) alongside title word-overlap, catching semantically identical entries with different titles
Add lore data reindex CLI command for on-demand re-embedding without gateway restart
Auto-reindex in lore data dedup when stale/missing embeddings detected

Motivation

With the Nomic v1.5 migration (PR #287), same-domain distinct entries score 0.46–0.70 cosine similarity — making embedding-based dedup viable at threshold 0.85+. Previously, BGE Small produced 0.93–0.97 for all same-domain entries, so dedup was limited to title word-overlap only.

What changed

`packages/core/src/ltm.ts`

deduplicate() now builds neighbor maps using two signals: title word-overlap (existing, ≥0.7 Jaccard + ≥4 shared words) OR embedding cosine similarity (new, ≥0.85). Pairs matching either signal are clustered together.
Loads embeddings for project entries and computes pairwise similarity, with a dimension guard (entryVec.length === otherVec.length) to skip stale vectors.

`packages/gateway/src/cli/data.ts`

New lore data reindex command: calls checkConfigChange() + backfillEmbeddings() + backfillDistillationEmbeddings() directly.
lore data dedup now auto-calls checkConfigChange() + backfillEmbeddings() before scanning, so stale embeddings from a model migration are refreshed automatically.

Test results

1348 tests pass, typecheck clean
Tested against real DB: found 102 duplicates across 39 clusters in 7 projects (vs 0 with title-overlap only on the same data)

Enhance deduplicate() to use vector cosine similarity (threshold 0.85) alongside title word-overlap. With Nomic v1.5, same-domain distinct entries score 0.46-0.70, making embedding-based dedup viable — entries with different titles but semantically identical content are now caught. Add 'lore data reindex' CLI command to trigger checkConfigChange() + backfillEmbeddings() + backfillDistillationEmbeddings() on demand, without requiring a gateway restart. The dedup command now auto-reindexes if the embedding config changed (e.g. after a model migration) or if entries are missing embeddings, ensuring vectors are fresh before running similarity comparisons.

- Restore corrupted .lore.md entry 019e20a4 (curator overwrote cache warming gotcha with Nomic migration text) - Scope embedding DB query to project entry IDs instead of global fetch - Add try/catch around fromBlob() for corrupted embedding BLOBs - Remove double checkConfigChange() calls (backfillEmbeddings handles it) - Dedup auto-reindex now also backfills distillation embeddings - Add error handling around backfill calls in both cmdReindex and cmdDedup - Rename OverlapHit → DedupHit with 'score' field (was 'coefficient')

BYK added 2 commits May 13, 2026 15:10

BYK merged commit 7b9665f into main May 13, 2026
7 checks passed

BYK deleted the feat/embedding-dedup branch May 13, 2026 15:24

BYK mentioned this pull request May 13, 2026

Adaptive dedup threshold with user feedback #292

Closed

This was referenced May 13, 2026

publish: BYK/loreai@0.18.0 #294

Closed

publish: BYK/loreai@0.18.0 #296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add embedding-based dedup and 'lore data reindex' command#288

feat: add embedding-based dedup and 'lore data reindex' command#288
BYK merged 2 commits into
mainfrom
feat/embedding-dedup

BYK commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BYK commented May 13, 2026

Summary

Motivation

What changed

packages/core/src/ltm.ts

packages/gateway/src/cli/data.ts

Test results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`packages/core/src/ltm.ts`

`packages/gateway/src/cli/data.ts`