fix: add word-overlap dedup for knowledge entries by BYK · Pull Request #286 · BYK/loreai

BYK · 2026-05-13T11:50:35Z

Summary

Knowledge entries accumulated duplicates across sessions because the LLM curator creates near-identical entries with slightly different wording. Embedding-based dedup doesn't work — BGE Small cosine similarity clusters at 0.85–0.97+ within the same codebase domain, making any threshold produce unacceptable false positives.

Changes

ltm.ts: findFuzzyDuplicate() for real-time dedup on create(), deduplicate() for batch cleanup with star clustering
curator.ts: Post-curation dedup sweep catches duplicates as they form
agents-file.ts: Cross-machine import dedup matches by title+category instead of only UUID (handles same entry created independently on two machines)
data.ts: lore data dedup CLI subcommand (dry-run by default, --yes to apply)

Results

Tested on real data: 111 duplicate entries removed across 33 clusters in 2 projects with zero false positives.

…e accumulation Knowledge entries accumulated duplicates across sessions because the LLM curator would create near-identical entries with slightly different wording. BGE Small embeddings are unusable for within-project dedup (similarity floor ~0.85+). Changes: - Add findFuzzyDuplicate() to ltm.ts using title word-overlap (overlap coefficient >= 0.7 with min 4 shared words) for real-time dedup on create() - Add deduplicate() batch function with star-clustering for bulk cleanup - Add 'lore data dedup' CLI subcommand (--yes to apply, dry-run by default) - Wire post-curation dedup sweep in curator.ts to catch duplicates as they form - Fix cross-machine import dedup in agents-file.ts to match by title+category instead of only by UUID (handles same entry created independently on two machines)

BYK merged commit 4f7bd31 into main May 13, 2026
7 checks passed

BYK deleted the fix/knowledge-dedup branch May 13, 2026 11:53

This was referenced May 13, 2026

publish: BYK/loreai@0.18.0 #294

Closed

publish: BYK/loreai@0.18.0 #296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add word-overlap dedup for knowledge entries#286

fix: add word-overlap dedup for knowledge entries#286
BYK merged 1 commit into
mainfrom
fix/knowledge-dedup

BYK commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BYK commented May 13, 2026

Summary

Changes

Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant