Skip to content

fix: add word-overlap dedup for knowledge entries#286

Merged
BYK merged 1 commit into
mainfrom
fix/knowledge-dedup
May 13, 2026
Merged

fix: add word-overlap dedup for knowledge entries#286
BYK merged 1 commit into
mainfrom
fix/knowledge-dedup

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 13, 2026

Summary

Knowledge entries accumulated duplicates across sessions because the LLM curator creates near-identical entries with slightly different wording. Embedding-based dedup doesn't work — BGE Small cosine similarity clusters at 0.85–0.97+ within the same codebase domain, making any threshold produce unacceptable false positives.

This PR adds title word-overlap dedup using the overlap coefficient (|intersection| / min(|setA|, |setB|)) with a dual threshold (≥0.7 coefficient AND ≥4 shared words). Star clustering prevents transitive snowballing.

Changes

  • ltm.ts: findFuzzyDuplicate() for real-time dedup on create(), deduplicate() for batch cleanup with star clustering
  • curator.ts: Post-curation dedup sweep catches duplicates as they form
  • agents-file.ts: Cross-machine import dedup matches by title+category instead of only UUID (handles same entry created independently on two machines)
  • data.ts: lore data dedup CLI subcommand (dry-run by default, --yes to apply)

Results

Tested on real data: 111 duplicate entries removed across 33 clusters in 2 projects with zero false positives.

…e accumulation

Knowledge entries accumulated duplicates across sessions because the LLM curator
would create near-identical entries with slightly different wording. BGE Small
embeddings are unusable for within-project dedup (similarity floor ~0.85+).

Changes:
- Add findFuzzyDuplicate() to ltm.ts using title word-overlap (overlap coefficient
  >= 0.7 with min 4 shared words) for real-time dedup on create()
- Add deduplicate() batch function with star-clustering for bulk cleanup
- Add 'lore data dedup' CLI subcommand (--yes to apply, dry-run by default)
- Wire post-curation dedup sweep in curator.ts to catch duplicates as they form
- Fix cross-machine import dedup in agents-file.ts to match by title+category
  instead of only by UUID (handles same entry created independently on two machines)
@BYK BYK merged commit 4f7bd31 into main May 13, 2026
7 checks passed
@BYK BYK deleted the fix/knowledge-dedup branch May 13, 2026 11:53
This was referenced May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant