fix: add word-overlap dedup for knowledge entries#286
Merged
Conversation
…e accumulation Knowledge entries accumulated duplicates across sessions because the LLM curator would create near-identical entries with slightly different wording. BGE Small embeddings are unusable for within-project dedup (similarity floor ~0.85+). Changes: - Add findFuzzyDuplicate() to ltm.ts using title word-overlap (overlap coefficient >= 0.7 with min 4 shared words) for real-time dedup on create() - Add deduplicate() batch function with star-clustering for bulk cleanup - Add 'lore data dedup' CLI subcommand (--yes to apply, dry-run by default) - Wire post-curation dedup sweep in curator.ts to catch duplicates as they form - Fix cross-machine import dedup in agents-file.ts to match by title+category instead of only by UUID (handles same entry created independently on two machines)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Knowledge entries accumulated duplicates across sessions because the LLM curator creates near-identical entries with slightly different wording. Embedding-based dedup doesn't work — BGE Small cosine similarity clusters at 0.85–0.97+ within the same codebase domain, making any threshold produce unacceptable false positives.
This PR adds title word-overlap dedup using the overlap coefficient (
|intersection| / min(|setA|, |setB|)) with a dual threshold (≥0.7 coefficient AND ≥4 shared words). Star clustering prevents transitive snowballing.Changes
ltm.ts:findFuzzyDuplicate()for real-time dedup oncreate(),deduplicate()for batch cleanup with star clusteringcurator.ts: Post-curation dedup sweep catches duplicates as they formagents-file.ts: Cross-machine import dedup matches by title+category instead of only UUID (handles same entry created independently on two machines)data.ts:lore data dedupCLI subcommand (dry-run by default,--yesto apply)Results
Tested on real data: 111 duplicate entries removed across 33 clusters in 2 projects with zero false positives.