Skip to content

fix: tune embedding dedup threshold to 0.935#293

Merged
BYK merged 1 commit into
mainfrom
fix/dedup-threshold
May 13, 2026
Merged

fix: tune embedding dedup threshold to 0.935#293
BYK merged 1 commit into
mainfrom
fix/dedup-threshold

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 13, 2026

Summary

Refines the embedding dedup threshold from 0.93 → 0.935 to eliminate a false positive caused by star clustering.

Root Cause

At threshold 0.93, star clustering bridged two distinct entries through a hub:

  • "Token-budget batching" ↔ "Nomic OOM": 0.958 (genuine duplicate)
  • "BGE Small unusable" ↔ "Nomic OOM": 0.9326 (related but distinct — different bugs)
  • "Nomic OOM" acted as hub, pulling all three into one cluster

0.935 excludes the 0.9326 pair while keeping all genuine duplicates (0.935+).

Threshold Tuning History

Threshold Issue
0.85 Too aggressive — caught related-but-distinct cross-project entries
0.92 Still caught same-subsystem entries (0.922 false positive)
0.93 Star clustering bridged 0.9326 false positive into cluster
0.935 Excludes all known false positives; all pairs above are genuine dupes

Refs #292 (adaptive threshold with user feedback — future improvement)

0.93 still caught false positives via star clustering — a hub entry
similar to two distinct entries (at 0.9326 and 0.958) pulled all three
into one cluster. 0.935 excludes 0.9326 while keeping genuine dupes.

Empirical distribution (312 Nomic v1.5 entries):
- 0.935+: all genuine duplicates
- 0.92-0.935: false positives from same-subsystem entries
- <0.92: related-but-distinct or noise

Refs #292
@BYK BYK merged commit 31208cc into main May 13, 2026
7 checks passed
@BYK BYK deleted the fix/dedup-threshold branch May 13, 2026 16:47
This was referenced May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant