fix: tune embedding dedup threshold to 0.935 by BYK · Pull Request #293 · BYK/loreai

BYK · 2026-05-13T16:43:03Z

Summary

Refines the embedding dedup threshold from 0.93 → 0.935 to eliminate a false positive caused by star clustering.

Root Cause

At threshold 0.93, star clustering bridged two distinct entries through a hub:

"Token-budget batching" ↔ "Nomic OOM": 0.958 (genuine duplicate)
"BGE Small unusable" ↔ "Nomic OOM": 0.9326 (related but distinct — different bugs)
"Nomic OOM" acted as hub, pulling all three into one cluster

0.935 excludes the 0.9326 pair while keeping all genuine duplicates (0.935+).

Threshold Tuning History

Threshold	Issue
0.85	Too aggressive — caught related-but-distinct cross-project entries
0.92	Still caught same-subsystem entries (0.922 false positive)
0.93	Star clustering bridged 0.9326 false positive into cluster
0.935	Excludes all known false positives; all pairs above are genuine dupes

Refs #292 (adaptive threshold with user feedback — future improvement)

0.93 still caught false positives via star clustering — a hub entry similar to two distinct entries (at 0.9326 and 0.958) pulled all three into one cluster. 0.935 excludes 0.9326 while keeping genuine dupes. Empirical distribution (312 Nomic v1.5 entries): - 0.935+: all genuine duplicates - 0.92-0.935: false positives from same-subsystem entries - <0.92: related-but-distinct or noise Refs #292

BYK merged commit 31208cc into main May 13, 2026
7 checks passed

BYK deleted the fix/dedup-threshold branch May 13, 2026 16:47

This was referenced May 13, 2026

publish: BYK/loreai@0.18.0 #294

Closed

publish: BYK/loreai@0.18.0 #296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: tune embedding dedup threshold to 0.935#293

fix: tune embedding dedup threshold to 0.935#293
BYK merged 1 commit into
mainfrom
fix/dedup-threshold

BYK commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

BYK commented May 13, 2026

Summary

Root Cause

Threshold Tuning History

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant