Skip to content

v1.3.0 (Phase 4): three-bucket tokenizer + weighted-DAG index

Choose a tag to compare

@MrDanRave MrDanRave released this 24 Jun 08:44
· 29 commits to master since this release
Rework the matcher per SEMANTIC_PLAN v2:
- nlp.ts: bucket tokenizer (intra/phrase/anchor), atoms w/ anchor gaps,
  splitUnits, extractBase (digit-variant father). 802.1q stays one atom.
- TitleIndex: unit-based weighted bipartite index (byToken edges + byPhrase),
  base/child digit-strip edges, IDF weights. Replaces the flat inverted index.
- scoreRegion: n-gram + single-token candidates from the index (no prefix scan,
  no COVERAGE_THRESHOLD gate, no mid-word). lexScore = IDF-weighted coverage +
  exact-token bonus; significance folds frequency-list + IDF; case penalty -1.5.
- Dedup policies: merge corroborating adjacent anchors ("hub dumbo" -> one),
  overlap -> highest confidence (suppresses contained sub-spans).
- scanRegion deleted.
- Settings: per-vault tokenizer bucket editor + restore-defaults; rebuild on change.
- SCORING weights add `accept` (Phase 7); renormalize over available signals.

Verified against 13 acceptance cases incl. the original Hub1/"hub" bug,
kruger->Dunning-Kruger, 802.1q, Story of Love (phrase), user/root/mnt
(IDF self-suppression), and the "hub dumbo" merge.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>