Skip to content

[bm25 1/5] refactor: extract BM25Index ABC, BM25Scorer inherits#328

Merged
ASuresh0524 merged 1 commit into
StarTrail-org:mainfrom
raoabinav:refactor/bm25-index-interface
May 22, 2026
Merged

[bm25 1/5] refactor: extract BM25Index ABC, BM25Scorer inherits#328
ASuresh0524 merged 1 commit into
StarTrail-org:mainfrom
raoabinav:refactor/bm25-index-interface

Conversation

@raoabinav
Copy link
Copy Markdown
Contributor

@raoabinav raoabinav commented May 20, 2026

Sub-PR 1 of 5 from #327.

Pure refactor, no behavior change. Extracts the two methods LeannSearcher actually uses from BM25Scorer (fit(documents), search(query, top_k) -> list[SearchResult]) into a BM25Index ABC. BM25Scorer inherits, so all existing call sites work unchanged.

Sets up the next sub-PRs to drop in an FTS5-backed implementation behind the same contract without touching the consumer.

BM25 persistence train (#327)

Pure refactor, no behavior change. Sets up follow-up PRs that add an FTS5-
backed implementation behind the same contract.

Currently the only consumer of BM25 (LeannSearcher._init_bm25 / _bm25_search)
relies on `BM25Scorer.fit(passages)` and `BM25Scorer.search(query, top_k) ->
list[SearchResult]`. Extracting those two methods into an ABC makes the
follow-up FTS5 implementation drop-in. See StarTrail-org#327 for the broader plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@raoabinav raoabinav marked this pull request as ready for review May 20, 2026 18:14
raoabinav added a commit to raoabinav/LEANN that referenced this pull request May 20, 2026
Sub-PR 3 of 5 from StarTrail-org#327. Builds on StarTrail-org#328 and StarTrail-org#332.

New `Fts5BM25Index(BM25Index)` class backed by SQLite FTS5 (`tokenize='unicode61 remove_diacritics 2'`). fit() bulk-inserts into a fresh virtual table; search() runs `MATCH` with `-bm25()` ordering so the rest of LeannSearcher (and hybrid fusion) keeps higher-is-better.

Opt-in via `LeannBuilder(bm25_backend="fts5")`. When set, build_index writes `<index>.bm25.sqlite` and records `bm25_backend="fts5"` + `bm25_db` in meta.json. `LeannSearcher._init_bm25` honors the field: fts5 → mmap the sqlite; memory → use the pickle from sub-PR 2; absent → fall back to fit-on-search for older indexes.

Default `bm25_backend="memory"` so nothing changes for existing callers. Default flip happens in sub-PR 4.

Query tokenization matches BM25Scorer (strip punctuation, lowercase, OR terms) so the same query text behaves consistently across backends; FTS5 syntax surprises like `:` `*` get neutralized.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@raoabinav raoabinav changed the title refactor: extract BM25Index ABC, BM25Scorer inherits [bm25 1/5] refactor: extract BM25Index ABC, BM25Scorer inherits May 20, 2026
@ASuresh0524 ASuresh0524 merged commit f3bab61 into StarTrail-org:main May 22, 2026
31 checks passed
ASuresh0524 pushed a commit that referenced this pull request May 22, 2026
…-search (#332)

* refactor: extract BM25Index ABC, BM25Scorer inherits

Pure refactor, no behavior change. Sets up follow-up PRs that add an FTS5-
backed implementation behind the same contract.

Currently the only consumer of BM25 (LeannSearcher._init_bm25 / _bm25_search)
relies on `BM25Scorer.fit(passages)` and `BM25Scorer.search(query, top_k) ->
list[SearchResult]`. Extracting those two methods into an ABC makes the
follow-up FTS5 implementation drop-in. See #327 for the broader plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: opt-in build-time BM25 snapshot to skip fit-on-first-search

Sub-PR 2 of 5 from #327. Builds on #328 (BM25Index ABC).

Adds `LeannBuilder(prebuild_bm25=True)`. When set, build_index fits a
BM25Scorer on the chunks and pickles it to <index>.bm25.pkl, then records
the snapshot filename in meta.json under "bm25_snapshot".

LeannSearcher._init_bm25 now checks for that snapshot first: if present and
loads cleanly, it skips fitting; otherwise it falls back to today's behavior
(scan passages.jsonl and fit). Older indexes are unaffected — no snapshot
field in their meta.json, so the fit-on-search path runs.

Default stays False so this PR changes nothing for existing callers. Default
flip happens in sub-PR 4 once the FTS5 backend (sub-PR 3) lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ASuresh0524 added a commit that referenced this pull request May 23, 2026
…on (#333)

* refactor: extract BM25Index ABC, BM25Scorer inherits

Pure refactor, no behavior change. Sets up follow-up PRs that add an FTS5-
backed implementation behind the same contract.

Currently the only consumer of BM25 (LeannSearcher._init_bm25 / _bm25_search)
relies on `BM25Scorer.fit(passages)` and `BM25Scorer.search(query, top_k) ->
list[SearchResult]`. Extracting those two methods into an ABC makes the
follow-up FTS5 implementation drop-in. See #327 for the broader plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: opt-in build-time BM25 snapshot to skip fit-on-first-search

Sub-PR 2 of 5 from #327. Builds on #328 (BM25Index ABC).

Adds `LeannBuilder(prebuild_bm25=True)`. When set, build_index fits a
BM25Scorer on the chunks and pickles it to <index>.bm25.pkl, then records
the snapshot filename in meta.json under "bm25_snapshot".

LeannSearcher._init_bm25 now checks for that snapshot first: if present and
loads cleanly, it skips fitting; otherwise it falls back to today's behavior
(scan passages.jsonl and fit). Older indexes are unaffected — no snapshot
field in their meta.json, so the fit-on-search path runs.

Default stays False so this PR changes nothing for existing callers. Default
flip happens in sub-PR 4 once the FTS5 backend (sub-PR 3) lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: Fts5BM25Index — SQLite FTS5-backed BM25 implementation

Sub-PR 3 of 5 from #327. Builds on #328 and #332.

New `Fts5BM25Index(BM25Index)` class backed by SQLite FTS5 (`tokenize='unicode61 remove_diacritics 2'`). fit() bulk-inserts into a fresh virtual table; search() runs `MATCH` with `-bm25()` ordering so the rest of LeannSearcher (and hybrid fusion) keeps higher-is-better.

Opt-in via `LeannBuilder(bm25_backend="fts5")`. When set, build_index writes `<index>.bm25.sqlite` and records `bm25_backend="fts5"` + `bm25_db` in meta.json. `LeannSearcher._init_bm25` honors the field: fts5 → mmap the sqlite; memory → use the pickle from sub-PR 2; absent → fall back to fit-on-search for older indexes.

Default `bm25_backend="memory"` so nothing changes for existing callers. Default flip happens in sub-PR 4.

Query tokenization matches BM25Scorer (strip punctuation, lowercase, OR terms) so the same query text behaves consistently across backends; FTS5 syntax surprises like `:` `*` get neutralized.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Aakash Suresh <aakashsuresh@berkeley.edu>
ASuresh0524 added a commit that referenced this pull request May 23, 2026
* refactor: extract BM25Index ABC, BM25Scorer inherits

Pure refactor, no behavior change. Sets up follow-up PRs that add an FTS5-
backed implementation behind the same contract.

Currently the only consumer of BM25 (LeannSearcher._init_bm25 / _bm25_search)
relies on `BM25Scorer.fit(passages)` and `BM25Scorer.search(query, top_k) ->
list[SearchResult]`. Extracting those two methods into an ABC makes the
follow-up FTS5 implementation drop-in. See #327 for the broader plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: opt-in build-time BM25 snapshot to skip fit-on-first-search

Sub-PR 2 of 5 from #327. Builds on #328 (BM25Index ABC).

Adds `LeannBuilder(prebuild_bm25=True)`. When set, build_index fits a
BM25Scorer on the chunks and pickles it to <index>.bm25.pkl, then records
the snapshot filename in meta.json under "bm25_snapshot".

LeannSearcher._init_bm25 now checks for that snapshot first: if present and
loads cleanly, it skips fitting; otherwise it falls back to today's behavior
(scan passages.jsonl and fit). Older indexes are unaffected — no snapshot
field in their meta.json, so the fit-on-search path runs.

Default stays False so this PR changes nothing for existing callers. Default
flip happens in sub-PR 4 once the FTS5 backend (sub-PR 3) lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: Fts5BM25Index — SQLite FTS5-backed BM25 implementation

Sub-PR 3 of 5 from #327. Builds on #328 and #332.

New `Fts5BM25Index(BM25Index)` class backed by SQLite FTS5 (`tokenize='unicode61 remove_diacritics 2'`). fit() bulk-inserts into a fresh virtual table; search() runs `MATCH` with `-bm25()` ordering so the rest of LeannSearcher (and hybrid fusion) keeps higher-is-better.

Opt-in via `LeannBuilder(bm25_backend="fts5")`. When set, build_index writes `<index>.bm25.sqlite` and records `bm25_backend="fts5"` + `bm25_db` in meta.json. `LeannSearcher._init_bm25` honors the field: fts5 → mmap the sqlite; memory → use the pickle from sub-PR 2; absent → fall back to fit-on-search for older indexes.

Default `bm25_backend="memory"` so nothing changes for existing callers. Default flip happens in sub-PR 4.

Query tokenization matches BM25Scorer (strip punctuation, lowercase, OR terms) so the same query text behaves consistently across backends; FTS5 syntax surprises like `:` `*` get neutralized.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: switch default BM25 backend from "memory" to "fts5"

Sub-PR 4 of 5 from #327. Builds on the FTS5 backend in sub-PR 3.

Flips LeannBuilder(bm25_backend=...) default from "memory" to "fts5". New
indexes built without specifying a backend now write a sqlite FTS5 database
instead of in-memory state. Existing indexes are unaffected — their meta.json
either lacks bm25_backend (older, fit-on-search path) or has it set
explicitly.

In-memory `BM25Scorer` stays as `bm25_backend="memory"` for tiny corpora
where the FTS5 sqlite overhead isn't worth it. Sub-PR 5 removes it after
this default has soaked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Aakash Suresh <aakashsuresh@berkeley.edu>
ASuresh0524 added a commit that referenced this pull request May 23, 2026
…ementation (#335)

* refactor: extract BM25Index ABC, BM25Scorer inherits

Pure refactor, no behavior change. Sets up follow-up PRs that add an FTS5-
backed implementation behind the same contract.

Currently the only consumer of BM25 (LeannSearcher._init_bm25 / _bm25_search)
relies on `BM25Scorer.fit(passages)` and `BM25Scorer.search(query, top_k) ->
list[SearchResult]`. Extracting those two methods into an ABC makes the
follow-up FTS5 implementation drop-in. See #327 for the broader plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: opt-in build-time BM25 snapshot to skip fit-on-first-search

Sub-PR 2 of 5 from #327. Builds on #328 (BM25Index ABC).

Adds `LeannBuilder(prebuild_bm25=True)`. When set, build_index fits a
BM25Scorer on the chunks and pickles it to <index>.bm25.pkl, then records
the snapshot filename in meta.json under "bm25_snapshot".

LeannSearcher._init_bm25 now checks for that snapshot first: if present and
loads cleanly, it skips fitting; otherwise it falls back to today's behavior
(scan passages.jsonl and fit). Older indexes are unaffected — no snapshot
field in their meta.json, so the fit-on-search path runs.

Default stays False so this PR changes nothing for existing callers. Default
flip happens in sub-PR 4 once the FTS5 backend (sub-PR 3) lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: Fts5BM25Index — SQLite FTS5-backed BM25 implementation

Sub-PR 3 of 5 from #327. Builds on #328 and #332.

New `Fts5BM25Index(BM25Index)` class backed by SQLite FTS5 (`tokenize='unicode61 remove_diacritics 2'`). fit() bulk-inserts into a fresh virtual table; search() runs `MATCH` with `-bm25()` ordering so the rest of LeannSearcher (and hybrid fusion) keeps higher-is-better.

Opt-in via `LeannBuilder(bm25_backend="fts5")`. When set, build_index writes `<index>.bm25.sqlite` and records `bm25_backend="fts5"` + `bm25_db` in meta.json. `LeannSearcher._init_bm25` honors the field: fts5 → mmap the sqlite; memory → use the pickle from sub-PR 2; absent → fall back to fit-on-search for older indexes.

Default `bm25_backend="memory"` so nothing changes for existing callers. Default flip happens in sub-PR 4.

Query tokenization matches BM25Scorer (strip punctuation, lowercase, OR terms) so the same query text behaves consistently across backends; FTS5 syntax surprises like `:` `*` get neutralized.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: switch default BM25 backend from "memory" to "fts5"

Sub-PR 4 of 5 from #327. Builds on the FTS5 backend in sub-PR 3.

Flips LeannBuilder(bm25_backend=...) default from "memory" to "fts5". New
indexes built without specifying a backend now write a sqlite FTS5 database
instead of in-memory state. Existing indexes are unaffected — their meta.json
either lacks bm25_backend (older, fit-on-search path) or has it set
explicitly.

In-memory `BM25Scorer` stays as `bm25_backend="memory"` for tiny corpora
where the FTS5 sqlite overhead isn't worth it. Sub-PR 5 removes it after
this default has soaked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: drop in-memory BM25Scorer; FTS5 is the only BM25 implementation

Sub-PR 5 of 5 from #327. Builds on sub-PR 4 (FTS5 default).

Deletes BM25Scorer entirely along with the build-time pickle snapshot path
(_build_bm25_snapshot, the "bm25_snapshot" meta.json field, the prebuild_bm25
kwarg) and the fit-on-search fallback inside _init_bm25. Drops the now-orphan
`from collections import Counter, defaultdict`.

`bm25_backend` is now fts5-only; passing "memory" raises ValueError. Indexes
built before this had a BM25 artifact will get a clear RuntimeError on first
hybrid/BM25 call pointing the user at `leann rebuild`.

Net: -74 LOC of dead BM25 implementation, one canonical path.

Breaking change for indexes without an FTS5 artifact. Documented in the error
message; affects only indexes built pre-sub-PR-3 (or with bm25_backend="memory"
during the soak window between sub-PRs 2-4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Aakash Suresh <aakashsuresh@berkeley.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants