DEV-1365: rank memories by BM25 instead of raw entity-overlap count#105
Conversation
The previous ranker `match_count = |wanted ∩ memory.entities|` trivially favoured memories with large entity sets — a memory tagged with 50 entities would out-overlap a precisely-tagged one of 2 regardless of relevance. `recall_memories` now ranks via BM25 (`rank_bm25.BM25Plus`, implemented in `slayer/memories/ranker.py`); IDF / avgdl are computed over the full memory corpus, an explicit set-intersection pre-filter enforces "must overlap on ≥1 entity," and BM25 is used purely to rank the eligible set. `RecallHit.match_count: int` becomes `RecallHit.score: float` across MCP, REST, CLI, and Python client. `BM25Plus` is used rather than `BM25Okapi` because at small corpus sizes the latter's IDF goes negative for terms that appear in even a moderate fraction of documents, and BM25's length normalisation inverts under negative IDF — broad memories rank above narrow ones, the exact bug DEV-1365 is trying to fix. `BM25Plus` uses `log((N+1)/df)` and stays positive. The empty-`about` recency fallback is unchanged. `inspect_model`'s Learnings section is unchanged (per-model browsing view, not a retrieval). `Memory` persisted shape is unchanged — no version bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughMemory recall ranking switches from entity-overlap counting to BM25 scoring over canonical entity sets. RecallHit exposes score: float instead of match_count: int. Service layer refactored to support recency-only fallback and BM25 ranking paths. All integration points and comprehensive test coverage updated. ChangesMemory Ranking via BM25
Sequence Diagram(s)sequenceDiagram
participant Client
participant MCP
participant Service
participant Storage
Client->>Service: POST /memories/recall (about)
MCP->>Service: recall_memories tool call
Service->>Storage: fetch candidate memories
Service->>Service: extract entities, dedupe
Service->>Service: call bm25_rank(candidates, entities)
Service->>Client: return RecallResponse with hits(score)
Service->>MCP: return RecallResponse with hits(score)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
pyproject.toml (1)
52-52: 💤 Low value
rank-bm25is effectively frozen at0.2.2— worth noting for future maintenance.The latest published version of
rank-bm25on PyPI is0.2.2, and the library's maintenance is considered inactive — it hasn't seen any new versions released to PyPI in the past 12 months. The>=0.2.2constraint is therefore equivalent to pinning exactly to0.2.2today.
BM25Plusis present in this version and works correctly, and no security vulnerabilities or license issues have been detected. This is not a blocking concern, but if the library ever becomes a supply-chain risk or a need for BM25Plus behaviour changes arises, consider replacing it with the more actively maintainedbm25salternative.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pyproject.toml` at line 52, The dependency constraint 'rank-bm25 = ">=0.2.2"' effectively pins us to 0.2.2; update the declaration to make intent explicit by either changing it to an exact pin 'rank-bm25 = "==0.2.2"' or add an inline comment next to the 'rank-bm25 = ">=0.2.2"' line explaining that the package is unmaintained and intentionally fixed to 0.2.2, and add a short TODO to consider replacing it with the actively maintained alternative 'bm25s' if supply-chain or feature needs arise; reference the dependency line 'rank-bm25 = ">=0.2.2"' when making the change.slayer/memories/service.py (1)
269-270: ⚡ Quick winUse keyword arguments for the new helper calls.
The new
_to_hit(...)andbm25_rank(...)invocations are positional. Switching these to keywords would match the repo convention and make thematched/scoreordering harder to mix up in later edits.♻️ Proposed cleanup
- learnings = [_to_hit(m, [], 0.0) for m in scored if m.query is None] - queries = [_to_hit(m, [], 0.0) for m in scored if m.query is not None] + learnings = [ + _to_hit(memory=m, matched=[], score=0.0) + for m in scored + if m.query is None + ] + queries = [ + _to_hit(memory=m, matched=[], score=0.0) + for m in scored + if m.query is not None + ] @@ - ranked = bm25_rank(ordered, query_entities) + ranked = bm25_rank(memories=ordered, query_entities=query_entities) hits = [ - _to_hit(memory, sorted(wanted & set(memory.entities)), score) + _to_hit( + memory=memory, + matched=sorted(wanted & set(memory.entities)), + score=score, + ) for memory, score in ranked ]As per coding guidelines, "Use keyword arguments for functions with more than 1 parameter."
Also applies to: 297-300
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@slayer/memories/service.py` around lines 269 - 270, The calls to the helper functions use positional args; change them to use keyword arguments instead so argument order can't be mixed up—update the _to_hit(...) calls (e.g., the two shown that build learnings and queries) to pass matched=[], score=0.0 (and any other params) by name, and do the same for any bm25_rank(...) and other _to_hit(...) invocations referenced later (around the other block mentioned) so all multi-parameter helper calls use keyword arguments.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@specs/DEV-1357-memories.md`:
- Around line 3-7: The historical note claims the v2 surface is current but
downstream text still references the old intersection-count ranking and
RecallHit.match_count; either update §5.3 and the RecallHit model to replace
RecallHit.match_count:int and "intersection-count" ranking with
RecallHit.score:float and "BM25 over canonical entity sets" (adjust prose and
any examples accordingly), or rewrite the historical note to explicitly state
that those sections are stale; also grep the repo for "match_count",
"intersection-count", and any references to RecallHit.match_count and update
them to the new field name and ranking description or mark them as deprecated to
avoid confusion.
In `@tests/test_memories_ranker.py`:
- Line 14: Update all calls to the functions _mem and bm25_rank in this test
file to use keyword arguments instead of positional arguments; specifically,
replace calls like _mem(arg1, arg2) with _mem(key1=arg1, key2=arg2) and
bm25_rank(arg1, arg2) with bm25_rank(documents=arg1, queries=arg2) (or the
actual parameter names used in the function definitions) for every occurrence
(including the call at the highlighted line and the other sites listed in the
comment ranges) so every call with more than one parameter uses explicit keyword
names.
- Around line 55-65: Update the test comment in
test_term_in_every_doc_still_returned to remove the incorrect reference to
BM25Okapi and state that the project uses BM25Plus (which prevents negative IDF)
— clarify that this test is checking the overlap-based pre-filter behavior in
bm25_rank so matching memories are retained regardless of BM25 variant IDF
behavior; mention the bm25_rank and test_term_in_every_doc_still_returned
identifiers so reviewers can find and update the corresponding comment.
---
Nitpick comments:
In `@pyproject.toml`:
- Line 52: The dependency constraint 'rank-bm25 = ">=0.2.2"' effectively pins us
to 0.2.2; update the declaration to make intent explicit by either changing it
to an exact pin 'rank-bm25 = "==0.2.2"' or add an inline comment next to the
'rank-bm25 = ">=0.2.2"' line explaining that the package is unmaintained and
intentionally fixed to 0.2.2, and add a short TODO to consider replacing it with
the actively maintained alternative 'bm25s' if supply-chain or feature needs
arise; reference the dependency line 'rank-bm25 = ">=0.2.2"' when making the
change.
In `@slayer/memories/service.py`:
- Around line 269-270: The calls to the helper functions use positional args;
change them to use keyword arguments instead so argument order can't be mixed
up—update the _to_hit(...) calls (e.g., the two shown that build learnings and
queries) to pass matched=[], score=0.0 (and any other params) by name, and do
the same for any bm25_rank(...) and other _to_hit(...) invocations referenced
later (around the other block mentioned) so all multi-parameter helper calls use
keyword arguments.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a2ad2758-1d21-4de1-8078-e6060bf1706b
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (17)
.claude/skills/slayer-models.md.claude/skills/slayer-overview.md.claude/skills/slayer-query.mdCLAUDE.mddocs/concepts/memories.mdpyproject.tomlslayer/cli.pyslayer/mcp/server.pyslayer/memories/models.pyslayer/memories/ranker.pyslayer/memories/service.pyspecs/DEV-1357-memories.mdtests/test_memories_cli.pytests/test_memories_client.pytests/test_memories_mcp.pytests/test_memories_ranker.pytests/test_memories_rest.py
Three threads + one nitpick on the BM25 memory ranker landed; Sonar clean. All four are valid; fixes are confined to test wording, the DEV-1357 historical-note block, and a pyproject inline comment — no runtime code changes. - tests/test_memories_ranker.py: convert every _mem() / bm25_rank() call to keyword arguments per the global "kwargs for >1-param functions" rule, and rewrite the test_term_in_every_doc_still_returned comment to reflect that the implementation uses BM25Plus (not BM25Okapi — the latter is what BM25Plus is chosen to avoid). - specs/DEV-1357-memories.md: broaden the historical-note block so it enumerates explicitly that both the recall ranking algorithm AND the RecallHit response shape are superseded by DEV-1365. Spec body stays as a record of the v2 surface as shipped. - pyproject.toml: add an inline comment next to rank-bm25 noting the upstream is unmaintained at 0.2.2 with bm25s as the active alternative; the >= constraint stays. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|



Summary
MemoryService.recall_memoriespreviously ranked by raw entity-overlap count (match_count = |wanted ∩ memory.entities|), which trivially favoured memories with large entity sets — a memory tagged with 50 entities would out-overlap a precisely-tagged one of 2 regardless of relevance. This PR replaces that ranker with BM25 over canonical entity sets.slayer/memories/ranker.pyexposesbm25_rank(memories, query_entities), usingrank_bm25.BM25Plus(added as a core dep).storage.list_memories(entities=None), not the intersection-filtered form).RecallHit.match_count: intis replaced byRecallHit.score: floatacross MCP, REST, CLI, and Python client. Hard rename — no alias.aboutrecency fallback is unchanged (no BM25 in that branch).Why
BM25Plusand notBM25OkapiAt small corpus sizes (typical for the agent memory store),
BM25Okapi's IDF formulalog((N - df + 0.5) / (df + 0.5))goes negative for terms that appear in even a moderate fraction of documents. With negative IDF, BM25's length-normalisation logic inverts — broad memories get higher scores than narrow ones, the exact bug DEV-1365 is trying to fix. Verified with a worked example before switching variants.BM25Plususeslog((N+1)/df)(always positive), keeps the samek1=1.5 / b=0.75defaults, plus adelta=1constant so the math stays in the right regime. Documented inslayer/memories/ranker.py's module header.Things explicitly left out of scope (flag for review)
These were either ruled out during the design interview or fall outside the bug DEV-1365 calls out. Each is listed here so reviewers know it's intentional, not an oversight:
Memory.versionis not bumped. The persisted row shape is unchanged; only the response modelRecallHitshifted. No storage migration logic added.match_count. Hard rename toscore. Old field is gone, not deprecated. Callers introspecting the field break loudly rather than silently misreadingscoreas a count.k1=1.5,b=0.75,delta=1(theBM25Plusdefaults). Not exposed via env var, config, or API.inspect_model's Learnings section is unchanged. It's a per-model browsing view, not a retrieval query — still pulls memories via the storage entity-intersection filter and renders them in insertion / created_at order. No BM25 here.mydb.orders.amountis one atomic BM25 token; it does not partial-match a query formydb.ordersormydb.orders.qty. Consistent with the existing equality-on-canonical-form contract indocs/concepts/memories.md.recall_memoriesrebuilds the index over the full corpus. Memory-store sizes don't justify a cache; revisit if any deployment ever scales past ~10k memories (separate issue).storage._list_memories_rows(entities=...)is unchanged. The storage-layer entity-intersection filter is still used byinspect_modeland remains available to any future filtered consumer; only the recall path switched toentities=None.Test plan
poetry run pytest -m "not integration"— 2184 passed, 0 failedpoetry run ruff check slayer/ tests/— cleantests/test_memories_ranker.py— 10 cases including the DEV-1365 fix proof, edge cases (empty corpus, empty query, single-doc corpus, term-in-every-doc, defensive dedup, stability)test_memories_client.py,test_memories_cli.py,test_memories_mcp.py,test_memories_rest.py— each proves a precisely-tagged memory ranks above an over-broad one through the public surface, and assertsscoreis a floatbm25_rank([precise(2 entities), broad(51 entities)], ["mydb.orders.amount"])→ precise first, score 1.099 > broad's 0.691 (verified in worked example before switching variants)Doc updates
docs/concepts/memories.md— recall section now describes BM25 ranking +score: floatfieldCLAUDE.md— Memories bullet appended with DEV-1365 ranker notespecs/DEV-1357-memories.md— historical-note block added at top pointing to DEV-1365 for the field rename (the spec body is left as a record of the v2 surface as shipped)🤖 Generated with Claude Code
Summary by CodeRabbit
Improvements
Documentation
Tests