DEV-1365: rank memories by BM25 instead of raw entity-overlap count by ZmeiGorynych · Pull Request #105 · MotleyAI/slayer

ZmeiGorynych · 2026-05-07T14:57:08Z

Summary

MemoryService.recall_memories previously ranked by raw entity-overlap count (match_count = |wanted ∩ memory.entities|), which trivially favoured memories with large entity sets — a memory tagged with 50 entities would out-overlap a precisely-tagged one of 2 regardless of relevance. This PR replaces that ranker with BM25 over canonical entity sets.

New module slayer/memories/ranker.py exposes bm25_rank(memories, query_entities), using rank_bm25.BM25Plus (added as a core dep).
IDF / avgdl are computed over the full memory corpus (recall now calls storage.list_memories(entities=None), not the intersection-filtered form).
An explicit set-intersection pre-filter enforces "must overlap on ≥1 entity"; BM25 is used purely to rank the eligible set.
RecallHit.match_count: int is replaced by RecallHit.score: float across MCP, REST, CLI, and Python client. Hard rename — no alias.
The empty-about recency fallback is unchanged (no BM25 in that branch).

Why `BM25Plus` and not `BM25Okapi`

At small corpus sizes (typical for the agent memory store), BM25Okapi's IDF formula log((N - df + 0.5) / (df + 0.5)) goes negative for terms that appear in even a moderate fraction of documents. With negative IDF, BM25's length-normalisation logic inverts — broad memories get higher scores than narrow ones, the exact bug DEV-1365 is trying to fix. Verified with a worked example before switching variants. BM25Plus uses log((N+1)/df) (always positive), keeps the same k1=1.5 / b=0.75 defaults, plus a delta=1 constant so the math stays in the right regime. Documented in slayer/memories/ranker.py's module header.

Things explicitly left out of scope (flag for review)

These were either ruled out during the design interview or fall outside the bug DEV-1365 calls out. Each is listed here so reviewers know it's intentional, not an oversight:

Memory.version is not bumped. The persisted row shape is unchanged; only the response model RecallHit shifted. No storage migration logic added.
No backwards-compat alias for match_count. Hard rename to score. Old field is gone, not deprecated. Callers introspecting the field break loudly rather than silently misreading score as a count.
BM25 parameters are hardcoded. k1=1.5, b=0.75, delta=1 (the BM25Plus defaults). Not exposed via env var, config, or API.
inspect_model's Learnings section is unchanged. It's a per-model browsing view, not a retrieval query — still pulls memories via the storage entity-intersection filter and renders them in insertion / created_at order. No BM25 here.
No sub-tokenisation of canonical entity strings. mydb.orders.amount is one atomic BM25 token; it does not partial-match a query for mydb.orders or mydb.orders.qty. Consistent with the existing equality-on-canonical-form contract in docs/concepts/memories.md.
No caching of the BM25 index across calls. Every recall_memories rebuilds the index over the full corpus. Memory-store sizes don't justify a cache; revisit if any deployment ever scales past ~10k memories (separate issue).
storage._list_memories_rows(entities=...) is unchanged. The storage-layer entity-intersection filter is still used by inspect_model and remains available to any future filtered consumer; only the recall path switched to entities=None.
One additional storage round-trip per recall. Recall now always pulls every memory rather than the intersection-filtered set. Acceptable at expected memory-store sizes (≤ low thousands); flagged as a future bottleneck.
No CodeRabbit / Sonar review pass yet — first push on this branch. Address findings in follow-up commits.

Test plan

poetry run pytest -m "not integration" — 2184 passed, 0 failed
poetry run ruff check slayer/ tests/ — clean
New tests/test_memories_ranker.py — 10 cases including the DEV-1365 fix proof, edge cases (empty corpus, empty query, single-doc corpus, term-in-every-doc, defensive dedup, stability)
Per-surface fix-proof tests added to test_memories_client.py, test_memories_cli.py, test_memories_mcp.py, test_memories_rest.py — each proves a precisely-tagged memory ranks above an over-broad one through the public surface, and asserts score is a float
Manual smoke (BM25Plus over corpus of 2): bm25_rank([precise(2 entities), broad(51 entities)], ["mydb.orders.amount"]) → precise first, score 1.099 > broad's 0.691 (verified in worked example before switching variants)

Doc updates

docs/concepts/memories.md — recall section now describes BM25 ranking + score: float field
CLAUDE.md — Memories bullet appended with DEV-1365 ranker note
specs/DEV-1357-memories.md — historical-note block added at top pointing to DEV-1365 for the field rename (the spec body is left as a record of the v2 surface as shipped)

🤖 Generated with Claude Code

Summary by CodeRabbit

Improvements
- Memory recall now ranks results by BM25 relevance, surfacing more precise matches above overbroad ones.
- Recall results include a numeric relevance score to indicate match quality.
Documentation
- Updated memories and ingestion docs and guides to reflect ranking and recall behavior and new memory concepts.
Tests
- Added tests verifying BM25 ranking, score presence/type, and CLI output ordering.

The previous ranker `match_count = |wanted ∩ memory.entities|` trivially favoured memories with large entity sets — a memory tagged with 50 entities would out-overlap a precisely-tagged one of 2 regardless of relevance. `recall_memories` now ranks via BM25 (`rank_bm25.BM25Plus`, implemented in `slayer/memories/ranker.py`); IDF / avgdl are computed over the full memory corpus, an explicit set-intersection pre-filter enforces "must overlap on ≥1 entity," and BM25 is used purely to rank the eligible set. `RecallHit.match_count: int` becomes `RecallHit.score: float` across MCP, REST, CLI, and Python client. `BM25Plus` is used rather than `BM25Okapi` because at small corpus sizes the latter's IDF goes negative for terms that appear in even a moderate fraction of documents, and BM25's length normalisation inverts under negative IDF — broad memories rank above narrow ones, the exact bug DEV-1365 is trying to fix. `BM25Plus` uses `log((N+1)/df)` and stays positive. The empty-`about` recency fallback is unchanged. `inspect_model`'s Learnings section is unchanged (per-model browsing view, not a retrieval). `Memory` persisted shape is unchanged — no version bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-07T14:58:01Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 220a9c7d-8b41-42dc-bc4a-fbb826eea988

📥 Commits

Reviewing files that changed from the base of the PR and between 5e2db08 and c4a0fec.

📒 Files selected for processing (3)

pyproject.toml
specs/DEV-1357-memories.md
tests/test_memories_ranker.py

🚧 Files skipped from review as they are similar to previous changes (2)

pyproject.toml
specs/DEV-1357-memories.md

📝 Walkthrough

Walkthrough

Memory recall ranking switches from entity-overlap counting to BM25 scoring over canonical entity sets. RecallHit exposes score: float instead of match_count: int. Service layer refactored to support recency-only fallback and BM25 ranking paths. All integration points and comprehensive test coverage updated.

Changes

Memory Ranking via BM25

Layer / File(s)	Summary
Data Model & Schema `slayer/memories/models.py`	`RecallHit` field `match_count: int` replaced with `score: float` for BM25 relevance scoring.
Dependencies `pyproject.toml`	`rank-bm25 = ">=0.2.2"` added to package dependencies.
Core Ranker Implementation `slayer/memories/ranker.py`	New `bm25_rank` computes BM25Plus scores from tokenized, deduplicated memory entity lists with an overlap pre-filter; returns descending (Memory, score) pairs and exports via `__all__`.
Service Layer Refactoring `slayer/memories/service.py`	`recall_memories` now handles no-entity recency fallback and BM25 ranking with helpers `_build_recency_response` and `_build_bm25_response`; `_to_hit` includes numeric `score`.
MCP/API/CLI Integration `slayer/mcp/server.py`, `slayer/cli.py`	MCP docstring updated to BM25 semantics; CLI output now prints each hit's `score` formatted to three decimals; API shapes unchanged except RecallHit.score.
Specifications & Documentation `specs/DEV-1357-memories.md`, `CLAUDE.md`, `docs/concepts/memories.md`, `.claude/skills/*`	Spec annotated with DEV-1365 historical note (entity-overlap→BM25 and match_count→score). CLAUDE.md and docs/concepts/memories.md updated. Skill guides refined for `rank()` top-N filtering and ingestion snippet simplification.
Test Coverage `tests/test_memories_ranker.py`, `tests/test_memories_*.py`	Unit tests for `bm25_rank` (empty inputs, single-doc scoring, precision-vs-broad regression, deduplication, determinism). Integration tests across CLI, client, MCP, and REST verify BM25 outranking and numeric `score` presence.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant MCP
  participant Service
  participant Storage
  Client->>Service: POST /memories/recall (about)
  MCP->>Service: recall_memories tool call
  Service->>Storage: fetch candidate memories
  Service->>Service: extract entities, dedupe
  Service->>Service: call bm25_rank(candidates, entities)
  Service->>Client: return RecallResponse with hits(score)
  Service->>MCP: return RecallResponse with hits(score)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

MotleyAI/slayer#100: Main PR builds on agent-memory feature from PR #100 — replaces entity-overlap ranking with BM25-based ranker, changes RecallHit.match_count→score, and updates memories service, server, CLI, models, and tests accordingly.

Suggested reviewers

AivanF

Poem

🐰 I hop through memories, tidy and bright,

BM25 helps precise thoughts take flight,
Short lessons rise above the broad crowd,
Scores lined up neat and proudly avowed,
A rabbit's cheer for recall done right!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 6.45% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'DEV-1365: rank memories by BM25 instead of raw entity-overlap count' accurately and specifically summarizes the main change: replacing entity-overlap counting with BM25-based ranking for memory retrieval.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch egor/dev-1365-use-bm25-for-memory-retrieval

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

pyproject.toml (1)

52-52: 💤 Low value

rank-bm25 is effectively frozen at 0.2.2 — worth noting for future maintenance.

The latest published version of rank-bm25 on PyPI is 0.2.2, and the library's maintenance is considered inactive — it hasn't seen any new versions released to PyPI in the past 12 months. The >=0.2.2 constraint is therefore equivalent to pinning exactly to 0.2.2 today.

BM25Plus is present in this version and works correctly, and no security vulnerabilities or license issues have been detected. This is not a blocking concern, but if the library ever becomes a supply-chain risk or a need for BM25Plus behaviour changes arises, consider replacing it with the more actively maintained bm25s alternative.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pyproject.toml` at line 52, The dependency constraint 'rank-bm25 = ">=0.2.2"'
effectively pins us to 0.2.2; update the declaration to make intent explicit by
either changing it to an exact pin 'rank-bm25 = "==0.2.2"' or add an inline
comment next to the 'rank-bm25 = ">=0.2.2"' line explaining that the package is
unmaintained and intentionally fixed to 0.2.2, and add a short TODO to consider
replacing it with the actively maintained alternative 'bm25s' if supply-chain or
feature needs arise; reference the dependency line 'rank-bm25 = ">=0.2.2"' when
making the change.

slayer/memories/service.py (1)

269-270: ⚡ Quick win

Use keyword arguments for the new helper calls.

The new _to_hit(...) and bm25_rank(...) invocations are positional. Switching these to keywords would match the repo convention and make the matched/score ordering harder to mix up in later edits.

♻️ Proposed cleanup

-        learnings = [_to_hit(m, [], 0.0) for m in scored if m.query is None]
-        queries = [_to_hit(m, [], 0.0) for m in scored if m.query is not None]
+        learnings = [
+            _to_hit(memory=m, matched=[], score=0.0)
+            for m in scored
+            if m.query is None
+        ]
+        queries = [
+            _to_hit(memory=m, matched=[], score=0.0)
+            for m in scored
+            if m.query is not None
+        ]
@@
-        ranked = bm25_rank(ordered, query_entities)
+        ranked = bm25_rank(memories=ordered, query_entities=query_entities)
         hits = [
-            _to_hit(memory, sorted(wanted & set(memory.entities)), score)
+            _to_hit(
+                memory=memory,
+                matched=sorted(wanted & set(memory.entities)),
+                score=score,
+            )
             for memory, score in ranked
         ]

As per coding guidelines, "Use keyword arguments for functions with more than 1 parameter."

Also applies to: 297-300

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@slayer/memories/service.py` around lines 269 - 270, The calls to the helper
functions use positional args; change them to use keyword arguments instead so
argument order can't be mixed up—update the _to_hit(...) calls (e.g., the two
shown that build learnings and queries) to pass matched=[], score=0.0 (and any
other params) by name, and do the same for any bm25_rank(...) and other
_to_hit(...) invocations referenced later (around the other block mentioned) so
all multi-parameter helper calls use keyword arguments.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@specs/DEV-1357-memories.md`:
- Around line 3-7: The historical note claims the v2 surface is current but
downstream text still references the old intersection-count ranking and
RecallHit.match_count; either update §5.3 and the RecallHit model to replace
RecallHit.match_count:int and "intersection-count" ranking with
RecallHit.score:float and "BM25 over canonical entity sets" (adjust prose and
any examples accordingly), or rewrite the historical note to explicitly state
that those sections are stale; also grep the repo for "match_count",
"intersection-count", and any references to RecallHit.match_count and update
them to the new field name and ranking description or mark them as deprecated to
avoid confusion.

In `@tests/test_memories_ranker.py`:
- Line 14: Update all calls to the functions _mem and bm25_rank in this test
file to use keyword arguments instead of positional arguments; specifically,
replace calls like _mem(arg1, arg2) with _mem(key1=arg1, key2=arg2) and
bm25_rank(arg1, arg2) with bm25_rank(documents=arg1, queries=arg2) (or the
actual parameter names used in the function definitions) for every occurrence
(including the call at the highlighted line and the other sites listed in the
comment ranges) so every call with more than one parameter uses explicit keyword
names.
- Around line 55-65: Update the test comment in
test_term_in_every_doc_still_returned to remove the incorrect reference to
BM25Okapi and state that the project uses BM25Plus (which prevents negative IDF)
— clarify that this test is checking the overlap-based pre-filter behavior in
bm25_rank so matching memories are retained regardless of BM25 variant IDF
behavior; mention the bm25_rank and test_term_in_every_doc_still_returned
identifiers so reviewers can find and update the corresponding comment.

---

Nitpick comments:
In `@pyproject.toml`:
- Line 52: The dependency constraint 'rank-bm25 = ">=0.2.2"' effectively pins us
to 0.2.2; update the declaration to make intent explicit by either changing it
to an exact pin 'rank-bm25 = "==0.2.2"' or add an inline comment next to the
'rank-bm25 = ">=0.2.2"' line explaining that the package is unmaintained and
intentionally fixed to 0.2.2, and add a short TODO to consider replacing it with
the actively maintained alternative 'bm25s' if supply-chain or feature needs
arise; reference the dependency line 'rank-bm25 = ">=0.2.2"' when making the
change.

In `@slayer/memories/service.py`:
- Around line 269-270: The calls to the helper functions use positional args;
change them to use keyword arguments instead so argument order can't be mixed
up—update the _to_hit(...) calls (e.g., the two shown that build learnings and
queries) to pass matched=[], score=0.0 (and any other params) by name, and do
the same for any bm25_rank(...) and other _to_hit(...) invocations referenced
later (around the other block mentioned) so all multi-parameter helper calls use
keyword arguments.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a2ad2758-1d21-4de1-8078-e6060bf1706b

📥 Commits

Reviewing files that changed from the base of the PR and between 030cbfb and 5e2db08.

⛔ Files ignored due to path filters (1)

poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (17)

.claude/skills/slayer-models.md
.claude/skills/slayer-overview.md
.claude/skills/slayer-query.md
CLAUDE.md
docs/concepts/memories.md
pyproject.toml
slayer/cli.py
slayer/mcp/server.py
slayer/memories/models.py
slayer/memories/ranker.py
slayer/memories/service.py
specs/DEV-1357-memories.md
tests/test_memories_cli.py
tests/test_memories_client.py
tests/test_memories_mcp.py
tests/test_memories_ranker.py
tests/test_memories_rest.py

Three threads + one nitpick on the BM25 memory ranker landed; Sonar clean. All four are valid; fixes are confined to test wording, the DEV-1357 historical-note block, and a pyproject inline comment — no runtime code changes. - tests/test_memories_ranker.py: convert every _mem() / bm25_rank() call to keyword arguments per the global "kwargs for >1-param functions" rule, and rewrite the test_term_in_every_doc_still_returned comment to reflect that the implementation uses BM25Plus (not BM25Okapi — the latter is what BM25Plus is chosen to avoid). - specs/DEV-1357-memories.md: broaden the historical-note block so it enumerates explicitly that both the recall ranking algorithm AND the RecallHit response shape are superseded by DEV-1365. Spec body stays as a record of the v2 surface as shipped. - pyproject.toml: add an inline comment next to rank-bm25 noting the upstream is unmaintained at 0.2.2 with bm25s as the active alternative; the >= constraint stays. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-05-07T17:56:08Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

ZmeiGorynych and others added 2 commits May 7, 2026 10:41

Update docs

990474b

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread specs/DEV-1357-memories.md Outdated

Comment thread tests/test_memories_ranker.py Outdated

Comment thread tests/test_memories_ranker.py

ZmeiGorynych merged commit 4b08be2 into main May 8, 2026
4 checks passed

This was referenced May 8, 2026

DEV-1369: pin two-mode reference semantics; drop predicate-promotion escape hatch #107

Merged

Add embedding-based semantic search (third retrieval channel) #114

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEV-1365: rank memories by BM25 instead of raw entity-overlap count#105

DEV-1365: rank memories by BM25 instead of raw entity-overlap count#105
ZmeiGorynych merged 3 commits into
mainfrom
egor/dev-1365-use-bm25-for-memory-retrieval

ZmeiGorynych commented May 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZmeiGorynych commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why BM25Plus and not BM25Okapi

Things explicitly left out of scope (flag for review)

Test plan

Doc updates

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud Bot commented May 7, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZmeiGorynych commented May 7, 2026 •

edited by coderabbitai Bot

Loading

Why `BM25Plus` and not `BM25Okapi`

coderabbitai Bot commented May 7, 2026 •

edited

Loading