Skip to content

DEV-1365: rank memories by BM25 instead of raw entity-overlap count#105

Merged
ZmeiGorynych merged 3 commits into
mainfrom
egor/dev-1365-use-bm25-for-memory-retrieval
May 8, 2026
Merged

DEV-1365: rank memories by BM25 instead of raw entity-overlap count#105
ZmeiGorynych merged 3 commits into
mainfrom
egor/dev-1365-use-bm25-for-memory-retrieval

Conversation

@ZmeiGorynych
Copy link
Copy Markdown
Member

@ZmeiGorynych ZmeiGorynych commented May 7, 2026

Summary

MemoryService.recall_memories previously ranked by raw entity-overlap count (match_count = |wanted ∩ memory.entities|), which trivially favoured memories with large entity sets — a memory tagged with 50 entities would out-overlap a precisely-tagged one of 2 regardless of relevance. This PR replaces that ranker with BM25 over canonical entity sets.

  • New module slayer/memories/ranker.py exposes bm25_rank(memories, query_entities), using rank_bm25.BM25Plus (added as a core dep).
  • IDF / avgdl are computed over the full memory corpus (recall now calls storage.list_memories(entities=None), not the intersection-filtered form).
  • An explicit set-intersection pre-filter enforces "must overlap on ≥1 entity"; BM25 is used purely to rank the eligible set.
  • RecallHit.match_count: int is replaced by RecallHit.score: float across MCP, REST, CLI, and Python client. Hard rename — no alias.
  • The empty-about recency fallback is unchanged (no BM25 in that branch).

Why BM25Plus and not BM25Okapi

At small corpus sizes (typical for the agent memory store), BM25Okapi's IDF formula log((N - df + 0.5) / (df + 0.5)) goes negative for terms that appear in even a moderate fraction of documents. With negative IDF, BM25's length-normalisation logic inverts — broad memories get higher scores than narrow ones, the exact bug DEV-1365 is trying to fix. Verified with a worked example before switching variants. BM25Plus uses log((N+1)/df) (always positive), keeps the same k1=1.5 / b=0.75 defaults, plus a delta=1 constant so the math stays in the right regime. Documented in slayer/memories/ranker.py's module header.

Things explicitly left out of scope (flag for review)

These were either ruled out during the design interview or fall outside the bug DEV-1365 calls out. Each is listed here so reviewers know it's intentional, not an oversight:

  • Memory.version is not bumped. The persisted row shape is unchanged; only the response model RecallHit shifted. No storage migration logic added.
  • No backwards-compat alias for match_count. Hard rename to score. Old field is gone, not deprecated. Callers introspecting the field break loudly rather than silently misreading score as a count.
  • BM25 parameters are hardcoded. k1=1.5, b=0.75, delta=1 (the BM25Plus defaults). Not exposed via env var, config, or API.
  • inspect_model's Learnings section is unchanged. It's a per-model browsing view, not a retrieval query — still pulls memories via the storage entity-intersection filter and renders them in insertion / created_at order. No BM25 here.
  • No sub-tokenisation of canonical entity strings. mydb.orders.amount is one atomic BM25 token; it does not partial-match a query for mydb.orders or mydb.orders.qty. Consistent with the existing equality-on-canonical-form contract in docs/concepts/memories.md.
  • No caching of the BM25 index across calls. Every recall_memories rebuilds the index over the full corpus. Memory-store sizes don't justify a cache; revisit if any deployment ever scales past ~10k memories (separate issue).
  • storage._list_memories_rows(entities=...) is unchanged. The storage-layer entity-intersection filter is still used by inspect_model and remains available to any future filtered consumer; only the recall path switched to entities=None.
  • One additional storage round-trip per recall. Recall now always pulls every memory rather than the intersection-filtered set. Acceptable at expected memory-store sizes (≤ low thousands); flagged as a future bottleneck.
  • No CodeRabbit / Sonar review pass yet — first push on this branch. Address findings in follow-up commits.

Test plan

  • poetry run pytest -m "not integration" — 2184 passed, 0 failed
  • poetry run ruff check slayer/ tests/ — clean
  • New tests/test_memories_ranker.py — 10 cases including the DEV-1365 fix proof, edge cases (empty corpus, empty query, single-doc corpus, term-in-every-doc, defensive dedup, stability)
  • Per-surface fix-proof tests added to test_memories_client.py, test_memories_cli.py, test_memories_mcp.py, test_memories_rest.py — each proves a precisely-tagged memory ranks above an over-broad one through the public surface, and asserts score is a float
  • Manual smoke (BM25Plus over corpus of 2): bm25_rank([precise(2 entities), broad(51 entities)], ["mydb.orders.amount"]) → precise first, score 1.099 > broad's 0.691 (verified in worked example before switching variants)

Doc updates

  • docs/concepts/memories.md — recall section now describes BM25 ranking + score: float field
  • CLAUDE.md — Memories bullet appended with DEV-1365 ranker note
  • specs/DEV-1357-memories.md — historical-note block added at top pointing to DEV-1365 for the field rename (the spec body is left as a record of the v2 surface as shipped)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Improvements

    • Memory recall now ranks results by BM25 relevance, surfacing more precise matches above overbroad ones.
    • Recall results include a numeric relevance score to indicate match quality.
  • Documentation

    • Updated memories and ingestion docs and guides to reflect ranking and recall behavior and new memory concepts.
  • Tests

    • Added tests verifying BM25 ranking, score presence/type, and CLI output ordering.

ZmeiGorynych and others added 2 commits May 7, 2026 10:41
The previous ranker `match_count = |wanted ∩ memory.entities|` trivially
favoured memories with large entity sets — a memory tagged with 50
entities would out-overlap a precisely-tagged one of 2 regardless of
relevance. `recall_memories` now ranks via BM25 (`rank_bm25.BM25Plus`,
implemented in `slayer/memories/ranker.py`); IDF / avgdl are computed
over the full memory corpus, an explicit set-intersection pre-filter
enforces "must overlap on ≥1 entity," and BM25 is used purely to rank
the eligible set. `RecallHit.match_count: int` becomes
`RecallHit.score: float` across MCP, REST, CLI, and Python client.

`BM25Plus` is used rather than `BM25Okapi` because at small corpus
sizes the latter's IDF goes negative for terms that appear in even a
moderate fraction of documents, and BM25's length normalisation
inverts under negative IDF — broad memories rank above narrow ones,
the exact bug DEV-1365 is trying to fix. `BM25Plus` uses
`log((N+1)/df)` and stays positive.

The empty-`about` recency fallback is unchanged. `inspect_model`'s
Learnings section is unchanged (per-model browsing view, not a
retrieval). `Memory` persisted shape is unchanged — no version bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 220a9c7d-8b41-42dc-bc4a-fbb826eea988

📥 Commits

Reviewing files that changed from the base of the PR and between 5e2db08 and c4a0fec.

📒 Files selected for processing (3)
  • pyproject.toml
  • specs/DEV-1357-memories.md
  • tests/test_memories_ranker.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • pyproject.toml
  • specs/DEV-1357-memories.md

📝 Walkthrough

Walkthrough

Memory recall ranking switches from entity-overlap counting to BM25 scoring over canonical entity sets. RecallHit exposes score: float instead of match_count: int. Service layer refactored to support recency-only fallback and BM25 ranking paths. All integration points and comprehensive test coverage updated.

Changes

Memory Ranking via BM25

Layer / File(s) Summary
Data Model & Schema
slayer/memories/models.py
RecallHit field match_count: int replaced with score: float for BM25 relevance scoring.
Dependencies
pyproject.toml
rank-bm25 = ">=0.2.2" added to package dependencies.
Core Ranker Implementation
slayer/memories/ranker.py
New bm25_rank computes BM25Plus scores from tokenized, deduplicated memory entity lists with an overlap pre-filter; returns descending (Memory, score) pairs and exports via __all__.
Service Layer Refactoring
slayer/memories/service.py
recall_memories now handles no-entity recency fallback and BM25 ranking with helpers _build_recency_response and _build_bm25_response; _to_hit includes numeric score.
MCP/API/CLI Integration
slayer/mcp/server.py, slayer/cli.py
MCP docstring updated to BM25 semantics; CLI output now prints each hit's score formatted to three decimals; API shapes unchanged except RecallHit.score.
Specifications & Documentation
specs/DEV-1357-memories.md, CLAUDE.md, docs/concepts/memories.md, .claude/skills/*
Spec annotated with DEV-1365 historical note (entity-overlap→BM25 and match_count→score). CLAUDE.md and docs/concepts/memories.md updated. Skill guides refined for rank() top-N filtering and ingestion snippet simplification.
Test Coverage
tests/test_memories_ranker.py, tests/test_memories_*.py
Unit tests for bm25_rank (empty inputs, single-doc scoring, precision-vs-broad regression, deduplication, determinism). Integration tests across CLI, client, MCP, and REST verify BM25 outranking and numeric score presence.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant MCP
  participant Service
  participant Storage
  Client->>Service: POST /memories/recall (about)
  MCP->>Service: recall_memories tool call
  Service->>Storage: fetch candidate memories
  Service->>Service: extract entities, dedupe
  Service->>Service: call bm25_rank(candidates, entities)
  Service->>Client: return RecallResponse with hits(score)
  Service->>MCP: return RecallResponse with hits(score)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • MotleyAI/slayer#100: Main PR builds on agent-memory feature from PR #100 — replaces entity-overlap ranking with BM25-based ranker, changes RecallHit.match_count→score, and updates memories service, server, CLI, models, and tests accordingly.

Suggested reviewers

  • AivanF

Poem

🐰 I hop through memories, tidy and bright,

BM25 helps precise thoughts take flight,
Short lessons rise above the broad crowd,
Scores lined up neat and proudly avowed,
A rabbit's cheer for recall done right!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'DEV-1365: rank memories by BM25 instead of raw entity-overlap count' accurately and specifically summarizes the main change: replacing entity-overlap counting with BM25-based ranking for memory retrieval.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch egor/dev-1365-use-bm25-for-memory-retrieval

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
pyproject.toml (1)

52-52: 💤 Low value

rank-bm25 is effectively frozen at 0.2.2 — worth noting for future maintenance.

The latest published version of rank-bm25 on PyPI is 0.2.2, and the library's maintenance is considered inactive — it hasn't seen any new versions released to PyPI in the past 12 months. The >=0.2.2 constraint is therefore equivalent to pinning exactly to 0.2.2 today.

BM25Plus is present in this version and works correctly, and no security vulnerabilities or license issues have been detected. This is not a blocking concern, but if the library ever becomes a supply-chain risk or a need for BM25Plus behaviour changes arises, consider replacing it with the more actively maintained bm25s alternative.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pyproject.toml` at line 52, The dependency constraint 'rank-bm25 = ">=0.2.2"'
effectively pins us to 0.2.2; update the declaration to make intent explicit by
either changing it to an exact pin 'rank-bm25 = "==0.2.2"' or add an inline
comment next to the 'rank-bm25 = ">=0.2.2"' line explaining that the package is
unmaintained and intentionally fixed to 0.2.2, and add a short TODO to consider
replacing it with the actively maintained alternative 'bm25s' if supply-chain or
feature needs arise; reference the dependency line 'rank-bm25 = ">=0.2.2"' when
making the change.
slayer/memories/service.py (1)

269-270: ⚡ Quick win

Use keyword arguments for the new helper calls.

The new _to_hit(...) and bm25_rank(...) invocations are positional. Switching these to keywords would match the repo convention and make the matched/score ordering harder to mix up in later edits.

♻️ Proposed cleanup
-        learnings = [_to_hit(m, [], 0.0) for m in scored if m.query is None]
-        queries = [_to_hit(m, [], 0.0) for m in scored if m.query is not None]
+        learnings = [
+            _to_hit(memory=m, matched=[], score=0.0)
+            for m in scored
+            if m.query is None
+        ]
+        queries = [
+            _to_hit(memory=m, matched=[], score=0.0)
+            for m in scored
+            if m.query is not None
+        ]
@@
-        ranked = bm25_rank(ordered, query_entities)
+        ranked = bm25_rank(memories=ordered, query_entities=query_entities)
         hits = [
-            _to_hit(memory, sorted(wanted & set(memory.entities)), score)
+            _to_hit(
+                memory=memory,
+                matched=sorted(wanted & set(memory.entities)),
+                score=score,
+            )
             for memory, score in ranked
         ]

As per coding guidelines, "Use keyword arguments for functions with more than 1 parameter."

Also applies to: 297-300

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@slayer/memories/service.py` around lines 269 - 270, The calls to the helper
functions use positional args; change them to use keyword arguments instead so
argument order can't be mixed up—update the _to_hit(...) calls (e.g., the two
shown that build learnings and queries) to pass matched=[], score=0.0 (and any
other params) by name, and do the same for any bm25_rank(...) and other
_to_hit(...) invocations referenced later (around the other block mentioned) so
all multi-parameter helper calls use keyword arguments.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@specs/DEV-1357-memories.md`:
- Around line 3-7: The historical note claims the v2 surface is current but
downstream text still references the old intersection-count ranking and
RecallHit.match_count; either update §5.3 and the RecallHit model to replace
RecallHit.match_count:int and "intersection-count" ranking with
RecallHit.score:float and "BM25 over canonical entity sets" (adjust prose and
any examples accordingly), or rewrite the historical note to explicitly state
that those sections are stale; also grep the repo for "match_count",
"intersection-count", and any references to RecallHit.match_count and update
them to the new field name and ranking description or mark them as deprecated to
avoid confusion.

In `@tests/test_memories_ranker.py`:
- Line 14: Update all calls to the functions _mem and bm25_rank in this test
file to use keyword arguments instead of positional arguments; specifically,
replace calls like _mem(arg1, arg2) with _mem(key1=arg1, key2=arg2) and
bm25_rank(arg1, arg2) with bm25_rank(documents=arg1, queries=arg2) (or the
actual parameter names used in the function definitions) for every occurrence
(including the call at the highlighted line and the other sites listed in the
comment ranges) so every call with more than one parameter uses explicit keyword
names.
- Around line 55-65: Update the test comment in
test_term_in_every_doc_still_returned to remove the incorrect reference to
BM25Okapi and state that the project uses BM25Plus (which prevents negative IDF)
— clarify that this test is checking the overlap-based pre-filter behavior in
bm25_rank so matching memories are retained regardless of BM25 variant IDF
behavior; mention the bm25_rank and test_term_in_every_doc_still_returned
identifiers so reviewers can find and update the corresponding comment.

---

Nitpick comments:
In `@pyproject.toml`:
- Line 52: The dependency constraint 'rank-bm25 = ">=0.2.2"' effectively pins us
to 0.2.2; update the declaration to make intent explicit by either changing it
to an exact pin 'rank-bm25 = "==0.2.2"' or add an inline comment next to the
'rank-bm25 = ">=0.2.2"' line explaining that the package is unmaintained and
intentionally fixed to 0.2.2, and add a short TODO to consider replacing it with
the actively maintained alternative 'bm25s' if supply-chain or feature needs
arise; reference the dependency line 'rank-bm25 = ">=0.2.2"' when making the
change.

In `@slayer/memories/service.py`:
- Around line 269-270: The calls to the helper functions use positional args;
change them to use keyword arguments instead so argument order can't be mixed
up—update the _to_hit(...) calls (e.g., the two shown that build learnings and
queries) to pass matched=[], score=0.0 (and any other params) by name, and do
the same for any bm25_rank(...) and other _to_hit(...) invocations referenced
later (around the other block mentioned) so all multi-parameter helper calls use
keyword arguments.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a2ad2758-1d21-4de1-8078-e6060bf1706b

📥 Commits

Reviewing files that changed from the base of the PR and between 030cbfb and 5e2db08.

⛔ Files ignored due to path filters (1)
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (17)
  • .claude/skills/slayer-models.md
  • .claude/skills/slayer-overview.md
  • .claude/skills/slayer-query.md
  • CLAUDE.md
  • docs/concepts/memories.md
  • pyproject.toml
  • slayer/cli.py
  • slayer/mcp/server.py
  • slayer/memories/models.py
  • slayer/memories/ranker.py
  • slayer/memories/service.py
  • specs/DEV-1357-memories.md
  • tests/test_memories_cli.py
  • tests/test_memories_client.py
  • tests/test_memories_mcp.py
  • tests/test_memories_ranker.py
  • tests/test_memories_rest.py

Comment thread specs/DEV-1357-memories.md Outdated
Comment thread tests/test_memories_ranker.py Outdated
Comment thread tests/test_memories_ranker.py
Three threads + one nitpick on the BM25 memory ranker landed; Sonar
clean. All four are valid; fixes are confined to test wording, the
DEV-1357 historical-note block, and a pyproject inline comment — no
runtime code changes.

- tests/test_memories_ranker.py: convert every _mem() / bm25_rank() call
  to keyword arguments per the global "kwargs for >1-param functions"
  rule, and rewrite the test_term_in_every_doc_still_returned comment
  to reflect that the implementation uses BM25Plus (not BM25Okapi —
  the latter is what BM25Plus is chosen to avoid).
- specs/DEV-1357-memories.md: broaden the historical-note block so it
  enumerates explicitly that both the recall ranking algorithm AND the
  RecallHit response shape are superseded by DEV-1365. Spec body stays
  as a record of the v2 surface as shipped.
- pyproject.toml: add an inline comment next to rank-bm25 noting the
  upstream is unmaintained at 0.2.2 with bm25s as the active
  alternative; the >= constraint stays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 7, 2026

@ZmeiGorynych ZmeiGorynych merged commit 4b08be2 into main May 8, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant