Skip to content

feat(ranking): BM25 context relevance + Thompson sampling + unified ranker#91

Merged
Gradata merged 2 commits intomainfrom
feat/ranking-bm25-thompson
Apr 15, 2026
Merged

feat(ranking): BM25 context relevance + Thompson sampling + unified ranker#91
Gradata merged 2 commits intomainfrom
feat/ranking-bm25-thompson

Conversation

@Gradata
Copy link
Copy Markdown
Owner

@Gradata Gradata commented Apr 15, 2026

Summary

Addresses the two ranking upgrades called out in the 2026-04 autoresearch synthesis (section 5) that were unblocked by the `rules.injected` emit in #86:

  • BM25 over (category + description + tags) replaces the substring keyword-overlap scorer in `rule_ranker.py` as the context-relevance signal. Expected +2-5% on injection relevance.
  • Thompson sampling over existing Beta(alpha, beta) posteriors as an opt-in mode. Solves cold-start: newly graduated PATTERN-tier rules get exploration weight instead of being buried under older RULE-tier posteriors.
  • Unifies the three ranking paths (`inject_brain_rules.py`, `agent_precontext.py`, `rule_ranker.py`) so the algorithm that ships is the one ablation tests.

Design

BM25

  • Uses `bm25s` - pure-Python, single-import, zero C extensions.
  • Added as `ranking` optional extra in `pyproject.toml` and rolled into `all`. Not a required dep - the SDK stays zero-required-deps.
  • Gated behind `try/except ImportError`; when `bm25s` is absent at import or call time, `_context_component` falls back to the legacy substring-overlap scorer. Covered by a monkeypatched fallback test.
  • Corpus: `category + description + tags` per rule. Query: `task_type + context_keywords`. Scores are max-normalized to [0, 1] and plugged into the existing 20% context weight slot.

Thompson sampling

  • Opt-in via `GRADATA_THOMPSON_RANKING=1` (default off - preserves current ranker behavior).
  • When on, the 25% confidence weight slot uses `p ~ Beta(alpha, beta_param)` instead of the mean confidence.
  • Uses stdlib `random.betavariate` - no numpy dep added.
  • New `session_seed` argument on `rank_rules` makes sampling deterministic within a session. Same seed -> same top-K; different seeds -> different orderings (validated by tests).
  • Hardens against malformed Beta params (zero / negative alpha or beta_param are clamped to 1e-3).

Unified ranking

  • `inject_brain_rules.py` and `agent_precontext.py` now both call `rule_ranker.rank_rules`. The linear `state_bonus + conf_norm + conf` scorer and the sub-agent `_relevance_score` helper are replaced.
  • The qmd wiki category match is preserved as an optional `wiki_boost: dict[str, float]` signal fed into the context component (+0.3 by default), not a hard pre-filter. BM25 can now rescue strong cross-category matches the wiki missed.
  • Back-compat `_score` shim kept in `inject_brain_rules.py` so existing tests / external callers don't break.
  • Did not touch `rule_engine.apply_rules` - it has many callers and changing its signature was out of scope. If future work wants to route `apply_rules` through the unified ranker too, it should be a separate PR with care on the public API.

Weights (unchanged)

  • 30% scope match
  • 25% confidence (or Beta-sampled p when Thompson is on)
  • 20% context relevance (BM25 normalized when bm25s available, keyword fallback otherwise)
  • 15% recency
  • 10% fire count
  • \pm 0.10 effectiveness bonus

Test plan

  • BM25 path ranks topical rules above unrelated ones when `bm25s` is installed
  • BM25 path falls back cleanly to keyword scorer when `bm25s` is absent (monkeypatched)
  • `sys.modules["bm25s"] = None` simulation doesn't crash
  • Thompson mode: same seed -> same output
  • Thompson mode: different seeds -> at least two distinct orderings across 10 seeds (exploration is real)
  • Thompson mode: zero / negative Beta params don't crash
  • `max_rules` respected
  • Output sorted descending by score
  • Empty / single-rule inputs handled
  • Missing `alpha` / `beta_param` fields don't crash in either mode
  • `wiki_boost` routes through the unified ranker and raises relevance
  • Full test suite: 2575 passed, 24 skipped (`pytest tests/ -x -q --ignore=tests/test_integration_full.py`)

What I did NOT do

  • `rule_engine.apply_rules` - out of scope per task instructions; has many callers and changing its signature is higher risk than warranted here.
  • Removing the keyword fallback - shipped additive as instructed.
  • numpy dependency - Thompson uses `random.betavariate` from stdlib.

Generated with Gradata

…anker

Upgrades rule injection ranking to address two issues surfaced by the
2026-04 autoresearch synthesis (sec 5):

- **BM25 over (category + description + tags)** replaces substring
  keyword overlap as the context-relevance signal. Uses the pure-Python
  `bm25s` package, gated behind `try/except ImportError` so the SDK
  stays zero-required-deps — falls back cleanly to the existing keyword
  scorer when bm25s is unavailable. Added as a `ranking` optional extra
  and rolled into the `all` group.

- **Thompson sampling over (alpha, beta) posteriors** as an opt-in mode
  (`GRADATA_THOMPSON_RANKING=1`). When enabled, the confidence term is
  replaced by `p ~ Beta(alpha, beta_param)` sampled via stdlib
  `random.betavariate`, giving exploration weight to newly graduated
  PATTERN-tier rules with uncertain posteriors. Deterministic within a
  session via new `session_seed` argument — same seed yields the same
  top-K across invocations, different seeds diverge as expected.

- **Unified ranking paths**. `inject_brain_rules.py` (SessionStart) and
  `agent_precontext.py` (PreToolUse/Agent) now both call
  `rule_ranker.rank_rules`, so the algorithm ablation-tests is the
  algorithm that ships. The qmd wiki-category signal is preserved as an
  optional `wiki_boost: dict[str, float]` input instead of a hard
  pre-filter — BM25 can still rescue strong cross-category matches.

- Kept `_score` shim in `inject_brain_rules.py` for back-compat with
  existing tests; did not touch `rule_engine.apply_rules` to avoid
  churn on its many callers.

Tests: 14 new cases in `tests/test_ranking_v2.py` covering BM25 win
over irrelevant rules, runtime fallback when bm25s is monkeypatched
out, Thompson determinism under seed, seed divergence across runs,
guards on bad Beta params, empty / single-rule / missing-Beta inputs,
max-K respect, and wiki_boost routing. Full suite: 2575 passed,
24 skipped.
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 15, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b52d9fd5-8d3b-4520-aaf3-0042c7e35e33

📥 Commits

Reviewing files that changed from the base of the PR and between 08673ef and 49f2d1f.

📒 Files selected for processing (2)
  • src/gradata/hooks/agent_precontext.py
  • src/gradata/hooks/inject_brain_rules.py

📝 Walkthrough
  • BM25-based context relevance: rank_rules() uses optional bm25s (bm25s>=0.2.0) to score context (category+description+tags) with a keyword-overlap fallback when bm25s is unavailable.
  • Thompson sampling: opt-in via GRADATA_THOMPSON_RANKING=1 to sample confidence from Beta(alpha, beta_param) (deterministic within a session via session_seed); beta params clamped to 1e-3; no numpy dependency added.
  • Unified ranking: inject_brain_rules.py and agent_precontext.py now call rank_rules() so both code paths use the same ranking algorithm.
  • Wiki boost refactor: wiki category matching changed from a hard pre-filter to an optional wiki_boost dict fed into the context component (default +0.3).
  • API additions (non-breaking): rank_rules(...) accepts new optional parameters wiki_boost: dict[str, float] | None and session_seed: int | None.
  • Dependency change: pyproject.toml adds an optional ranking extras group with bm25s and includes it in the all extras; import guarded so runtime fallback preserves behavior.
  • Tests: new tests (tests/test_ranking_v2.py) cover BM25 behavior and fallback, Thompson sampling determinism and robustness, ordering/max_rules/edge cases, and wiki_boost; test run: 2575 passed, 24 skipped.
  • Backward compatibility: a _score shim retained in inject_brain_rules.py; existing ranking weights unchanged.

Walkthrough

Adds BM25-backed context relevance and optional Thompson-sampling to rule/lesson ranking, exposes a new optional dependency group ranking for bm25s>=0.2.0, and integrates the enhanced rank_rules into agent precontext and brain-rules injection with deterministic per-session seeding.

Changes

Cohort / File(s) Summary
Dependencies
pyproject.toml
Add optional dependency group ranking = ["bm25s>=0.2.0"] and include it in the all extras.
Rule Ranker Core
src/gradata/rules/rule_ranker.py
Expanded rank_rules() to use BM25 (optional) for context relevance with a keyword fallback, added wiki_boost and session_seed params, and optional Thompson Beta sampling (controlled by env flag) with deterministic RNG.
Hook Integrations
src/gradata/hooks/agent_precontext.py, src/gradata/hooks/inject_brain_rules.py
Replaced local heuristic ranking with rank_rules() calls. Added adapters to convert lessons→rule dicts, build context_keywords and wiki_boost, and compute deterministic session_seed; ranked outputs re-map to original lesson objects.
Tests
tests/test_ranking_v2.py
New test suite covering BM25 relevance (with conditional skip/fallback), Thompson sampling determinism/seed divergence and edge cases, wiki_boost behavior, ordering invariants, and max_rules truncation.

Sequence Diagram(s)

sequenceDiagram
    participant Hook as Agent/Brain Hook
    participant Adapter as Lesson→Rule Adapter
    participant Ranker as rank_rules
    participant BM25 as BM25 Scorer
    participant Sampler as Thompson Sampler
    participant Results as Ranked Rules

    Hook->>Adapter: provide lessons + metadata (agent_type, session_number, session_id)
    Adapter->>Ranker: call rank_rules(rule_dicts, context_keywords, wiki_boost, session_seed, max_rules)
    Ranker->>BM25: request relevance scores for context (if bm25s present)
    alt BM25 available
        BM25-->>Ranker: normalized scores per rule
    else
        Ranker->>Ranker: compute keyword hit-ratio fallback
    end
    Ranker->>Sampler: if Thompson enabled, sample Beta(alpha,beta_param) per rule using session_seed
    Sampler-->>Ranker: sampled confidence values
    Ranker->>Ranker: combine context score + wiki_boost + confidence -> composite score
    Ranker-->>Results: sort by composite score and truncate to max_rules
    Results-->>Hook: return ranked lessons via stored _lesson field
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.83% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the three main changes: BM25 context relevance, Thompson sampling, and unified ranker across multiple modules.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering the motivation, design, implementation details, test coverage, and intentional scope boundaries.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/ranking-bm25-thompson

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the feature label Apr 15, 2026
Previous comprehensions `[rd.get("_lesson") for rd in ranked if ... is not None]`
don't narrow types for pyright — .get() returns Optional, and the predicate
runs a second call which pyright can't tie back. Switch to explicit loop
with local variable so the narrowing sticks.

Closes 8 reportOptionalMemberAccess errors that failed CI on PR #91.
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gradata has reached the 50-review limit for trial accounts. To continue receiving code reviews, upgrade your plan.

@Gradata Gradata merged commit 8d8b5aa into main Apr 15, 2026
16 checks passed
@Gradata Gradata deleted the feat/ranking-bm25-thompson branch April 17, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant