fix: rebuild pipeline-agnostic qrels via TREC pooling (EVAL) by EtanHey · Pull Request #208 · EtanHey/brainlayer

EtanHey · 2026-04-05T01:03:09Z

Summary

Rebuild eval qrels using TREC-style pooling from both FTS5 and hybrid_rrf pipelines (49 queries, 1,433 judgments)
Add 24 mined frustration query texts to DEFAULT_QUERY_SUITE
Fix make_comparable in evaluate(), rank-based scoring in hybrid pipeline
Checked-in benchmark results proving hybrid RRF improvement

Eval Results (pooled qrels)

Metric	FTS5	Hybrid RRF	Delta
ndcg@10	0.910	0.930	+2.1%
mrr	0.949	1.000	+5.4%
map@10	0.332	0.341	+2.6%
recall@20	0.671	0.691	+3.1%

Hybrid RRF wins on all metrics. MRR=1.0 means the most relevant result is always ranked first.

Test plan

pytest tests/test_eval_framework.py -q (11 passed)
ruff check on touched files

🤖 Generated with Claude Code

Note

Rebuild qrels via TREC pooling combining FTS5 and hybrid RRF pipelines

Adds collect_candidate_ids in build_qrels.py that pools candidates from both pipeline_fts5_only and pipeline_hybrid_rrf, de-duplicating by first occurrence; default mode is pool.
Implements pipeline_hybrid_rrf in benchmark.py, replacing the previous NotImplementedError with a real hybrid search call that returns rank-based 1/(rank+1) scores.
Adds MINED_FRUSTRATION_QUERY_SUITE and merges it into DEFAULT_QUERY_SUITE; evaluate_pipeline now passes make_comparable=True to ranx.evaluate.
Updates run_benchmark.py to abort when no valid queries are found and names output files per pipeline instead of using a hardcoded baseline prefix.
Behavioral Change: default build_qrels now requires a VectorStore (for embedding) instead of ReadOnlyBenchmarkStore, and grades up to 20 results per query instead of the previous default.

📊 Macroscope summarized b9387c0. 4 files reviewed, 2 issues evaluated, 0 issues filtered, 1 comment posted

🗂️ Filtered Issues

Summary by CodeRabbit

Release Notes

New Features
- Hybrid RRF search pipeline is now fully implemented and functional.
- Added frustration-themed queries to the evaluation suite.
- CLI now supports pool and FTS-only modes for candidate selection.
Improvements
- Benchmark output filenames now include pipeline names for clarity.
- Enhanced query filtering to exclude empty query text.
- Default candidate result count increased to 20.
Tests
- Expanded evaluation framework test coverage with new integrity checks and pipeline consistency validation.

Rebuild eval qrels using TREC-style pooling from both FTS5 and hybrid_rrf pipelines. Add frustration query texts to DEFAULT_QUERY_SUITE. Fix make_comparable and rank-based scoring in hybrid benchmark. Results: hybrid RRF beats FTS5 on all metrics with fair qrels: ndcg@10: 0.910 → 0.930 (+2.1%) mrr: 0.949 → 1.000 (+5.4%) recall@20: 0.671 → 0.691 (+3.1%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

coderabbitai · 2026-04-05T01:03:23Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

The pull request introduces candidate ID collection centralization via collect_candidate_ids(), enabling dual-mode query processing: "fts-only" returns FTS5 results; "pool" mode augments with hybrid-RRF results. build_qrels signature updated to accept mode parameter and configure appropriate stores. Benchmark runner validation and filename generation also updated. Evaluation framework expanded with frustration query suite and working hybrid-RRF implementation.

Changes

Cohort / File(s)	Summary
Candidate Collection & Build Logic `scripts/build_qrels.py`	Introduced `collect_candidate_ids()` function to centralize FTS5 + optional hybrid-RRF candidate selection. Updated `build_qrels` signature to accept `mode` parameter ("pool" or "fts-only") and store type selection logic. CLI updated with mutually exclusive `--pool` / `--fts-only` flags and default `n_results` changed to 20.
Benchmark Runner `scripts/run_benchmark.py`	Added validation requiring non-empty intersection of benchmark queries with qrels. Changed result filename from fixed `baseline_{timestamp}.json` to pipeline-specific `{pipeline_name}_{timestamp}.json`.
Evaluation Framework `src/brainlayer/eval/benchmark.py`	Added `MINED_FRUSTRATION_QUERY_SUITE` and expanded `DEFAULT_QUERY_SUITE`. Tightened `queries_in_qrels()` filtering for non-empty query text. Updated evaluation to call `ranx.evaluate(..., make_comparable=True)`. Implemented working `pipeline_hybrid_rrf()` with optional `embed_fn` parameter; now calls `store.hybrid_search()` and converts rank order to inverse-rank scores.
Test Coverage `tests/test_eval_framework.py`	Added comprehensive test coverage: pipeline/qrels mismatch handling, expanded qrels format validation, frustration query suite integrity checks, pooled qrels consistency across pipelines, and unit tests for `pipeline_hybrid_rrf` rank scoring. Environment defaults set at import to disable Numba JIT and localize caches.
Evaluation Results Artifacts `tests/eval_results/fts5_20260405T005855Z.json`, `tests/eval_results/hybrid_rrf_20260405T005847Z.json`	New JSON evaluation output files recording retrieval metrics (`map@10`, `mrr`, `ndcg@10`, `recall@20`), qrels metadata, and query counts for fts5 and hybrid_rrf pipelines.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant build_qrels as build_qrels()
    participant collect as collect_candidate_ids()
    participant FTS5 as pipeline_fts5_only()
    participant Fallback as fallback_fts_candidates()
    participant Hybrid as pipeline_hybrid_rrf()
    participant Store as VectorStore

    User->>build_qrels: mode="pool"
    build_qrels->>collect: collect candidates
    collect->>FTS5: run FTS5 search
    FTS5-->>collect: candidate_ids (possibly empty)
    alt FTS5 returns empty
        collect->>Fallback: get fallback candidates
        Fallback-->>collect: fallback_ids
    end
    alt mode=="pool"
        collect->>Hybrid: augment with hybrid-RRF
        Hybrid->>Store: hybrid_search(query_embedding, query_text)
        Store-->>Hybrid: ranked results
        Hybrid-->>collect: hybrid_ids with scores
        collect->>collect: deduplicate & combine
    end
    collect-->>build_qrels: combined_candidates
    build_qrels-->>User: graded qrels

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat: Ranx eval framework + 25-query baseline (P1b) #201: Main PR introduces eval framework functions now being modified directly—build_qrels signature changed, pipeline_hybrid_rrf now implemented with working logic, and DEFAULT_QUERY_SUITE expanded.
feat: search quality improvements + Groq rate limiter #68: Related through VectorStore and hybrid_search integration—collect_candidate_ids now calls pipeline_hybrid_rrf which invokes store.hybrid_search(), directly depending on VectorStore enhancements from that PR.

Poem

🐰 A pooling hop brings candidates galore,
FTS5 first, then hybrid encore,
Dedup the ids, combine the score,
Frustration queries added to core—
Benchmark dreams, we couldn't ask for more! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: rebuild pipeline-agnostic qrels via TREC pooling (EVAL)' accurately captures the main change: rebuilding evaluation qrels using TREC-style pooling from both FTS5 and hybrid_rrf pipelines to create pipeline-agnostic quality judgments.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/eval-pooled-qrels

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

macroscopeapp · 2026-04-05T01:05:42Z

+    ("frustration_003", "[Request interrupted by user]"),
+    (


🟢 Low eval/benchmark.py:30

Entries frustration_003 and frustration_018 use "[Request interrupted by user]" as query text, which is a session metadata marker rather than a meaningful frustration query. When benchmark pipelines run FTS search on this literal phrase, they will return results for the words "Request", "interrupted", "by", "user" instead of matching actual user frustration patterns, polluting relevance metrics with irrelevant data points. Consider removing these placeholder entries from MINED_FRUSTRATION_QUERY_SUITE.

- ("frustration_003", "[Request interrupted by user]"), - ("frustration_004", 'Check surfaces 19 and 21 for DONE_FIXES_1 and DONE_FIXES_2 signals. Run: for surf in surface:19 surface:21; do echo "=== $surf ===" && cmux read-screen --surface "$surf" --lines 10; done. Report which are done. When both are done, notify user via Telegram and delete this cron.'), - ( - "frustration_005", - "Transcript quality is rough — detected English only, so Hebrew parts are garbled. We'll need to re-run forcing Hebrew. But that's for later.\n\nOn your request — the calendar is already clean and rebooked (I deleted the old events and put in 16 new ones). Let me just add your actual sleep from last night as a past event.", - ), - ( - "frustration_006", - "Server is running on port 3000. It opened in Brave though — click Allow on the Whoop page in Zen (or whichever browser has it). Once you authorize, it'll save the token to Supabase + local cache automatically.\n\nCLAUDE_COUNTER: 7", - ),

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file src/brainlayer/eval/benchmark.py around lines 30-31: Entries `frustration_003` and `frustration_018` use `"[Request interrupted by user]"` as query text, which is a session metadata marker rather than a meaningful frustration query. When benchmark pipelines run FTS search on this literal phrase, they will return results for the words "Request", "interrupted", "by", "user" instead of matching actual user frustration patterns, polluting relevance metrics with irrelevant data points. Consider removing these placeholder entries from `MINED_FRUSTRATION_QUERY_SUITE`. Evidence trail: src/brainlayer/eval/benchmark.py:30 - `("frustration_003", "[Request interrupted by user]"),` src/brainlayer/eval/benchmark.py:88 - `("frustration_018", "[Request interrupted by user]"),` src/brainlayer/eval/benchmark.py:152-171 - `DEFAULT_QUERY_SUITE` includes `*MINED_FRUSTRATION_QUERY_SUITE` Commit: REVIEWED_COMMIT

greptile-apps Bot reviewed Apr 5, 2026

View reviewed changes

macroscopeapp Bot reviewed Apr 5, 2026

View reviewed changes

EtanHey merged commit bf667e6 into main Apr 5, 2026
4 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: rebuild pipeline-agnostic qrels via TREC pooling (EVAL)#208

fix: rebuild pipeline-agnostic qrels via TREC pooling (EVAL)#208
EtanHey merged 1 commit intomainfrom
feat/eval-pooled-qrels

EtanHey commented Apr 5, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

greptile-apps Bot left a comment

Uh oh!

coderabbitai Bot commented Apr 5, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

macroscopeapp Bot Apr 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EtanHey commented Apr 5, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Eval Results (pooled qrels)

Test plan

Rebuild qrels via TREC pooling combining FTS5 and hybrid RRF pipelines

🗂️ Filtered Issues

Summary by CodeRabbit

Release Notes

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

macroscopeapp Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EtanHey commented Apr 5, 2026 •

edited by macroscopeapp Bot

Loading

coderabbitai Bot commented Apr 5, 2026 •

edited

Loading