Skip to content

fix: rebuild pipeline-agnostic qrels via TREC pooling (EVAL)#208

Merged
EtanHey merged 1 commit intomainfrom
feat/eval-pooled-qrels
Apr 5, 2026
Merged

fix: rebuild pipeline-agnostic qrels via TREC pooling (EVAL)#208
EtanHey merged 1 commit intomainfrom
feat/eval-pooled-qrels

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented Apr 5, 2026

Summary

  • Rebuild eval qrels using TREC-style pooling from both FTS5 and hybrid_rrf pipelines (49 queries, 1,433 judgments)
  • Add 24 mined frustration query texts to DEFAULT_QUERY_SUITE
  • Fix make_comparable in evaluate(), rank-based scoring in hybrid pipeline
  • Checked-in benchmark results proving hybrid RRF improvement

Eval Results (pooled qrels)

Metric FTS5 Hybrid RRF Delta
ndcg@10 0.910 0.930 +2.1%
mrr 0.949 1.000 +5.4%
map@10 0.332 0.341 +2.6%
recall@20 0.671 0.691 +3.1%

Hybrid RRF wins on all metrics. MRR=1.0 means the most relevant result is always ranked first.

Test plan

  • pytest tests/test_eval_framework.py -q (11 passed)
  • ruff check on touched files

🤖 Generated with Claude Code

Note

Rebuild qrels via TREC pooling combining FTS5 and hybrid RRF pipelines

  • Adds collect_candidate_ids in build_qrels.py that pools candidates from both pipeline_fts5_only and pipeline_hybrid_rrf, de-duplicating by first occurrence; default mode is pool.
  • Implements pipeline_hybrid_rrf in benchmark.py, replacing the previous NotImplementedError with a real hybrid search call that returns rank-based 1/(rank+1) scores.
  • Adds MINED_FRUSTRATION_QUERY_SUITE and merges it into DEFAULT_QUERY_SUITE; evaluate_pipeline now passes make_comparable=True to ranx.evaluate.
  • Updates run_benchmark.py to abort when no valid queries are found and names output files per pipeline instead of using a hardcoded baseline prefix.
  • Behavioral Change: default build_qrels now requires a VectorStore (for embedding) instead of ReadOnlyBenchmarkStore, and grades up to 20 results per query instead of the previous default.
📊 Macroscope summarized b9387c0. 4 files reviewed, 2 issues evaluated, 0 issues filtered, 1 comment posted

🗂️ Filtered Issues

Summary by CodeRabbit

Release Notes

  • New Features

    • Hybrid RRF search pipeline is now fully implemented and functional.
    • Added frustration-themed queries to the evaluation suite.
    • CLI now supports pool and FTS-only modes for candidate selection.
  • Improvements

    • Benchmark output filenames now include pipeline names for clarity.
    • Enhanced query filtering to exclude empty query text.
    • Default candidate result count increased to 20.
  • Tests

    • Expanded evaluation framework test coverage with new integrity checks and pipeline consistency validation.

Rebuild eval qrels using TREC-style pooling from both FTS5 and hybrid_rrf
pipelines. Add frustration query texts to DEFAULT_QUERY_SUITE. Fix
make_comparable and rank-based scoring in hybrid benchmark.

Results: hybrid RRF beats FTS5 on all metrics with fair qrels:
  ndcg@10: 0.910 → 0.930 (+2.1%)
  mrr: 0.949 → 1.000 (+5.4%)
  recall@20: 0.671 → 0.691 (+3.1%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 5, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

The pull request introduces candidate ID collection centralization via collect_candidate_ids(), enabling dual-mode query processing: "fts-only" returns FTS5 results; "pool" mode augments with hybrid-RRF results. build_qrels signature updated to accept mode parameter and configure appropriate stores. Benchmark runner validation and filename generation also updated. Evaluation framework expanded with frustration query suite and working hybrid-RRF implementation.

Changes

Cohort / File(s) Summary
Candidate Collection & Build Logic
scripts/build_qrels.py
Introduced collect_candidate_ids() function to centralize FTS5 + optional hybrid-RRF candidate selection. Updated build_qrels signature to accept mode parameter ("pool" or "fts-only") and store type selection logic. CLI updated with mutually exclusive --pool / --fts-only flags and default n_results changed to 20.
Benchmark Runner
scripts/run_benchmark.py
Added validation requiring non-empty intersection of benchmark queries with qrels. Changed result filename from fixed baseline_{timestamp}.json to pipeline-specific {pipeline_name}_{timestamp}.json.
Evaluation Framework
src/brainlayer/eval/benchmark.py
Added MINED_FRUSTRATION_QUERY_SUITE and expanded DEFAULT_QUERY_SUITE. Tightened queries_in_qrels() filtering for non-empty query text. Updated evaluation to call ranx.evaluate(..., make_comparable=True). Implemented working pipeline_hybrid_rrf() with optional embed_fn parameter; now calls store.hybrid_search() and converts rank order to inverse-rank scores.
Test Coverage
tests/test_eval_framework.py
Added comprehensive test coverage: pipeline/qrels mismatch handling, expanded qrels format validation, frustration query suite integrity checks, pooled qrels consistency across pipelines, and unit tests for pipeline_hybrid_rrf rank scoring. Environment defaults set at import to disable Numba JIT and localize caches.
Evaluation Results Artifacts
tests/eval_results/fts5_20260405T005855Z.json, tests/eval_results/hybrid_rrf_20260405T005847Z.json
New JSON evaluation output files recording retrieval metrics (map@10, mrr, ndcg@10, recall@20), qrels metadata, and query counts for fts5 and hybrid_rrf pipelines.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant build_qrels as build_qrels()
    participant collect as collect_candidate_ids()
    participant FTS5 as pipeline_fts5_only()
    participant Fallback as fallback_fts_candidates()
    participant Hybrid as pipeline_hybrid_rrf()
    participant Store as VectorStore

    User->>build_qrels: mode="pool"
    build_qrels->>collect: collect candidates
    collect->>FTS5: run FTS5 search
    FTS5-->>collect: candidate_ids (possibly empty)
    alt FTS5 returns empty
        collect->>Fallback: get fallback candidates
        Fallback-->>collect: fallback_ids
    end
    alt mode=="pool"
        collect->>Hybrid: augment with hybrid-RRF
        Hybrid->>Store: hybrid_search(query_embedding, query_text)
        Store-->>Hybrid: ranked results
        Hybrid-->>collect: hybrid_ids with scores
        collect->>collect: deduplicate & combine
    end
    collect-->>build_qrels: combined_candidates
    build_qrels-->>User: graded qrels
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 A pooling hop brings candidates galore,
FTS5 first, then hybrid encore,
Dedup the ids, combine the score,
Frustration queries added to core—
Benchmark dreams, we couldn't ask for more! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: rebuild pipeline-agnostic qrels via TREC pooling (EVAL)' accurately captures the main change: rebuilding evaluation qrels using TREC-style pooling from both FTS5 and hybrid_rrf pipelines to create pipeline-agnostic quality judgments.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/eval-pooled-qrels

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment on lines +30 to +31
("frustration_003", "[Request interrupted by user]"),
(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low eval/benchmark.py:30

Entries frustration_003 and frustration_018 use "[Request interrupted by user]" as query text, which is a session metadata marker rather than a meaningful frustration query. When benchmark pipelines run FTS search on this literal phrase, they will return results for the words "Request", "interrupted", "by", "user" instead of matching actual user frustration patterns, polluting relevance metrics with irrelevant data points. Consider removing these placeholder entries from MINED_FRUSTRATION_QUERY_SUITE.

-    ("frustration_003", "[Request interrupted by user]"),
-    ("frustration_004", 'Check surfaces 19 and 21 for DONE_FIXES_1 and DONE_FIXES_2 signals. Run: for surf in surface:19 surface:21; do echo "=== $surf ===" && cmux read-screen --surface "$surf" --lines 10; done. Report which are done. When both are done, notify user via Telegram and delete this cron.'),
-    (
-        "frustration_005",
-        "Transcript quality is rough — detected English only, so Hebrew parts are garbled. We'll need to re-run forcing Hebrew. But that's for later.\n\nOn your request — the calendar is already clean and rebooked (I deleted the old events and put in 16 new ones). Let me just add your actual sleep from last night as a past event.",
-    ),
-    (
-        "frustration_006",
-        "Server is running on port 3000. It opened in Brave though — click Allow on the Whoop page in Zen (or whichever browser has it). Once you authorize, it'll save the token to Supabase + local cache automatically.\n\nCLAUDE_COUNTER: 7",
-    ),
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file src/brainlayer/eval/benchmark.py around lines 30-31:

Entries `frustration_003` and `frustration_018` use `"[Request interrupted by user]"` as query text, which is a session metadata marker rather than a meaningful frustration query. When benchmark pipelines run FTS search on this literal phrase, they will return results for the words "Request", "interrupted", "by", "user" instead of matching actual user frustration patterns, polluting relevance metrics with irrelevant data points. Consider removing these placeholder entries from `MINED_FRUSTRATION_QUERY_SUITE`.

Evidence trail:
src/brainlayer/eval/benchmark.py:30 - `("frustration_003", "[Request interrupted by user]"),`
src/brainlayer/eval/benchmark.py:88 - `("frustration_018", "[Request interrupted by user]"),`
src/brainlayer/eval/benchmark.py:152-171 - `DEFAULT_QUERY_SUITE` includes `*MINED_FRUSTRATION_QUERY_SUITE`
Commit: REVIEWED_COMMIT

@EtanHey EtanHey merged commit bf667e6 into main Apr 5, 2026
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant