fix: rebuild pipeline-agnostic qrels via TREC pooling (EVAL)#208
fix: rebuild pipeline-agnostic qrels via TREC pooling (EVAL)#208
Conversation
Rebuild eval qrels using TREC-style pooling from both FTS5 and hybrid_rrf pipelines. Add frustration query texts to DEFAULT_QUERY_SUITE. Fix make_comparable and rank-based scoring in hybrid benchmark. Results: hybrid RRF beats FTS5 on all metrics with fair qrels: ndcg@10: 0.910 → 0.930 (+2.1%) mrr: 0.949 → 1.000 (+5.4%) recall@20: 0.671 → 0.691 (+3.1%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThe pull request introduces candidate ID collection centralization via Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant build_qrels as build_qrels()
participant collect as collect_candidate_ids()
participant FTS5 as pipeline_fts5_only()
participant Fallback as fallback_fts_candidates()
participant Hybrid as pipeline_hybrid_rrf()
participant Store as VectorStore
User->>build_qrels: mode="pool"
build_qrels->>collect: collect candidates
collect->>FTS5: run FTS5 search
FTS5-->>collect: candidate_ids (possibly empty)
alt FTS5 returns empty
collect->>Fallback: get fallback candidates
Fallback-->>collect: fallback_ids
end
alt mode=="pool"
collect->>Hybrid: augment with hybrid-RRF
Hybrid->>Store: hybrid_search(query_embedding, query_text)
Store-->>Hybrid: ranked results
Hybrid-->>collect: hybrid_ids with scores
collect->>collect: deduplicate & combine
end
collect-->>build_qrels: combined_candidates
build_qrels-->>User: graded qrels
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| ("frustration_003", "[Request interrupted by user]"), | ||
| ( |
There was a problem hiding this comment.
🟢 Low eval/benchmark.py:30
Entries frustration_003 and frustration_018 use "[Request interrupted by user]" as query text, which is a session metadata marker rather than a meaningful frustration query. When benchmark pipelines run FTS search on this literal phrase, they will return results for the words "Request", "interrupted", "by", "user" instead of matching actual user frustration patterns, polluting relevance metrics with irrelevant data points. Consider removing these placeholder entries from MINED_FRUSTRATION_QUERY_SUITE.
- ("frustration_003", "[Request interrupted by user]"),
- ("frustration_004", 'Check surfaces 19 and 21 for DONE_FIXES_1 and DONE_FIXES_2 signals. Run: for surf in surface:19 surface:21; do echo "=== $surf ===" && cmux read-screen --surface "$surf" --lines 10; done. Report which are done. When both are done, notify user via Telegram and delete this cron.'),
- (
- "frustration_005",
- "Transcript quality is rough — detected English only, so Hebrew parts are garbled. We'll need to re-run forcing Hebrew. But that's for later.\n\nOn your request — the calendar is already clean and rebooked (I deleted the old events and put in 16 new ones). Let me just add your actual sleep from last night as a past event.",
- ),
- (
- "frustration_006",
- "Server is running on port 3000. It opened in Brave though — click Allow on the Whoop page in Zen (or whichever browser has it). Once you authorize, it'll save the token to Supabase + local cache automatically.\n\nCLAUDE_COUNTER: 7",
- ),🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file src/brainlayer/eval/benchmark.py around lines 30-31:
Entries `frustration_003` and `frustration_018` use `"[Request interrupted by user]"` as query text, which is a session metadata marker rather than a meaningful frustration query. When benchmark pipelines run FTS search on this literal phrase, they will return results for the words "Request", "interrupted", "by", "user" instead of matching actual user frustration patterns, polluting relevance metrics with irrelevant data points. Consider removing these placeholder entries from `MINED_FRUSTRATION_QUERY_SUITE`.
Evidence trail:
src/brainlayer/eval/benchmark.py:30 - `("frustration_003", "[Request interrupted by user]"),`
src/brainlayer/eval/benchmark.py:88 - `("frustration_018", "[Request interrupted by user]"),`
src/brainlayer/eval/benchmark.py:152-171 - `DEFAULT_QUERY_SUITE` includes `*MINED_FRUSTRATION_QUERY_SUITE`
Commit: REVIEWED_COMMIT
Summary
Eval Results (pooled qrels)
Hybrid RRF wins on all metrics. MRR=1.0 means the most relevant result is always ranked first.
Test plan
pytest tests/test_eval_framework.py -q(11 passed)ruff checkon touched files🤖 Generated with Claude Code
Note
Rebuild qrels via TREC pooling combining FTS5 and hybrid RRF pipelines
collect_candidate_idsin build_qrels.py that pools candidates from bothpipeline_fts5_onlyandpipeline_hybrid_rrf, de-duplicating by first occurrence; default mode ispool.pipeline_hybrid_rrfin benchmark.py, replacing the previousNotImplementedErrorwith a real hybrid search call that returns rank-based1/(rank+1)scores.MINED_FRUSTRATION_QUERY_SUITEand merges it intoDEFAULT_QUERY_SUITE;evaluate_pipelinenow passesmake_comparable=Truetoranx.evaluate.baselineprefix.build_qrelsnow requires aVectorStore(for embedding) instead ofReadOnlyBenchmarkStore, and grades up to 20 results per query instead of the previous default.📊 Macroscope summarized b9387c0. 4 files reviewed, 2 issues evaluated, 0 issues filtered, 1 comment posted
🗂️ Filtered Issues
Summary by CodeRabbit
Release Notes
New Features
Improvements
Tests