ablation: rerank shortlist size sweep#10
Open
mukund-setti wants to merge 3 commits intomainfrom
Open
Conversation
added 3 commits
April 25, 2026 13:51
The project switched from OpenRouter to Gemini after the initial update/ landed; the LLM auto-mode probe was still reading the deleted openrouter_api_key field, so the heuristic path was always selected even when a real key was configured. Probe gemini_api_key / google_api_key first, fall back to openrouter_api_key, and treat all of them as optional so offline mode keeps working. Made-with: Cursor
Compares BAAI/bge-small-en-v1.5, maidalun1020/bce-embedding-base_v1, and intfloat/e5-small-v2 on the 15-query eval set with BGE-reranker-base held constant. Quality identical across all 3 (12/14 Recall@3). BGE-small wins on latency (3.2s/query vs 5.7s BCE vs 7.6s E5), cold start (9.7s vs 29.7s BCE), and footprint (33MB vs 280MB BCE). Decision: keep BGE-small. Branch preserved as Q&A evidence, not intended to merge.
Single-parameter sweep over _STAGE1_TOPN in {4, 8, 12, 16, 20, 30} held
against BGE-small (production), BCE, and E5-small. Reranker, corpus, and
15-query eval set are constant. Per cell: 1 deterministic quality run +
5 in-series latency reps reported as mean / median / min-max. Cold start
measured once per encoder.
Verdict: keep _STAGE1_TOPN = 8. BGE Recall@3 peaks at topN=8 (13/14),
holds through topN=16, then drops to 12/14 at topN >= 20 -- and all
three encoders show the same non-monotone shape with elbows at
topN <= 12. Going topN=8 -> 4 saves ~700ms but loses 2/14 on R@3;
going topN=8 -> 12+ costs 1700-6500ms for zero R@3 gain.
Production retriever.py is NOT modified; topN is injected at runtime by
mutating the module constant between configs. Results, mermaid charts,
per-query disagreement table for BGE topN=4 vs topN=30, findings, and
recommendation in agent/scripts/ablation-results.md.
Total runtime: ~99 min (81 min main sweep + ~18 min finalize step for
two E5 cells the main run dropped after a process exit).
Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
99-minute ablation study sweeping rerank shortlist size (topN) across 6 values
and 3 encoders, with 5 latency repetitions per cell.
Headline finding
For the production encoder (BGE-small):
topN=8 is at the peak Recall@3 with 3x faster queries than the current
main default of 20. Higher topN values lose recall as the reranker promotes
distractors that wouldn't be in a smaller shortlist.
Cross-encoder check
All 3 encoders show the same elbow shape — peak quality at topN ∈ {4, 8},
degradation past topN=16. This isn't an artifact of BGE's embedding geometry;
it's a property of cross-encoder reranking on a small structured corpus.
Recommendation
Change
_STAGE1_TOPNfrom 20 → 8 onmain.Risk is small: the only quality regression at topN=8 vs topN=12+ is on a
single query (
Important Failure Behavior) which BGE topN=8 already gets rightin the test set.
Reproducibility
cd agent uv run python scripts/ablation_topn.pyWrites
scripts/ablation-results.mdwith full per-cell tables, mermaid charts,disagreement breakdowns, and findings.
Branch is evidence + recommendation
Whether to actually land the topN change is a separate decision. The ablation
data lives here regardless.