ablation: rerank shortlist size sweep by mukund-setti · Pull Request #10 · SanjoyDat1/Brian

mukund-setti · 2026-04-25T23:40:29Z

Summary

99-minute ablation study sweeping rerank shortlist size (topN) across 6 values
and 3 encoders, with 5 latency repetitions per cell.

Headline finding

For the production encoder (BGE-small):

topN	Recall@3	Mean latency
4	11/14	404 ms
8	13/14	1095 ms
12	13/14	2805 ms
20 (current main)	12/14	3322 ms
30	12/14	7574 ms

topN=8 is at the peak Recall@3 with 3x faster queries than the current
main default of 20. Higher topN values lose recall as the reranker promotes
distractors that wouldn't be in a smaller shortlist.

Cross-encoder check

All 3 encoders show the same elbow shape — peak quality at topN ∈ {4, 8},
degradation past topN=16. This isn't an artifact of BGE's embedding geometry;
it's a property of cross-encoder reranking on a small structured corpus.

Recommendation

Change _STAGE1_TOPN from 20 → 8 on main.

Risk is small: the only quality regression at topN=8 vs topN=12+ is on a
single query (Important Failure Behavior) which BGE topN=8 already gets right
in the test set.

Reproducibility

cd agent
uv run python scripts/ablation_topn.py

Writes scripts/ablation-results.md with full per-cell tables, mermaid charts,
disagreement breakdowns, and findings.

Branch is evidence + recommendation

Whether to actually land the topN change is a separate decision. The ablation
data lives here regardless.

The project switched from OpenRouter to Gemini after the initial update/ landed; the LLM auto-mode probe was still reading the deleted openrouter_api_key field, so the heuristic path was always selected even when a real key was configured. Probe gemini_api_key / google_api_key first, fall back to openrouter_api_key, and treat all of them as optional so offline mode keeps working. Made-with: Cursor

Compares BAAI/bge-small-en-v1.5, maidalun1020/bce-embedding-base_v1, and intfloat/e5-small-v2 on the 15-query eval set with BGE-reranker-base held constant. Quality identical across all 3 (12/14 Recall@3). BGE-small wins on latency (3.2s/query vs 5.7s BCE vs 7.6s E5), cold start (9.7s vs 29.7s BCE), and footprint (33MB vs 280MB BCE). Decision: keep BGE-small. Branch preserved as Q&A evidence, not intended to merge.

Single-parameter sweep over _STAGE1_TOPN in {4, 8, 12, 16, 20, 30} held against BGE-small (production), BCE, and E5-small. Reranker, corpus, and 15-query eval set are constant. Per cell: 1 deterministic quality run + 5 in-series latency reps reported as mean / median / min-max. Cold start measured once per encoder. Verdict: keep _STAGE1_TOPN = 8. BGE Recall@3 peaks at topN=8 (13/14), holds through topN=16, then drops to 12/14 at topN >= 20 -- and all three encoders show the same non-monotone shape with elbows at topN <= 12. Going topN=8 -> 4 saves ~700ms but loses 2/14 on R@3; going topN=8 -> 12+ costs 1700-6500ms for zero R@3 gain. Production retriever.py is NOT modified; topN is injected at runtime by mutating the module constant between configs. Results, mermaid charts, per-query disagreement table for BGE topN=4 vs topN=30, findings, and recommendation in agent/scripts/ablation-results.md. Total runtime: ~99 min (81 min main sweep + ~18 min finalize step for two E5 cells the main run dropped after a process exit). Made-with: Cursor

Mukund Ummadisetti added 3 commits April 25, 2026 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ablation: rerank shortlist size sweep#10

ablation: rerank shortlist size sweep#10
mukund-setti wants to merge 3 commits intomainfrom
retrieval-ablation

mukund-setti commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mukund-setti commented Apr 25, 2026

Summary

Headline finding

Cross-encoder check

Recommendation

Reproducibility

Branch is evidence + recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant