Skip to content

ablation: rerank shortlist size sweep#10

Open
mukund-setti wants to merge 3 commits intomainfrom
retrieval-ablation
Open

ablation: rerank shortlist size sweep#10
mukund-setti wants to merge 3 commits intomainfrom
retrieval-ablation

Conversation

@mukund-setti
Copy link
Copy Markdown
Collaborator

Summary

99-minute ablation study sweeping rerank shortlist size (topN) across 6 values
and 3 encoders, with 5 latency repetitions per cell.

Headline finding

For the production encoder (BGE-small):

topN Recall@3 Mean latency
4 11/14 404 ms
8 13/14 1095 ms
12 13/14 2805 ms
20 (current main) 12/14 3322 ms
30 12/14 7574 ms

topN=8 is at the peak Recall@3 with 3x faster queries than the current
main default of 20. Higher topN values lose recall as the reranker promotes
distractors that wouldn't be in a smaller shortlist.

Cross-encoder check

All 3 encoders show the same elbow shape — peak quality at topN ∈ {4, 8},
degradation past topN=16. This isn't an artifact of BGE's embedding geometry;
it's a property of cross-encoder reranking on a small structured corpus.

Recommendation

Change _STAGE1_TOPN from 20 → 8 on main.

Risk is small: the only quality regression at topN=8 vs topN=12+ is on a
single query (Important Failure Behavior) which BGE topN=8 already gets right
in the test set.

Reproducibility

cd agent
uv run python scripts/ablation_topn.py

Writes scripts/ablation-results.md with full per-cell tables, mermaid charts,
disagreement breakdowns, and findings.

Branch is evidence + recommendation

Whether to actually land the topN change is a separate decision. The ablation
data lives here regardless.

Mukund Ummadisetti added 3 commits April 25, 2026 13:51
The project switched from OpenRouter to Gemini after the initial update/
landed; the LLM auto-mode probe was still reading the deleted
openrouter_api_key field, so the heuristic path was always selected even
when a real key was configured. Probe gemini_api_key / google_api_key
first, fall back to openrouter_api_key, and treat all of them as optional
so offline mode keeps working.

Made-with: Cursor
Compares BAAI/bge-small-en-v1.5, maidalun1020/bce-embedding-base_v1, and intfloat/e5-small-v2 on the 15-query eval set with BGE-reranker-base held constant. Quality identical across all 3 (12/14 Recall@3). BGE-small wins on latency (3.2s/query vs 5.7s BCE vs 7.6s E5), cold start (9.7s vs 29.7s BCE), and footprint (33MB vs 280MB BCE). Decision: keep BGE-small. Branch preserved as Q&A evidence, not intended to merge.
Single-parameter sweep over _STAGE1_TOPN in {4, 8, 12, 16, 20, 30} held
against BGE-small (production), BCE, and E5-small. Reranker, corpus, and
15-query eval set are constant. Per cell: 1 deterministic quality run +
5 in-series latency reps reported as mean / median / min-max. Cold start
measured once per encoder.

Verdict: keep _STAGE1_TOPN = 8. BGE Recall@3 peaks at topN=8 (13/14),
holds through topN=16, then drops to 12/14 at topN >= 20 -- and all
three encoders show the same non-monotone shape with elbows at
topN <= 12. Going topN=8 -> 4 saves ~700ms but loses 2/14 on R@3;
going topN=8 -> 12+ costs 1700-6500ms for zero R@3 gain.

Production retriever.py is NOT modified; topN is injected at runtime by
mutating the module constant between configs. Results, mermaid charts,
per-query disagreement table for BGE topN=4 vs topN=30, findings, and
recommendation in agent/scripts/ablation-results.md.

Total runtime: ~99 min (81 min main sweep + ~18 min finalize step for
two E5 cells the main run dropped after a process exit).

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant