This repository contains an end-to-end research notebook that benchmarks three Retrieval-Augmented Generation (RAG) control strategies for multi-hop question answering on HotpotQA.
The project focuses on one core question:
Can adaptive, verification-guided retrieval preserve strong answer quality while reducing unnecessary model calls compared to fixed-depth recursion?
Current repository contents:
nb.ipynb: primary implementation, benchmarking, and visualization notebook
At this stage, the notebook is the project. It includes setup, model loading, retrieval, generation, adaptive control logic, evaluation, and plotting.
Single-pass RAG is often insufficient for multi-hop QA because required evidence may be distributed across multiple passages. A fixed recursive strategy can recover some missed evidence but spends the same compute budget on easy and hard questions.
This project implements an adaptive controller that uses claim-level verification to decide whether to:
- stop and return an answer,
- refine the query,
- retrieve more evidence, or
- abstain when support is insufficient.
All modes share the same dataset sample, retriever stack, and generator model. The difference is only in control policy.
standard
- One retrieval pass, one generation pass.
- Fastest, lowest control overhead.
recursive
- Fixed retrieval schedule across multiple steps.
- More compute, no adaptive stopping.
adaptive
- Iterative loop with claim extraction and NLI-based verification.
- Uses policy actions to continue, refine, stop, or abstain.
The notebook pipeline combines the following components:
- Dataset: HotpotQA (
distractorsplit) - Retrieval embeddings:
BAAI/bge-small-en-v1.5 - Vector index: FAISS (
IndexFlatIP) - Generator:
microsoft/Phi-3-mini-4k-instruct(4-bit quantized) - Verifier:
cross-encoder/nli-deberta-v3-small
| Layer | Component | Used In Notebook |
|---|---|---|
| Generator LLM | microsoft/Phi-3-mini-4k-instruct |
Answer generation, reasoning traces, query decomposition |
| Quantization | bitsandbytes (4-bit NF4) |
Memory-efficient model loading |
| Embedding model | BAAI/bge-small-en-v1.5 |
Query/passage embeddings for retrieval |
| Verifier model | cross-encoder/nli-deberta-v3-small |
Claim-level entailment checking |
| Vector DB | FAISS (IndexFlatIP) |
Similarity search over passage embeddings |
| Dataset | hotpot_qa (distractor, validation sample) |
Multi-hop QA benchmark data |
| Framework stack | transformers, accelerate, sentence-transformers |
Model loading and inference |
| Analysis stack | pandas, matplotlib, numpy |
Metrics aggregation and plots |
High-level flow:
- Build a deduplicated passage corpus from sampled HotpotQA contexts.
- Embed corpus and index with FAISS.
- Retrieve top passages for a query.
- Generate answer (+ reasoning trace format).
- Extract claims from answer/reasoning.
- Verify claims against retrieved docs.
- Use policy thresholds to stop, refine, retrieve more, or abstain.
- Score outputs and aggregate metrics per mode.
The notebook is organized into clear stages:
- Setup and imports
- Dataset loading and corpus construction
- FAISS index construction
- Phi-3 model loading (4-bit)
- NLI verifier loading
- Core retrieval, generation, and verification functions
- Pipeline orchestrator (
run_pipeline) - Benchmark runner (all modes over sampled questions)
- Results and tabular evaluation
- Step-efficiency and ablation cells
- Final visualizations
The notebook reports a broad set of quality and efficiency metrics:
- Exact Match (EM %)
- F1 and length-penalized F1
- Faithfulness / grounding score
- True hallucination rate
- Retrieval-failure proxy rate (faithful but wrong)
- Abstention rate
- Average steps
- Average LLM calls
- Average latency
- EM per LLM call (cost-efficiency proxy)
- Adaptive step-efficiency curve
The notebook benchmark compares standard, recursive, and adaptive on 75 sampled HotpotQA validation questions.
| Mode | EM (%) | Length-Penalized F1 | Faithfulness (%) | True Hallucination (%) | Avg LLM Calls | Avg Latency (s) |
|---|---|---|---|---|---|---|
| Standard | 65.3 | 0.672 | 94.7 | 5.3 | 1.0 | 5.8 |
| Recursive | 64.0 | 0.665 | 95.5 | 4.0 | 3.0 | 20.5 |
| Adaptive | 68.0 | 0.697 | 100.0 | 0.0 | 1.2 | 6.3 |
- Step 1: 65%
- Step 2: 71%
- Step 3: 0%
- Adaptive mode achieves the best EM and best length-penalized F1 in this run.
- Adaptive mode reaches full faithfulness in the reported benchmark and eliminates true hallucinations.
- Recursive mode is the most expensive in call count and latency.
- Adaptive mode stays close to standard in cost while outperforming it on quality metrics.
- A substantial gap between faithfulness and EM indicates retrieval limitations still exist even when answers are evidence-grounded.
- Open
nb.ipynbin Kaggle, VS Code Jupyter, or JupyterLab. - Ensure GPU runtime is enabled (recommended).
- Run notebook cells sequentially from top to bottom.
- Let model downloads and index construction finish.
- Run benchmark and evaluation cells.
- Inspect printed tables and generated plots.
The notebook installs required packages in its setup cell:
transformers==4.41.2accelerate==0.30.1bitsandbytessentence-transformers==2.7.0faiss-cpudatasetspandasmatplotlib
- GPU is strongly recommended; the notebook is tuned for Kaggle T4-like resources.
- First run may be slow due to model/dataset downloads.
- Benchmark runtime scales with sample size and mode count.
- Quantized loading is used to fit the generator model more reliably in constrained VRAM.
When run in Kaggle-style environments, the notebook writes artifacts such as:
- summary CSV files (for benchmark metrics)
- benchmark figures (PNG plots)
Paths are currently configured to Kaggle working directories in parts of the notebook (for example, /kaggle/working/...). If you run locally, adjust output paths accordingly.
To keep runs comparable:
- Use the same sample size and split settings.
- Keep package versions aligned with the setup cell.
- Avoid changing policy thresholds unless conducting an ablation.
- Report both quality and efficiency metrics, not EM alone.
- Results depend on retrieval quality from the sampled context corpus.
- Benchmark outcomes can vary across hardware/runtime conditions.
- String/heuristic correctness checks may not capture all semantic equivalences.
- Some policy thresholds are tuned heuristically and may require retuning for new datasets.
This repository is intended for:
- research experimentation on adaptive RAG control,
- classroom or project demonstrations of retrieval-verification loops,
- extending to stronger verifiers, retrievers, or alternative policy logic.
Demonstrate that adaptive verification-driven retrieval can reach competitive multi-hop QA quality while reducing average compute compared to fixed recursive retrieval, and while lowering unsupported answer risk.