CompactRAG

Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering

This is the official implementation of the paper "CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering", accepted at The Web Conference 2026 (WWW 2026).

Overview

Retrieval-Augmented Generation (RAG) has become the standard approach for knowledge-intensive NLP tasks. However, multi-hop question answering (MHQA) remains challenging as it requires integrating evidence from multiple documents. Existing iterative RAG pipelines face three critical issues:

Efficiency degradation as reasoning hops increase
Redundant information in retrieved context
Entity drift during multi-hop decomposition

CompactRAG addresses these challenges through a novel two-call framework that separates corpus processing from online inference:

Offline Stage: An LLM reads the corpus once and constructs an atomic QA knowledge base — concise, fact-level QA pairs that reduce redundancy and better align with question semantics.
Online Stage: A complex query is decomposed into dependency-ordered sub-questions. Each sub-question is resolved using lightweight modules:
- Answer Extractor (RoBERTa-base): Extracts grounded entities from retrieved QA pairs
- Sub-Question Rewriter (FLAN-T5-small): Resolves entity references to prevent semantic drift

The main LLM is invoked only twice per query: once for decomposition and once for final synthesis reasoning — making LLM usage independent of hop depth.

Key Contributions

Scalability Analysis: We analyze scalability issues in iterative RAG pipelines, showing how token consumption and LLM calls grow with reasoning depth.
Two-Call Framework: We introduce CompactRAG, a two-call RAG framework that uses an offline atomic QA knowledge base and lightweight online modules to enable efficient multi-hop inference.
Comprehensive Evaluation: We evaluate CompactRAG on HotpotQA, 2WikiMultiHopQA, and MuSiQue. Results show competitive accuracy and large reductions in inference token usage compared to strong iterative baselines.

Installation

Requirements

Python 3.8+
CUDA 12.x (for GPU acceleration)
24GB+ GPU memory (for vLLM inference)

Setup

# Clone the repository
git clone https://github.com/your-username/CompactRAG.git
cd CompactRAG

# Create virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Model Setup

Download the required models:

LLM: Meta-Llama-3-8B-Instruct (or your preferred model)
Answer Extractor: RoBERTa-base
Question Rewriter: FLAN-T5-small

Update the model path in start_vllm_server.sh:

MODEL_PATH="/path/to/your/Meta-Llama-3-8B-Instruct/"

Usage

Step 1: Start the vLLM Server

bash start_vllm_server.sh

This starts the vLLM OpenAI-compatible API server at http://localhost:8000.

Step 2: Run Evaluation

Basic Usage

# Run on a single benchmark with default settings
python src/run.py --benchmark musique --method qa --model llama8b

# Run with specific parameters
python src/run.py --benchmark hotpotqa --method qa --topk 5 --model llama8b

Using the Run Script

# Run all benchmarks with all methods
./run.sh

# Run specific benchmarks and methods
./run.sh -b musique,2wiki -m qa --topk 5

# Run with parallel jobs
./run.sh -j 4 -b hotpotqa -m qa

# Specify output and log directories
./run.sh -o data/results -l logs

Available Options

Option	Description	Default
`-b, --benchmarks`	Comma-separated list of benchmarks	`hotpotqa,2wiki,musique`
`-m, --methods`	Comma-separated list of methods	`qa`
`--model`	Model name for inference	`llama8b`
`--topk`	Number of top-k QA pairs for retrieval	`5`
`--iterations`	Number of iterations for itergen method	`2`
`-j, --parallel-jobs`	Number of parallel jobs	`1`
`-o, --output-dir`	Output directory	`data/results`
`--log-dir`	Log directory	`logs`

Supported Benchmarks

HotpotQA: Distractor setting with 10 Wikipedia paragraphs per question
2WikiMultiHopQA: Multi-hop reasoning across Wikipedia articles
MuSiQue: Answerable subset requiring multi-step reasoning

Supported Methods

qa: CompactRAG main method
itergen: Iterative generation baseline
selfask: Self-Ask baseline
ircot: IRCoT baseline

Training

Train Answer Extractor

python train/train_roberta_qa.py

Train Question Rewriter

python train/train_question_rewrite.py

Generate Training Data

# Synthesize QA pairs for knowledge base
python train/SynthesisQA.py

# Synthesize question rewrite pairs
python train/SynthesisRewrite.py

Project Structure

CompactRAG/
├── src/
│   ├── core/           # Core algorithms and methods
│   ├── metrics/        # Evaluation metrics (EM, F1, LLM-Acc)
│   ├── prompt/         # Prompt templates
│   ├── service/        # API service utilities
│   ├── utils/          # Helper functions
│   └── run.py          # Main entry point
├── train/              # Training scripts for lightweight modules
├── data/
│   ├── QA/             # Atomic QA knowledge base
│   ├── sampled/        # Sampled test data
│   └── results/        # Evaluation results
├── images/             # Framework figures
├── run.sh              # Batch evaluation script
├── start_vllm_server.sh # vLLM server startup script
└── requirements.txt    # Python dependencies

Results

CompactRAG achieves competitive accuracy while significantly reducing computational overhead:

Only 2 LLM calls per query (vs. 2n+1 for n-hop questions in iterative methods)
Reduced token consumption through atomic QA retrieval
Consistent performance across different hop depths

Citation

If you find this work useful, please cite our paper:

@inproceedings{yang2026compactrag,
  title={CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering},
  author={Yang, Hao and Yang, Zhiyu and Zhang, Xupeng and Wei, Wei and Zhang, Yunjie and Yang, Lin},
  booktitle={Proceedings of the ACM Web Conference 2026 (WWW '26)},
  year={2026},
  publisher={ACM},
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

We thank the authors of HotpotQA, 2WikiMultiHopQA, and MuSiQue for providing the benchmark datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CompactRAG

Overview

Key Contributions

Installation

Requirements

Setup

Model Setup

Usage

Step 1: Start the vLLM Server

Step 2: Run Evaluation

Basic Usage

Using the Run Script

Available Options

Supported Benchmarks

Supported Methods

Training

Train Answer Extractor

Train Question Rewriter

Generate Training Data

Project Structure

Results

Citation

License

Acknowledgments

About

Uh oh!

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
figures		figures
images		images
src		src
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ReadingCorpus2.py		ReadingCorpus2.py
ablation.sh		ablation.sh
consume.sh		consume.sh
demo_failed_samples.py		demo_failed_samples.py
requirements.txt		requirements.txt
run.sh		run.sh
start_vllm_server.sh		start_vllm_server.sh

License

How-Young-X/CompactRAG

Folders and files

Latest commit

History

Repository files navigation

CompactRAG

Overview

Key Contributions

Installation

Requirements

Setup

Model Setup

Usage

Step 1: Start the vLLM Server

Step 2: Run Evaluation

Basic Usage

Using the Run Script

Available Options

Supported Benchmarks

Supported Methods

Training

Train Answer Extractor

Train Question Rewriter

Generate Training Data

Project Structure

Results

Citation

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages