# RAGU Baseline Replication on Google Colab

This notebook orchestrates the end-to-end pipeline described in *Uncertainty Quantification in Retrieval-Augmented Question Answering (Perez et al., 2024)*. It mirrors the steps used in the paper so you can reproduce the retrieval-augmented QA baseline, generate stochastic samples, and evaluate uncertainty metrics (ECE, AUROC, semantic entropy, etc.) inside a Colab runtime.

> **Tip:** The full pipeline is GPU intensive. Use a Colab runtime with an A100 or higher-memory GPU (Colab Pro/Pro+) and plenty of disk space (>150 GB) if you plan to download the full DPR corpus and run large language models such as Qwen2-72B.


## 0. Runtime diagnostics
Check the attached GPU before proceeding. You need a recent NVIDIA GPU (A100/H100 class) to load the large AWQ quantized models used in the paper. For debugging or sanity checks you can temporarily switch to a lighter model (e.g., `Qwen/Qwen2-7B-Instruct`).


In [None]:
!nvidia-smi


## 1. Install Python dependencies
This installs the exact toolchain used across the repository:

* `vllm` for fast autoregressive decoding.
* `contriever` for dense retrieval.
* `lm-polygraph`, `xgboost`, `wandb`, and metric toolkits required by the uncertainty scripts.
* `faiss-gpu` (preferred) or `faiss-cpu` for passage search; switch to CPU if your runtime lacks CUDA headers.

> **Note:** Colab runtimes start with an older version of `pip`. Upgrading it first avoids resolver errors when installing `vllm` wheels.


In [None]:
%%bash
set -euo pipefail
python -m pip install --upgrade pip
python -m pip install --no-cache-dir   accelerate==0.32.1   bitsandbytes==0.43.1   datasets==2.20.0   faiss-gpu==1.7.4   git+https://github.com/facebookresearch/contriever.git   git+https://github.com/EleutherAI/lm-polygraph.git   huggingface_hub==0.23.4   jsonlines==4.0.0   numpy==1.26.4   pandas==2.2.2   pyarrow==16.1.0   scikit-learn==1.5.1   sentencepiece==0.2.0   torch==2.3.1   torchvision==0.18.1   torchaudio==2.3.1   transformers==4.42.3   vllm==0.4.2   wandb==0.17.4   xgboost==2.1.1


### Configure runtime-wide environment variables
W&B logging is disabled by default to keep the pipeline self-contained, and tokenizer parallelism warnings are silenced. Adjust these flags if you prefer online tracking.


In [None]:
import os
os.environ.setdefault("WANDB_MODE", "offline")
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
print("WANDB_MODE=", os.environ["WANDB_MODE"])


## 2. Clone the RAGU and Contriever repositories
The paper couples this repository with Facebook AI's Contriever retriever. Both repos are cloned into `/content` (the default working directory in Colab). Re-run these cells only if you want a fresh checkout.


In [None]:
import subprocess
from pathlib import Path

WORKSPACE = Path("/content")
RAGU_REPO = "https://github.com/lauhaide/ragu.git"
CONTRIEVER_REPO = "https://github.com/facebookresearch/contriever.git"

RAGU_DIR = WORKSPACE / "ragu"
CONTRIEVER_DIR = WORKSPACE / "contriever"

if not RAGU_DIR.exists():
    subprocess.run(["git", "clone", "--depth", "1", RAGU_REPO, str(RAGU_DIR)], check=True)
else:
    print("RAGU repo already present at", RAGU_DIR)

if not CONTRIEVER_DIR.exists():
    subprocess.run(["git", "clone", "--depth", "1", CONTRIEVER_REPO, str(CONTRIEVER_DIR)], check=True)
else:
    print("Contriever repo already present at", CONTRIEVER_DIR)


Create directories for inputs (retrieval-ready datasets) and outputs (LLM generations, metrics). These paths mirror the `${HOMEDATA}` and `${HOMEOUT}` variables used in the original shell scripts.


In [None]:
from pathlib import Path

DATA_DIR = Path("/content/ragu_data")
OUTPUT_DIR = Path("/content/ragu_outputs")
CACHE_DIR = Path("/content/ragu_cache")
for path in (DATA_DIR, OUTPUT_DIR, CACHE_DIR):
    path.mkdir(parents=True, exist_ok=True)

print("DATA_DIR:", DATA_DIR)
print("OUTPUT_DIR:", OUTPUT_DIR)
print("CACHE_DIR:", CACHE_DIR)


Switch the working directory to the cloned RAGU repo so that relative imports (e.g., `../utils`) resolve exactly as expected by the authors' scripts.


In [None]:
import os
os.chdir(RAGU_DIR)
print("Current working directory:", os.getcwd())


## 3. Authenticate with Hugging Face (optional but recommended)
Large checkpoint downloads (Qwen2-72B, Llama 3.1, etc.) require an access token tied to your Hugging Face account. Uncomment and run the login cell, or set the `HF_TOKEN` environment variable beforehand.


In [None]:
# from huggingface_hub import login
# login(token="hf_...", add_to_git_credential=True)


## 4. Prepare QA data in the expected JSONL format
The helper below pulls a slice of Natural Questions (`nq_open`) via 🤗 Datasets, converts it to the RAGU schema (question, answers, q_id), and saves it under `${DATA_DIR}`. Replace the dataset loader if you already have DPR-formatted files.

* `DATASET_NAME`: identifier used downstream.
* `SPLIT`: typically `train`, `dev`, or `test`.
* `SAMPLE_SIZE`: shrink this during dry runs to conserve GPU time.


In [None]:
from datasets import load_dataset
import jsonlines
from pathlib import Path

DATASET_NAME = "nq"
SPLIT = "dev"
SAMPLE_SIZE = 200  # set to None to keep the full split

raw_split = load_dataset("nq_open", split="validation")
if SAMPLE_SIZE is not None:
    raw_split = raw_split.select(range(SAMPLE_SIZE))

output_path = DATA_DIR / f"{DATASET_NAME}-{SPLIT}.jsonl"
with jsonlines.open(output_path, mode="w") as writer:
    for idx, row in enumerate(raw_split):
        answers = row["answer"] if isinstance(row["answer"], list) else [row["answer"]]
        writer.write({
            "question": row["question"],
            "answers": answers,
            "q_id": f"{DATASET_NAME}-{SPLIT}-{idx}"
        })

print(f"Saved {len(raw_split)} examples to {output_path}")


## 5. Stage dense retrieval assets (DPR Wikipedia)
Contriever expects the DPR-formatted Wikipedia passages plus the pre-computed embeddings released by the original authors. These files collectively exceed 80 GB, so mount Google Drive or attach a persistent disk before enabling the download flag.

Set the boolean switches below to `True` the first time you run the notebook. Subsequent runs will skip the downloads if the files already exist.


In [None]:
import shlex
import subprocess
from pathlib import Path

psgs_tsv = DATA_DIR / "psgs_w100.tsv"
embeddings_dir = DATA_DIR / "wikipedia_embeddings"

DOWNLOAD_PASSAGES = False  # <-- change to True for the full 21M-passage corpus (~2.7 GB compressed)
DOWNLOAD_EMBEDDINGS = False  # <-- change to True for the pre-built FAISS index (~80 GB)

if DOWNLOAD_PASSAGES and not psgs_tsv.exists():
    url = "https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz"
    subprocess.run(shlex.split(f"wget {url} -O {psgs_tsv}.gz"), check=True)
    subprocess.run(["gunzip", f"{psgs_tsv}.gz"], check=True)
else:
    print("Skipping passage download (toggle DOWNLOAD_PASSAGES to True).")

if DOWNLOAD_EMBEDDINGS and not embeddings_dir.exists():
    embeddings_dir.mkdir(parents=True, exist_ok=True)
    url = "https://dl.fbaipublicfiles.com/contriever/embeddings/contriever-msmarco/wikipedia_embeddings.tar"
    subprocess.run(shlex.split(f"wget {url} -O {embeddings_dir}.tar"), check=True)
    subprocess.run(["tar", "-xf", f"{embeddings_dir}.tar", "-C", str(DATA_DIR)], check=True)
    Path(f"{embeddings_dir}.tar").unlink()
else:
    print("Skipping embedding download (toggle DOWNLOAD_EMBEDDINGS to True).")

print("Passage file exists:", psgs_tsv.exists())
print("Embeddings directory exists:", embeddings_dir.exists())


If you do **not** have the official index, you can still smoke-test the pipeline by building a miniature FAISS index over a few thousand passages. Uncomment the block below to create a toy corpus directly inside Colab.


In [None]:
# from datasets import load_dataset
# import numpy as np
# import faiss
# import torch
# from transformers import AutoTokenizer, AutoModel
#
# MINI_CORPUS_SIZE = 5000
# CORPUS_PATH = DATA_DIR / "mini_psgs.tsv"
#
# if not CORPUS_PATH.exists():
#     wiki = load_dataset("wiki_dpr", "psgs_w100", split=f"train[:{MINI_CORPUS_SIZE}]")
#     with CORPUS_PATH.open("w") as fout:
#         for row in wiki:
#             fout.write(f"{row['id']}	{row['title']}	{row['text']}
")
#     print(f"Mini corpus saved to {CORPUS_PATH}")
# else:
#     print("Mini corpus already exists at", CORPUS_PATH)


## 6. Run Contriever retrieval
Point the retriever at your JSONL file. When using the full DPR assets, set `PASSAGES_TSV` to `psgs_w100.tsv` and `EMB_PATTERN` to `wikipedia_embeddings/*`. For the miniature corpus, swap in `mini_psgs.tsv` and drop the `--passages_embeddings` argument to trigger on-the-fly encoding.


In [None]:
import subprocess
from pathlib import Path

PASSAGES_TSV = psgs_tsv if psgs_tsv.exists() else DATA_DIR / "mini_psgs.tsv"
EMB_PATTERN = str(embeddings_dir / "*") if embeddings_dir.exists() else ""
OUTPUT_PREFIX = DATA_DIR / f"{DATASET_NAME}-{SPLIT}-ctx20"
OUTPUT_PREFIX.mkdir(parents=True, exist_ok=True)

cmd = [
    "python",
    str((Path("/content") / "contriever" / "passage_retrieval.py")),
    "--model_name_or_path", "facebook/contriever-msmarco",
    "--passages", str(PASSAGES_TSV),
    "--data", str(output_path),
    "--output_dir", str(OUTPUT_PREFIX),
    "--n_docs", "20",
]

if EMB_PATTERN:
    cmd.extend(["--passages_embeddings", EMB_PATTERN])

print("Running:", " ".join(cmd))
subprocess.run(cmd, check=True)


The retriever writes a JSONL file that augments each example with a `ctxs` list. Confirm the schema before moving on.


In [None]:
import jsonlines
from itertools import islice

retrieved_file = OUTPUT_PREFIX / f"{DATASET_NAME}-{SPLIT}.jsonl"
print("Retrieved file:", retrieved_file)
with jsonlines.open(retrieved_file) as reader:
    for record in islice(reader, 2):
        print({k: record[k] for k in ['question', 'answers', 'ctxs']})
        break


## 7. Generate most-likely RAG answers with vLLM
Use the same CLI that the paper employs (`retrieval_qa/run_baseline_lm.py`). Edit `MODEL_NAME` to whichever checkpoint fits your GPU budget.

* For faithful replication: `Qwen/Qwen2-72B-Instruct-AWQ` with AWQ quantization.
* For smoke tests: `Qwen/Qwen2-7B-Instruct` or `meta-llama/Meta-Llama-3-8B-Instruct`.

The script appends model predictions, token log-probs, PMI statistics, and other diagnostic fields directly into the retrieved JSONL file.


In [None]:
import subprocess

MODEL_NAME = "Qwen/Qwen2-7B-Instruct"
RESULT_FILE = OUTPUT_DIR / f"{MODEL_NAME.split('/')[-1]}-{DATASET_NAME}-{SPLIT}-RAGQA.jsonl"

cmd = [
    "python", "retrieval_qa/run_baseline_lm.py",
    "--model_name", MODEL_NAME,
    "--split", SPLIT,
    "--input_file", str(retrieved_file),
    "--result_fp", str(RESULT_FILE),
    "--prompt_name", "chat_directRagQA_REAR3",
    "--chat_template",
    "--top_n", "5",
    "--temperature", "0.0",
    "--top_p", "1",
    "--max_new_tokens", "50",
    "--do_stop",
    "--logprobs", "1",
    "--compute_pmi",
]

print("Running:", " ".join(cmd))
subprocess.run(cmd, check=True)


Inspect a sample prediction to verify that generations, PMI terms, and Fisher-Rao statistics were appended correctly.


In [None]:
with jsonlines.open(RESULT_FILE) as reader:
    sample = next(reader)

print("Fields:", list(sample.keys()))
print("Predicted answer:", sample.get("predicted_answer"))
print("Acc_LM (token-level exact match):", sample.get("acc"))


## 8. Evaluate answers with the LLM-based accuracy judge
The paper uses `Qwen/Qwen2-72B-Instruct-AWQ` as the evaluator (`run_compute_accLM.py`). For lighter runs, you can substitute a smaller instruction-tuned model. Set `--eval_distil` when evaluating per-passage generations; omit it for single-answer files.


In [None]:
EVAL_MODEL = "Qwen/Qwen2-7B-Instruct"
ACC_RESULT_FILE = OUTPUT_DIR / f"{RESULT_FILE.stem}-acc.jsonl"

cmd = [
    "python", "retrieval_qa/run_compute_accLM.py",
    "--model_name", EVAL_MODEL,
    "--input_file", str(RESULT_FILE),
    "--result_fp", str(ACC_RESULT_FILE),
    "--acc",
    "--top_n", "5",
]

print("Running:", " ".join(cmd))
subprocess.run(cmd, check=True)


## 9. Sample stochastic generations for semantic uncertainty
`semantic_uncertainty/generate.py` replicates Semantic Entropy, PMI, Fisher-Rao, and other estimators by drawing multiple answers per question. Keep `NUM_GENERATIONS` modest in Colab to cap inference cost.


In [None]:
import subprocess

SAMPLES_FILE = OUTPUT_DIR / f"{MODEL_NAME.split('/')[-1]}-{DATASET_NAME}-{SPLIT}-samples.jsonl"

cmd = [
    "python", "semantic_uncertainty/generate.py",
    "--model", MODEL_NAME,
    "--input_file", str(retrieved_file),
    "--result_fp", str(SAMPLES_FILE),
    "--prompt_name", "chat_directRagQA_REAR3",
    "--chat_template",
    "--top_n", "5",
    "--split", SPLIT,
    "--max_new_tokens", "50",
    "--do_stop",
    "--num_generations", "5",
]

print("Running:", " ".join(cmd))
subprocess.run(cmd, check=True)


## 10. Compute uncertainty metrics and calibration curves
`semantic_uncertainty/generate_answers.py` aggregates the outputs above into AUROC, AURAC, ECE, and Semantic Entropy scores. Because it also supports Passage Utility predictors, toggle `COMPUTE_UTILITY` off unless you have already trained those models.


In [None]:
import subprocess

UNCERTAINTY_DIR = OUTPUT_DIR / "uncertainty_runs"
UNCERTAINTY_DIR.mkdir(exist_ok=True)

cmd = [
    "python", "semantic_uncertainty/generate_answers.py",
    "--dataset", DATASET_NAME,
    "--precomputed_gen",
    "--no-get_training_set_generations",
    "--no-get_training_set_generations_most_likely_only",
    "--eval_mode", SPLIT,
    "--most_likely_file", str(ACC_RESULT_FILE),
    "--samples_file", str(SAMPLES_FILE),
    "--original_file", str(output_path),
    "--top_n", "5",
    "--acc_LM",
    "--metric", "llm",
    "--entity", "offline",
    "--experiment_lot", "colab",
    "--num_samples", "100",
    "--compute_uncertainties",
    "--no-compute_utility",
]

print("Running:", " ".join(cmd))
subprocess.run(cmd, check=True)


## 11. Inspect the resulting calibration metrics
The uncertainty script writes pickle files (`uncertainty_measures.pkl`, `experiment_details.pkl`) and logs summary scores. Load them directly to verify AUROC/ECE values against the numbers reported in the paper.


In [None]:
import pickle

measures_path = UNCERTAINTY_DIR / "uncertainty_measures.pkl"
if measures_path.exists():
    with measures_path.open("rb") as fin:
        measures = pickle.load(fin)
    print("Available uncertainty measures:", list(measures.get("uncertainty_measures", {}).keys()))
else:
    print("Could not find", measures_path)


## 12. Next steps
* Increase `SAMPLE_SIZE` and `--num_generations` once the pipeline works end-to-end.
* Swap in the exact checkpoints from the paper (`Qwen/Qwen2-72B-Instruct-AWQ`, `gemma-2-9b-it`, etc.) and rerun the evaluation cells.
* Re-enable `--compute_utility` with Passage Utility annotations when you are ready to benchmark BR-RAG against the published baselines.

Happy experimenting!
