# RAGU Colab Pipeline (A100 GPU)

This notebook walks through the exact sequence of scripts used in the RAGU repository to recreate the paper's retrieval-augmented question answering pipeline on Google Colab.

> **Important:** In Colab select **Runtime → Change runtime type → GPU → A100** before running the notebook. A different GPU may not have enough VRAM for the vLLM generation steps.

In [None]:
!nvidia-smi


## 1. Install Docker Engine in Colab
Installing Docker lets us run the pre-built `lauhaide/algomo:v1` image so we don't have to manage Python packages manually. The commands below install Docker, start the daemon, and confirm that it is available in the Colab runtime.

In [None]:
%%bash
set -e
apt-get update
apt-get install -y docker.io
service docker start
docker --version

> **Note:** If Docker fails to start (for example with `Cannot connect to the Docker daemon`), restart the Colab runtime and rerun the setup cells above before continuing.

In [None]:
!docker run --gpus all --rm nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi


## 2. Clone repositories and create working directories

In [None]:
%%bash
set -e
cd /content
if [ ! -d ragu ]; then
  git clone https://github.com/lauraperez/ragu.git
fi
if [ ! -d contriever ]; then
  git clone https://github.com/facebookresearch/contriever.git
fi

In [None]:
from pathlib import Path
import os

BASE_DIR = Path('/content')
RAGU_DIR = BASE_DIR / 'ragu'
CONTRIEVER_DIR = BASE_DIR / 'contriever'
DATA_DIR = RAGU_DIR / 'colab_data'
OUTPUT_DIR = RAGU_DIR / 'colab_outputs'
CACHE_DIR = BASE_DIR / 'ragu_cache'

for path in (DATA_DIR, OUTPUT_DIR, CACHE_DIR):
    path.mkdir(parents=True, exist_ok=True)

os.chdir(RAGU_DIR)
print('Working directory set to:', RAGU_DIR)
print('Cache directory:', CACHE_DIR)

## 3. Provide an optional Hugging Face token
If a model requires authentication, store your token in the `HF_TOKEN` environment variable so the Docker container can access it. Public models can skip this step.

In [None]:
import os
from getpass import getpass

if not os.environ.get('HF_TOKEN'):
    token = getpass('Enter Hugging Face token (leave blank to skip): ').strip()
    if token:
        os.environ['HF_TOKEN'] = token
        print('Stored HF token in the environment for this session.')
    else:
        print('Proceeding without a Hugging Face token.')
else:
    print('Using Hugging Face token from environment.')

## 4. Build the retrieval dataset input
We fetch the `nq_open` validation split and convert it into the DPR-style format expected by `create_retrieval_data.py`. The entire process happens inside the Docker image so we reuse its dependencies.

In [None]:
%%bash
set -e
# Generate raw DPR-style JSON inside the container
docker run --gpus all --rm \
  -e HF_TOKEN="${HF_TOKEN}" \
  -v /content/ragu:/workspace/ragu \
  -w /workspace/ragu \
  lauhaide/algomo:v1 \
  /bin/bash -lc "python - <<'PY'
import json
from datasets import load_dataset
from pathlib import Path
base = Path('colab_data')
base.mkdir(parents=True, exist_ok=True)
raw_path = base / 'nq_open_validation_raw.json'
ds = load_dataset('nq_open', split='validation')
records = []
for idx, example in enumerate(ds):
    records.append({
        'question': example['question'],
        'short_answers': example['answer'],
        'example_id': f'nq-open-{idx}'
    })
raw_path.write_text(json.dumps({'data': records}))
print('Saved raw DPR JSON to', raw_path)
PY"

docker run --gpus all --rm \
  -v /content/ragu:/workspace/ragu \
  -w /workspace/ragu/data_creation \
  lauhaide/algomo:v1 \
  python create_retrieval_data.py \
    --dataset NQ \
    --input_file /workspace/ragu/colab_data/nq_open_validation_raw.json \
    --output_file /workspace/ragu/colab_data/nq_open_validation.jsonl


In [None]:
raw_input_path = DATA_DIR / 'nq_open_validation_raw.json'
processed_input_path = DATA_DIR / 'nq_open_validation.jsonl'
print('Raw DPR JSON:', raw_input_path)
print('Processed retrieval input stored at', processed_input_path)

## 5. Download Contriever passages and embeddings
Download the Wikipedia passages plus the pre-built Contriever embeddings into the shared cache directory using the Docker image.

In [None]:
%%bash
set -e
CACHE_ROOT=/content/ragu_cache/contriever
mkdir -p ${CACHE_ROOT}

docker run --gpus all --rm \
  -v /content/ragu_cache:/workspace/cache \
  -w /workspace/cache/contriever \
  lauhaide/algomo:v1 \
  /bin/bash -lc "set -e
mkdir -p /workspace/cache/contriever
cd /workspace/cache/contriever
if [ ! -f psgs_w100.tsv.gz ]; then
  wget -q https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
fi
if [ ! -f wikipedia_embeddings.tar ]; then
  wget -q https://dl.fbaipublicfiles.com/contriever/embeddings/contriever-msmarco/wikipedia_embeddings.tar
fi
if [ ! -d wikipedia_embeddings ]; then
  mkdir -p wikipedia_embeddings
  tar -xf wikipedia_embeddings.tar -C wikipedia_embeddings --strip-components=1
fi"


## 6. Run dense retrieval (Contriever)
Execute the Contriever retrieval script from inside Docker so that dependencies such as FAISS are satisfied by the image.

In [None]:
%%bash
set -e
DATA_PATH=/content/ragu/colab_data/nq_open_validation.jsonl
OUTPUT_PATH=/content/ragu/colab_data/nq_open_validation_retrieved.jsonl

docker run --gpus all --rm \
  -v /content/ragu:/workspace/ragu \
  -v /content/contriever:/workspace/contriever \
  -v /content/ragu_cache:/workspace/cache \
  -w /workspace/contriever/scripts \
  lauhaide/algomo:v1 \
  python passage_retrieval.py \
    --model_name_or_path facebook/contriever-msmarco \
    --passages /workspace/cache/contriever/psgs_w100.tsv.gz \
    --passages_embeddings "/workspace/cache/contriever/wikipedia_embeddings/*" \
    --data /workspace/ragu/colab_data/nq_open_validation.jsonl \
    --output_dir /workspace/ragu/colab_data/nq_open_validation_retrieved.jsonl \
    --n_docs 5


## 7. Run RAG answer generation with `run_baseline_lm.py`
Launch the generator inside Docker (vLLM is already included in the image). You can change `GENERATOR_MODEL` to use a smaller model if VRAM is limited.

In [None]:
%%bash
set -e
cd /content/ragu
GENERATOR_MODEL=google/gemma-2-9b-it
INPUT_FILE=colab_data/nq_open_validation_retrieved.jsonl
OUTPUT_FILE=colab_outputs/gemma2_nq_validation.jsonl

docker run --gpus all --rm \
  -e HF_TOKEN="${HF_TOKEN}" \
  -e HF_HOME=/workspace/cache/huggingface \
  -v /content/ragu:/workspace/ragu \
  -v /content/ragu_cache:/workspace/cache \
  -w /workspace/ragu \
  lauhaide/algomo:v1 \
  python retrieval_qa/run_baseline_lm.py \
    --model_name ${GENERATOR_MODEL} \
    --split validation \
    --input_file ${INPUT_FILE} \
    --result_fp ${OUTPUT_FILE} \
    --prompt_name chat_directRagQA_REAR3 \
    --chat_template \
    --top_n 5 \
    --temperature 0.0 \
    --top_p 1.0 \
    --max_new_tokens 50 \
    --logprobs 1 \
    --compute_pmi \
    --batch_size 2


## 8. Score answers with `run_compute_accLM.py`
The judge model also runs from inside the Docker image. Adjust `EVAL_MODEL` or `--batch_size` if you hit memory limits.

In [None]:
%%bash
set -e
cd /content/ragu
EVAL_MODEL=Qwen/Qwen2-1.5B-Instruct
PRED_FILE=colab_outputs/gemma2_nq_validation.jsonl
SCORED_FILE=colab_outputs/gemma2_nq_validation_acclm.jsonl

docker run --gpus all --rm \
  -e HF_TOKEN="${HF_TOKEN}" \
  -e HF_HOME=/workspace/cache/huggingface \
  -v /content/ragu:/workspace/ragu \
  -v /content/ragu_cache:/workspace/cache \
  -w /workspace/ragu \
  lauhaide/algomo:v1 \
  python retrieval_qa/run_compute_accLM.py \
    --model_name ${EVAL_MODEL} \
    --input_file ${PRED_FILE} \
    --result_fp ${SCORED_FILE} \
    --acc \
    --top_n 5 \
    --batch_size 4 \
    --prompt_name chat_accuracy_eval-rlhf-calib \
    --chat_template


## 9. Inspect aggregate metrics

In [None]:
import jsonlines
import numpy as np
from pathlib import Path

scored_path = OUTPUT_DIR / 'gemma2_nq_validation_acclm.jsonl'
entries = list(jsonlines.open(scored_path))
acc_values = [entry.get('acc_LM', 0) for entry in entries]
print('Total examples:', len(entries))
print('LM-judged accuracy:', np.mean(acc_values))
print('Sample prediction:', entries[0]['output'])
print('Sample gold answers:', entries[0]['golds'])


## 10. Next steps
* Swap in alternative generator or judge models by editing the variables in Sections 7–8.
* Increase `--n_docs` or `--top_n` to match the paper's retrieval depth.
* Persist `colab_data` and `colab_outputs` folders to Google Drive if you need to reuse them.