# Evaluating Reranking

In this notebook, we will:
1. Compute the **baseline nDCG** using BM25 results.
2. Load the trained neural reranker model.
3. Apply the model to rerank the BM25 candidates.
4. Compute the **new nDCG** after reranking.

## Setup

In this section, we configure the environment and paths needed for reranking and evaluation:

1. **Parameters**  
- `BATCH_SIZE` defines how many (query, document) pairs are processed per batch.
- `VOCAB_SIZE` is the tokenizer vocabulary size used during training (has to be the same used during training, check `reranking_model_training.ipynb`).  
- `MAX_Q_LENGTH` and `MAX_D_LENGTH` are the maximum sequence lengths for queries and documents, ensuring consistent padding/truncation (should be the same used during training, check `reranking_model_training.ipynb`).  

2. **File paths**  
- `CORPUS_PATH` → the full text corpus (`MEDLINE_2024_Baseline.jsonl`).  
- `QUESTIONS_PATH` → the evaluation questions.  
- `BM25_PATH` → top candidate documents retrieved by BM25 for each question.  
- `MODEL_PATH` → the checkpoint of the trained neural reranker.  
- `OUPUT_FILE` → where reranked results will be saved.  
- `TRAIN_Q_FILE` and `TRAIN_BM25_FILE` → training data and BM25 candidates used during model training.  
 


In [35]:
# Setup
import os
import ujson
import torch
import sys
from torch.utils.data import DataLoader

# Add parent directory to path to import from src
sys.path.append('..')

import src.evaluation as ndcg 
from src.model import CNNInteractionBasedModel, Tokenizer, PointWiseDataset
from src.utils import build_collate_fn, get_questions, get_all_doc_texts

BATCH_SIZE = 64
VOCAB_SIZE = 284342
MAX_Q_LENGTH = 30
MAX_D_LENGTH = 1785

# Paths
OUTPUT_DIR = '../output'
CORPUS_PATH = "../data/MEDLINE_2024_Baseline.jsonl"
QUESTIONS_PATH = "../data/questions.jsonl"
BM25_PATH = "../data/questions_bm25_ranked.jsonl"
MODEL_PATH = OUTPUT_DIR + "/model/model_20251001_153857_1.pt"
OUPUT_FILE = OUTPUT_DIR + "/ranked_questions.jsonl"
TRAIN_Q_FILE = "../data/training_data.jsonl"
TRAIN_BM25_FILE = "../data/training_data_bm25_ranked.jsonl"

## Computing Ranking Metrics (BM25 Results) 

The system includes a script to compute the **Normalized Discounted Cumulative Gain (nDCG)** metric, which evaluates the quality of the ranked retrieval results. For this manner, execute the `nDCG.py` script.

#### How nDCG Works

- **DCG (Discounted Cumulative Gain)**: Measures the gain (relevance) of each document in the result list, discounted by its position in the list.
- **IDCG (Ideal DCG)**: The maximum possible DCG achievable, obtained by an ideal ranking of documents.
- **nDCG**: The ratio of DCG to IDCG, normalized to a value between 0 and 1.

In [33]:
# Compute nDCG for the given results
ndcg.compute_average_ndcg(
    questions_file_path=QUESTIONS_PATH,
    results_file_path=OUPUT_FILE,
    k=10
    )

Query ID: 63f73f1b33942b094c000008, nDCG@10: 1.0000
Query ID: 643d41e757b1c7a315000037, nDCG@10: 0.7000
Query ID: 643c88a257b1c7a315000030, nDCG@10: 0.2824
Query ID: 64403c4257b1c7a31500004f, nDCG@10: 0.6309
Query ID: 6441302d57b1c7a315000056, nDCG@10: 0.6625
Query ID: 63f042e2f36125a426000022, nDCG@10: 0.3801
Query ID: 64184483690f196b51000038, nDCG@10: 0.8226
Query ID: 643de76757b1c7a315000039, nDCG@10: 0.4993
Query ID: 64403ab057b1c7a31500004d, nDCG@10: 1.0000
Query ID: 64179139690f196b5100002f, nDCG@10: 0.0736
Query ID: 63f02b50f36125a426000014, nDCG@10: 0.2022
Query ID: 6411b678201352f04a000036, nDCG@10: 0.6180
Query ID: 643bc8f957b1c7a31500002b, nDCG@10: 0.5916
Query ID: 64403be357b1c7a31500004e, nDCG@10: 0.3155
Query ID: 644289c457b1c7a31500005e, nDCG@10: 0.5965
Query ID: 63f02ec1f36125a426000017, nDCG@10: 0.0000
Query ID: 641c516d690f196b5100003f, nDCG@10: 0.6326
Query ID: 64371c5957b1c7a31500002a, nDCG@10: 0.5972
Query ID: 6440396957b1c7a31500004b, nDCG@10: 0.6309
Query ID: 64

In [34]:
# Load trained model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

tokenizer = Tokenizer()
questions = get_questions(TRAIN_Q_FILE)
documents = get_all_doc_texts(TRAIN_Q_FILE, TRAIN_BM25_FILE, CORPUS_PATH)
tokenizer.fit(questions + documents)

model = CNNInteractionBasedModel(vocab_size=VOCAB_SIZE)
model.load_state_dict(torch.load(MODEL_PATH, map_location=device))
model.to(device)
model.eval()

Using device: cpu


CNNInteractionBasedModel(
  (embedding): Embedding(284342, 300, padding_idx=0)
  (conv): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1))
  (activation): ReLU()
  (pool): AdaptiveMaxPool2d(output_size=(1, 1))
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=32, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [36]:
# Prepare dataset for reranking
dataset = PointWiseDataset(QUESTIONS_PATH, BM25_PATH, CORPUS_PATH, tokenizer, return_label=False)

loader = DataLoader(
    dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    collate_fn=build_collate_fn(tokenizer, MAX_Q_LENGTH, MAX_D_LENGTH),
    pin_memory=(device.type == "cuda")
)


### Reranking with the Neural Model

1. **Batch scoring**  
- For each batch from the DataLoader, we take the tokenized queries and candidate documents.  
- The model outputs a **relevance score** for each (query, document) pair.  

2. **Collect scores per query**  
- We store the `(document_id, score)` pairs for every query.  

3. **Sort candidates**  
- For each query, we sort the candidate documents in descending order of model score.  
- This step produces the final reranked list of documents for each query.  

4. **Save results**  
- The results are saved in a JSONL file with the format:
   ```json
   {
      "query_id": "...",
      "retrieved_documents": ["doc1", "doc2", "doc3", ...]
   }
   ```
- This keeps the same structure as the BM25 file, making it easy to compare baseline vs reranked performance.  

This reranking step does not retrieve new documents — it only **reorders the BM25 shortlist** according to the learned neural model.

In [37]:
# Run reranking
reranked_results = {}
with torch.no_grad():
    for batch in loader:
        q_tokens = batch["question_token_ids"].to(device)
        d_tokens = batch["document_token_ids"].to(device)
        qids = batch["query_ids"]
        dids = batch["document_ids"]

        scores = model(q_tokens, d_tokens)

        for i, qid in enumerate(qids):
            if qid not in reranked_results:
                reranked_results[qid] = []
            reranked_results[qid].append((dids[i], float(scores[i])))

# sort by model score
for qid, doc_scores in reranked_results.items():
    reranked_results[qid] = [doc for doc, _ in sorted(doc_scores, key=lambda x: x[1], reverse=True)]

output_file = os.path.join(OUTPUT_DIR, 'ranked_questions_model.jsonl')
with open(output_file, 'w') as f:
    for qid, docs in reranked_results.items():
        entry = {
            "query_id": qid,
            "retrieved_documents": docs
        }
        f.write(ujson.dumps(entry) + '\n')

print(f"Reranked results saved to: {output_file}")

Reranked results saved to: ../output/ranked_questions_model.jsonl


## Evaluate Retrieved Documents (Model Reranking)

Here we compute the **Normalized Discounted Cumulative Gain (nDCG)** metric, which evaluates the quality of the ranked retrieval results after model reranking.


In [39]:
print(f"Reranked nDCG@10 (Model)")

# Compute nDCG after reranking
ndcg.compute_average_ndcg(
    questions_file_path=QUESTIONS_PATH,
    results_file_path=output_file,
    k=10
    )

Reranked nDCG@10 (Model)
Query ID: 63f73f1b33942b094c000008, nDCG@10: 0.4307
Query ID: 643d41e757b1c7a315000037, nDCG@10: 0.0000
Query ID: 643c88a257b1c7a315000030, nDCG@10: 0.0000
Query ID: 64403c4257b1c7a31500004f, nDCG@10: 0.6309
Query ID: 6441302d57b1c7a315000056, nDCG@10: 0.0000
Query ID: 63f042e2f36125a426000022, nDCG@10: 0.0000
Query ID: 64184483690f196b51000038, nDCG@10: 0.4524
Query ID: 643de76757b1c7a315000039, nDCG@10: 0.4903
Query ID: 64403ab057b1c7a31500004d, nDCG@10: 0.0000
Query ID: 64179139690f196b5100002f, nDCG@10: 0.0000
Query ID: 63f02b50f36125a426000014, nDCG@10: 0.0000
Query ID: 6411b678201352f04a000036, nDCG@10: 0.2581
Query ID: 643bc8f957b1c7a31500002b, nDCG@10: 0.1009
Query ID: 64403be357b1c7a31500004e, nDCG@10: 0.0000
Query ID: 644289c457b1c7a31500005e, nDCG@10: 0.0000
Query ID: 63f02ec1f36125a426000017, nDCG@10: 0.0000
Query ID: 641c516d690f196b5100003f, nDCG@10: 0.1774
Query ID: 64371c5957b1c7a31500002a, nDCG@10: 0.3806
Query ID: 6440396957b1c7a31500004b, nDC