# ‚öñÔ∏è Notebook 3: Retrieval Benchmark (Dense vs Sparse)
**Author:** Gabriele Righi

**Project:** Dense vs Sparse Retrieval Reproducibility

## üéØ Objective
This notebook performs the **core comparative analysis** of the project. It evaluates the retrieval effectiveness (Quality) and efficiency (Speed/RAM) of different indexing strategies on the **Natural Questions (NQ)** dataset.

## ‚öôÔ∏è Key Operations
1.  **Index Loading:** Loads the pre-built Faiss indexes (HNSW FP32, HNSW INT8, Flat INT8) created in Notebook 2.
2.  **Dense Retrieval:** Executes search queries to measure **nDCG@10**, **Recall@10**, and **QPS** (Queries Per Second).
3.  **Sparse Baseline (BM25):** Calculates the BM25 baseline using the optimized `bm25s` library to replicate Table 3 of the original paper.
4.  **Results Aggregation:** Compiles all metrics into a final CSV report (`final_results_nq.csv`) representing the replication of **Table 3** (Effectiveness) and **Table 4** (Efficiency).

## üìÇ Inputs & Outputs
* **Input:** `doc_embeddings.npy` (NB1), `.faiss` indexes (NB2), NQ Dataset.
* **Output:** `final_results_nq.csv` containing the final benchmark metrics.

---

## ‚ö†Ô∏è CRITICAL SETUP: ENVIRONMENT & DEPENDENCY FIX

**Did you get a `ValueError: numpy.dtype size changed` below?**
Don't panic! This is expected because we are updating core libraries (Numpy/Pandas) to ensure compatibility between `beir`, `faiss`, and `bm25s`.

### üõë FOLLOW THESE STEPS CAREFULLY:
1.  **RUN CELL 1:** Execute the installation cell below. It might crash or show the error.
2.  **RESTART THE SESSION:** Go to the top menu: **Run** -> **Restart and clear cell output**.
3.  **SKIP CELL 1:** Once restarted, do **NOT** run the installation cell again.
4.  **PROCEED:** Go directly to **Cell 2** and run the rest of the notebook.

The code will work perfectly after the restart! üöÄ

In [None]:
# CELL 1: ROBUST ENVIRONMENT SETUP
# ==============================================================================
# 1. UNINSTALL conflicting libraries first (The "Nuclear" Step)
!pip uninstall -y numpy pandas scipy scikit-learn

# 2. INSTALL a coherent ecosystem compatible with Numpy 1.x
# We use --no-cache-dir to ensure we download fresh compatible wheels
!pip install "numpy<2.0" "pandas==2.2.2" "scipy==1.13.1" "scikit-learn==1.5.0" --no-cache-dir --force-reinstall

# 3. INSTALL Project Libraries
!pip install faiss-cpu sentence-transformers beir pyserini --no-deps

# 4. INSTALL Core ML Libraries
!pip install tqdm transformers torch torchvision

# 5. VERIFY Installation
import numpy as np
import pandas as pd
print(f"‚úÖ Setup Success. Numpy: {np.__version__} | Pandas: {pd.__version__}")

# CELL 2: COMPLETE BENCHMARK SCRIPT

In [None]:
import faiss
import numpy as np
import pandas as pd
import os
import gc
import time
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

print("üöÄ STARTING FINAL EVALUATION BENCHMARK...")

# ==========================================
# 1. FILE DETECTION
# ==========================================
print("\n[1/5] Detecting Index Files...")
input_root = '/kaggle/input'
paths = {
    'doc_emb': None,       
    'hnsw_fp32': None, 
    'hnsw_int8': None, 
    'flat_int8': None
}

for root, _, files in os.walk(input_root):
    if 'doc_embeddings.npy' in files: 
        paths['doc_emb'] = os.path.join(root, 'doc_embeddings.npy')
    if 'hnsw_index.faiss' in files: 
        paths['hnsw_fp32'] = os.path.join(root, 'hnsw_index.faiss')
    if 'hnsw_int8_index.faiss' in files: 
        paths['hnsw_int8'] = os.path.join(root, 'hnsw_int8_index.faiss')
    if 'flat_int8_index.faiss' in files: 
        paths['flat_int8'] = os.path.join(root, 'flat_int8_index.faiss')

print(f"üìÇ Files found: {[k for k, v in paths.items() if v]}")

# ==========================================
# 2. LOAD DATASET (Natural Questions)
# ==========================================
print("\n[2/5] Loading NQ Dataset...")
qrels_ds = load_dataset("Cohere/beir-embed-english-v3", "nq-qrels", split="test")
queries_ds = load_dataset("Cohere/beir-embed-english-v3", "nq-queries", split="test")
corpus_ds = load_dataset("Cohere/beir-embed-english-v3", "nq-corpus", split="train")

doc_ids_list = corpus_ds['_id']
query_ids_ordered = queries_ds['_id']

qrels = {}
q_key = 'query_id' if 'query_id' in qrels_ds.column_names else 'query-id'
c_key = 'corpus_id' if 'corpus_id' in qrels_ds.column_names else 'corpus-id'

for row in qrels_ds:
    qid = str(row[q_key])
    did = str(row[c_key])
    if qid not in qrels: qrels[qid] = {}
    qrels[qid][did] = int(row['score'])

# ==========================================
# 3. QUERY ENCODING (WITH INSTRUCTION)
# ==========================================
print("\n[3/5] Encoding Queries (With BGE Instruction)...")
instruction = "Represent this sentence for searching relevant passages: "
query_texts = queries_ds['text']
queries_with_instruction = [instruction + q for q in query_texts]

model = SentenceTransformer('BAAI/bge-base-en-v1.5')
query_embeddings = model.encode(queries_with_instruction, batch_size=32, convert_to_numpy=True, normalize_embeddings=True)
print(f"‚úÖ Queries Encoded. Shape: {query_embeddings.shape}")

del model
gc.collect()

# ==========================================
# 4. EVALUATION FUNCTION
# ==========================================
def evaluate(index, index_name, q_embs, k=10):
    print(f"\nüîé Testing: {index_name} ...")
    start_time = time.time()
    scores, indices = index.search(q_embs, k)
    search_time = time.time() - start_time
    qps = len(q_embs) / search_time
    
    ndcg_list = []
    recall_list = []
    
    for i, (res_indices, res_scores) in enumerate(zip(indices, scores)):
        if i >= len(query_ids_ordered): break
        qid = str(query_ids_ordered[i])
        
        if qid not in qrels: continue
        relevant_docs = qrels[qid]
        
        retrieved_ids = []
        for idx in res_indices:
            if idx < len(doc_ids_list):
                retrieved_ids.append(str(doc_ids_list[idx]))
            else:
                retrieved_ids.append("-1") 

        dcg = 0.0
        for rank, doc_id in enumerate(retrieved_ids, 1):
            if doc_id in relevant_docs:
                dcg += relevant_docs[doc_id] / np.log2(rank + 1)
        
        ideal_rels = sorted(relevant_docs.values(), reverse=True)[:k]
        idcg = sum(r / np.log2(rank + 2) for rank, r in enumerate(ideal_rels))
        ndcg_list.append(dcg / idcg if idcg > 0 else 0)

        rel_set = set(relevant_docs.keys())
        ret_set = set(retrieved_ids)
        if len(rel_set) > 0:
            recall_list.append(len(rel_set & ret_set) / len(rel_set))
            
    mean_ndcg = np.mean(ndcg_list)
    mean_recall = np.mean(recall_list)
    
    print(f"   üëâ Result: nDCG@10={mean_ndcg:.4f} | Recall@10={mean_recall:.4f} | QPS={qps:.1f}")
    
    return {
        'Method': index_name, 
        'nDCG@10': round(mean_ndcg, 4), 
        'Recall@10': round(mean_recall, 4),
        'QPS': round(qps, 1)
    }

# ==========================================
# 5. RUN EXPERIMENTS
# ==========================================
results = []

if paths['hnsw_fp32']:
    idx = faiss.read_index(paths['hnsw_fp32'])
    results.append(evaluate(idx, "HNSW FP32", query_embeddings))
    del idx; gc.collect()

if paths['hnsw_int8']:
    idx = faiss.read_index(paths['hnsw_int8'])
    results.append(evaluate(idx, "HNSW INT8", query_embeddings))
    del idx; gc.collect()

if paths['flat_int8']:
    idx = faiss.read_index(paths['flat_int8'])
    results.append(evaluate(idx, "Flat INT8", query_embeddings))
    del idx; gc.collect()

if paths['doc_emb']:
    print("\n------------------------------------------------")
    print("4/4. Building Flat FP32 (Exact Search) in Memory...")
    try:
        doc_embeddings = np.load(paths['doc_emb'], mmap_mode='r')
        d = doc_embeddings.shape[1]
        idx = faiss.IndexFlatIP(d)
        idx.add(np.array(doc_embeddings)) 
        results.append(evaluate(idx, "Flat FP32 (Exact)", query_embeddings))
        del idx; del doc_embeddings; gc.collect()
    except Exception as e:
        print(f"‚ö†Ô∏è Flat FP32 skipped due to memory limit: {e}")

# ==========================================
# 6. SAVE RESULTS
# ==========================================
print("\nüèÜ FINAL BENCHMARK RESULTS:")
df = pd.DataFrame(results)
print(df)

df.to_csv("final_results_nq.csv", index=False)
print("\n‚úÖ Results saved to 'final_results_nq.csv'")

# üìä BM25 Baseline Calculation (Sparse Retrieval)

In this section, we compute the **BM25** scores to serve as a standardized baseline for our dense retrieval experiments (Table 1 & 3 replication).

We use the `bm25s` library, a high-performance Python implementation that allows indexing the entire Natural Questions (NQ) corpus (2.6M documents) efficiently in RAM without external dependencies like Java/Lucene.

**Metrics computed:** nDCG@10, Recall@10, and QPS.

In [None]:
# ==========================================
# BM25 CALCULATION (SPARSE BASELINE)
# ==========================================
import pandas as pd
import numpy as np
import time
import os

print("üöÄ STARTING BM25 CALCULATION (Using fast 'bm25s' library)...")

# 1. Install optimized libraries
#    We install 'PyStemmer' (wrapper for C Stemming lib) to make tokenization fast.
#    Note: Ignore any "JAX/CUDA" warnings in the output; they are irrelevant for CPU execution.
!pip install PyStemmer bm25s --no-deps

import bm25s
import Stemmer 

# ---------------------------------------------------------
# A. DATA PREPARATION
# ---------------------------------------------------------
# BM25 operates on raw text. We extract it from the datasets loaded in memory.
print("   üì• Extracting raw text from Corpus and Queries...")
corpus_texts = corpus_ds['text']
query_texts = queries_ds['text']

# ---------------------------------------------------------
# B. TOKENIZATION & STEMMING
# ---------------------------------------------------------
print("   ‚úÇÔ∏è  Tokenizing Corpus (this may take 2-4 minutes)...")
# We use the English stemmer to reduce words to their root (e.g., "running" -> "run")
stemmer = Stemmer.Stemmer("english")
corpus_tokens = bm25s.tokenize(corpus_texts, stopwords="en", stemmer=stemmer)
query_tokens = bm25s.tokenize(query_texts, stopwords="en", stemmer=stemmer)

# ---------------------------------------------------------
# C. INDEX CONSTRUCTION
# ---------------------------------------------------------
print("   üèóÔ∏è  Building BM25 Index...")
retriever = bm25s.BM25()
retriever.index(corpus_tokens)
print("   ‚úÖ Index built successfully!")

# ---------------------------------------------------------
# D. RETRIEVAL (SEARCH)
# ---------------------------------------------------------
print("   üîé Executing Search (Top-10 for all queries)...")
start_time = time.time()

# Perform search
results, scores = retriever.retrieve(query_tokens, k=10)

search_time = time.time() - start_time
qps_bm25 = len(query_texts) / search_time
print(f"   ‚úÖ Search completed in {search_time:.2f}s (QPS: {qps_bm25:.1f})")

# ---------------------------------------------------------
# E. EVALUATION (nDCG & Recall)
# ---------------------------------------------------------
print("   üìä Calculating Metrics...")
ndcg_list = []
recall_list = []

for i, res_indices in enumerate(results):
    # Safety check for index alignment
    if i >= len(query_ids_ordered): break
    
    qid = str(query_ids_ordered[i])
    
    # Skip if we don't have ground truth for this query
    if qid not in qrels: continue
    
    relevant_docs = qrels[qid]
    
    # Map internal BM25 integer IDs back to dataset String IDs
    retrieved_ids = []
    for idx in res_indices:
        if idx < len(doc_ids_list):
            retrieved_ids.append(str(doc_ids_list[idx]))
        else:
            retrieved_ids.append("-1")

    # --- nDCG Calculation ---
    dcg = 0.0
    for rank, doc_id in enumerate(retrieved_ids, 1):
        if doc_id in relevant_docs:
            dcg += relevant_docs[doc_id] / np.log2(rank + 1)
    
    ideal_rels = sorted(relevant_docs.values(), reverse=True)[:10]
    idcg = sum(r / np.log2(rank + 2) for rank, r in enumerate(ideal_rels))
    ndcg_list.append(dcg / idcg if idcg > 0 else 0)

    # --- Recall Calculation ---
    rel_set = set(relevant_docs.keys())
    ret_set = set(retrieved_ids)
    if len(rel_set) > 0:
        recall_list.append(len(rel_set & ret_set) / len(rel_set))

mean_ndcg_bm25 = np.mean(ndcg_list)
mean_recall_bm25 = np.mean(recall_list)

print(f"\nüèÜ BM25 RESULTS:")
print(f"   üëâ nDCG@10:   {mean_ndcg_bm25:.4f}")
print(f"   üëâ Recall@10: {mean_recall_bm25:.4f}")

# ---------------------------------------------------------
# F. UPDATE RESULTS TABLE
# ---------------------------------------------------------
bm25_result = {
    'Method': 'BM25 (Sparse)', 
    'nDCG@10': mean_ndcg_bm25, 
    'Recall@10': mean_recall_bm25, 
    'QPS': qps_bm25
}

# Load previous results, append BM25, and save
try:
    csv_path = "final_results_nq.csv"
    if os.path.exists(csv_path):
        df_final = pd.read_csv(csv_path)
        # Remove old BM25 entry if it exists to avoid duplicates
        df_final = df_final[df_final['Method'] != 'BM25 (Sparse)']
        # Add new result
        df_final = pd.concat([df_final, pd.DataFrame([bm25_result])], ignore_index=True)
    else:
        df_final = pd.DataFrame([bm25_result])
    
    print("\nüìÑ UPDATED FINAL TABLE:")
    print(df_final)
    df_final.to_csv("final_results_with_bm25.csv", index=False)
    print("‚úÖ Saved to 'final_results_with_bm25.csv'")
except Exception as e:
    print(f"‚ö†Ô∏è Error updating CSV: {e}")

In [None]:
# ==========================================
# FINAL PAPER TABLE GENERATION
# ==========================================
import pandas as pd
import numpy as np

# 1. Load the final results
df = pd.read_csv("final_results_with_bm25.csv")

# Constants for NQ Dataset (needed for memory calculation)
N_DOCS = 2681468   # Number of documents in NQ
DIM = 768          # Dimension of BGE-Base vectors

# ==========================================
# CALCULATE MISSING METRICS (Table 4)
# ==========================================

def calculate_theoretical_memory(method_name):
    """Calculates RAM usage in GB based on vector size."""
    if "BM25" in method_name:
        # BM25 index size varies, but typically smaller than dense FP32.
        # Estimate: ~0.5 GB for vocabulary + inverted index of this size
        return 0.50 
    
    # Dense Retrieval Logic
    if "INT8" in method_name:
        bytes_per_vec = DIM * 1  # 1 byte per dimension (Quantized)
    else: 
        bytes_per_vec = DIM * 4  # 4 bytes per dimension (FP32)
    
    raw_size_gb = (N_DOCS * bytes_per_vec) / (1024**3)
    
    # HNSW adds graph overhead (links between nodes). 
    # Typically +20-30% overhead compared to raw vectors.
    if "HNSW" in method_name:
        return raw_size_gb * 1.25
    else:
        return raw_size_gb

# 1. Compute Memory (GB)
df['Memory (GB)'] = df['Method'].apply(calculate_theoretical_memory).round(2)

# 2. Compute Latency (ms per query) -> Latency = 1000 / QPS
df['Latency (ms)'] = (1000 / df['QPS']).round(2)

# ==========================================
# FORMATTING TABLES
# ==========================================

# --- TABLE 3 REPLICATION: Effectiveness ---
# Focus on nDCG and Recall
print("\n" + "="*40)
print("üìÑ TABLE 3: RETRIEVAL EFFECTIVENESS")
print("="*40)
table_3 = df[['Method', 'nDCG@10', 'Recall@10']].sort_values(by='nDCG@10', ascending=False)
print(table_3.to_markdown(index=False))


# --- TABLE 4 REPLICATION: Efficiency ---
# Focus on Memory, QPS, and Latency
print("\n" + "="*40)
print("‚öôÔ∏è TABLE 4: RETRIEVAL EFFICIENCY")
print("="*40)
# We sort by QPS (Speed) to show the fastest first
table_4 = df[['Method', 'Memory (GB)', 'QPS', 'Latency (ms)']].sort_values(by='QPS', ascending=False)
print(table_4.to_markdown(index=False))

# --- SAVE FORMATTED DATA ---
df.to_csv("paper_final_tables.csv", index=False)
print("\n‚úÖ Formatted tables saved to 'paper_final_tables.csv'")