# RAG Benchmark Evaluation on Natural Questions

This notebook evaluates a RAG (Retrieval-Augmented Generation) system on the Natural Questions dataset.

## Setup and Initialization

### Imports and dependencies

Import all necessary libraries for dataset loading, embeddings, LLM, text processing, and evaluation metrics.

In [1]:
import os
import pandas as pd
import json
import evaluate

from datetime import datetime
from langchain_core.messages import SystemMessage, HumanMessage
from tqdm import tqdm
from datasets import load_dataset
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.document import Document
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint, ChatHuggingFace
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv

### Config File

Configure the dataset split, output folder, embedding model, chunk size, overlap, and LLM parameters.

In [2]:
config_file = "configs/custom_benchmark_config.json"

with open(config_file, "r") as f:
    config = json.load(f)

### Load API Key

Load the Hugging Face API key from environment variables to authenticate model requests.

In [3]:
load_dotenv()
HF_API_KEY = os.getenv("API_KEY2")

### Load dataset

Load the Natural Questions dataset subset for benchmarking.

In [4]:
dataset = load_dataset("natural_questions", split=config["dataset_split"])

Resolving data files:   0%|          | 0/287 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/287 [00:00<?, ?it/s]

## Model Initialization

### Embeddings initialization

Create the embedding model for converting text chunks into vector representations.

In [5]:
embedding_model = HuggingFaceEmbeddings(
    model_name=config["embedding_model_name"]
)

### Text splitter configuration

Define how the document is split into overlapping chunks for embedding and retrieval.

In [6]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=config["chunk_size"],
    chunk_overlap=config["chunk_overlap"]
)

### LLM setup

Configure the language model endpoint on Hugging Face Hub for text generation.

In [7]:
llm = HuggingFaceEndpoint(
    repo_id=config["llm_model"],
    huggingfacehub_api_token=HF_API_KEY,
    task="text-generation",
    temperature=config["temperature"],
    max_new_tokens=config["max_new_tokens"],
)
chat_model = ChatHuggingFace(llm=llm)

### Retrieval and querying

Functions to retrieve relevant documents locally and query the LLM with that context.

In [8]:
def retrieve_local(query, vectorstore, k=config["top_k"]):
    docs_faiss = vectorstore.similarity_search(query, k=k)
    return [d.page_content for d in docs_faiss]

In [9]:
def ask(query, context):
    messages = [
        SystemMessage(content=f"Just answer queries based on {context}."),
        HumanMessage(content=f"""
        Answer the query: {query} based uniquely on the context: {context}, don't make up anything, just say what the context contains. If the information is not in the context, you must say you don't know. You must answer only the specified question and nothing else.
        """)
    ]
    response = chat_model.invoke(messages)
    return response.content

### Data extraction and preprocessing

Functions to extract valid answers from dataset.

In [10]:
def extract_answers(sample):
    tokens = sample["document"]["tokens"]
    short_answer = ""
    start = sample["annotations"]["short_answers"][0]["start_token"]
    end = sample["annotations"]["short_answers"][0]["end_token"]
    if len(start) > 0:
        short_answer = " ".join([
            t for t, html in zip(tokens["token"][int(start[0]):int(end[0])], tokens["is_html"][start[0]:end[0]]) if not html
        ])

    long_answer = ""
    if sample["annotations"]["long_answer"][0]["start_token"] != -1:
        start = sample["annotations"]["long_answer"][0]["start_token"]
        end = sample["annotations"]["long_answer"][0]["end_token"]
        long_answer = " ".join([
            t for t, html in zip(tokens["token"][start:end], tokens["is_html"][start:end]) if not html
        ])

    return long_answer or "", short_answer or ""

In [11]:
def preprocess_text(sample):
    tokens = sample["document"]["tokens"]
    return " ".join([t for t, html in zip(tokens["token"], tokens["is_html"]) if not html])

## Benchmarking 

### Retrieve benchmark

By finding the longest match between the prediction and the golden context, we can evaluate how well the model retrieves relevant information.

In [12]:
def find_longest_match(str1, str2):
    max_len = 0
    len2 = len(str2)

    for i in range(len2):
        for j in range(i + 1, len2 + 1):
            substr = str2[i:j]
            if substr in str1 and len(substr) > max_len:
                max_len = len(substr)

    return max_len / len2 if len2 > 0 else 0


## Evaluation benchmark

Load rouge and bleu metrics for evaluating the model's predictions against the golden answers. 

In [13]:
metric_rouge = evaluate.load("rouge")
metric_bleu = evaluate.load("bleu")

### Single sample processing

Process one sample: build context, eventually query LLM, extract answer, and compute metrics.

In [14]:
def process_sample(i, sample, use_llm):
    golden_context, golden_answer = extract_answers(sample)
    if golden_answer == "" or golden_context == "":
        return None
    
    query = sample["question"]["text"]
    
    # Standard request
    text = preprocess_text(sample)
    docs = [Document(page_content=text)]
    split_docs = splitter.split_documents(docs)
    vectorstore = FAISS.from_documents(split_docs, embedding_model)
    context = retrieve_local(query, vectorstore)
    
    if use_llm:
        prediction = ask(query, context)
    else:   
        prediction = context[0]
        
    rouge = metric_rouge.compute(predictions=[prediction], references=[golden_context])
    bleu = metric_bleu.compute(predictions=[prediction], references=[golden_context])
    longest_match = find_longest_match(prediction, golden_context)
    
    # Golden answer scenario
    if use_llm:
        prediction_golden = ask(query, golden_context)
    else:
        prediction_golden = golden_context
        
    rouge_golden = metric_rouge.compute(predictions=[prediction_golden], references=[golden_context])
    bleu_golden = metric_bleu.compute(predictions=[prediction_golden], references=[golden_context])
    longest_match_golden = find_longest_match(prediction_golden, golden_context)

    result = {
        "index": i,
        "rougeL": rouge["rougeL"],
        "rougeL_golden": rouge_golden["rougeL"],
        "bleu": bleu["bleu"],
        "bleu_golden": bleu_golden["bleu"],
        "longest_match": longest_match,
        "longest_match_golden": longest_match_golden,
        "query": query,
        "gold_answer": golden_answer,
        "prediction": prediction,
        "prediction_golden": prediction_golden,
    }
    result.update({k: str(v) for k, v in config.items()})  # Add config values
    return result



### Full benchmark loop

Run the evaluation until the target number of valid samples with answers is reached, skipping samples with empty references.

In [15]:
num_valid_examples = config["num_valid_examples"]
use_llm = config["use_llm"]

results = []
i = 0

with tqdm(total=num_valid_examples) as pbar:
    while len(results) < num_valid_examples and i < len(dataset):
        sample = dataset[i]
        
        result = process_sample(i, sample, use_llm)
            
        if result:
            results.append(result)
            pbar.update(1)
            
        i += 1

print(f"Processed {len(results)} valid samples out of {i} total samples")

100%|██████████| 10/10 [00:37<00:00,  3.73s/it]

Processed 10 valid samples out of 40 total samples





### Save and display results

Print average metric scores, save data to CSV and JSON.

In [16]:
for r in results:
    print(f"\n\n===================================== SAMPLE {r['index']} =====================================")
    print(f"QUERY: {r['query']}\n")
    print(f"PREDICTION: {r['prediction']}\n")
    print(f"PREDICTION (GOLDEN): {r['prediction_golden']}\n")
    print(f"GOLDEN ANSWER: {r['gold_answer']}\n")
    print(f"rougeL: {r['rougeL']}, rougeL_golden: {r['rougeL_golden']}")
    print(f"bleu: {r['bleu']}, blue golden: {r['bleu_golden']}")
    print(f"longest match: {r['longest_match']}, longest match golden: {r['longest_match_golden']}")



QUERY: when did now thats what i call music come out

PREDICTION: Now That 's What I Call Music ( 1983 ) Now That 's What I Call Music II ( 1984 ) Now That 's What I Call Music ( also simply titled Now or Now 1 ) is the first album from the popular Now ! series

PREDICTION (GOLDEN): Now That 's What I Call Music ( also simply titled Now or Now 1 ) is the first album from the popular Now ! series that was released in the United Kingdom on 28 November 1983 . Initial pressings were released on vinyl and audio cassette . To celebrate the 25th anniversary of the album and series , the album was re-released on CD for the first time in 2009 . However , alternative longer mixes of Only For Love , Double Dutch and Candy Girl were included in place of the original shorter single mixes from 1983 . A double vinyl re-release followed for Record Store Day on 18 April 2015 . In July 2018 , the album was newly remastered and re-released on CD , vinyl and cassette to commemorate the release of the 10

In [17]:
output_folder = config["output_folder"]
run_name = f"run_custom_{datetime.now().strftime("%Y_%m_%d_%H%M%S")}"
out_path = os.path.join(output_folder, run_name)
os.makedirs(out_path, exist_ok=True)

results_df = pd.DataFrame(results)
results_df.drop(["prediction_golden", "prediction"], axis=1, inplace=True)

file_path = os.path.join(out_path, f"results.csv")
results_df.to_csv(file_path, index=False)

with open(os.path.join(out_path, "used_config.json"), "w") as f:
    json.dump(config, f, indent=4)