# RAG Benchmark Evaluation on Natural Questions

This notebook evaluates a RAG (Retrieval-Augmented Generation) system on the Natural Questions dataset.

## Setup and Initialization

### Imports and dependencies

Import all necessary libraries for dataset loading, embeddings, LLM, text processing, and evaluation metrics.

In [None]:
import os
import time
import re

import pandas as pd
import json
import evaluate
import numpy as np

from datetime import datetime
from langchain_core.messages import SystemMessage, HumanMessage
from tqdm.notebook import tqdm
from datasets import load_dataset
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.document import Document
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint, ChatHuggingFace
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
from sentence_transformers import SentenceTransformer, util
from collections import defaultdict

### Config File

Configure the dataset split, output folder, embedding model, chunk size, overlap, and LLM parameters.

In [None]:
config_file = "configs/custom_benchmark_config.json"

with open(config_file, "r") as f:
    config = json.load(f)

only_retrieve = config["only_retrieve"]

### Load API Key

Load the Hugging Face API key from environment variables to authenticate model requests.

In [None]:
load_dotenv()
HF_API_KEY = os.getenv("API_KEY10")

### Load dataset

Load the Natural Questions dataset subset for benchmarking.

In [None]:
dataset = load_dataset("natural_questions", split=config["dataset_split"])

## Model Initialization

### Embeddings initialization

Create the embedding model for converting text chunks into vector representations.

In [None]:
embedding_model = HuggingFaceEmbeddings(
    model_name=config["embedding_model_name"]
)

### Text splitter configuration

Define how the document is split into overlapping chunks for embedding and retrieval.

In [None]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=config["chunk_size"],
    chunk_overlap=config["chunk_overlap"]
)

### LLM setup

Configure the language model endpoint on Hugging Face Hub for text generation.

In [None]:
if not only_retrieve:
    llm = HuggingFaceEndpoint(
        repo_id=config["llm_model"],
        huggingfacehub_api_token=HF_API_KEY,
        task="text-generation",
        temperature=config["temperature"],
        max_new_tokens=config["max_new_tokens"],
    )
    chat_model = ChatHuggingFace(llm=llm)

### Retrieval and querying

Functions to retrieve relevant documents locally and query the LLM with that context.

In [None]:
def retrieve_local(query, vectorstore, k=config["top_k"]):
    docs_faiss = vectorstore.similarity_search(query, k=k)
    return [d.page_content for d in docs_faiss]

In [None]:
def get_relevant_doc_ids(split_docs, golden_answer=None):
    relevant_ids = []
    for doc in split_docs:
        content = doc.page_content.lower()
        if golden_answer and golden_answer.strip() and golden_answer.strip().lower() in content:
            relevant_ids.append(doc.metadata["doc_id"])
    return relevant_ids

In [None]:
def ask(query, context):
    messages = [
        SystemMessage(content=f"Just answer queries based on {context}."),
        HumanMessage(content=f"""
        Answer the query: {query} based uniquely on the context: {context}, don't make up anything, just say what the context contains. If the information is not in the context, you must say you don't know. You must answer only the specified question and nothing else.
        """)
    ]
    response = chat_model.invoke(messages)
    cleaned = re.sub(r"<think>[\s\S]*?</think>\s*", "", response.content, flags=re.DOTALL)
    return cleaned.strip()

#ask("What is the capital of France?", "France is a country in Europe.")



In [None]:
import re

def clean_think_tags(text):
    cleaned = re.sub(r"\s*<think>[\s\S]*?</think>\s*", "", text, flags=re.DOTALL)
    return cleaned.strip()

sample = """
<think>
Okay, let me try to figure this out. The user is asking whether the statement "The context contains information about the 'Now That's What I Call Music ( original UK album )', which is referenced in the statement "Now That's What I Call Music (", indicating that it is related to the album series. However, it does not specify the exact release year of the original UK album. To provide a precise answer, I would need to refer to external sources or more detailed context. Based on the information provided, I do not know the exact release year of the original UK album."

First, I need to parse the given text. The text mentions that the album was released in the UK on 28 November 1983. The user is saying that the context provided doesn't specify the exact release year of the original UK album. Wait, but the text does say 28 November 1983. So the user is pointing out that even though the text mentions the release date, they are saying that the context doesn't specify it. That seems contradictory.

Wait, the user's statement is a bit confusing. Let me read it again. The user is saying that the context (the text provided) contains information about the "Now That's What I Call Music ( original UK album )", which is referenced in the statement "Now That's What I Call Music (", indicating that it's related to the series. However, the context does not specify the exact release year. But the text clearly states the release date as 28 November 1983. So the user is saying that the context doesn't specify it, but the text does. That seems like a contradiction. 

Wait, maybe the user is saying that the context (the text) does not specify the release year, but the user is aware that the text does. So the user is pointing out that the context (the text) doesn't have the release year, but in reality, the text does. Therefore, the user's statement is incorrect because the text does specify the release year. 

Alternatively, maybe the user is confused. Let me check the original text again. The original text says: "Initial pressings were released on vinyl and audio cassette. To celebrate the 25th anniversary of the album and series, the album was re-released on CD for the first time in 2009. However, alternative longer mixes of Only For Love, Double Dutch and Candy Girl were included in place of the original shorter single mixes from 1983. A double vinyl re-release followed for Record Store Day on 18 April 2015. In July 2018, the album was newly remastered and re-released on CD, vinyl and cassette to commemorate the release of the 100th volume of the series."

So the original release date is mentioned as 28 November 1983. Therefore, the user's statement that the context does not specify the exact release year is incorrect because the text does specify it. Therefore, the answer is that the user's statement is incorrect because the context does provide the release year. The user is wrong in their assertion that the context does not specify it. The correct answer is that the user's statement is incorrect because the text does include the release year.
</think>babaab
balblablalbalbalb
"""

print(clean_think_tags(sample))

In [None]:
def semantic_check(query, ground_truth, prediction):
    messages = [
        #SystemMessage(content=f"Just answer queries based on {ground_truth}."),
        HumanMessage(content=f"""
        {ground_truth}

        given the previous statements, you MUST say (shortly) if the following phrase is correct

        {prediction}
        """)
    ]
    response = chat_model.invoke(messages)
    cleaned = re.sub(r"\s*<think>[\s\S]*?</think>\s*", "", response.content, flags=re.DOTALL)
    return cleaned.strip()


### Data extraction and preprocessing

Functions to extract valid answers from dataset.

In [None]:
def extract_answers(sample):
    tokens = sample["document"]["tokens"]
    short_answer = ""
    start = sample["annotations"]["short_answers"][0]["start_token"]
    end = sample["annotations"]["short_answers"][0]["end_token"]
    if len(start) > 0:
        short_answer = " ".join([
            t for t, html in zip(tokens["token"][int(start[0]):int(end[0])], tokens["is_html"][start[0]:end[0]]) if not html
        ])

    long_answer = ""
    if sample["annotations"]["long_answer"][0]["start_token"] != -1:
        start = sample["annotations"]["long_answer"][0]["start_token"]
        end = sample["annotations"]["long_answer"][0]["end_token"]
        long_answer = " ".join([
            t for t, html in zip(tokens["token"][start:end], tokens["is_html"][start:end]) if not html
        ])

    return long_answer or "", short_answer or ""

In [None]:
def preprocess_text(sample):
    tokens = sample["document"]["tokens"]
    return " ".join([t for t, html in zip(tokens["token"], tokens["is_html"]) if not html])

## Benchmarking 

### Retrieve benchmark

By finding the longest match between the prediction and the golden context, we can evaluate how well the model retrieves relevant information.

In [None]:
def find_longest_match(str1, str2):
    max_len = 0
    len2 = len(str2)

    for i in range(len2):
        for j in range(i + 1, len2 + 1):
            substr = str2[i:j]
            if substr in str1 and len(substr) > max_len:
                max_len = len(substr)

    return max_len / len2 if len2 > 0 else 0


## Evaluation benchmark

Load rouge and bleu metrics for evaluating the model's predictions against the golden answers. 

In [None]:
if not only_retrieve:
    metric_rouge = evaluate.load("rouge")
    metric_bleu = evaluate.load("bleu")
    metric_bleurt = evaluate.load("bleurt", "bleurt-large-512")
    comparison_model = SentenceTransformer(config["embedding_model_name"])

In [None]:
def precision_at_k(retrieved, relevant, k):
    retrieved_k = retrieved[:k]
    return len(set(retrieved_k) & set(relevant)) / k

def recall_at_k(retrieved, relevant, k):
    retrieved_k = retrieved[:k]
    return len(set(retrieved_k) & set(relevant)) / len(relevant) if relevant else 0

def average_precision(retrieved, relevant, k):
    score = 0.0
    num_hits = 0.0
    for i, doc_id in enumerate(retrieved[:k]):
        if doc_id in relevant:
            num_hits += 1.0
            score += num_hits / (i + 1.0)
    return score / len(relevant) if relevant else 0

def dcg_at_k(retrieved, relevant, k):
    dcg = 0.0
    for i, doc_id in enumerate(retrieved[:k]):
        if doc_id in relevant:
            dcg += 1.0 / np.log2(i + 2)
    return dcg

def ndcg_at_k(retrieved, relevant, k):
    dcg = dcg_at_k(retrieved, relevant, k)
    idcg = sum(1.0 / np.log2(i + 2) for i in range(min(len(relevant), k)))
    return dcg / idcg if idcg > 0 else 0

### Single sample processing

Process one sample: build context, eventually query LLM, extract answer, and compute metrics.

In [None]:
def process_sample(i, sample):
    k_values = config.get("k_values", [1, 3, 5, 10])
    golden_context, golden_answer = extract_answers(sample)
    if golden_answer == "" or golden_context == "":
        return None
    
    query = sample["question"]["text"]

    text = preprocess_text(sample)
    docs = [Document(page_content=text)]
    split_docs = splitter.split_documents(docs)
    for idx, doc in enumerate(split_docs):
        doc.metadata = {"doc_id": f"doc_{idx}"}
    vectorstore = FAISS.from_documents(split_docs, embedding_model)

    docs_faiss = vectorstore.similarity_search(query, k=config['top_k'])
    retrieved_ids = [d.metadata["doc_id"] for d in docs_faiss]

    relevant_ids = get_relevant_doc_ids(split_docs, golden_answer)

    precision_dict = {}
    recall_dict = {}
    map_dict = {}
    ndcg_dict = {}
    for k in k_values:
        precision_dict[f"precision@{k}"] = precision_at_k(retrieved_ids, relevant_ids, k)
        recall_dict[f"recall@{k}"] = recall_at_k(retrieved_ids, relevant_ids, k)
        map_dict[f"map@{k}"] = average_precision(retrieved_ids, relevant_ids, k)
        ndcg_dict[f"ndcg@{k}"] = ndcg_at_k(retrieved_ids, relevant_ids, k)

    if not only_retrieve:
        golden_embeddings = comparison_model.encode(golden_context, convert_to_tensor=True)
        
        context = retrieve_local(query, vectorstore)
        prediction = ask(query, context)
        
        rouge = metric_rouge.compute(predictions=[prediction], references=[golden_context])
        bleu = metric_bleu.compute(predictions=[prediction], references=[golden_context])
        bleurt = metric_bleurt.compute(predictions=[prediction], references=[golden_context])
        longest_match = find_longest_match(prediction, golden_context)
        pred_embeddings = comparison_model.encode(prediction, convert_to_tensor=True)
        cosine_similarity = util.pytorch_cos_sim(pred_embeddings, golden_embeddings)
        semantic_evaluation = semantic_check(query, golden_context, prediction)

        prediction_golden = ask(query, golden_context)
            
        rouge_golden = metric_rouge.compute(predictions=[prediction_golden], references=[golden_context])
        bleu_golden = metric_bleu.compute(predictions=[prediction_golden], references=[golden_context])
        bleurt_golden = metric_bleurt.compute(predictions=[prediction_golden], references=[golden_context])
        longest_match_golden = find_longest_match(prediction_golden, golden_context)
        pred_embeddings_golden = comparison_model.encode(prediction_golden, convert_to_tensor=True)
        cosine_similarity_golden = util.pytorch_cos_sim(pred_embeddings_golden, golden_embeddings)
        semantic_evaluation_golden = semantic_check(query, golden_context, prediction_golden)
    
        result = {
            "index": i,
            "rougeL": rouge["rougeL"],
            "rougeL_golden": rouge_golden["rougeL"],
            "bleu": bleu["bleu"],
            "bleu_golden": bleu_golden["bleu"],
            "bleurt": bleurt["scores"][0],
            "bleurt_golden": bleurt_golden["scores"][0],
            "cosine_similarity": cosine_similarity.item(),
            "cosine_similarity_golden": cosine_similarity_golden.item(),
            "longest_match": longest_match,
            "longest_match_golden": longest_match_golden,
            "query": query,
            "gold_answer": golden_answer,
            "golden_context": golden_context,
            "prediction": prediction,
            "prediction_golden": prediction_golden,
            "semantic_evaluation": semantic_evaluation,
            "semantic_evaluation_golden": semantic_evaluation_golden,
            "retrieved_docs": context,
            "relevant_ids": relevant_ids,
            "retrieved_ids": retrieved_ids,
            "precision_dict": precision_dict,
            "recall_dict": recall_dict,
            "map_dict": map_dict,
            "ndcg_dict": ndcg_dict
        }
    else:
        result = {
            "index": i,
            "query": query,
            "gold_answer": golden_answer,
            "golden_context": golden_context,
            "relevant_ids": relevant_ids,
            "retrieved_ids": retrieved_ids,
            "precision_dict": precision_dict,
            "recall_dict": recall_dict,
            "map_dict": map_dict,
            "ndcg_dict": ndcg_dict
        }
    return result

### Full benchmark loop

Run the evaluation until the target number of valid samples with answers is reached, skipping samples with empty references.

In [None]:
num_valid_examples = config["num_valid_examples"]

results = []
i = 0

with tqdm(total=num_valid_examples) as pbar:
    while len(results) < num_valid_examples and i < len(dataset):
        sample = dataset[i]
        
        start_time = time.perf_counter()
        result = process_sample(i, sample)
        elapsed_time = time.perf_counter() - start_time
            
        if result:
            result["elapsed_time"] = elapsed_time
            results.append(result)
            pbar.update(1)
            
        i += 1

print(f"Processed {len(results)} valid samples out of {i} total samples")

### Save and display results

Print average metric scores, save data to CSV and JSON.

In [None]:
for r in results:
    print(f"\n\n===================================== SAMPLE {r['index']} =====================================")
    print(f"QUERY: {r['query']}\n")
    print(f"GROUND TRUTH: {r['golden_context']}\n")
    print(f"GOLDEN ANSWER: {r['gold_answer']}\n")
    if not only_retrieve:
        print(f"PREDICTION: {r['prediction']}\n")
        print(f"PREDICTION SEMANTIC EVALUATION: {r['semantic_evaluation']}")
        print(f"PREDICTION GOLDEN: {r['prediction_golden']}\n")
        print(f"PREDICTION GOLDEN SEMANTIC EVALUATION: {r['semantic_evaluation_golden']}")
        print(f"rougeL: {r['rougeL']}, rougeL_golden: {r['rougeL_golden']}")
        print(f"bleu: {r['bleu']}, bleu_golden: {r['bleu_golden']}")
        print(f"bleurt: {r['bleurt']}, bleurt_golden: {r['bleurt_golden']}")
        print(f"cosine_similarity: {r['cosine_similarity']}, cosine_similarity_golden: {r['cosine_similarity_golden']}")
        print(f"longest match: {r['longest_match']}, longest_match_golden: {r['longest_match_golden']}")
    print("retrieved_ids:", r['retrieved_ids'])
    print("relevant_ids:", r['relevant_ids'])
    print("precision: ", r['precision_dict'])
    print("recall: ", r['recall_dict'])
    print("map: ", r['map_dict'])
    print("ndcg: ", r['ndcg_dict'])

In [None]:
output_folder = config["output_folder"]
name = config["run_name"]
run_name = f"{name}_{datetime.now().strftime('%Y_%m_%d_%H%M%S')}"
out_path = os.path.join(output_folder, run_name)
os.makedirs(out_path, exist_ok=True)

results_df = pd.DataFrame(results)

# Split metrics into LLM (generation) and retrieval
general_infos = [
    "query", "gold_answer", "golden_context", "elapsed_time"
]
llm_data = [
    "prediction", "prediction_golden", "semantic_evaluation", "semantic_evaluation_golden"
]
llm_metrics = [
    "rougeL", "rougeL_golden", "bleu", "bleu_golden", "bleurt", "bleurt_golden",
    "cosine_similarity", "cosine_similarity_golden", "longest_match", "longest_match_golden"
]
retrieval_data = [
    "relevant_ids", "retrieved_ids"
]
retrieval_metrics = ["precision_dict", "recall_dict", "map_dict", "ndcg_dict"]

# Always include these columns
common_cols = ["index"]

# Save general infos of test run
general_infos_df = results_df[common_cols + general_infos]
general_infos_json = general_infos_df.to_dict(orient="records")
with open(os.path.join(out_path, "queries.json"), "w", encoding="utf-8") as f:
    json.dump(general_infos_json, f, indent=4, ensure_ascii=False)

# Save LLM metrics
if not only_retrieve:
    results_df[llm_metrics].to_csv(os.path.join(out_path, "generation_metrics.csv"), index=False)
    # Save LLM responses
    llm_data_df = results_df[common_cols + llm_data]
    general_infos_json = llm_data_df.to_dict(orient="records")
    with open(os.path.join(out_path, "responses.json"), "w", encoding="utf-8") as f:
        json.dump(general_infos_json, f, indent=4, ensure_ascii=False)

# Save retrieval metrics
sums = {m: defaultdict(float) for m in retrieval_metrics}
counts = {m: defaultdict(int) for m in retrieval_metrics}

for r in results:
    for m in retrieval_metrics:
        for k, v in r[m].items():
            sums[m][k] += float(v)
            counts[m][k] += 1

means = {
    m: {k: (sums[m][k] / counts[m][k]) for k in sums[m]}
    for m in retrieval_metrics
}

means_df = pd.DataFrame()
means_df["k"] = config["k_values"]
for _, m in means.items():
    metric = next(iter(m.keys()))
    metric = metric.split("@")[0]
    values = m.values()
    means_df[metric] = values
means_df.to_csv(os.path.join(out_path, "retrieval_metrics.csv"), index=False)
# Save retrieved docs
llm_data_df = results_df[common_cols + retrieval_data]
general_infos_json = llm_data_df.to_dict(orient="records")
with open(os.path.join(out_path, "retrieved_docs.json"), "w", encoding="utf-8") as f:
    json.dump(general_infos_json, f, indent=4, ensure_ascii=False)

# Save config
with open(os.path.join(out_path, "used_config.json"), "w") as f:
    json.dump(config, f, indent=4)
