# RAG Benchmark Evaluation on Natural Questions

This notebook evaluates a RAG (Retrieval-Augmented Generation) system on the Natural Questions dataset.

## Setup and Initialization

### Imports and dependencies

Import all necessary libraries for dataset loading, embeddings, LLM, text processing, and evaluation metrics.

In [357]:
import os
import time

import pandas as pd
import json
import evaluate

from datetime import datetime
from langchain_core.messages import SystemMessage, HumanMessage
from tqdm.notebook import tqdm
from datasets import load_dataset
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.document import Document
from langchain_huggingface import HuggingFaceEmbeddings, HuggingFaceEndpoint, ChatHuggingFace
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
from sentence_transformers import SentenceTransformer, util

### Config File

Configure the dataset split, output folder, embedding model, chunk size, overlap, and LLM parameters.

In [358]:
config_file = "configs/custom_benchmark_config.json"

with open(config_file, "r") as f:
    config = json.load(f)

### Load API Key

Load the Hugging Face API key from environment variables to authenticate model requests.

In [359]:
load_dotenv()
HF_API_KEY = os.getenv("API_KEY7")

### Load dataset

Load the Natural Questions dataset subset for benchmarking.

In [360]:
dataset = load_dataset("natural_questions", split=config["dataset_split"])

Resolving data files:   0%|          | 0/287 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/287 [00:00<?, ?it/s]

## Model Initialization

### Embeddings initialization

Create the embedding model for converting text chunks into vector representations.

In [361]:
embedding_model = HuggingFaceEmbeddings(
    model_name=config["embedding_model_name"]
)

### Text splitter configuration

Define how the document is split into overlapping chunks for embedding and retrieval.

In [362]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=config["chunk_size"],
    chunk_overlap=config["chunk_overlap"]
)

### LLM setup

Configure the language model endpoint on Hugging Face Hub for text generation.

In [363]:
llm = HuggingFaceEndpoint(
    repo_id=config["llm_model"],
    huggingfacehub_api_token=HF_API_KEY,
    task="text-generation",
    temperature=config["temperature"],
    max_new_tokens=config["max_new_tokens"],
)
chat_model = ChatHuggingFace(llm=llm)

### Retrieval and querying

Functions to retrieve relevant documents locally and query the LLM with that context.

In [364]:
def retrieve_local(query, vectorstore, k=config["top_k"]):
    docs_faiss = vectorstore.similarity_search(query, k=k)
    return [d.page_content for d in docs_faiss]

In [365]:
def ask(query, context):
    messages = [
        SystemMessage(content=f"Just answer queries based on {context}."),
        HumanMessage(content=f"""
        Answer the query: {query} based uniquely on the context: {context}, don't make up anything, just say what the context contains. If the information is not in the context, you must say you don't know. You must answer only the specified question and nothing else.
        """)
    ]
    response = chat_model.invoke(messages)
    return response.content

#ask("What is the capital of France?", "France is a country in Europe.")



In [None]:
def semantic_check(query, ground_truth, prediction):
    messages = [
        #SystemMessage(content=f"Just answer queries based on {ground_truth}."),
        HumanMessage(content=f"""
        {ground_truth}

        given the previous statements, you MUST say (shortly) if the following phrase is correct

        {prediction}
        """)
    ]
    response = chat_model.invoke(messages)
    return response.content


### Data extraction and preprocessing

Functions to extract valid answers from dataset.

In [367]:
def extract_answers(sample):
    tokens = sample["document"]["tokens"]
    short_answer = ""
    start = sample["annotations"]["short_answers"][0]["start_token"]
    end = sample["annotations"]["short_answers"][0]["end_token"]
    if len(start) > 0:
        short_answer = " ".join([
            t for t, html in zip(tokens["token"][int(start[0]):int(end[0])], tokens["is_html"][start[0]:end[0]]) if not html
        ])

    long_answer = ""
    if sample["annotations"]["long_answer"][0]["start_token"] != -1:
        start = sample["annotations"]["long_answer"][0]["start_token"]
        end = sample["annotations"]["long_answer"][0]["end_token"]
        long_answer = " ".join([
            t for t, html in zip(tokens["token"][start:end], tokens["is_html"][start:end]) if not html
        ])

    return long_answer or "", short_answer or ""

In [368]:
def preprocess_text(sample):
    tokens = sample["document"]["tokens"]
    return " ".join([t for t, html in zip(tokens["token"], tokens["is_html"]) if not html])

## Benchmarking 

### Retrieve benchmark

By finding the longest match between the prediction and the golden context, we can evaluate how well the model retrieves relevant information.

In [369]:
def find_longest_match(str1, str2):
    max_len = 0
    len2 = len(str2)

    for i in range(len2):
        for j in range(i + 1, len2 + 1):
            substr = str2[i:j]
            if substr in str1 and len(substr) > max_len:
                max_len = len(substr)

    return max_len / len2 if len2 > 0 else 0


## Evaluation benchmark

Load rouge and bleu metrics for evaluating the model's predictions against the golden answers. 

In [370]:
metric_rouge = evaluate.load("rouge")
metric_bleu = evaluate.load("bleu")
metric_bleurt = evaluate.load("bleurt", "bleurt-large-512")
comparison_model = SentenceTransformer(config["embedding_model_name"])

INFO:tensorflow:Reading checkpoint C:\Users\ruben\.cache\huggingface\metrics\bleurt\bleurt-large-512\downloads\extracted\b2c310c50992e604abf5e8f9f4695388f5f2e7bfdfc3d5d81a5f4693d2f300d3\bleurt-large-512.


INFO:tensorflow:Reading checkpoint C:\Users\ruben\.cache\huggingface\metrics\bleurt\bleurt-large-512\downloads\extracted\b2c310c50992e604abf5e8f9f4695388f5f2e7bfdfc3d5d81a5f4693d2f300d3\bleurt-large-512.


INFO:tensorflow:Config file found, reading.


INFO:tensorflow:Config file found, reading.


INFO:tensorflow:Will load checkpoint bert_custom


INFO:tensorflow:Will load checkpoint bert_custom


INFO:tensorflow:Loads full paths and checks that files exists.


INFO:tensorflow:Loads full paths and checks that files exists.


INFO:tensorflow:... name:bert_custom


INFO:tensorflow:... name:bert_custom


INFO:tensorflow:... vocab_file:vocab.txt


INFO:tensorflow:... vocab_file:vocab.txt


INFO:tensorflow:... bert_config_file:bert_config.json


INFO:tensorflow:... bert_config_file:bert_config.json


INFO:tensorflow:... do_lower_case:True


INFO:tensorflow:... do_lower_case:True


INFO:tensorflow:... max_seq_length:512


INFO:tensorflow:... max_seq_length:512


INFO:tensorflow:Creating BLEURT scorer.


INFO:tensorflow:Creating BLEURT scorer.


INFO:tensorflow:Creating WordPiece tokenizer.


INFO:tensorflow:Creating WordPiece tokenizer.


INFO:tensorflow:WordPiece tokenizer instantiated.


INFO:tensorflow:WordPiece tokenizer instantiated.


INFO:tensorflow:Creating Eager Mode predictor.


INFO:tensorflow:Creating Eager Mode predictor.


INFO:tensorflow:Loading model.


INFO:tensorflow:Loading model.


INFO:tensorflow:BLEURT initialized.


INFO:tensorflow:BLEURT initialized.


### Single sample processing

Process one sample: build context, eventually query LLM, extract answer, and compute metrics.

In [371]:
def process_sample(i, sample):
    golden_context, golden_answer = extract_answers(sample)
    if golden_answer == "" or golden_context == "":
        return None
    golden_embeddings = comparison_model.encode(golden_context, convert_to_tensor=True)
    
    query = sample["question"]["text"]

    text = preprocess_text(sample)
    docs = [Document(page_content=text)]
    split_docs = splitter.split_documents(docs)
    vectorstore = FAISS.from_documents(split_docs, embedding_model)
    context = retrieve_local(query, vectorstore)
    
    prediction = ask(query, context)
        
    rouge = metric_rouge.compute(predictions=[prediction], references=[golden_context])
    bleu = metric_bleu.compute(predictions=[prediction], references=[golden_context])
    bleurt = metric_bleurt.compute(predictions=[prediction], references=[golden_context])
    longest_match = find_longest_match(prediction, golden_context)
    pred_embeddings = comparison_model.encode(prediction, convert_to_tensor=True)
    cosine_similarity = util.pytorch_cos_sim(pred_embeddings, golden_embeddings)
    semantic_evaluation = semantic_check(query, golden_context, prediction)

    prediction_golden = ask(query, golden_context)
        
    rouge_golden = metric_rouge.compute(predictions=[prediction_golden], references=[golden_context])
    bleu_golden = metric_bleu.compute(predictions=[prediction_golden], references=[golden_context])
    bleurt_golden = metric_bleurt.compute(predictions=[prediction_golden], references=[golden_context])
    longest_match_golden = find_longest_match(prediction_golden, golden_context)
    pred_embeddings_golden = comparison_model.encode(prediction_golden, convert_to_tensor=True)
    cosine_similarity_golden = util.pytorch_cos_sim(pred_embeddings_golden, golden_embeddings)
    semantic_evaluation_golden = semantic_check(query, golden_context, prediction_golden)
    
    result = {
        "index": i,
        "rougeL": rouge["rougeL"],
        "rougeL_golden": rouge_golden["rougeL"],
        "bleu": bleu["bleu"],
        "bleu_golden": bleu_golden["bleu"],
        "bleurt": bleurt["scores"][0],
        "bleurt_golden": bleurt_golden["scores"][0],
        "cosine_similarity": cosine_similarity.item(),
        "cosine_similarity_golden": cosine_similarity_golden.item(),
        "longest_match": longest_match,
        "longest_match_golden": longest_match_golden,
        "query": query,
        "gold_answer": golden_answer,
        "golden_context": golden_context,
        "prediction": prediction,
        "prediction_golden": prediction_golden,
        "semantic_evaluation": semantic_evaluation,
        "semantic_evaluation_golden": semantic_evaluation_golden
    }
    return result



### Full benchmark loop

Run the evaluation until the target number of valid samples with answers is reached, skipping samples with empty references.

In [372]:
num_valid_examples = config["num_valid_examples"]

results = []
i = 0

with tqdm(total=num_valid_examples) as pbar:
    while len(results) < num_valid_examples and i < len(dataset):
        sample = dataset[i]
        
        start_time = time.perf_counter()
        result = process_sample(i, sample)
        elapsed_time = time.perf_counter() - start_time
            
        if result:
            result["elapsed_time"] = elapsed_time
            results.append(result)
            pbar.update(1)
            
        i += 1

print(f"Processed {len(results)} valid samples out of {i} total samples")

  0%|          | 0/3 [00:00<?, ?it/s]

Processed 3 valid samples out of 17 total samples


### Save and display results

Print average metric scores, save data to CSV and JSON.

In [373]:
for r in results:
    print(f"\n\n===================================== SAMPLE {r['index']} =====================================")
    print(f"QUERY: {r['query']}\n")
    print(f"GROUND TRUTH: {r['golden_context']}\n")
    print(f"GOLDEN ANSWER: {r['gold_answer']}\n")
    print(f"PREDICTION: {r['prediction']}\n")
    print(f"PREDICTION SEMANTIC EVALUATION: {r['semantic_evaluation']}")
    print(f"PREDICTION GOLDEN: {r['prediction_golden']}\n")
    print(f"PREDICTION GOLDEN SEMANTIC EVALUATION: {r['semantic_evaluation_golden']}")
    print(f"rougeL: {r['rougeL']}, rougeL_golden: {r['rougeL_golden']}")
    print(f"bleu: {r['bleu']}, bleu_golden: {r['bleu_golden']}")
    print(f"bleurt: {r['bleurt']}, bleurt_golden: {r['bleurt_golden']}")
    print(f"cosine_similarity: {r['cosine_similarity']}, cosine_similarity_golden: {r['cosine_similarity_golden']}")
    print(f"longest match: {r['longest_match']}, longest_match_golden: {r['longest_match_golden']}")



QUERY: when did now thats what i call music come out

GROUND TRUTH: Now That 's What I Call Music ( also simply titled Now or Now 1 ) is the first album from the popular Now ! series that was released in the United Kingdom on 28 November 1983 . Initial pressings were released on vinyl and audio cassette . To celebrate the 25th anniversary of the album and series , the album was re-released on CD for the first time in 2009 . However , alternative longer mixes of Only For Love , Double Dutch and Candy Girl were included in place of the original shorter single mixes from 1983 . A double vinyl re-release followed for Record Store Day on 18 April 2015 . In July 2018 , the album was newly remastered and re-released on CD , vinyl and cassette to commemorate the release of the 100th volume of the series .

GOLDEN ANSWER: 28 November 1983

PREDICTION: The context provided states that "Now That's What I Call Music ( 1983 ) Now That's What I Call Music II ( 1984 ) Now That's What I Call Music (

In [374]:
output_folder = config["output_folder"]
run_name = f"run_custom_{datetime.now().strftime("%Y_%m_%d_%H%M%S")}"
out_path = os.path.join(output_folder, run_name)
os.makedirs(out_path, exist_ok=True)

results_df = pd.DataFrame(results)
response = [
    {
        "index": row["index"],
        "query": row["query"],
        "gold_answer": row["gold_answer"],
        "prediction": row["prediction"],
        "semantic_evaluation": row["semantic_evaluation"],
        "golden_context": row["golden_context"],
        "semantic_evaluation_golden": row["semantic_evaluation_golden"],
        "elapsed_time": row["elapsed_time"]
    }
    for _, row in results_df.iterrows()
]
results_df.drop(["prediction_golden", "prediction", "query", "gold_answer", "elapsed_time"], axis=1, inplace=True)

file_path = os.path.join(out_path, f"results.csv")
results_df.to_csv(file_path, index=False)

with open(os.path.join(out_path, "responses.json"), "w") as f:
    json.dump(response, f, indent=4)
    
with open(os.path.join(out_path, "used_config.json"), "w") as f:
    json.dump(config, f, indent=4)