# Evaluation of a RAG Pipeline using the MS MARCO dataset

This notebook presents the implementation of a simple Retrieval-Augmented Generation (RAG) pipeline, and its evaluation using advanced frameworks that provide a collection of metrics that assess the performance of the retrieval and generation components of the pipeline, such as **RAGChecker**, **ARES**, **RAGAs**, and **AutoRAG**.

The pipeline will be developed using the **LangChain** framework, employing the **Chroma** vector database, that will store the corpus embeddings and act both as the vector store and the retrieval component, and OpenAI's `GPT-4o mini` model as the generator component.

## 1. Inserting API Keys

* LangChain API

In [5]:
import os
import getpass

os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Insert LangChain API Key:")

Insert LangChain API Key: ········


* OpenAI API

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Insert OpenAI API key:")

Insert OpenAI API key: ········


## 2. Dataset Preprocessing

MS MARCO (Microsoft MAchine Reading COmprehension) is a large general-purpose dataset, containing factual user queries that are derived from Bing's public search logs. This dataset is a great choice for benchmarking retrieval and RAG-based systems, because it examines 3 aspects of the system:

**a)** It evaluates whether the system retrieved useful text segments for answering the initial query or not, and if possible, the system should generate an answer.

**b)** It assesses whether the system can generate a coherent and context-aware answer (or "well-formed") based on the retrieved text segments.

**c)** It inspects the way the system ranks the retrieved text segments.

In [18]:
from datasets import load_dataset

ms_marco = load_dataset("microsoft/ms_marco", "v1.1", cache_dir="./Datasets/", trust_remote_code=True)
print(ms_marco)

DatasetDict({
    validation: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 10047
    })
    train: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 82326
    })
    test: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 9650
    })
})


I selected the training subset of the MS Marco dataset to serve as the retrieval corpus, consisting of 82,326 instances.

Each instance of the dataset consists of a `query` and it's identifier (`id`), the `query type`, a dictionary of context `passages`, the query's `answer`, and a field named `wellFormedAnswers`, which is a list of string features.

In [19]:
ms_marco_train = ms_marco["train"]
example = ms_marco_train[0]

print(f"The keys of the 'passages' dictionary are: {example['passages'].keys()}")

The keys of the 'passages' dictionary are: dict_keys(['is_selected', 'passage_text', 'url'])


As we can observe, the `passages` dictionary contains a list of `passage_text` contexts, their corresponding `url`, and a flag (`is_selected`) indicating whether the context contributed to the final answer.

However, given the fact that Ι need to assess the retrieval capabilities of the RAG pipeline, Ι need to use all the context passages, not just the one that formulated the final answer. 

The **preprocessing** of the retrieval corpus includes creating a list of LangChain `Document` objects for every `passage_text` of every training instance, and using the `url`, as well as the corresponding query's `query_id` and `query_type` as the keys of the `metadata` dictionary.

In [4]:
from langchain_core.documents import Document

ms_marco_retrieval_documents = []

for example in ms_marco_train:
    # metadata about the query
    query_id = example["query_id"]
    query_type = example["query_type"]

    for i in range(len(example["passages"]["passage_text"])):
        # actual context passage
        context = example["passages"]["passage_text"][i]
        # metadata about the context passage
        url = example["passages"]["url"][i]

        doc = Document(
            page_content=context,
            metadata={"url": url, "query_id": query_id, "query_type": query_type}
        )

        ms_marco_retrieval_documents.append(doc)

print(f"Total number of Documents created: {len(ms_marco_retrieval_documents)}")

Total number of Documents created: 676193


At this point, Ι took the decision to consider each `Document` object of the retrieval documents list, a chunk, which will be later embedded in the Chroma vector store.

In [5]:
ms_marco_chunks = ms_marco_retrieval_documents

## 3. Chroma Vector Store

My RAG pipeline employs a Chroma vector store that uses OpenAI's `text-embedding-3-large` to embed the chunks and store their embeddings.

In [20]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [8]:
from langchain_chroma import Chroma
from tqdm import tqdm
from uuid import uuid4
import os

# Initialization of the Chroma vector store
ms_marco_vector_store = Chroma(
    collection_name="MS_MARCO",
    embedding_function=embeddings,
    persist_directory="./Vector_Stores/MS_Marco/"
)


# If the vector store's directory is empty, then we need to insert the corpus chunks to it.  
if len(os.listdir("./Vector_Stores/MS_Marco")) == 1:
    print("Chroma vector store is empty. Inserting chunk embeddings....")
    
    # creating a unique identifier for each dataset chunk to be stored
    uuids = [str(uuid4()) for _ in range(len(ms_marco_chunks))]
    
    # storing the chunks in batches, in order to reduce the amount of API calls
    batch_size = 1000
    
    # Tracking the process using a progress bar
    for i in tqdm(range(0, len(ms_marco_chunks), batch_size)):
        batch_chunks = ms_marco_chunks[i:i+batch_size]
        batch_ids = uuids[i:i+batch_size]
    
        ms_marco_vector_store.add_documents(documents=batch_chunks, ids=batch_ids)

    print("Chunk embeddings added to the vector store.")

print("Chroma vector store initialized.")

Chroma vector store initialized.


## 4. Testing the RAG pipeline

In this section, Ι define the retrieval and generation functionalities. Then, Ι query the Chroma vector stores that also acts as the retriever.

Given that the MS Marco training set contains 82326 instances and that Chroma doesn't support parallel execution of queries, Ι randomly sampled 100 QA instances from the dataset, by selecting their corresponding indices, so that Ι can access each data instance both in the dataset (`ms_marco_train`) and in the QA list (`ms_marco_qa`).

In [21]:
import random

# Setting the seed for reproducibility of the experiments
random.seed(100)

sampled_indices = random.sample(list(range(82326)), 100)

ms_marco_subset = []

for index in sampled_indices:
    ms_marco_subset.append(ms_marco_train[index])

At this point, I define a `results` list that stores dictionaries of (query, query_id, ground-truth answer, generated answer, retrieved context passages) instances of the sampled dataset

In [9]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_chroma import Chroma
from langchain import hub
from tqdm import tqdm

# Instantiating the MS Marco's Chroma vector store
ms_marco_vector_store = Chroma(
    collection_name="MS_MARCO",
    embedding_function=embeddings,
    persist_directory="./Vector_Stores/MS_Marco/"
)

# Creating a retriever out of the vector store
ms_marco_retriever = ms_marco_vector_store.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 10, 'lambda_mult': 0.25}
)

# Instantiating the generator model GPT-4o mini
generator_llm = ChatOpenAI(model="gpt-4o-mini")

# Defining the prompt template of the RAG pipeline
prompt_template = """Answer the following question:
\n\n
{question}
\n\n

Using the following list of context passages:
\n\n
{context}
\n\n"""

prompt = ChatPromptTemplate.from_template(prompt_template)

# RAG PIPELINE
print("Testing the RAG pipeline and collecting the retrieved passages and generated answers....")
results = []

for qa in tqdm(ms_marco_subset):
    # RETRIEVAL
    retrieved_chunks = ms_marco_retriever.invoke(qa["query"])

    # Processing the retrieved chunks, by adding their url source in the end of the text segment,
    # so that i only add the important text of each chunk, and not the metadata.  
    retrieved_context = ""
    for context in retrieved_chunks:
        retrieved_context += context.page_content + "\nSource: " + context.metadata["url"] + " \n\n "

    # GENERATION
    # Configuring the input prompt and calling GPT-4o mini, by promting it.
    prompt_message = prompt.invoke({"question": qa["query"], "context": retrieved_context})
    response = generator_llm.invoke(prompt_message)

    
    result = {
        "query_id": qa["query_id"],
        "query": qa["query"],
        "ground_truth_answer": qa["answers"],
        "generated_answer": response.content,
        "retrieved_chunks": retrieved_chunks
    }
    results.append(result)

print("The testing process is completed.")

Testing the RAG pipeline and collecting the retrieved passages and generated answers....


100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [06:59<00:00,  4.19s/it]

The testing process is completed.





**Checkpoint -** Storing the results of the testing process of the RAG pipeline in a `.json` file, saving them, as a checkpoint, for the evaluation process. 

In [10]:
import json

# Preprocessing of the results, since the retrieved chunks, which are `Document` objects, are not JSON serializable
serializable_results = []

for res in results:
    serializable_res = res.copy()
    serializable_res["retrieved_chunks"] = [{"page_content": chunk.page_content, "metadata": chunk.metadata} for chunk in res["retrieved_chunks"]]
    serializable_results.append(serializable_res)
    
# Saving the results list to a JSON file
with open("./Output_Files/ms_marco_results.json", "w", encoding="utf-8") as file:
    json.dump(serializable_results, file, indent=4)

## 5. RAG Evaluation

### 5.1 RAGChecker

Loading the RAG pipeline's test results:

In [11]:
import json

with open("./Output_Files/ms_marco_results.json", "r", encoding="utf-8") as file:
    json_results = json.load(file)   


RAGChecker requires that the RAG pipeline's results should be formatted as:

```json
{
    results: [
        {
            "query_id": <query's identifier as a string value>,
            "query": <The actual input query>,
            "gt_answer": <The ground-truth answer provided in the dataset>,
            "response": <RAG pipeline's generated response>,
            "retrieved_context": [ <The list of the retrieved chunks, which are pertinent to the input  query>
                {
                    "doc_id": <The document identifier of the retrieved context passage as a string value>,
                    "text": <The actual retrieved context passage>
                },
                {
                    "doc_id": <The document identifier of the retrieved context passage as a string value>,
                    "text": <The actual retrieved context passage>
                }
                ......
            ]
        }
        ......
    ]
}
```

However, the RAG pipeline's results stored in `json_results.json` have the following format:

```json
[
    {
        "query_id": <The input query's identifier>,
        "query": <The actual input query>,
        "ground_truth_answer": <The ground-truth answer provided in the dataset>,
        "generated_answer": <RAG pipeline's generated response>,
        "retrieved_chunks": [
            {
                "page_content": <The context passage of the retrieved chunk>,
                "metadata": {
                    "query_id": <The input query's identifier that ensures 
                                 the retrieved chunk is related to the input query>,
                    "query_type": <The category of the input query>,
                    "url": <The url source of the retrieved chunk>
                }
            }
        ]
    }
    .......
]
```

Given that, a necessary preprocessing step is to convert the results' format to the exact format specified by RAGChecker. 

For the MS MARCO dataset, 

In [12]:
ragchecker_results = {"results": []}

for res in json_results:
    if (len(res["ground_truth_answer"]) > 0):
        formatted_res = {
            "query_id": str(res["query_id"]),
            "query": res["query"],
            "gt_answer": res["ground_truth_answer"][0],
            "response": res["generated_answer"],
            "retrieved_context": []
        }
    
        for chunk in res["retrieved_chunks"]:
            formatted_res["retrieved_context"].append({"doc_id": str(chunk["metadata"]["query_id"]), "text": chunk["page_content"]})
    
        ragchecker_results["results"].append(formatted_res)

**Checkpoint -** Storing the RAG pipeline's results in the format required by RAGChecker.

In [13]:
with open("./Output_Files/ms_marco_ragchecker.json", "w", encoding="utf-8") as file:
    json.dump(ragchecker_results, file, indent=4)

**RAGChecker Evaluation**

RAGChecker computes a set of overall, retrieval, and generation metrics:

| Metric | Description |
|--------|-------------|
| Precision | The fraction of correct generated claims $c_i^{(m)}$ in the generated response $m$ |
| Recall | The fraction of ground-truth claims $c_i^{(gt)}$ that can be found in the model response |
| F1 Score | The harmonic mean of Precision and Recall |
| Claim Recall | The fraction of ground-truth claims $c^{(gt)}$ that can be found in the set of retrieved chunks $\{chunk_j\}$ |
| Context Precision | The fraction of relevant chunks $\{r\text{-}chunk_j\}$ from the $k$ retrieved chunks |
| Faithfulness | The fraction of the model-generated claims $c_i^{(m)}$ that can be attributed to retrieved chunks. |
| Relevant Noise Sensitivity | The fraction of generated claims $c_i^{(m)}$ that are incorrect and extracted from relevant retrieved chunks |
| Irrelevant Noise Sensitivity | The fraction of generated claims $c_i^{(m)}$ that are incorrect and extracted from irrelevant retrieved chunks |
| Hallucination | The fraction of generated claims $c_i^{(m)}$ that belong neither in the ground-truth answer $gt$ nor in any retrieved chunk |
| Self Knowledge |  The fraction of generated responses $c_i^{(m)}$ that can be traced in the ground-truth $gt$ but not in any retrieved chunk |
| Context Utilization | The fraction of ground-truth claims $c_i^{(gt)}$ that can be found in the set of retrieved chunks, that can also be extracted from the generated response $m$ |

Executing the RAGChecker evaluation, and using `gpt-3.5-turbo` both as the extractor and the checker model.

In [4]:
from ragchecker import RAGResults, RAGChecker
from ragchecker.metrics import all_metrics

# initialization of rag_results from json/dict
with open("./Output_Files/ms_marco_ragchecker.json") as fp:
    rag_results = RAGResults.from_json(fp.read())

# Setting up the evaluator using "gpt-4o-mini" as the extractor and checker model.
evaluator = RAGChecker(
    extractor_name="gpt-3.5-turbo",
    checker_name="gpt-3.5-turbo",
    batch_size_extractor=10,
    batch_size_checker=10
)

# Evaluating results on all metrics, holistic, retrieval, and generation metrics
evaluator.evaluate(rag_results, all_metrics)
print(rag_results)

[32m2025-01-06 11:05:23.592[0m | [1mINFO    [0m | [36mragchecker.evaluator[0m:[36mextract_claims[0m:[36m113[0m - [1mExtracting claims for gt_answer of 100 RAG results.[0m

[32m2025-01-06 11:05:45.607[0m | [1mINFO    [0m | [36mragchecker.evaluator[0m:[36mcheck_claims[0m:[36m173[0m - [1mChecking retrieved2answer for 100 RAG results.[0m

[32m2025-01-06 11:07:24.670[0m | [1mINFO    [0m | [36mragchecker.evaluator[0m:[36mextract_claims[0m:[36m113[0m - [1mExtracting claims for response of 100 RAG results.[0m

[32m2025-01-06 11:08:01.402[0m | [1mINFO    [0m | [36mragchecker.evaluator[0m:[36mcheck_claims[0m:[36m173[0m - [1mChecking retrieved2response for 100 RAG results.[0m

[32m2025-01-06 11:12:12.919[0m | [1mINFO    [0m | [36mragchecker.evaluator[0m:[36mcheck_claims[0m:[36m173[0m - [1mChecking response2answer for 100 RAG results.[0m

[32m2025-01-06 11:12:25.195[0m | [1mINFO    [0m | [36mragchecker.evaluator[0m:[36mcheck_claims

RAGResults(
  100 RAG results,
  Metrics:
  {
    "overall_metrics": {
      "precision": 64.0,
      "recall": 73.8,
      "f1": 61.9
    },
    "retriever_metrics": {
      "claim_recall": 93.5,
      "context_precision": 83.9
    },
    "generator_metrics": {
      "context_utilization": 77.8,
      "noise_sensitivity_in_relevant": 28.4,
      "noise_sensitivity_in_irrelevant": 2.7,
      "hallucination": 4.5,
      "self_knowledge": 0.8,
      "faithfulness": 94.7
    }
  }
)





Storing the RAGChecker's evalaution results in a text file that will also store the evaluation results of the other 3 frameworks.

In [5]:
with open("./Output_Files/ms_marco_framework_results.txt", "a+", encoding="utf-8") as file:
    file.write("RAGChecker results:\n\n" + str(rag_results))

### 5.2 RAGAs

Due to dependency conflicts between modules required by RAGChecker and RAGAs, such as transformers and scikit-learn, RAGAs is executed on a different anaconda environment, that is installed and set up with the following commands in the anaconda terminal:

```
conda create --name rag_eval_ragas python=3.11.5
conda install jupyter
pip install ragas 
```
**The next 2 cells should be run in the "rag_eval_ragas" anaconda environment.**

Activating the environment with the command:

```
conda activate rag_eval_ragas
```

Loading the RAG pipeline's test results:

In [None]:
import json

with open("./Output_Files/ms_marco_results.json", "r", encoding="utf-8") as file:
    json_results = json.load(file)  

**RAGAs Evaluation**

RAGAs provides a series of RAG evaluation metrics as well as general LLM evaluation metrics, but in this setting, I chose to compute the subset of those metrics that are described in the thesis report:

| RAG Metric                     | Description                                                                                                     |
|-------------------------------|-----------------------------------------------------------------------------------------------------------------|
| Faithfulness                  | The fraction of inferred claims $|V|$ from the total claims $|S|$                                              |
| Answer Relevance             | The average cosine similarity of the input query and a set of LLM-generated queries that can produce the same answer |
| Context Relevance            | The fraction of sentences in the context that successfully confront the question                              |
| Context Precision@$K$        | The fraction of contextually relevant retrieved chunks — average of Precision@$k$ for all $k$ up to $K$        |
| Context Recall               | The fraction of relevant chunks that were actually retrieved based on the input query                          |
| Context Entity Recall        | How many ground-truth entities were present in the generated answer                                            |
| Relevant/Irrelevant Noise Sensitivity | Same metrics as RAGChecker                                                                                 |


| General LLM Metric | Description |
|--------------------|-------------|
| Non-LLM Semantic Similarity | The similarity (on a scale of 0 to 1) between the model response and the ground-truth answer using distance measurements|
| BLEU | The similarity (on a scale of 0 to 1) between the model response and the ground-truth answer using the n-gram precision |
| ROUGE | The fraction of overlap between the generated and reference response based on the n-gram precision, recall and F1 score |
| Exact Match | Tests whether the generated response is exactly the same (1), or not (0), as the ground-truth answer |
| String Presence | Checks whether the generated response contains (1), or not (0), several parts or keywords of the reference answer |

However, calculating all those metrics for 100 RAG result instances is a resource-intensive process, requiring almost 10 minutes per instance!

That's why Ι decided only to calcuate the 3 most fundamental metrics: **Faithfulness**, **Answer Relevance** (defined as response relevancy in the code), and **Context Relevance** (defined as context recall in the code).

In [None]:
from ragas.dataset_schema import SingleTurnSample
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from tqdm import tqdm

# Importing RAG metrics
from ragas.metrics import Faithfulness, ResponseRelevancy, LLMContextRecall

# Defining the LLM and Embeddings model that are required parameters for some evaluation metrics
llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# The LLM and Embeddings model need to be wrapped so that they conform with RAGAs' interface
langchain_llm = LangchainLLMWrapper(langchain_llm=llm)
langchain_embeddings = LangchainEmbeddingsWrapper(embeddings=embeddings)

# Defining all the scoring functions that leverage the LLM to compute the respective evaluation metric 
faithfulness = Faithfulness(llm=langchain_llm)
answer_relevance = ResponseRelevancy(llm=langchain_llm, embeddings=langchain_embeddings)
context_relevance = LLMContextRecall(llm=langchain_llm)


# Initializing Metrics dictionary
# In every iteration of the evaluation process, 
# this dictionary will hold the running sum of each metric for the evaluated dataset instances up to this point,
# and, in the end, will store the average value for every metric
metrics =  {
    "faithfulness" : 0,
    "answer_relevance" : 0,
    "context_relevance" : 0
}

# Main evaluation loop
print("Evaluating the RAG pipeline results using RAGAS....")

for rag_result in tqdm(json_results):

    # Each rag result instance must be converted into a single-turn sample instance
    sample = SingleTurnSample(
        user_input=rag_result["query"],
        response=rag_result["generated_answer"],
        reference=rag_result["ground_truth_answer"][0], # ground_truth_answer is a list and its first element is the actual string value
        retrieved_contexts=[context["page_content"] for context in rag_result["retrieved_chunks"]]
    )

    # calculating the metrics for each rag result instance and storing the running sum to the 'metrics' dicitionary
    metrics["faithfulness"] += await faithfulness.single_turn_ascore(sample)
    metrics["answer_relevance"] += await answer_relevance.single_turn_ascore(sample)
    metrics["context_relevance"] += await context_relevance.single_turn_ascore(sample)

# After the evaluation loop is completed, i compute the average of each metric on the 100 rag result instances.
for metric_label in list(metrics.keys()):
    metrics[metric_label] /= 100

print("The evaluation process has been completed.")

Evaluating the RAG pipeline results using RAGAS....


100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [41:33<00:00, 24.93s/it]

The evaluation process has been completed.





In [None]:
print(f"The RAGAs average metrics for the sample of 100 RAG results are:\n {metrics}")

The RAGAs average metrics for the sample of 100 RAG results are:
 {'faithfulness': 0.912651358488059, 'answer_relevance': 0.6664825653535421, 'context_relevance': 0.833}


In [None]:
with open("./Output_Files/ms_marco_framework_results.txt", "a+", encoding="utf-8") as file:
    file.write("\n\nRAGAs results:\n\n" + str(metrics))

## 6. Failed Attempts

**The implementation attempts of the ARES and AutoRAG evaluation frameworks were unsuccessful, — potentially due to undocumented assumptions, configuration mismatches, or limitations in the way I applied it.**

**These outcomes may not fully reflect the frameworks’ capabilities, but rather real-world integration challenges and time constraints typical in academic research.**

### 6.1 ARES

Due to dependency conflicts between modules required by RAGChecker and ARES, ARES is executed on a different anaconda environment, that is installed and set up with the following commands:

```
conda create --name rag_eval_ares python=3.11.5
conda install jupyter
pip install ares-ai
```

**Only the last 2 cells of this subsection should be run on "rag-eval-ares".**

**The previous 5 cells should be executed on the "base" environment, due to the use of GPT-4o-mini.**

Activating the environment with the command:

```
conda activate rag_eval_ares
```

Loading the RAG pipeline's test results:

In [8]:
import json

with open("./Output_Files/ms_marco_results.json", "r", encoding="utf-8") as file:
    json_results = json.load(file)  

For the evaluation process, ARES requires that a series of configuration parameters is set up, and they are:

1. `in_domain_prompts_dataset`: A .tsv file that contains a set of few-shot examples, each represented as a Query-Document-Answer triple, as well as 'YES' and 'NO' labels for each of the context relevance, answer relevance, and answer faithfulness metrics.
2. `unlabeled_evaluation_set`: A .tsv file that contains the input Query-Document-Answer triples to be evaluated. 
3. `model_choice`: The LLM used by ARES.

**In-domain Prompts Dataset Creation**

In [31]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
import random
import re

# Setting the seed for reproducibility of the experiments
random.seed(100)

# Repeating the process of selecting indices to estimate the set of remaining available indices
sampled_indices = random.sample(range(82326), 100)
available_indices = set(range(82326)) - set(sampled_indices)

# Selecting a new set of examples, in order to evaluate the labels of the few-shot prompt dataset
few_shot_indices = random.sample(list(available_indices), 5)

# Storing the selected dataset instances in a list
examples = []

for index in few_shot_indices:
    example = {}
    example["query"] = ms_marco_train[index]["query"]
    example["context"] = []
    example["answer"] = ms_marco_train[index]["answers"]
    

    for context in ms_marco_train[index]["passages"]["passage_text"]:
        example["context"].append(context)

    examples.append(example)


# Invoking GPT-4o mini to create 10 examples with different combinations of [[Yes]] and [[No]] labels 
# for context relevance, answer relevance, and answer faithfulness
prompt_template = """
You are tasked with generating ten combinations of `Query - Document - Answer` triples 
from the examples provided. Each combination should result in various combinations of the following labels:

- Context Relevance Label: [[Yes]] or [[No]]
- Answer Relevance Label: [[Yes]] or [[No]]
- Answer Faithfulness Label: [[Yes]] or [[No]]

Definitions:
1. **Context Relevance**: A context passage is relevant if it provides meaningful information about the query.
2. **Answer Relevance**: An answer is relevant if it logically answers the query based on the provided context passage.
3. **Answer Faithfulness**: An answer is faithful if it aligns with the factual information presented in the context passage.

The input examples are provided in a Python list, where each example is a dictionary that contains the query, a list of context passages, 
and a list of answers:

{examples}
"""

# Creating the prompt message
prompt = ChatPromptTemplate.from_template(prompt_template)
prompt_message = prompt.invoke({"examples": examples})

# Invoking GPT-4o mini
model = ChatOpenAI(model="gpt-4o-mini")
response = model.invoke(prompt_message)

# Regex pattern that will separate the  generated Query - Document - Answer - Labels instances
pattern = r"\*\*Query\*\*: (.*?)\s+\*\*Document\*\*: (.*?)\s+\*\*Answer\*\*: (.*?)\s+\*\*Labels\*\*: Context Relevance: \[\[(.*?)\]\], Answer Relevance: \[\[(.*?)\]\], Answer Faithfulness: \[\[(.*?)\]\]"

# Extracting all the generated examples in a list of tuples 
generated_examples = re.findall(pattern, response.content, re.DOTALL)
generated_examples

[('"what is a document control specialist"\n   -',
  '"Document control specialists primarily work in industries such as engineering or pharmaceutical manufacturing, where they fine-tune literature and manuals that the company releases to simplify the language and/or create clear instructions."\n   -',
  '"Document control specialists store, manage and track company documents. They scan, image, organize and maintain documents, adhering to the company\'s document lifecycle procedures, and they archive inactive records in accordance with the records retention schedule."\n   -',
  'Yes',
  'Yes',
  'Yes'),
 ('"what year did tom\'s diner come out"\n   -',
  '"Tom\'s Diner is a song written in 1981 by American recording artist Suzanne Vega. It was first released as a track on the January 1984 issue of Fast Folk Musical Magazine."\n   -',
  '"1987"\n   -',
  'Yes',
  'No',
  'No'),
 ('"what does gluteus medius weakness cause"\n   -',
  '"Weakness in the muscle, nerve damage, or problems with

Formatting the generated answers in a DataFrame

In [56]:
import pandas as pd

Query = []
Document = []
Answer = []
Context_Relevance_Label = []
Answer_Relevance_Label = []
Answer_Faithfulness_Label = []

for example in generated_examples:
    Query.append(example[0][1:-6])
    Document.append(example[1][1:-6])
    Answer.append(example[2][1:-6])
    Context_Relevance_Label.append([[example[3]]])
    Answer_Relevance_Label.append([[example[4]]])
    Answer_Faithfulness_Label.append([[example[5]]])

few_shot_examples = pd.DataFrame({
    "Query": Query,
    "Document": Document,
    "Answer": Answer,
    "Context_Relevance_Label": Context_Relevance_Label,
    "Answer_Relevance_Label": Answer_Relevance_Label,
    "Answer_Faithfulness_Label": Answer_Faithfulness_Label
})

few_shot_examples

Unnamed: 0,Query,Document,Answer,Context_Relevance_Label,Answer_Relevance_Label,Answer_Faithfulness_Label
0,what is a document control specialist,Document control specialists primarily work in...,"Document control specialists store, manage and...",[[Yes]],[[Yes]],[[Yes]]
1,what year did tom's diner come out,Tom's Diner is a song written in 1981 by Ameri...,1987,[[Yes]],[[No]],[[No]]
2,what does gluteus medius weakness cause,"Weakness in the muscle, nerve damage, or probl...",Weakness of the right gluteus medius will caus...,[[Yes]],[[Yes]],[[Yes]]
3,viruses that infect bacteria are,Bacteriophages are viruses that infect bacteria.,A bacteriophage,[[Yes]],[[Yes]],[[Yes]]
4,functions of progesterone during pregnancy,Progesterone plays a role in maintaining pregn...,Progesterone is produced in the placenta and l...,[[Yes]],[[Yes]],[[Yes]]
5,what is a document control specialist,Most of what document control specialists work...,Document control specialists primarily work in...,[[Yes]],[[Yes]],[[Yes]]
6,what year did tom's diner come out,"On that day in New York, however, the weather ...",1987,[[No]],[[No]],[[No]]
7,what does gluteus medius weakness cause,"A trendelenburg gait, in which there is weakne...",Weakness of the right gluteus medius will caus...,[[Yes]],[[Yes]],[[Yes]]
8,viruses that infect bacteria are,Classification of viruses is defined by host p...,A bacteriophage,[[No]],[[No]],[[No]]
9,"functions of progesterone during pregnancy""","If a pregnancy occurs, progesterone is produce...",Progesterone is produced in the placenta and l...,[[Yes]],[[Yes]],[[Yes]]


Storing the few-shot examples DataFrame in a `.tsv` file in the Ouput_Files folder.

In [80]:
few_shot_examples.to_csv("./Output_Files/ms_marco_ares_few_shot_examples.csv", sep="\t", index=False)

Extracting the RAG results from the corresponding `.json` file and storing them in a `.tsv` file.

In [81]:
rag_results_df = pd.read_json("./Output_Files/ms_marco_results.json")

Query_rag = list(rag_results_df["query"])
Answer_rag = list(rag_results_df["generated_answer"])
Documents_rag = []

context_passages = list(rag_results_df["retrieved_chunks"])
for context in context_passages:
    document = ""

    for chunk in context:
        document += chunk["page_content"] + "\n\n"

    Documents_rag.append(document)

rag_results_ares_format = pd.DataFrame({
    "Query": Query_rag,
    "Document": Documents_rag,
    "Answer": Answer_rag
})

rag_results_ares_format.to_csv("./Output_Files/ms_marco_ares_rag_results.csv", sep="\t", index=False)

**ARES Evaluation**

ARES evaluates the 3 foundational RAG metrics:

| Metric | Description |
|--------|-------------|
| Context Relevance | Tests if the returned context is relevant to the input query |
| Answer Faithfulness | Tests if the generated answer is grounded on the retrieved context or if it contains hallucinations |
| Answer Relevance | Tests if the generated answer is contextually relevant to the query and retrieved context |

**NOTE -**  In order to import the framework I needed to change the "load_metric" function from the datasets module, to the "load" function, in the `anaconda3\envs\rag_eval_ares\Lib\site-packages\ares\LLM_as_a_Judge_Adaptation\General_Binary_Classifier.py` file.

In [4]:
from ares import ARES

ues_idp_config = {
    "in_domain_prompts_dataset": "./Output_Files/ms_marco_ares_few_shot_examples.csv",
    "unlabeled_evaluation_set": "./Output_Files/ms_marco_ares_rag_results.csv",
    "model": "gpt-4o-mini"
}

ares = ARES(ues_idp=ues_idp_config)

vLLM not imported.


In [None]:
results = ares.ues_idp()
results

Evaluating large subset with gpt-3.5-turbo-1106:   0%|          | 0/100 [00:00<?, ?it/s]

Attempt 1 failed with error: Client.__init__() got an unexpected keyword argument 'proxies'
Attempt 2 failed with error: Client.__init__() got an unexpected keyword argument 'proxies'


### 6.2 AutoRAG

Due to dependency conflicts between modules required by RAGChecker and ARES, ARES is executed on a different anaconda environment, that is installed and set up with the following commands:

```
conda create --name rag_eval_autorag python=3.11.5
conda install jupyter
pip install autorag
```

**Only the last 2 cells of this subsection should be run on "rag-eval-autorag".**

**The previous 4 cells should be executed on the "base" environment, due to the use of GPT-4o-mini.**

Activating the environment with the command:

```
conda activate rag_eval_autorag
```

Loading the RAG pipeline's test results:

In [1]:
import pandas as pd

df = pd.read_json("./Output_Files/ms_marco_results.json")

**AutoRAG Evaluation**

AutoRAG offers a variety of retrieval and generation evaluation metrics:

| Metric | Description |
|--------|-------------|
| Mean Reciprocal Rank (MRR) | the average of the reciprocal ranks - the inverse of ranks (index positions) of the first relevant retrieved chunk in the top-k list of chunks of all user queries, $U$ |
| Mean Average Precision (MAP) |  The average of the Average Precision (AP) terms across all user queries, U , each of which is the average of the precision for all the positions up to which a new relevant chunk was retrieved in the top-k list of chunks |
| Normalized Distributed Cumulative Gain (NDCG) | The fraction of DCG (Sum of the relevance of the top-$k$ retrieved chunks, divided by their logarithmic rank) over IDCG (Ideal value for DCG) |
| BLEU | Tests the overlap precision of n-grams in the generated response that are also present in the ground-truth answer |
| ROUGE | Tests the overlap recall of n-grams on the ground-truth answer that are also present in the generated response | 
| METEOR | Focuses on linking unigrams in the generated and ground-truth answers, computes the mean F1 Score which significantly weights (9) recall, and scales it by a term that penalizes higher degrees of fragmentation in the generated unigrams. |
| SemScore | Computes Semantic Text Similarity between the generated response and the ground-truth reference, using a pre-trained LLM  |
| BERTScore | Computes precision, recall, and F1 score on the embedded generated and ground-truth answers |
| G-Eval | Evaluation framework that assess the coherence and fluency of the generated response, as well as its consistency and relevance in terms of the question and the retrieved context. |

**Checkpoint -** Saving the MS Marco results in a `.parquet` file

In [9]:
df.to_parquet("./Output_Files/ms_marco_results.parquet", index=False)

qa_df = pd.read_parquet("./Output_Files/ms_marco_results.parquet", engine="pyarrow")
qa_df

Unnamed: 0,query_id,query,ground_truth_answer,generated_answer,retrieved_chunks
0,38805,what does the optic nerve do in the eye,[Senses light and creates impulses that go thr...,The optic nerve is a sensory nerve that connec...,"[{'metadata': {'query_id': 56676, 'query_type'..."
1,79993,what is cyclobenzaprine hcl,"[Muscle relaxant., A prescription muscle relax...","Cyclobenzaprine hydrochloride, commonly referr...","[{'metadata': {'query_id': 79993, 'query_type'..."
2,79389,camp lejeune phone number,[910-451-1113.],The phone number for Camp Lejeune Operator (Di...,"[{'metadata': {'query_id': 79389, 'query_type'..."
3,42621,what is chakalaka,[Chakalaka is a South African vegetable relish...,Chakalaka is a traditional South African veget...,"[{'metadata': {'query_id': 42621, 'query_type'..."
4,71268,soil temperature for germination,[45 to 85 °F],Soil temperature is a critical factor for seed...,"[{'metadata': {'query_id': 31018, 'query_type'..."
...,...,...,...,...,...
95,21340,how many ft are in a meter,[3.28 Feet],There are approximately 3.2808399 feet in a me...,"[{'metadata': {'query_id': 21340, 'query_type'..."
96,34655,what are puberty hormones,"[Puberty is started by hormones, which are che...",Puberty hormones are chemicals produced by the...,"[{'metadata': {'query_id': 34655, 'query_type'..."
97,82091,how to use laser wheel alignment gauges,[Used by BMW and Ford to calibrate automated w...,To use laser wheel alignment gauges effectivel...,"[{'metadata': {'query_id': 82091, 'query_type'..."
98,44995,average price per square foot for custom home ...,[$132],The average price per square foot for custom h...,"[{'metadata': {'query_id': 44995, 'query_type'..."


The evaluation of the retrieval and generation using AutoRAG requires the following additional information:
1. The index of the retrieved chunk that actually addresses the question (`retrieval_gt`)
2. The embeddings of the query and retrieved chunks that will be later used to calculate the similarity between the query and each retrieved chunk. 

In [13]:
from tqdm import tqdm

retrieval_gt = []
query_embeddings = []
context_passage_embeddings = []


for instance in tqdm(ms_marco_subset):
    # In order to find the retrieved chunk that actually addresses the question,
    # we first trace the '1' in the "is_selected" list and then extract the corresponding url in the "url" list
    gt_index = instance["passages"]["is_selected"].index(1)
    gt_url = instance["passages"]["url"][gt_index]

    # Then, we traverse the "retrieved_chunks" list of the corresponding dataFrame instance, and if we find
    # the same url in one of the chunks, we found the retrieved chunks that answered the question, 
    # and store its "query_id" to the "retrieval_gt" list.
    for chunk in qa_df.iloc[index]["retrieved_chunks"]:
        if chunk["metadata"]["url"] == gt_url:
            retrieval_gt.append([chunk["metadata"]["query_id"]])
            break
    else:
        retrieval_gt.append([-1])

    # Storing the query and retrieved chunk embeddings in lists that will be assigned to new dataFrame columns
    query_embeddings.append(embeddings.embed_query(instance["query"]))

    passage_embeddings = []
    for passage in instance["passages"]["passage_text"]:
        passage_embeddings.append(embeddings.embed_query(passage))
    context_passage_embeddings.append(passage_embeddings)
    
    

df["retrieval_gt"] = retrieval_gt
df["query_embeddings"] = query_embeddings
df["context_passage_embeddings"] = context_passage_embeddings

df.to_parquet("./Output_Files/ms_marco_results.parquet", index=False)

100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [09:21<00:00,  5.61s/it]


In [15]:
qa_df = pd.read_parquet("./Output_Files/ms_marco_results.parquet", engine="pyarrow")
qa_df

Unnamed: 0,query_id,query,ground_truth_answer,generated_answer,retrieved_chunks,retrieval_gt,query_embeddings,context_passage_embeddings
0,38805,what does the optic nerve do in the eye,[Senses light and creates impulses that go thr...,The optic nerve is a sensory nerve that connec...,"[{'metadata': {'query_id': 56676, 'query_type'...",[56676],"[-0.011821986176073551, 0.005827188957482576, ...","[[-0.002295552985742688, -0.010894258506596088..."
1,79993,what is cyclobenzaprine hcl,"[Muscle relaxant., A prescription muscle relax...","Cyclobenzaprine hydrochloride, commonly referr...","[{'metadata': {'query_id': 79993, 'query_type'...",[79993],"[-0.009452899917960167, -0.00997375138103962, ...","[[0.007091563194990158, -0.008067519403994083,..."
2,79389,camp lejeune phone number,[910-451-1113.],The phone number for Camp Lejeune Operator (Di...,"[{'metadata': {'query_id': 79389, 'query_type'...",[79389],"[0.012590419501066208, 0.002833800157532096, 0...","[[0.005907153245061636, -0.02618991769850254, ..."
3,42621,what is chakalaka,[Chakalaka is a South African vegetable relish...,Chakalaka is a traditional South African veget...,"[{'metadata': {'query_id': 42621, 'query_type'...",[-1],"[-0.006089339964091778, -0.0024807783775031567...","[[0.011079679243266582, -0.025997988879680634,..."
4,71268,soil temperature for germination,[45 to 85 °F],Soil temperature is a critical factor for seed...,"[{'metadata': {'query_id': 31018, 'query_type'...",[-1],"[0.014579696580767632, -0.02120109833776951, 0...","[[0.021847618743777275, -0.033776286989450455,..."
...,...,...,...,...,...,...,...,...
95,21340,how many ft are in a meter,[3.28 Feet],There are approximately 3.2808399 feet in a me...,"[{'metadata': {'query_id': 21340, 'query_type'...",[21340],"[-0.011784564703702927, -0.023685170337557793,...","[[-0.015622134320437908, -0.001858986448496580..."
96,34655,what are puberty hormones,"[Puberty is started by hormones, which are che...",Puberty hormones are chemicals produced by the...,"[{'metadata': {'query_id': 34655, 'query_type'...",[100089],"[0.012578323483467102, 0.005482597276568413, -...","[[0.028954099863767624, 0.0065473830327391624,..."
97,82091,how to use laser wheel alignment gauges,[Used by BMW and Ford to calibrate automated w...,To use laser wheel alignment gauges effectivel...,"[{'metadata': {'query_id': 82091, 'query_type'...",[82091],"[0.00804098043590784, -0.03547366335988045, -0...","[[0.0037620707880705595, 0.000761187809985131,..."
98,44995,average price per square foot for custom home ...,[$132],The average price per square foot for custom h...,"[{'metadata': {'query_id': 44995, 'query_type'...",[-1],"[-0.015348628163337708, -0.02563305012881756, ...","[[-0.03981368988752365, -0.008323799818754196,..."


**Retrieval Evaluation**

According to the docs page https://docs.auto-rag.com/test_your_rag.html, the retrieval evaluation requires the execution of a decorated function that re-executes the retrieval process and calculates the defined "metrics".

However, since the RAG pipeline has been tested, I don't need to re-execute the retrieval process. I only need to define the `retrieved_ids` which are the `query_ids` of the retrieved chunks, and the `retrieve_scores` that denote the similarity between the query and each chunk embeddings, which are calculated as the cosine similarity.

In [7]:
import pandas as pd
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity
from autorag.schema.metricinput import MetricInput
from autorag.evaluation import evaluate_retrieval
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

qa_df = pd.read_parquet("./Output_Files/ms_marco_results.parquet", engine="pyarrow")

metric_inputs = list(map(lambda x: MetricInput(
    query=list(x[1]["query"]),
    retrieval_gt=list(x[1]["retrieval_gt"]),
), qa_df.iterrows()))

@evaluate_retrieval(
    metric_inputs=metric_inputs,
    metrics=["retrieval_f1", "retrieval_recall", "retrieval_precision",
                   "retrieval_ndcg", "retrieval_map", "retrieval_mrr"]
)
def custom_retrieval(qa_df):
    # Your custom retrieval function
    # You have to return the retrieved_contents, retrieved_ids, retrieve_scores as List
    retrieved_contents = []
    retrieved_ids = []
    retrieve_scores = []

    
    for qa in qa_df.iterrows():
        contents = []
        ids = []
        scores = []

        query_embedding = embeddings.embed_query(qa[1]["query"])
        for chunk in qa[1]["retrieved_chunks"]:
            contents.append(chunk["page_content"])
            ids.append(chunk["metadata"]["query_id"])

        for context_embeddings in qa[1]["context_passage_embeddings"]:
            scores.append(cosine_similarity([qa[1]["query_embeddings"]], [context_embeddings]))
            
        retrieved_contents.append(contents)
        retrieved_ids.append(ids)
        retrieve_scores.append(scores)

    return retrieved_contents, retrieved_ids, retrieve_scores


# Combine results into a single DataFrame
retrieval_result_df = custom_retrieval(qa_df)
retrieval_result_df

Unnamed: 0,retrieved_contents,retrieved_ids,retrieve_scores,retrieval_f1,retrieval_recall,retrieval_precision,retrieval_ndcg,retrieval_map,retrieval_mrr
0,[Optic nerve: The optic nerve connects the eye...,"[56676, 77305, 38805, 90193, 53747, 47630, 442...","[[[0.6380005750603623]], [[0.5933254283966266]...",,,,,,
1,[cyclobenzaprine hydrochloride. [sī′kləben′zəp...,"[79993, 42232, 79993, 79993, 79993, 79993, 889...","[[[0.7078694965655192]], [[0.6631842320507195]...",,,,,,
2,[Camp Lejeune Directory. You can find even mor...,"[79389, 79389, 79389, 79389, 56120, 71850, 761...","[[[0.693111411705623]], [[0.3407399904958373]]...",,,,,,
3,[Chakalaka is a traditional South African vege...,"[42621, 42621, 42621, 84689, 62704, 29259, 101...","[[[0.6372815416791178]], [[0.7352107426040818]...",,,,,,
4,[1 Temperature: Germination can take place ove...,"[31018, 31018, 31374, 31018, 73336, 31374, 346...","[[[0.6540902612008017]], [[0.6123671853648576]...",,,,,,
...,...,...,...,...,...,...,...,...,...
95,[1 Meter = 3.2808399 Feet. Meter (metre in bri...,"[21340, 94933, 101291, 21340, 21340, 34334, 21...","[[[0.5496155175872887]], [[0.5301979457191792]...",,,,,,
96,"[Early on in puberty, these hormones (which, a...","[34655, 100089, 34655, 100089, 80808, 38882, 8...","[[[0.5991622322216412]], [[0.644392264256834]]...",,,,,,
97,[ASSEMBLY & CALIBRATION The GA50 laser wheel a...,"[82091, 82091, 82091, 82091, 51363, 60381, 792...","[[[0.5557077345826977]], [[0.6523398462892922]...",,,,,,
98,[The average custom home building costs for cu...,"[44995, 97357, 55153, 98757, 38888, 31649, 973...","[[[0.5166889495960728]], [[0.48625652142155706...",,,,,,


**Generation Evaluation**

Similarly to the retrieval evaluation, I don't need to re-try the generation process of the RAG pipeline, because I already obtain the generated asnwers for each instance (the only data needed for the decorated function).

In [14]:
import pandas as pd
from autorag.schema.metricinput import MetricInput
from autorag.evaluation import evaluate_generation

# Load QA dataset
qa_df = pd.read_parquet("./Output_Files/ms_marco_results.parquet", engine="pyarrow")

metric_inputs = list(map(lambda x: MetricInput(
    query=list(x[1]["query"]),
    retrieval_gt=x[1]["ground_truth_answer"],
), qa_df.iterrows()))

# Define custom generation function with decorator
@evaluate_generation(
    metric_inputs=metric_inputs,
    metrics=["bleu", "meteor", "rouge"]
)
def custom_generation(qa_df):
    generated_texts = []

    for qa in qa_df.iterrows():
        generated_texts.append(qa[1]["generated_answer"])
         
    return generated_texts, [[1, 30]] * len(generated_texts), [[-1, -1.3]] * len(generated_texts)

# Evaluate generation performance
generation_result_df = custom_generation(qa_df)
generation_result_df

Unnamed: 0,generated_texts,generated_tokens,generated_log_probs,bleu,meteor,rouge
0,The optic nerve is a sensory nerve that connec...,"[1, 30]","[-1, -1.3]",,,
1,"Cyclobenzaprine hydrochloride, commonly referr...","[1, 30]","[-1, -1.3]",,,
2,The phone number for Camp Lejeune Operator (Di...,"[1, 30]","[-1, -1.3]",,,
3,Chakalaka is a traditional South African veget...,"[1, 30]","[-1, -1.3]",,,
4,Soil temperature is a critical factor for seed...,"[1, 30]","[-1, -1.3]",,,
...,...,...,...,...,...,...
95,There are approximately 3.2808399 feet in a me...,"[1, 30]","[-1, -1.3]",,,
96,Puberty hormones are chemicals produced by the...,"[1, 30]","[-1, -1.3]",,,
97,To use laser wheel alignment gauges effectivel...,"[1, 30]","[-1, -1.3]",,,
98,The average price per square foot for custom h...,"[1, 30]","[-1, -1.3]",,,
