# Evaluation of a RAG Pipeline using the LegalBench-RAG dataset

This notebook presents the implementation of a simple Retrieval-Augmented Generation (RAG) pipeline, and its evaluation using advanced frameworks that provide a collection of metrics that assess the performance of the retrieval and generation components of the pipeline, such as **RAGChecker**, **ARES**, **RAGAs**, and **AutoRAG**.

The pipeline will be developed using the **LangChain** framework, employing the **SQLiteVec** vector database, that will store the corpus embeddings and act both as the vector store and the retrieval component, and OpenAI's `GPT-4o mini` model as the generator component.

## 1. Inserting API Keys

* LangChain API

In [1]:
import os
import getpass

os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("Insert LangChain API Key:")

Insert LangChain API Key: ········


* OpenAI API

In [16]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Insert OpenAI API key:")

Insert OpenAI API key: ········


## 2 Dataset Preprocessing

LegalBench-RAG is the legal benchmark that assesses RAG systems on their capacity to understand legal jargon and retrieve pertinent (to a question) legal documents, and compiles legal documents from 4 datasets:

**a)** Privacy Question-Answering (**PrivacyQA**): PrivacyQA consists of 1,750 questions about the contents of privacy policies of mobile applications, with over 3,500 annotations from experts.

**b)** Contract Understanding Atticus Dataset (**CUAD**): CUAD consists of more than 500 legal documents and more than 13,000 annotations, made by legal experts that are members of the Atticus Project (a non-profit organization of legal experts). The documents extend up to 41 label categories. 

**c)** Merger Agreement Understanding Dataset (**MAUD**): MAUD is a reading comprehension dataset including over 39,000 examples and over 47,000 annotations, originating from the American Bar Association's 2021 Public Target Deal Points Study.

**d)** Contract Natural Language Inference (**ContractNLI**): ContractNLI contains 607 legal contracts and addresses the contract review automation task. The system is assigned to figure out whether a set of hypotheses can be entailed in a specific document.

The LegalBench-RAG dataset contains 4 folders that each consists of a set of .txt legal documents that I preprocess into `Document` objects.

In [17]:
import os
from langchain_core.documents import Document

# List of input corpus directories
directories = ["./Datasets/contractnli/", "./Datasets/cuad/", "./Datasets/maud/", "./Datasets/privacy_qa/"]

legalbench_rag_retrieval_docs = []

# For every folder in the input corpus
for folder in directories:

    # For every .txt file in each folder
    for file in os.listdir(folder):

        # Making sure that we only account for text files
        if file.endswith(".txt"):

            filepath = folder + file
            with open(filepath, "r", encoding="utf-8") as f:
                doc = Document(
                    page_content=f.read(),
                    metadata={"source_dataset": folder, "filename": file}
                )

                legalbench_rag_retrieval_docs.append(doc)

print(f"Total number of Documents created: {len(legalbench_rag_retrieval_docs)}")

Total number of Documents created: 698


Also, the dataset includes 4 .json files for each folder, that contain test cases of questions and answer spans from a document. I extract the QA pairs and store them in a list, as dictionaries. Additionally, I assign an arbitrary `query_id` field to each QA pair, based on the order they're accessed.

In [18]:
import json

legalbench_rag_qa = []

benchmarks = ["./Datasets/benchmarks/contractnli.json", "./Datasets/benchmarks/cuad.json", 
              "./Datasets/benchmarks/maud.json", "./Datasets/benchmarks/privacy_qa.json"]

query_id = 1

for benchmark in benchmarks:

    # I load each .json file
    with open(benchmark, "r", encoding="utf-8") as file:
        data = json.load(file)

        # Each test instance contains a query,
        for example in data["tests"]:
            query = example["query"]

            answers = []
            # and a set of answer snippets with corresponding source file
            for snippet in example["snippets"]:
                answers.append(snippet["answer"] + "\nSource: " + snippet["file_path"])
                
            legalbench_rag_qa.append({"query_id": query_id, "query": query, "answers": answers})
            query_id += 1

print(f"Total number of question-answer pairs created: {len(legalbench_rag_qa)}")

Total number of question-answer pairs created: 6889


### 2.1 Chunking

In [19]:
total_length = 0

for doc in legalbench_rag_retrieval_docs:
    total_length += len(doc.page_content)

print(f"The average size (in terms of characters) of the legal documents is: {total_length/len(legalbench_rag_retrieval_docs)}")

The average size (in terms of characters) of the legal documents is: 105009.388252149


The average size of the legal documents is approximately 105,000 characters, while GPT-4o mini has a context window of 128,000 tokens. Retrieving the top-10 most similar context chunks for a query could easily exceed this context window limit, making it impractical for effective processing. That's why I decided to split the retrieval corpus into chunks of 2,000 characters with a 500-character overlap, ensuring that the retrieved text remains within the model's context window while preserving contextual coherence. 

For the chunking process, I use LangChain's `RecursiveCharacterTextSplitter` that splits the input `Document` objects based on the number of specified tokens.

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=500, add_start_index=True)

legalbench_rag_chunks = text_splitter.split_documents(legalbench_rag_retrieval_docs)

print(f"Total number of chunks: {len(legalbench_rag_chunks)}")

Total number of chunks: 58387


## 3. SQLiteVec Vector Store

My RAG pipeline employs a SQLiteVec vector store that uses OpenAI's `text-embedding-3-large` to embed and store the chunk embeddings

In [20]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

**NOTE:** Unlike the Chroma vector store that will check if the vector store is existent in the specified directory, and if it's there, the vector store will be instantiated, SQLiteVec will try to recreate the database from the beginning. 

**Small Typo Error I noticed after testing - The print() functions, in the cell below, should mention the SQLiteVec instead of the Chroma vector store.**

In [9]:
from langchain_community.vectorstores import SQLiteVec
from tqdm import tqdm
from uuid import uuid4
import os

# Initialization of the SQLiteVec vector store
legalbench_rag_vector_store = SQLiteVec(
    table="LegalBench_RAG",
    db_file="./Vector_Stores/LegalBench_RAG/legalbench_rag.db",
    embedding=embeddings,
    connection=None
)

print("Chroma vector store is empty. Inserting chunk embeddings....")
    
# creating a unique identifier for each dataset chunk to be stored
uuids = [str(uuid4()) for _ in range(len(legalbench_rag_chunks))]

batch_size = 1000

# Lists storing the text passages and their corresponding metadata
batch_texts = []
batch_metadata = []

for i in tqdm(range(0, len(legalbench_rag_chunks), batch_size)):
    batch_texts = [chunk.page_content for chunk in legalbench_rag_chunks[i:i+batch_size]]
    batch_metadata = [chunk.metadata for chunk in legalbench_rag_chunks[i:i+batch_size]]
    
    legalbench_rag_vector_store.add_texts(
        texts=batch_texts,
        metdata=batch_metadata
    )
print("Chunk embeddings added to the vector store.")

print("Chroma vector store initialized.")

Chroma vector store is empty. Inserting chunk embeddings....


100%|██████████████████████████████████████████████████████████████████████████████████| 59/59 [31:00<00:00, 31.54s/it]

Chunk embeddings added to the vector store.
Chroma vector store initialized.





## 4. Testing the RAG pipeline

In this section, I define the retrieval and generation functionalities. Then, I query the SQLiteVec vector stores that also acts as the retriever.

Given that the LegalBench-RAG training set contains 6889 instances and that SQLiteVec doesn't support parallel execution of queries, I randomly sampled 100 QA instances dataset, by selecting their corresponding indices.

In [7]:
import random

# Setting the seed for reproducibility of the experiments
random.seed(100)

sampled_indices = random.sample(range(6889), 100)

# Defining the list of remaining available indices that will later be used in the ARES evaluation process 
available_indices = set(range(6889)) - set(sampled_indices)

legalbench_rag_subset = []

for index in sampled_indices:
    legalbench_rag_subset.append(legalbench_rag_qa[index])

* At this point, I define a `results` list that stores dictionaries of (query, query_id, ground-truth answer, generated answer, retrieved context passages) instances of the sampled dataset

In [9]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.vectorstores import SQLiteVec
from langchain import hub
from tqdm import tqdm

# Instantiating the LegalBench-RAG's SQLiteVec vector store
connection = SQLiteVec.create_connection(db_file="./Vector_Stores/LegalBench_RAG/legalbench_rag.db")

legalbench_rag_vector_store = SQLiteVec(
    table="LegalBench_RAG",
    embedding=embeddings,
    connection=connection
)

# Defining the prompt template of the RAG pipeline
prompt_template = """Answer the following question:
\n\n
{question}
\n\n

Using the following list of context passages:
\n\n
{context}
\n\n"""

prompt = ChatPromptTemplate.from_template(prompt_template)


# Instantiating the generator model GPT-4o mini
generator_llm = ChatOpenAI(model="gpt-4o-mini")


# RAG PIPELINE
print("Testing the RAG pipeline and collecting the retrieved passages and generated answers....")
results = []

for qa in tqdm(legalbench_rag_subset):
    # RETRIEVAL
    retrieved_chunks = legalbench_rag_vector_store.similarity_search(qa["query"], k=10)
    
    # Processing the retrieved chunks, by adding their url source in the end of the text segment,
    # so that i only add the important text of each chunk, and not the metadata.
    retrieved_context = ""
    for context in retrieved_chunks:
        retrieved_context += context.page_content + "\n\n"

    # GENERATION
    # Configuring the input prompt and calling GPT-4o mini, by promting it.
    prompt_message = prompt.format_prompt(question=qa["query"], context=retrieved_context)
    response = generator_llm.invoke(prompt_message)

    result = {
        "query_id": qa["query_id"],
        "query": qa["query"],
        "ground_truth_answer": qa["answers"],
        "generated_answer": response.content,
        "retrieved_chunks": retrieved_chunks
    }
    results.append(result)

print("The testing process is completed.")

Testing the RAG pipeline and collecting the retrieved passages and generated answers....


100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [10:50<00:00,  6.51s/it]

The testing process is completed.





**Checkpoint -** Storing the results of the testing process of the RAG pipeline in a `.json` file, saving them, as a checkpoint, for the evaluation process. 

In [10]:
import json

# Preprocessing of the results, since the retrieved chunks, which are `Document` objects, are not JSON serializable
serializable_results = []

for res in results:
    serializable_res = res.copy()
    serializable_res["retrieved_chunks"] = [{"page_content": chunk.page_content} for chunk in res["retrieved_chunks"]]
    serializable_results.append(serializable_res)
    
# Saving the results list to a JSON file
with open("./Output_Files/legalbench_rag_results.json", "w", encoding="utf-8") as file:
    json.dump(serializable_results, file, indent=4)

## 5. RAG Evaluation

### 5.1 RAGChecker

Loading the RAG pipeline's test results:

In [11]:
import json

with open("./Output_Files/legalbench_rag_results.json", "r", encoding="utf-8") as file:
    json_results = json.load(file)

RAGChecker requires that the RAG pipeline's results should be formatted as:

```json
{
    results: [
        {
            "query_id": <query's identifier as a string value>,
            "query": <The actual input query>,
            "gt_answer": <The ground-truth answer provided in the dataset>,
            "response": <RAG pipeline's generated response>,
            "retrieved_context": [ <The list of the retrieved chunks, which are pertinent to the input  query>
                {
                    "doc_id": <The document identifier of the retrieved context passage as a string value>,
                    "text": <The actual retrieved context passage>
                },
                {
                    "doc_id": <The document identifier of the retrieved context passage as a string value>,
                    "text": <The actual retrieved context passage>
                }
                ......
            ]
        }
        ......
    ]
}
```

However, the RAG pipeline's results stored in `json_results.json` have the following format:

```json
[
    {
        "query_id": <The input query's identifier>,
        "query": <The actual input query>,
        "ground_truth_answer": <The ground-truth answer provided in the dataset>,
        "generated_answer": <RAG pipeline's generated response>,
        "retrieved_chunks": [
            {
                "page_content": <The context passage of the retrieved chunk>
            }
        ]
    }
    .......
]
```

Given that, a necessary preprocessing step is to convert the results' format to the exact format specified by RAGChecker. 

In [12]:
ragchecker_results = {"results": []}

for res in json_results:
    if (len(res["ground_truth_answer"]) > 0):
        formatted_res = {
            "query_id": str(res["query_id"]),
            "query": res["query"],
            "gt_answer": res["ground_truth_answer"][0],
            "response": res["generated_answer"],
            "retrieved_context": []
        }
    
        for chunk in res["retrieved_chunks"]:
            formatted_res["retrieved_context"].append({"text": chunk["page_content"]})
    
        ragchecker_results["results"].append(formatted_res)

**Checkpoint -** Storing the RAG pipeline's results in the format required by RAGChecker.

In [13]:
with open("./Output_Files/legalbench_rag_ragchecker.json", "w", encoding="utf-8") as file:
    json.dump(ragchecker_results, file, indent=4)

**RAGChecker Evaluation**

RAGChecker computes a set of overall, retrieval, and generation metrics:

| Metric | Description |
|--------|-------------|
| Precision | The fraction of correct generated claims $c_i^{(m)}$ in the generated response $m$ |
| Recall | The fraction of ground-truth claims $c_i^{(gt)}$ that can be found in the model response |
| F1 Score | The harmonic mean of Precision and Recall |
| Claim Recall | The fraction of ground-truth claims $c^{(gt)}$ that can be found in the set of retrieved chunks $\{chunk_j\}$ |
| Context Precision | The fraction of relevant chunks $\{r\text{-}chunk_j\}$ from the $k$ retrieved chunks |
| Faithfulness | The fraction of the model-generated claims $c_i^{(m)}$ that can be attributed to retrieved chunks. |
| Relevant Noise Sensitivity | The fraction of generated claims $c_i^{(m)}$ that are incorrect and extracted from relevant retrieved chunks |
| Irrelevant Noise Sensitivity | The fraction of generated claims $c_i^{(m)}$ that are incorrect and extracted from irrelevant retrieved chunks |
| Hallucination | The fraction of generated claims $c_i^{(m)}$ that belong neither in the ground-truth answer $gt$ nor in any retrieved chunk |
| Self Knowledge |  The fraction of generated responses $c_i^{(m)}$ that can be traced in the ground-truth $gt$ but not in any retrieved chunk |
| Context Utilization | The fraction of ground-truth claims $c_i^{(gt)}$ that can be found in the set of retrieved chunks, that can also be extracted from the generated response $m$ |

Executing the RAGChecker evaluation, and using `gpt-3.5-turbo` both as the extractor and the checker model.

In [17]:
from ragchecker import RAGResults, RAGChecker
from ragchecker.metrics import all_metrics

# initialization of rag_results from json/dict
with open("./Output_Files/legalbench_rag_ragchecker.json") as fp:
    rag_results = RAGResults.from_json(fp.read())

# Setting up the evaluator using "gpt-4o-mini" as the extractor and checker model.
evaluator = RAGChecker(
    extractor_name="gpt-3.5-turbo",
    checker_name="gpt-3.5-turbo",
    batch_size_extractor=10,
    batch_size_checker=10
)

# Evaluating results on all metrics, holistic, retrieval, and generation metrics
evaluator.evaluate(rag_results, all_metrics)
print(rag_results)

* 'fields' has been removed
[32m2025-01-05 19:01:23.735[0m | [1mINFO    [0m | [36mragchecker.evaluator[0m:[36mextract_claims[0m:[36m113[0m - [1mExtracting claims for gt_answer of 100 RAG results.[0m





[32m2025-01-05 19:02:01.104[0m | [1mINFO    [0m | [36mragchecker.evaluator[0m:[36mcheck_claims[0m:[36m173[0m - [1mChecking retrieved2answer for 100 RAG results.[0m

[32m2025-01-05 19:04:55.848[0m | [1mINFO    [0m | [36mragchecker.evaluator[0m:[36mextract_claims[0m:[36m113[0m - [1mExtracting claims for response of 100 RAG results.[0m

[32m2025-01-05 19:05:36.159[0m | [1mINFO    [0m | [36mragchecker.evaluator[0m:[36mcheck_claims[0m:[36m173[0m - [1mChecking answer2response for 100 RAG results.[0m

[32m2025-01-05 19:05:56.075[0m | [1mINFO    [0m | [36mragchecker.evaluator[0m:[36mcheck_claims[0m:[36m173[0m - [1mChecking response2answer for 100 RAG results.[0m

[32m2025-01-05 19:06:10.659[0m | [1mINFO    [0m | [36mragchecker.evaluator[0m:[36mcheck_claims[0m:[36m173[0m - [1mChecking retrieved2response for 100 RAG results.[0m
100%|████████████████████████████████████████████████████████████████████████████████| 191/191 [03:18<00:00, 

RAGResults(
  100 RAG results,
  Metrics:
  {
    "overall_metrics": {
      "precision": 66.1,
      "recall": 64.5,
      "f1": 60.9
    },
    "retriever_metrics": {
      "claim_recall": 82.6,
      "context_precision": 91.6
    },
    "generator_metrics": {
      "context_utilization": 75.6,
      "noise_sensitivity_in_relevant": 26.3,
      "noise_sensitivity_in_irrelevant": 0.7,
      "hallucination": 6.8,
      "self_knowledge": 0.8,
      "faithfulness": 92.3
    }
  }
)





Storing the RAGChecker's evalaution results in a text file that will also store the evaluation results of the other 3 frameworks.

In [18]:
with open("./Output_Files/legalbench_rag_framework_results.txt", "w", encoding="utf-8") as file:
    file.write("RAGChecker results:\n\n" + str(rag_results))

### 5.2 RAGAs

Due to dependency conflicts between packages required by RAGChecker and RAGAs, such as transformers and scikit-learn, RAGAs is executed on a different anaconda environment, that is installed and set up with the following commands:

```
conda create --name rag_eval_ragas python=3.11.5
conda install jupyter
pip install ragas 
```

**The cells of this subsection should be executed on the "rag_eval_ragas" environment.**

Activating the environment with the command:
```
conda activate rag_eval_ragas
```

Loading the RAG pipeline's test results:

In [4]:
import json

with open("./Output_Files/legalbench_rag_results.json", "r", encoding="utf-8") as file:
    json_results = json.load(file)

**RAGAs Evaluation**

RAGAs provides a series of RAG evaluation metrics as well as general LLM evaluation metrics, but in this setting, I chose to compute the subset of those metrics that are described in the thesis report:

| RAG Metric | Description |
|--------|-------------|
| Faithfulness | The fraction of inferred claims $|V|$ from the total claims $|S|$ |
| Answer Relevance | The average cosine similarity of the input query and a set of LLM-generated queries that can possibly result in the same generated answer |
| Context Relevance | The fraction of the sentences in the context the successfully confront the question |
| Context Precision@$K$ | The fraction of contextually relevant retrieved chunks - Average of Precision@$k$ for all $k$ up to $K$ |
| Context Recall | The fraction of relevant chunks that were actually retrieved based on the input query |
| Context Entity Recall | Examines how many of the ground-truth entities were present in the generated answer |
| Relevant/Irrelevant Noise Sensitivity | Same metrics as RAGChecker |

| General LLM Metric | Description |
|--------------------|-------------|
| Non-LLM Semantic Similarity | The similarity (on a scale of 0 to 1) between the model response and the ground-truth answer using distance measurements|
| BLEU | The similarity (on a scale of 0 to 1) between the model response and the ground-truth answer using the n-gram precision |
| ROUGE | The fraction of overlap between the generated and reference response based on the n-gram precision, recall and F1 score |
| Exact Match | Tests whether the generated response is exactly the same (1), or not (0), as the ground-truth answer |
| String Presence | Checks whether the generated response contains (1), or not (0), several parts or keywords of the reference answer |

However, just like the MS MARCO dataset, calculating all those metrics for 100 RAG result instances is a resource-intensive process, requiring almost 10 minutes per instance!

That's why I decided only to calcuate the 3 most fundamental metrics: **Faithfulness**, **Answer Relevance** (defined as response relevancy in the code), and **Context Relevance** (defined as context recall in the code).

In [10]:
from ragas.dataset_schema import SingleTurnSample
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from tqdm import tqdm

# Importing RAG metrics
from ragas.metrics import Faithfulness, ResponseRelevancy, LLMContextRecall

# Defining the LLM and Embeddings model that are required parameters for some evaluation metrics
llm = ChatOpenAI(model="gpt-4o-mini")
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# The LLM and Embeddings model need to be wrapped so that they conform with RAGAs' interface
langchain_llm = LangchainLLMWrapper(langchain_llm=llm)
langchain_embeddings = LangchainEmbeddingsWrapper(embeddings=embeddings)

# Defining all the scoring functions that leverage the LLM to compute the respective evaluation metric 
faithfulness = Faithfulness(llm=langchain_llm)
answer_relevance = ResponseRelevancy(llm=langchain_llm, embeddings=langchain_embeddings)
context_relevance = LLMContextRecall(llm=langchain_llm)


# Initializing Metrics dictionary
# In every iteration of the evaluation process, 
# this dictionary will hold the running sum of each metric for the evaluated dataset instances up to this point,
# and, in the end, will store the average value for every metric
metrics =  {
    "faithfulness" : 0,
    "answer_relevance" : 0,
    "context_relevance" : 0
}

# Main evaluation loop
print("Evaluating the RAG pipeline results using RAGAS....")

for rag_result in tqdm(json_results):

    # Each rag result instance must be converted into a single-turn sample instance
    sample = SingleTurnSample(
        user_input=rag_result["query"],
        response=rag_result["generated_answer"],
        reference=rag_result["ground_truth_answer"][0], # ground_truth_answer is a list and its first element is the actual string value
        retrieved_contexts=[context["page_content"] for context in rag_result["retrieved_chunks"]]
    )

    # calculating the metrics for each rag result instance and storing the running sum to the 'metrics' dicitionary
    metrics["faithfulness"] += await faithfulness.single_turn_ascore(sample)
    metrics["answer_relevance"] += await answer_relevance.single_turn_ascore(sample)
    metrics["context_relevance"] += await context_relevance.single_turn_ascore(sample)

# After the evaluation loop is completed, i compute the average of each metric on the 100 rag result instances.
for metric_label in list(metrics.keys()):
    metrics[metric_label] /= 100

print("The evaluation process has been completed.")

Evaluating the RAG pipeline results using RAGAS....


100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [44:55<00:00, 26.95s/it]

The evaluation process has been completed.





In [11]:
print(f"The RAGAs average metrics for the sample of 100 RAG results are:\n {metrics}")

The RAGAs average metrics for the sample of 100 RAG results are:
 {'faithfulness': 0.8721074718643704, 'answer_relevance': 0.7764295769799838, 'context_relevance': 0.5683333333333332}


In [12]:
with open("./Output_Files/legalbench_rag_framework_results.txt", "a+", encoding="utf-8") as file:
    file.write("\n\nRAGAs results:\n\n" + str(metrics))