# RAG Evaluation Test Set Generation

This example shows how to use the [Ragas](https://docs.ragas.io/en/stable/) (```v 0.1.22```) framework to generate a **test set** that can be used to evaluate the quality of a RAG pipeline. We then use the Python [LangChain](https://python.langchain.com/docs/introduction/) library to run some requests through this pipeline and we evaluate the quality of the results.

### <u>Requirements</u>
1. As you will accessing the LLMs and embedding models through Vector AI Engineering's Kaleidoscope Service (Vector Inference + Autoscaling), you will need to request a KScope API Key:

      Run the following command (replace ```<user_id>``` and ```<password>```) from **within the cluster** to obtain the API Key. The ```access_token``` in the output is your KScope API Key.
  ```bash
  curl -X POST -d "grant_type=password" -d "username=<user_id>" -d "password=<password>" https://kscope.vectorinstitute.ai/token
  ```
2. After obtaining the `.env` configurations, make sure to create the ```.kscope.env``` file in your home directory (```/h/<user_id>```) and set the following env variables:
- For local models through Kaleidoscope (KScope):
    ```bash
    export OPENAI_BASE_URL="https://kscope.vectorinstitute.ai/v1"
    export OPENAI_API_KEY=<kscope_api_key>
    ```
- For OpenAI models:
   ```bash
   export OPENAI_BASE_URL="https://api.openai.com/v1"
   export OPENAI_API_KEY=<openai_api_key>
   ```

## Set up the RAG workflow environment

#### Import libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
#!pip install pdfplumber


In [3]:
#import pdfplumber

In [4]:
import numpy as np
import os
import sys

from datasets import Dataset
from pathlib import Path

from langchain.chains import RetrievalQA
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from ragas import evaluate
from ragas.metrics import Faithfulness, ContextPrecision, AnswerCorrectness
from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

#### Load config files

In [5]:
# Add root folder of the rag_bootcamp repo to PYTHONPATH
current_dir = Path().resolve()
parent_dir = current_dir.parent
sys.path.insert(0, str(parent_dir))



In [6]:
from utils.load_secrets import load_env_file
load_env_file()

#### Set up some helper functions

In [7]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

#### Make sure other necessary items are in place

In [8]:
# Look for the source_documents folder and make sure there is at least 1 pdf file here
contains_pdf = False
documents_path = "./source_documents"
if not os.path.exists(documents_path):
    print(f"ERROR: The {documents_path} subfolder must exist under this notebook")
for filename in os.listdir(documents_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
if not contains_pdf:
    print(f"ERROR: The {documents_path} subfolder must contain at least one .pdf file")

## Generate a sythentic test set

#### Start by loading in the documents we'll be using to augment our RAG generations

In [9]:
#loader = PyPDFDirectoryLoader(documents_path)
#documents = loader.load()
#for document in documents:
#    document.metadata['file_name'] = document.metadata['source']
    
#%%time
# Load the IBIS pdfs
#directory_path = "./source_documents"
directory_path = "/projects/RAG2/scotia-2/Datasets-Scotia-2/IBIS"
loader = PyPDFDirectoryLoader(directory_path)
docs = loader.load()
print(f"Number of source documents: {len(docs)}")
for document in docs:
    document.metadata['file_name'] = document.metadata['source']
# Split the documents into smaller chunks
#text_splitter = RecursiveCharacterTextSplitter(chunk_size=3600, chunk_overlap=32)
#chunks = text_splitter.split_documents(docs)
#print(f"Number of text chunks: {len(chunks)}")

Number of source documents: 280


In [9]:

from langchain.document_loaders import TextLoader

In [12]:
directory_path  ='/projects/RAG2/scotia-2/Datasets-Scotia-2/PDF_text'

documents=[]
for filename in os.listdir(directory_path):
    if filename.endswith('.txt'):
        file_path = os.path.join(directory_path, filename)
        print (file_path)
        loader = TextLoader(file_path)

        # Load the document

        document = loader.load()
        documents = documents+ document
        #chunks2 =text_splitter.split_documents(document)
        #print(f"Number of text chunks: {len(chunks2)}")
        #chunks= chunks +chunks2

/projects/RAG2/scotia-2/Datasets-Scotia-2/PDF_text/11114CA Wheat Farming in Canada Industry Report.txt
/projects/RAG2/scotia-2/Datasets-Scotia-2/PDF_text/48422CA Local Specialized Freight Trucking in Canada Industry Report.txt
/projects/RAG2/scotia-2/Datasets-Scotia-2/PDF_text/44111CA New Car Dealers in Canada Industry Report.txt
/projects/RAG2/scotia-2/Datasets-Scotia-2/PDF_text/48412CA Long-Distance Freight Trucking in Canada Industry Report.txt
/projects/RAG2/scotia-2/Datasets-Scotia-2/PDF_text/33639CA Auto Parts Manufacturing in Canada Industry Report.txt
/projects/RAG2/scotia-2/Datasets-Scotia-2/PDF_text/48423CA Long-Distance Specialized Freight Trucking in Canada Industry .txt
/projects/RAG2/scotia-2/Datasets-Scotia-2/PDF_text/11115CA Corn Farming in Canada Industry Repor.txt


In [13]:
type(document)

list

#### Now use OpenAI to generate a test set from the data in these documents (This takes about 2-3 minutes)

**IMP Note:** The LLM and embedding model used for test set generation should be more capable than the model being evaluated. Hence, we will use OpenAI GPT-4o and OpenAI embeddings for this purpose.

Store your OpenAI API key in ```~/.ragas_openai.env``` using the following format (this is in addition to ```~/.kscope.env```):

```bash
export RAGAS_OPENAI_BASE_URL="https://api.openai.com/v1"
export RAGAS_OPENAI_API_KEY=<openai_api_key>
```

In [14]:
from utils.load_secrets import load_env_file_ragas
load_env_file_ragas()

In [15]:
generator_llm = ChatOpenAI(
    model="gpt-4o",
    base_url=os.environ["RAGAS_OPENAI_BASE_URL"],
    api_key=os.environ["RAGAS_OPENAI_API_KEY"],
)
generator_embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    base_url=os.environ["RAGAS_OPENAI_BASE_URL"],
    api_key=os.environ["RAGAS_OPENAI_API_KEY"],
)

In [16]:
#generator_llm = (
#    model="DeepSeek-R1-Distill-Llama-8B",
#    base_url=os.environ["OPENAI_BASE_URL"],
#    api_key=os.environ["OPENAI_API_KEY"],
#)
# Define the RAG embeddings model (different than the OpenAI embedding model defined above for test set generation)
model_kwargs = {'device': 'cuda', 'trust_remote_code': True}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
print(f"Setting up the RAG LLM...")
llm = ChatOpenAI(
    #model="DeepSeek-R1-Distill-Llama-8B",
    model="Meta-Llama-3.1-8B-Instruct",
    temperature=0,
    max_tokens=256,
    base_url=os.environ["OPENAI_BASE_URL"],
    api_key=os.environ["OPENAI_API_KEY"],
)

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-base-en-v1.5",
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

Setting up the RAG LLM...


In [18]:
%%time
# Create generator with OpenAI model
generator = TestsetGenerator.from_langchain(
    generator_llm=llm,
    critic_llm=llm,
    embeddings=embeddings,
)

# Generate the test set
testset = generator.generate_with_langchain_docs(
    documents=documents, 
    test_size=50,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)

embedding nodes:   0%|          | 0/260 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/50 [00:00<?, ?it/s]

Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. 

CPU times: user 35.4 s, sys: 1.92 s, total: 37.4 s
Wall time: 8min 30s


#### Preview the test dataset so far

In [19]:
testset1 = testset.to_pandas()

In [20]:
# testset1.to_parquet ('../../Testing_Data/testset_llama.parquet')
# testset1.to_csv ('../../Testing_Data/testset_llama.csv', index =False)

In [25]:
import joblib

# Save (serialize) the object to a file
joblib.dump(testset, "../../Testing_Data/testset_openai_clean.pkl")


['../../Testing_Data/testset_openai_clean.pkl']

In [22]:
testset1.to_parquet ('../../Testing_Data/testset_openai_clean.parquet')
testset1.to_csv ('../../Testing_Data/testset_openai_clean.csv', index =False)

In [26]:
#load 
# Load (deserialize) the object from file
testset2 = joblib.load("../../Testing_Data/testset_openai_clean.pkl")


In [27]:
print (testset==testset2)

True


## Now, start the RAG pipeline!

#### Choose the RAG LLM and embedding model
Note: This is different than the OpenAI LLM and embedding model defined above for test set generation.

In [28]:
RAG_LLM_MODEL_NAME = "Meta-Llama-3.1-8B-Instruct"
RAG_EMBEDDING_MODEL_NAME = "BAAI/bge-base-en-v1.5"

#### Generate answers for all the questions in our test set

Go through the embedding, storage and retrieval steps.

In [31]:
# Split the documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2500, chunk_overlap=32)
chunks = text_splitter.split_documents(documents)
print(f"Number of text chunks: {len(chunks)}")

Number of text chunks: 233


In [32]:
# Define the RAG embeddings model (different than the OpenAI embedding model defined above for test set generation)
model_kwargs = {'device': 'cuda', 'trust_remote_code': True}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

print(f"Setting up the RAG embeddings model...")
embeddings = HuggingFaceEmbeddings(
    model_name=RAG_EMBEDDING_MODEL_NAME,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

Setting up the RAG embeddings model...


In [33]:
# Create the vector store and the retriever
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

In [34]:
RAG_LLM_MODEL_NAME

'Meta-Llama-3.1-8B-Instruct'

In [35]:
%%time
# Define the RAG LLM (different than the OpenAI LLM defined above for test set generation)
print(f"Setting up the RAG LLM...")
llm = ChatOpenAI(
    model=RAG_LLM_MODEL_NAME,
    temperature=0,
    max_tokens=256,
    base_url=os.environ["OPENAI_BASE_URL"],
    api_key=os.environ["OPENAI_API_KEY"],
)

Setting up the RAG LLM...
CPU times: user 12.6 ms, sys: 7.32 ms, total: 19.9 ms
Wall time: 19.1 ms


Iterate over the questions in our synthetic testset, and run them each through the RAG pipeline to see what answers get returned. (This also takes 2-3 minutes)

In [36]:
%%time
dataset = testset.to_dataset()
answers = np.empty(len(dataset), dtype=object)

for index, row in enumerate(dataset):
    query = row["question"]
    
    # Run the query through the RAG pipeline
    rag_pipeline = RetrievalQA.from_llm(
        llm=llm,
        retriever=retriever
    )
    answer = rag_pipeline.invoke(input=query)
    answer = answer["result"]
    print(f"Result {index}\nQuestion: {query}\nAnswer: {answer}\n")
    
    # Store the result
    answers[index] = answer

Result 0
Question: Here is a question that can be fully answered from the given context:

"What drives the demand for dry bulk transportation in the local specialized freight trucking industry in Canada?"

This question can be answered by referencing the context, which states that "Dry bulk transportation swells alongside construction activity" and that "Strong construction activity has supported this segment but is expected to trend downward through the end of 2024 as tightened monetary policy pressures businesses' financing ability.
Answer: The demand for dry bulk transportation in the local specialized freight trucking industry in Canada is driven by construction activity.

Result 1
Question: Here is a question that can be fully answered from the given context:

"What are the expected outcomes of the government's pro-immigration policies on the driver shortage and turnover in the long-distance specialized freight trucking industry in Canada?
Answer: The expected outcomes of the gove

Result 12
Question: Here is a question that can be fully answered from the given context:

"What is the expected Compound Annual Growth Rate (CAGR) of revenue for local specialized freight trucking companies in Canada through the end of 2029?
Answer: The expected Compound Annual Growth Rate (CAGR) of revenue for local specialized freight trucking companies in Canada through the end of 2029 is 1.5%.

Result 13
Question: Here is a question that can be fully answered from the given context:

"According to the Electric Vehicle Availability Standard, what percentage of new cars, SUVs, crossovers, and light-duty pickup trucks sold in Canada must be 100% zero-emission by 2035?"

This question can be answered directly from the context, which states: "Canada unveiled the Electric Vehicle Availability Standard, requiring all new cars, SUVs, crossovers and light-duty pickup trucks sold in Canada to be 100% zero-emission by 2035.
Answer: According to the Electric Vehicle Availability Standard, 100

Result 23
Question: Here is a question that can be fully answered from the given context:

"Despite the importance of livestock transportation, what has increased the complexity and documentation requirements for this service in Canada?"

This question can be answered by referencing the following sentence in the context:

"The lack of viable alternatives causes livestock companies to depend on trucking for animal transportation. Nonetheless, stringent animal welfare regulations in Canada have increased the complexity and documentation requirements for livestock transport.
Answer: Stringent animal welfare regulations in Canada have increased the complexity and documentation requirements for livestock transport.

Result 24
Question: Here is a question that can be fully answered from the given context:

"Are regulatory barriers at both the provincial and federal levels governing the testing of autonomous vehicles in Canada?"

This question is formed using the topic "Regulatory barriers in

Result 37
Question: Here's a rewritten version of the question:

"What triggers industry growth in Canada when a key competitor lowers prices?"

I've used abbreviations like "key" instead of "major" and "triggers" instead of "drives" to make the question shorter and more indirect.
Answer: Based on the provided context, it appears that industry growth in Canada is influenced by various factors, but there is no specific information on what triggers growth when a key competitor lowers prices.

However, the context does mention that severe input shortages have decimated the supply of cars, leading to higher costs and prices for consumers. This has given buyers significant pricing power, allowing them to negotiate prices with dealers.

Additionally, the context mentions that technological advancements in freight trucking will remain crucial for future competitiveness, and that investments in electric and autonomous vehicles, as well as Internet of Things (IoT) integrations, will optimize lo

Add the list of answers into our original dataset. Now we have a complete test set that is ready for evaluation.

In [37]:
dataset = dataset.add_column("answer", answers)

## Evaluate the results

#### Preview the final test set

In [38]:
dataset.to_pandas()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done,answer
0,Here is a question that can be fully answered ...,[ers benefit both parties and will remain popu...,Construction activity drives the demand for dr...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,The demand for dry bulk transportation in the ...
1,Here is a question that can be fully answered ...,[ navigate economic downturns and invest in ne...,The government's pro-immigration policies are ...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,The expected outcomes of the government's pro-...
2,Here is a question that can be fully answered ...,[ Ireland\nâ¢\nFreight Trucking in China\nLon...,The rebound in economic growth is expected to ...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,The rebound in economic growth is expected to ...
3,Here is a question that can be fully answered ...,"[.8 pp\nProfit per Business\n$69,567\nCurrent ...",The growth of e-commerce in the long-distance ...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,"According to the context, the growth of e-comm..."
4,Here is a question that can be fully answered ...,[.8 1.9 2.6 1.6 3.2 2.8 2.5\nLabor intensive C...,The answer to given question is not present in...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,To calculate the average debt ratio of corn fa...
5,Here is a question that can be fully answered ...,"[ Columbia\n43 11.3 886.4 11.2 40.9 4.0 1,885 ...",Ontario has the majority of new car dealership...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,The province in Canada with the majority of ne...
6,Here is a question that can be fully answered ...,[ landscape is much more demanding than for tr...,Successful businesses in the specialized truck...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,"According to the context, the key factors that..."
7,Here is a question that can be fully answered ...,[ landscape is much more demanding than for tr...,Successful businesses in the specialized truck...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,"According to the context, the key factors that..."
8,Here is a question that can be fully answered ...,[Transportation and Warehousing In Canada â¢ ...,IBISWorld specializes in industry research wit...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,IBISWorld specializes in industry research wit...
9,Here is a question that can be fully answered ...,[ who don't enjoy the same pricing power.\nâ¢...,The expected long-term effects of rising truck...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,"According to the context, the expected long-te..."


Run the evaluation query to score the results. In this evaluation, we are looking at the following metrics:
- *[Faithfulness](https://docs.ragas.io/en/v0.1.21/concepts/metrics/faithfulness.html)*: Are all the claims that are made in the answer inferred from the given context(s)?
- *[Context Precision](https://docs.ragas.io/en/v0.1.21/concepts/metrics/context_precision.html)*: Did our retriever return good results that matched the question it was being asked?
- *[Answer Correctness](https://docs.ragas.io/en/v0.1.21/concepts/metrics/answer_correctness.html)*: Was the generated answer correct? Was it complete?

In [39]:
%%time
score = evaluate(
    dataset=dataset,
    metrics=[
        Faithfulness(),
        ContextPrecision(),
        AnswerCorrectness(),
    ],
    llm=llm, # Using OpenAI LLM as the evaluator
    embeddings=embeddings,
)
score.to_pandas()

Evaluating:   0%|          | 0/132 [00:00<?, ?it/s]

Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.


CPU times: user 9.56 s, sys: 374 ms, total: 9.94 s
Wall time: 2min 12s


Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done,answer,faithfulness,context_precision,answer_correctness
0,Here is a question that can be fully answered ...,[ers benefit both parties and will remain popu...,Construction activity drives the demand for dr...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,The demand for dry bulk transportation in the ...,1.0,1.0,0.733687
1,Here is a question that can be fully answered ...,[ navigate economic downturns and invest in ne...,The government's pro-immigration policies are ...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,The expected outcomes of the government's pro-...,,1.0,0.65177
2,Here is a question that can be fully answered ...,[ Ireland\nâ¢\nFreight Trucking in China\nLon...,The rebound in economic growth is expected to ...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,The rebound in economic growth is expected to ...,,1.0,0.792041
3,Here is a question that can be fully answered ...,"[.8 pp\nProfit per Business\n$69,567\nCurrent ...",The growth of e-commerce in the long-distance ...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,"According to the context, the growth of e-comm...",,1.0,0.895714
4,Here is a question that can be fully answered ...,[.8 1.9 2.6 1.6 3.2 2.8 2.5\nLabor intensive C...,The answer to given question is not present in...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,To calculate the average debt ratio of corn fa...,0.666667,0.0,0.351871
5,Here is a question that can be fully answered ...,"[ Columbia\n43 11.3 886.4 11.2 40.9 4.0 1,885 ...",Ontario has the majority of new car dealership...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,The province in Canada with the majority of ne...,1.0,1.0,0.537397
6,Here is a question that can be fully answered ...,[ landscape is much more demanding than for tr...,Successful businesses in the specialized truck...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,"According to the context, the key factors that...",,1.0,
7,Here is a question that can be fully answered ...,[ landscape is much more demanding than for tr...,Successful businesses in the specialized truck...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,"According to the context, the key factors that...",,1.0,0.781102
8,Here is a question that can be fully answered ...,[Transportation and Warehousing In Canada â¢ ...,IBISWorld specializes in industry research wit...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,IBISWorld specializes in industry research wit...,1.0,1.0,0.566585
9,Here is a question that can be fully answered ...,[ who don't enjoy the same pricing power.\nâ¢...,The expected long-term effects of rising truck...,simple,[{'source': '/projects/RAG2/scotia-2/Datasets-...,True,"According to the context, the expected long-te...",,1.0,


In [40]:
score["answer_correctness"].mean()

0.6325593395188253

In [41]:
score["faithfulness"].mean()

0.8

In [94]:
#"Meta-Llama-3.1-8B-Instruct"
#chunksize, answer correctness, faithfulness, number of question
#10000, 0.592, 0.698, 50
# 5000, 0.617, 0.700, 50
# 3500, 0.677, 0.734, 50
# 3000, 0.688, 0.800, 50
# 2000, 0.629, 0.742, 50
# 1000, 0.644, 0.703, 50
#  500, 0.619, 0.503, 50
#  250, 0.594, 0.527, 50



In [None]:
#2500, 0.63, 08