# RAG Evaluation Test Set Generation

This example shows how to use the [Ragas](https://docs.ragas.io/en/stable/) (```v 0.1.22```) framework to generate a **test set** that can be used to evaluate the quality of a RAG pipeline. We then use the Python [LangChain](https://python.langchain.com/docs/introduction/) library to run some requests through this pipeline and we evaluate the quality of the results.

### <u>Requirements</u>
1. As you will accessing the LLMs and embedding models through Vector AI Engineering's Kaleidoscope Service (Vector Inference + Autoscaling), you will need to request a KScope API Key:

      Run the following command (replace ```<user_id>``` and ```<password>```) from **within the cluster** to obtain the API Key. The ```access_token``` in the output is your KScope API Key.
  ```bash
  curl -X POST -d "grant_type=password" -d "username=<user_id>" -d "password=<password>" https://kscope.vectorinstitute.ai/token
  ```
2. After obtaining the `.env` configurations, make sure to create the ```.kscope.env``` file in your home directory (```/h/<user_id>```) and set the following env variables:
- For local models through Kaleidoscope (KScope):
    ```bash
    export OPENAI_BASE_URL="https://kscope.vectorinstitute.ai/v1"
    export OPENAI_API_KEY=<kscope_api_key>
    ```
- For OpenAI models:
   ```bash
   export OPENAI_BASE_URL="https://api.openai.com/v1"
   export OPENAI_API_KEY=<openai_api_key>
   ```

## Set up the RAG workflow environment

#### Import libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
#!pip install pdfplumber


In [3]:
#import pdfplumber

In [4]:
import numpy as np
import os
import sys

from datasets import Dataset
from pathlib import Path

from langchain.chains import RetrievalQA
from langchain.document_loaders.pdf import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from ragas import evaluate
from ragas.metrics import Faithfulness, ContextPrecision, AnswerCorrectness
from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

#### Load config files

In [5]:
# Add root folder of the rag_bootcamp repo to PYTHONPATH
current_dir = Path().resolve()
parent_dir = current_dir.parent
sys.path.insert(0, str(parent_dir))



In [6]:
from utils.load_secrets import load_env_file
load_env_file()

#### Set up some helper functions

In [7]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

#### Make sure other necessary items are in place

In [8]:
# Look for the source_documents folder and make sure there is at least 1 pdf file here
contains_pdf = False
documents_path = "./source_documents"
if not os.path.exists(documents_path):
    print(f"ERROR: The {documents_path} subfolder must exist under this notebook")
for filename in os.listdir(documents_path):
    contains_pdf = True if ".pdf" in filename else contains_pdf
if not contains_pdf:
    print(f"ERROR: The {documents_path} subfolder must contain at least one .pdf file")

## Generate a sythentic test set

#### Start by loading in the documents we'll be using to augment our RAG generations

In [9]:
#loader = PyPDFDirectoryLoader(documents_path)
#documents = loader.load()
#for document in documents:
#    document.metadata['file_name'] = document.metadata['source']
    
#%%time
# Load the IBIS pdfs
#directory_path = "./source_documents"
directory_path = "/projects/RAG2/scotia-2/Datasets-Scotia-2/IBIS"
loader = PyPDFDirectoryLoader(directory_path)
docs = loader.load()
print(f"Number of source documents: {len(docs)}")
for document in docs:
    document.metadata['file_name'] = document.metadata['source']
# Split the documents into smaller chunks
#text_splitter = RecursiveCharacterTextSplitter(chunk_size=3600, chunk_overlap=32)
#chunks = text_splitter.split_documents(docs)
#print(f"Number of text chunks: {len(chunks)}")

Number of source documents: 280


#### Now use OpenAI to generate a test set from the data in these documents (This takes about 2-3 minutes)

**IMP Note:** The LLM and embedding model used for test set generation should be more capable than the model being evaluated. Hence, we will use OpenAI GPT-4o and OpenAI embeddings for this purpose.

Store your OpenAI API key in ```~/.ragas_openai.env``` using the following format (this is in addition to ```~/.kscope.env```):

```bash
export RAGAS_OPENAI_BASE_URL="https://api.openai.com/v1"
export RAGAS_OPENAI_API_KEY=<openai_api_key>
```

In [10]:
from utils.load_secrets import load_env_file_ragas
load_env_file_ragas()

In [11]:
generator_llm = ChatOpenAI(
    model="gpt-4o",
    base_url=os.environ["RAGAS_OPENAI_BASE_URL"],
    api_key=os.environ["RAGAS_OPENAI_API_KEY"],
)
generator_embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    base_url=os.environ["RAGAS_OPENAI_BASE_URL"],
    api_key=os.environ["RAGAS_OPENAI_API_KEY"],
)

In [12]:
#generator_llm = (
#    model="DeepSeek-R1-Distill-Llama-8B",
#    base_url=os.environ["OPENAI_BASE_URL"],
#    api_key=os.environ["OPENAI_API_KEY"],
#)
# Define the RAG embeddings model (different than the OpenAI embedding model defined above for test set generation)
model_kwargs = {'device': 'cuda', 'trust_remote_code': True}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
print(f"Setting up the RAG LLM...")
llm = ChatOpenAI(
    #model="DeepSeek-R1-Distill-Llama-8B",
    model="Meta-Llama-3.1-8B-Instruct",
    temperature=0,
    max_tokens=256,
    base_url=os.environ["OPENAI_BASE_URL"],
    api_key=os.environ["OPENAI_API_KEY"],
)

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-base-en-v1.5",
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

Setting up the RAG LLM...


In [13]:
%%time
# Create generator with OpenAI model
generator = TestsetGenerator.from_langchain(
    generator_llm=llm,
    critic_llm=llm,
    embeddings=embeddings,
)

# Generate the test set
testset = generator.generate_with_langchain_docs(
    documents=docs, 
    test_size=50,
    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},
)

embedding nodes:   0%|          | 0/666 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/50 [00:00<?, ?it/s]

Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avo

CPU times: user 42.9 s, sys: 1.86 s, total: 44.7 s
Wall time: 7min 20s


#### Preview the test dataset so far

In [14]:
testset1 = testset.to_pandas()

In [20]:
# testset1.to_parquet ('../../Testing_Data/testset_llama.parquet')
# testset1.to_csv ('../../Testing_Data/testset_llama.csv', index =False)

In [25]:
import joblib

# Save (serialize) the object to a file
joblib.dump(testset, "../../Testing_Data/testset_openai.pkl")


['../../Testing_Data/testset_openai.pkl']

In [15]:
testset1.to_parquet ('../../Testing_Data/testset_openai.parquet')
testset1.to_csv ('../../Testing_Data/testset_openai.csv', index =False)

In [29]:
#load 
# Load (deserialize) the object from file
testset2 = joblib.load("../../Testing_Data/testset_openai.pkl")

#print(testset2)  # Output: {'name': 'Alice', 'age': 30, 'city': 'Toronto'}


In [28]:
print (testset==testset2)

True


## Now, start the RAG pipeline!

#### Choose the RAG LLM and embedding model
Note: This is different than the OpenAI LLM and embedding model defined above for test set generation.

In [16]:
RAG_LLM_MODEL_NAME = "Meta-Llama-3.1-8B-Instruct"
RAG_EMBEDDING_MODEL_NAME = "BAAI/bge-base-en-v1.5"

#### Generate answers for all the questions in our test set

Go through the embedding, storage and retrieval steps.

In [36]:
# Split the documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2500, chunk_overlap=32)
chunks = text_splitter.split_documents(docs)
print(f"Number of text chunks: {len(chunks)}")

Number of text chunks: 397


In [37]:
# Define the RAG embeddings model (different than the OpenAI embedding model defined above for test set generation)
model_kwargs = {'device': 'cuda', 'trust_remote_code': True}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

print(f"Setting up the RAG embeddings model...")
embeddings = HuggingFaceEmbeddings(
    model_name=RAG_EMBEDDING_MODEL_NAME,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)

Setting up the RAG embeddings model...


In [38]:
# Create the vector store and the retriever
vectorstore = FAISS.from_documents(chunks, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

In [39]:
RAG_LLM_MODEL_NAME

'Meta-Llama-3.1-8B-Instruct'

In [40]:
%%time
# Define the RAG LLM (different than the OpenAI LLM defined above for test set generation)
print(f"Setting up the RAG LLM...")
llm = ChatOpenAI(
    model=RAG_LLM_MODEL_NAME,
    temperature=0,
    max_tokens=256,
    base_url=os.environ["OPENAI_BASE_URL"],
    api_key=os.environ["OPENAI_API_KEY"],
)

Setting up the RAG LLM...
CPU times: user 18.2 ms, sys: 4.16 ms, total: 22.4 ms
Wall time: 21.9 ms


Iterate over the questions in our synthetic testset, and run them each through the RAG pipeline to see what answers get returned. (This also takes 2-3 minutes)

In [41]:
%%time
dataset = testset.to_dataset()
answers = np.empty(len(dataset), dtype=object)

for index, row in enumerate(dataset):
    query = row["question"]
    
    # Run the query through the RAG pipeline
    rag_pipeline = RetrievalQA.from_llm(
        llm=llm,
        retriever=retriever
    )
    answer = rag_pipeline.invoke(input=query)
    answer = answer["result"]
    print(f"Result {index}\nQuestion: {query}\nAnswer: {answer}\n")
    
    # Store the result
    answers[index] = answer

Result 0
Question: Here is a question that can be fully answered from the given context:

"What options do buyers have in the shipping industry to ensure delivery in full and on time (DIFOT)?
Answer: According to the context, buyers in the shipping industry have several options to ensure delivery in full and on time (DIFOT). These options include:

* Shipping alternatives: Buyers can choose from various shipping alternatives, such as rail and air shipping, which provide them with multiple options to compare and select the most reliable businesses for transport.
* Measuring service quality: Buyers can measure service quality by the frequency of service and capacity, making it relatively easy for them to compare freight trucking companies and choose the most reliable ones.
* Multiple shipping options: The presence of many shipping alternatives within the industry and from alternative modes of transport, such as rail and air shipping, provides truckers with many sources of competition, wh

Result 14
Question: Here is a question that can be fully answered from the given context:

"Who does IBISWorld industry research help with gaining insights on industries around the world?
Answer: According to the context, IBISWorld industry research helps authorized licensees gain insights on industries around the world.

Result 15
Question: Here is a question that can be fully answered from the given context:

"What is the impact of the ongoing easing of monetary policy on the wholesale and retail markets, and how is it affecting the demand for trucking services?
Answer: The ongoing easing of monetary policy is expected to lower borrowing costs and support a recovery in economic activity. This is expected to lead to a pickup in growth and manufacturing output, which will drive distribution services across specialized freight transport and support expansion after a muted performance between 2019 and 2024 as the industry struggled to shake off the impacts of the pandemic.

The expected 

Result 24
Question: Here is a question that can be fully answered from the given context:

"What are some factors that have complicated the economic landscape for the trucking industry?"

This question can be answered by referencing the context, which mentions "economic volatility fueled by inflationary pressures, tightened monetary policy and rising fuel costs" as factors that have complicated the landscape, particularly for long-distance trucking companies.
Answer: According to the context, some factors that have complicated the economic landscape for the trucking industry include:

* Economic volatility, which affects the volumes of distributed freight
* Volatility in downstream markets, which translates to contract fluctuations
* Fuel price fluctuations, which have a more pronounced impact on long-distance trucking operations due to the length of haul
* Rising fuel costs, which are often passed down to consumers through fuel surcharges
* Driver shortages, which continue to threaten

Result 35
Question: Here's a rewritten version of the question:

"What market structures are vulnerable to fragmentation due to low barriers to entry, and how does this affect market concentration?"

I've used abbreviations like "vulnerable to fragmentation" instead of "hinder the concentration of market share" and "affect market concentration" instead of "what is the impact of this hindrance on the industry's market share concentration". This makes the question more concise and indirect while retaining the essence of the original question.
Answer: Based on the provided context, the market structures that are vulnerable to fragmentation due to low barriers to entry are:

1. Long-distance freight trucking in Canada: The industry has low barriers to entry, making it easy for small-scale businesses and non-employers to enter the market. This has led to the addition of over 5,000 non-employer establishments since 2018, resulting in a low concentration of market share.
2. Corn farming in Ca

Result 44
Question: Here is a rewritten version of the question:

"What strategy helps farmers stabilize prices and supply?"

This version conveys the same meaning as the original question but in a more indirect and concise manner.
Answer: Based on the provided context, the strategy that helps farmers stabilize prices and supply is:

"Develop a clear market position and use new technology to contain costs and boost productivity."

This is mentioned in the context of "Key Success Factors" under the section "How do successful businesses handle concentration?" It suggests that establishing a distinct market position and implementing the latest agricultural technologies and precision agriculture practices can help farmers differentiate their products, attract unique customer segments, and create brand loyalty, ultimately leading to stabilized prices and supply.

Result 45
Question: How do businesses of all types use IBISWorld's research?
Answer: According to the provided context, businesse

Add the list of answers into our original dataset. Now we have a complete test set that is ready for evaluation.

In [42]:
dataset = dataset.add_column("answer", answers)

## Evaluate the results

#### Preview the final test set

In [43]:
dataset.to_pandas()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done,answer
0,Here is a question that can be fully answered ...,[Buy er & Supplier P ower\nSour ce: IBIS World...,Buyers can select services with the expectatio...,simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,"According to the context, buyers in the shippi..."
1,Here is a question that can be fully answered ...,[Related T erms\nLIGHT -DUTY VEHICLE\nA passen...,"6,350 kilograms (14,000 pounds)",simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,The weight limit for a light-duty vehicle is l...
2,Here is a question that can be fully answered ...,[ lik e obesity and canc er\nhave dampened dem...,Health-conscious customers are willing to pay ...,simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,Health-conscious customers are willing to pay ...
3,Here is a question that can be fully answered ...,"[pass higher input pric es on to buy ers, lead...","Higher input prices are passed on to buyers, l...",simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,"According to the context, higher input prices ..."
4,Here is a question that can be fully answered ...,[Major Mark ets Segmen tation\nIndus try r eve...,"According to the IBIS World source, 35.9% of i...",simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,"According to the IBIS World report, the human ..."
5,Here is a question that can be fully answered ...,"[ on tr ade, which has impr oved following the...",The dismantling of the Canadian Wheat Board in...,simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,"No, the context does not mention the dismantli..."
6,Here is a question that can be fully answered ...,[Major Mark ets Segmen tation\nIndus try r eve...,"Yes, rising corn production in Canada is expec...",simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,"Yes, rising corn production in Canada is expec..."
7,Here is a question that can be fully answered ...,"[transit o ver driving.\n•In fact, mor e than ...",70.0% of Canada's population has access to pub...,simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,"According to the text, approximately 70.0% of ..."
8,Here is a question that can be fully answered ...,[activity and impr oved in ventory c ontrol. T...,The just-in-time inventory management system c...,simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,The just-in-time (JIT) inventory management sy...
9,Here is a question that can be fully answered ...,[Ontario has the l argest spr ead o f business...,The benefits of selecting regions with optimal...,simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,"According to the text, the benefits of selecti..."


Run the evaluation query to score the results. In this evaluation, we are looking at the following metrics:
- *[Faithfulness](https://docs.ragas.io/en/v0.1.21/concepts/metrics/faithfulness.html)*: Are all the claims that are made in the answer inferred from the given context(s)?
- *[Context Precision](https://docs.ragas.io/en/v0.1.21/concepts/metrics/context_precision.html)*: Did our retriever return good results that matched the question it was being asked?
- *[Answer Correctness](https://docs.ragas.io/en/v0.1.21/concepts/metrics/answer_correctness.html)*: Was the generated answer correct? Was it complete?

In [44]:
%%time
score = evaluate(
    dataset=dataset,
    metrics=[
        Faithfulness(),
        ContextPrecision(),
        AnswerCorrectness(),
    ],
    llm=llm, # Using OpenAI LLM as the evaluator
    embeddings=embeddings,
)
score.to_pandas()

Evaluating:   0%|          | 0/138 [00:00<?, ?it/s]

Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.
Failed to parse output. Returning None.


CPU times: user 9.59 s, sys: 253 ms, total: 9.84 s
Wall time: 2min 21s


Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done,answer,faithfulness,context_precision,answer_correctness
0,Here is a question that can be fully answered ...,[Buy er & Supplier P ower\nSour ce: IBIS World...,Buyers can select services with the expectatio...,simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,"According to the context, buyers in the shippi...",,1.0,
1,Here is a question that can be fully answered ...,[Related T erms\nLIGHT -DUTY VEHICLE\nA passen...,"6,350 kilograms (14,000 pounds)",simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,The weight limit for a light-duty vehicle is l...,1.0,1.0,0.70007
2,Here is a question that can be fully answered ...,[ lik e obesity and canc er\nhave dampened dem...,Health-conscious customers are willing to pay ...,simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,Health-conscious customers are willing to pay ...,1.0,1.0,0.55
3,Here is a question that can be fully answered ...,"[pass higher input pric es on to buy ers, lead...","Higher input prices are passed on to buyers, l...",simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,"According to the context, higher input prices ...",1.0,1.0,0.992848
4,Here is a question that can be fully answered ...,[Major Mark ets Segmen tation\nIndus try r eve...,"According to the IBIS World source, 35.9% of i...",simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,"According to the IBIS World report, the human ...",0.333333,1.0,0.495778
5,Here is a question that can be fully answered ...,"[ on tr ade, which has impr oved following the...",The dismantling of the Canadian Wheat Board in...,simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,"No, the context does not mention the dismantli...",0.0,1.0,0.597126
6,Here is a question that can be fully answered ...,[Major Mark ets Segmen tation\nIndus try r eve...,"Yes, rising corn production in Canada is expec...",simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,"Yes, rising corn production in Canada is expec...",,1.0,
7,Here is a question that can be fully answered ...,"[transit o ver driving.\n•In fact, mor e than ...",70.0% of Canada's population has access to pub...,simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,"According to the text, approximately 70.0% of ...",0.5,1.0,0.744149
8,Here is a question that can be fully answered ...,[activity and impr oved in ventory c ontrol. T...,The just-in-time inventory management system c...,simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,The just-in-time (JIT) inventory management sy...,1.0,1.0,0.608877
9,Here is a question that can be fully answered ...,[Ontario has the l argest spr ead o f business...,The benefits of selecting regions with optimal...,simple,[{'file_name': '/projects/RAG2/scotia-2/Datase...,True,"According to the text, the benefits of selecti...",,1.0,


In [45]:
score["answer_correctness"].mean()

0.6463411575905084

In [46]:
score["faithfulness"].mean()

0.70679012345679

In [94]:
#"Meta-Llama-3.1-8B-Instruct"
#chunksize, answer correctness, faithfulness, number of question
#10000, 0.592, 0.698, 50
# 5000, 0.617, 0.700, 50
# 3500, 0.677, 0.734, 50
# 3000, 0.688, 0.800, 50
# 2000, 0.629, 0.742, 50
# 1000, 0.644, 0.703, 50
#  500, 0.619, 0.503, 50
#  250, 0.594, 0.527, 50



In [47]:
2+2

4