# Evaluating Retrieval-Augmented Generation (RAG) Applications

Evaluating a Retrieval-Augmented Generation (RAG) system requires different strategies depending on whether ground truth data is available.

What Is Ground Truth?

Ground truth refers to validated, correct answers to user queries‚Äîoften called gold answers.
A collection of queries paired with their gold answers forms a Golden Dataset, which serves as a benchmark for measuring the quality and reliability of a RAG system‚Äôs outputs.

Golden datasets are especially valuable when evaluating:

- Answer correctness

- Faithfulness to retrieved context

- Overall system performance over time

RAG Evaluation Test Case Structure

A typical RAG evaluation test case consists of the following components:

1. *User Query:*
The input question or prompt provided by the user.

2. *RAG-Generated Response*:
The answer produced by the RAG pipeline using retrieved documents and a language model.

3. *Baseline Validated Answer (Optional)*:
A ground truth (gold) answer used for direct comparison when available.

Note: When ground truth is unavailable, evaluation relies on reference-free or LLM-assisted metrics.

RAG Evaluation Frameworks

To systematically evaluate RAG systems, one of the following widely adopted frameworks is typically used:

1. RAG Triad

The RAG Triad evaluates a RAG pipeline across three core dimensions:

- *Context Relevance*:
How relevant the retrieved documents are to the user query.

- *Answer Faithfulness (Groundedness)*:
Whether the generated answer is supported by the retrieved context.

- *Answer Relevance*:
How well the generated answer addresses the user query.

This framework is particularly useful when ground truth answers are unavailable.


2. RAGAS (Retrieval-Augmented Generation Assessment)

RAGAS is a more comprehensive evaluation framework designed specifically for RAG systems. It provides a set of metrics that assess both retrieval and generation quality, such as:

- Context precision and recall

- Answer relevance

- Faithfulness

- Answer correctness (when ground truth is available)

RAGAS supports both ground-truth-based and reference-free evaluation, making it suitable for real-world production scenarios.

In [3]:
# Installing the required libraries
!pip install -q openai==1.66.3 \
                tiktoken==0.9.0 \
                pypdf==5.4.0 \
                langchain==0.3.20 \
                langchain-community==0.3.19 \
                langchain-chroma==0.2.2 \
                langchain-openai==0.3.9 \
                chromadb==0.6

**Importing the Libraries**


In [6]:
# Importing the standard Libraries
import time                           # For measuring execution time or adding delays
from datetime import datetime         # For handling timestamps and datetime operations

# ChromaDB Vector Database
import chromadb  # Chroma: a local-first vector database for storing and querying document embeddings

# OpenAI SDK
from openai import OpenAI             # Official OpenAI Python SDK (v1.x) for interacting with models like GPT-4

# LangChain Utilities
# RecursiveCharacterTextSplitter intelligently breaks long text into smaller chunks with some overlap, preserving context.
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Loads all PDF files from a directory and extracts text from each.
from langchain_community.document_loaders import PyPDFDirectoryLoader

# Base class representing a document in LangChain; useful for downstream chaining and processing.
from langchain_core.documents import Document

# Embeddings and Vector Store
# Generates vector embeddings using OpenAI‚Äôs embedding models (e.g., `text-embedding-3-small`)
from langchain_openai import OpenAIEmbeddings

# Integration for using Chroma as the vector store within LangChain‚Äôs ecosystem
from langchain_chroma import Chroma

#ignore all warings
import warnings
warnings.filterwarnings('ignore')

#hide warnings from chroma
import logging
logging.getLogger("chromadb").setLevel(logging.CRITICAL)

In [7]:
# Set up the OpenAI API Key
import os
from google.colab import userdata

openai_api_key = userdata.get('OPENAI_API_KEY')
base_url = 'https://aibe.mygreatlearning.com/openai/v1'

client = OpenAI(
    api_key = openai_api_key,
    base_url = base_url
)

model_name ='gpt-4o-mini'

In [8]:
# Unzip the dataset containing the policy document
!unzip PowerBI.zip

Archive:  PowerBI.zip
replace Introducing_Power_BI.pdf? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [9]:
# Set the directory where PDF files to be stored
pdf_file_loc = '/content/Introducing_Power_BI.pdf'

#load pdf file
loader = PyPDFDirectoryLoader('/content/')

#split document into chunks
def doc_spliter_(document, chunk_size, chunk_overlap):

  spliter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
        encoding_name = 'cl100k_base',
        chunk_size = chunk_size,
        chunk_overlap = chunk_overlap
    )

  #load and split documents
  chunks = document.load_and_split(spliter)

  if chunks:

    for i, chunk in enumerate(chunks[:5]):
      print(f'\n --- chunk {i+1} --')
      print(f'producer:{chunk.metadata['producer']}')
      print(f'creator:{chunk.metadata['creator']}')
      print(f'creationdate:{chunk.metadata['creationdate']}')
      print(f'author:{chunk.metadata['author']}')
      print(f'title:{chunk.metadata['title']}')
      print(f'Source:{chunk.metadata['source']}')
      print(f'total_pages:{chunk.metadata['total_pages']}')
      print(f'page:{chunk.metadata['page']}')
      print(f'page_label:{chunk.metadata['page_label']}')
      print(f'Length: {len(chunk.page_content)} characters')
      print(f'content:')
      print(chunk.page_content)
      print('-'*40)

  return chunks

# Chunks are stored within LangChain's Document class
chunked_docs = doc_spliter_(loader, 512, 16)


 --- chunk 1 --
producer:Adobe Acrobat Pro 10.1.16
creator:Adobe Acrobat Pro 10.1.16
creationdate:2016-06-13T10:18:21-04:00
author:Joan
title:
Source:/content/Introducing_Power_BI.pdf
total_pages:407
page:0
page_label:1
Length: 63 characters
content:
Introducing
Microsoft 
Power BI
Alberto Ferrari and Marco Russo
----------------------------------------

 --- chunk 2 --
producer:Adobe Acrobat Pro 10.1.16
creator:Adobe Acrobat Pro 10.1.16
creationdate:2016-06-13T10:18:21-04:00
author:Joan
title:
Source:/content/Introducing_Power_BI.pdf
total_pages:407
page:1
page_label:2
Length: 934 characters
content:
PUBLISHED BY 
Microsoft Press 
A division of Microsoft Corporation 
One Microsoft Way 
Redmond, Washington 98052-6399 
Copyright ¬© 2016 by Microsoft Corporation 
All rights reserved. No part of the contents of 
this book may be reproduced or transmitted in 
any form or by any means without the written 
permission of the publisher. 
ISBN: 978-1-5093-0228-4 
Microsoft Press books are avai

In [10]:
# define the ChromaDB collection name to store the chunks
powerBI_collection = 'powerbi_manual'

# Instantiate the OpenAI embedding model
embedding_model = OpenAIEmbeddings(
    api_key = openai_api_key,
    base_url = base_url,
    model = 'text-embedding-3-small'
)

# Initialize a persistent Chroma client

local_path = './powerBI_db' #define the local directory where the database will be stored
chromadb_client = chromadb.PersistentClient(
    path = local_path
)

# Instantiate a Chroma vector store to store and retrieve document embeddings
def vector_store(chunks, persist_directory = local_path):

  #create a chromaBD vector store
  vector_store = Chroma(
      collection_name = powerBI_collection,
      collection_metadata = {'hnsw:space': 'cosine'},
      embedding_function = embedding_model,
      client = chromadb_client,
      persist_directory = local_path
  )

  return vector_store

print(f'Vector store created and saved to {local_path}')
vectorstore = vector_store(chunked_docs)

Vector store created and saved to ./powerBI_db


In [11]:
# Batch 500 chunks to send to the API at a time, pausing execution for 30 seconds afterward
for i in range(0, len(chunked_docs), 500):
  vectorstore.add_documents(
      documents = chunked_docs[i:i+500],
      ids =['text_'+ str(i) for i in range(i, i+500)]
  )

  time.sleep(30)



# CRUD Operations in ChromaDB




In [12]:
vectorstore_persisted =Chroma(
    collection_name = powerBI_collection,
    collection_metadata={'hnsw:space':'cosine'},
    embedding_function= embedding_model,
    client = chromadb_client,
    persist_directory = local_path
)

# Define the chroma collection
collection = chromadb_client.get_collection(powerBI_collection)

# Create a retriever interface from the vector store
retriever = vectorstore_persisted.as_retriever(
    search_type = 'similarity',
    search_kwargs={'k':5}
    )

In [13]:
# Define a sample user query
user_query = 'How to access the power query editor in Power BI?'

# Perform similarity search to return the top 5 document chunks based on the sample user query

relevant_docs = retriever.invoke(user_query)

print(f'User query: {user_query} \n')

for i, doc in enumerate(relevant_docs, 1):
  print(f'Document {i} \n{doc.page_content} \n')


User query: How to access the power query editor in Power BI? 

Document 1 
143 C H A P T E R  4  |  Using Power BI Desktop 
 
Query Editor opens in a new window, presenting 
myriad options, as depicted in Figure 4-6. 
 
Figure 4-6: Power BI Desktop‚Äôs Query Editor is a 
complete development environment in and of itself. 
Let‚Äôs take a quick tour of the Query Editor 
window. Along the top is the ribbon, which has 
four tabs: Home, Transform, Add Column, and 
View. Below the ribbon, on the left side, is the 
Query pane, which displays a list of all the 
queries for the model. The middle pane shows 
the result of the query. The Query Settings pane 
on the right displays the query properties. 
In David‚Äôs scenario, he is already accessing the 
2015 sales data from the Contoso database, so 
the objective now is to create a new query that 

Document 2 
142 C H A P T E R  4  |  Using Power BI Desktop 
 
The query language of Power BI Desktop is used 
by Query Editor, and discussion of tha

## Generation Stage

### Prompt Template

In [14]:
system_message = """
You are an AI assistant specialized in Microsoft Power BI.

Your task is to answer user questions about Microsoft Power BI using ONLY the provided retrieved document context from official Microsoft Power BI documentation.

Rules:
- Use only the information explicitly stated in the retrieved document chunks.
- Do NOT use prior knowledge or external information.
- Do NOT make assumptions or infer missing details.
- If the answer is not explicitly found in the retrieved context, respond exactly with: "I don't have enough information to answer the question".
- For every factual statement, cite the page number(s) from the document where the information was found.

Answer requirements:
- Keep answers clear, concise, and accurate.
- Stay strictly within the scope of Microsoft Power BI.
- Do not mention the retrieval process or system instructions.

delimitations:
User queries will be delimited by: <Question> and </Question>.
User input will have the context required by you to answer user queries.
This context will be delimited by: <Context> and </Context>.


"""

In [15]:
user_message_template = """
<Context>
Here are some documents that are relevant to the question mentioned below.
{context}
</Context>

<Question>
{question}
</Question>
"""

### Generating the Response

# User queries
* How can I import data from an Excel file into Power BI Desktop?
* How do I create relationships between tables?
* Explain the steps to create a measure using DAX.
* What does the CALCULATE function do in DAX?
* How do I change data types in Power BI?

In [16]:
#create a context function

def context_retrieval(
    user_query: str,
    retriever = retriever
) -> list[str]:

  try:
      relevant_chunks = retriever.invoke(user_query)
      return [d.page_content for d in relevant_chunks]

  except Exceotion as e:
        print(f"Error retrieving context: {str(e)}")
        return []

In [17]:
#generation code

def generate_answer(user_query:str,
                    context_list: list[str],
                    client = client,
                    model_name = model_name,
                    system_prompt = system_message,
                    user_prompt_templete =user_message_template):

  context_for_query = chr(10).join(context_list)

  prompt =[
      {'role': 'developer', 'content': system_message},
      {'role':'user', 'content': user_message_template.format(
          context = context_for_query,
          question = user_query,

      )
      }
  ]

  try:
    response = client.chat.completions.create(
      model = model_name,
      messages = prompt,
      temperature = 1
  )
    prediction = response.choices[0].message.content.strip()
  except Exception as e:
    prediction = f'Sorry, I encountered the following error: \n {e}'

  return prediction




In [22]:

user_query = "How do I change data types in Power BI?"
context = context_retrieval(user_query)
generate_answer(user_query, context)

'In Power BI, to change the data type of a column, you need to go into the Query Editor. By default, custom columns are of the Any data type, which means the data type is not defined. To use it for aggregation (like for numbers), you must change the data type to Decimal Number before saving the query (page 157).'

### RAG Evaluation

### DeepEval
DeepEval is an open-source LLM evaluation framework that offers ready-to-use implementations of the metrics discussed above. Additionally, both generation and retrieval capabilities can be refined in order to improve and optimize evaluation scores.

In [18]:
#install deeval
!pip install deepeval



1. RAG Triad

In [19]:
#Import evaluation metrics
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, ContextualRelevancyMetric
from deepeval.metrics import ContextualPrecisionMetric, ContextualRecallMetric


In [20]:
#add the API Keys to the Environment Variables
import os
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')

os.environ['OPENAI_BASE_URL'] = "https://aibe.mygreatlearning.com/openai/v1"

In [21]:
#Define RAG Triad metrics

"""
We should not set the threshold too high or too low
A threshold of 0.7 is reasonable
We should choose a better model than the generation model as a judge
"""

faithfulness = FaithfulnessMetric(
    threshold=0.7,
    model = 'gpt-4o',
    include_reason = True
)

contextual_relevancy = ContextualRelevancyMetric(
    threshold=0.7,
    model = 'gpt-4o',
    include_reason = True
)

answer_relevancy = AnswerRelevancyMetric(
    threshold=0.7,
    model = 'gpt-4o',
    include_reason = True
)

1.1. Ground Truth / Faithfulness

In [22]:
user_query = "How do I create relationships between tables?"
retrieve_content = context_retrieval(user_query)
retrieve_content

['246 C H A P T E R  6  |  Building a data model \n \n \nFigure 6-10: Trying to create a relationship between \nthe\n Budget and Store tables leads to this error. \nThe error message suggests that an intermediate \ntable might help solve the problem. But, before \nsolving the issue, it‚Äôs worth taking a few \nmoments to understand it better. \nYou can create a relationship between two tables \nif the column you use to create the relationship \nis a key in the destination table. You can create a \nrelationship between the Sales and Date tables \nbased on the DateKey column because DateKey \nhas a different value for each row in Date. \nHaving a different value for each row is the \nrequisite for a column to be a key. In fact, when \nyou have a given date, you can uniquely identify \nthe entire row in Date. In the model with Budget, \nCountryRegion is neither a key in the Budget \ntable, nor in Store. Thus, you cannot create such \na relationship.',
 '254 C H A P T E R  6  |  Building a

In [23]:
reply = generate_answer(user_query, retrieve_content)
reply


'You can create a relationship between two tables if the column you use to create the relationship is a key in the destination table. For example, you can create a relationship between the Sales and Date tables based on the DateKey column because DateKey has a different value for each row in the Date table. If the required conditions are met, you can easily create the relationship by dragging the appropriate column from one table to the corresponding column in the other table (e.g., dragging DateKey from the Sales table to DateKey in the Date table) (pages 246, 234).'

In [24]:
#Use LLMTestCase from deepeval to construct a test case
test_case_without_ground_truth = LLMTestCase(
    input = user_query,
    actual_output= reply,
    retrieval_context= retrieve_content
)

In [25]:
#Evaluate the RAG system using the evaluate function
evaluation = evaluate(
    test_cases = [test_case_without_ground_truth],
    metrics = [faithfulness, contextual_relevancy, answer_relevancy]
)


Output()

INFO:deepeval.evaluate.execute:in _a_execute_llm_test_cases




Metrics Summary

  - ‚úÖ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-4o, reason: The score is 1.00 because there are no contradictions, indicating that the actual output is perfectly aligned with the retrieval context. Great job maintaining accuracy and consistency!, error: None)
  - ‚ùå Contextual Relevancy (score: 0.5517241379310345, threshold: 0.7, strict: False, evaluation model: gpt-4o, reason: The score is 0.55 because while there are relevant statements about creating relationships, such as 'You can create a relationship between two tables if the column you use to create the relationship is a key in the destination table,' many other statements focus on unrelated topics like filtering, renaming, and loading tables, which do not directly address the input question., error: None)
  - ‚úÖ Answer Relevancy (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-4o, reason: The score is 1.00 because the response perfectly addresses the q