üîß **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys. **You will need to ensure you've executed the Indexing pipeline before completing this exercise**

# RAGAS Evaluation Tutorial: A Complete Guide to RAG System Assessment

This notebook provides a comprehensive walkthrough for evaluating Retrieval-Augmented Generation (RAG) systems using the RAGAS (RAG Assessment) framework. 

## What You'll Learn

In this tutorial, you will:
1. **Set up RAGAS** - Import necessary libraries and understand the evaluation framework
2. **Test RAG Systems** - Run queries against different RAG implementations (Hybrid vs Naive)
3. **Prepare Evaluation Data** - Format synthetic test data for RAGAS evaluation
4. **Run Comprehensive Metrics** - Evaluate your RAG system using multiple assessment criteria
5. **Interpret Results** - Understand what the metrics tell us about system performance

## Prerequisites

- Basic understanding of RAG (Retrieval-Augmented Generation) systems
- Familiarity with Python and Pandas
- OpenAI API key configured in your environment

Let's begin!

## Step 1: Import Required Libraries and RAG System

First, let's import the RAG system we'll be evaluating:
- **Hybrid RAG**: Uses both dense and sparse retrieval methods for comprehensive document retrieval

We'll also import utility functions for pretty-printing results.

In [1]:
import os
from scripts.rag.hybridrag import HybridRAGSuperComponent
from scripts.rag.pretty_print import pretty_print_rag_answer
from haystack_integrations.document_stores.elasticsearch import ElasticsearchDocumentStore

In [2]:
document_store = ElasticsearchDocumentStore(hosts="http://localhost:9200")

In [3]:
# Create the Hybrid RAG SuperComponent with base parameters
hybrid_rag_sc = HybridRAGSuperComponent(
    document_store=document_store
)

In [4]:
# Define a sample query to test the RAG system
query = "Who uses ChatGPT more, men or women, and how does this change by 2025"

# Run the Hybrid RAG system
print("Running Hybrid RAG system...")
hybrid_answer = hybrid_rag_sc.run(query=query)

Running Hybrid RAG system...


## Step 2: Test Your RAG System with a Sample Query

Before running evaluation metrics, let's test the RAG system with a sample query to understand its behavior. This helps us verify everything is working correctly.

In [5]:
# Display the Hybrid RAG results in a formatted way
pretty_print_rag_answer(hybrid_answer, "Hybrid RAG", query)

üîç HYBRID RAG ANSWER
üìù Query: Who uses ChatGPT more, men or women, and how does this change by 2025
--------------------------------------------------------------------------------
üí¨ Answer:
   By the first half of 2025, the share of active users with typically
   feminine and typically masculine names reaches near-parity. By June
   2025, active users are more likely to have typically feminine names,
   indicating that the gender gap in ChatGPT usage has closed substantially
   over time. Initially, around 80% of weekly active users were those with
   typically masculine names, but this changed as the user base evolved.

üìö Source Documents (3 found):
--------------------------------------------------
1. Source: Unknown source
   Preview: However, in the first half of 2025, we see the share of active users with typically feminine and typically
masculine names reach near-parity. By June ...

2. Source: Unknown source
   Preview: Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida,

**Observation**: Notice the response quality from the system:
- Answer completeness and accuracy
- Retrieved context relevance
- Response structure and clarity

This manual review gives us intuition, but RAGAS provides systematic evaluation metrics.

## Step 3: Introduction to RAGAS Framework

**RAGAS** (RAG Assessment) is a framework designed to evaluate RAG systems comprehensively. It provides several key metrics:

### Basic RAGAS Metrics We'll Use:

1. **Faithfulness** - How factually consistent the answer is with the retrieved context
2. **Answer Relevancy** - How relevant the answer is to the question

These core metrics provide essential insights into RAG system performance while keeping evaluation simple and efficient.

Let's import RAGAS and prepare our evaluation dataset.

In [6]:
# Import RAGAS core modules
from ragas import evaluation
print("RAGAS imported successfully!")

RAGAS imported successfully!


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Step 4: Load and Examine Evaluation Dataset

For systematic evaluation, we need a test dataset with:
- **Questions** - Queries to test the RAG system
- **Ground Truth Answers** - Expected correct responses  
- **Reference Contexts** - Documents that should be retrieved

We'll use a synthetic dataset that was pre-generated for evaluation purposes.

In [7]:
# Import additional required libraries
from haystack.components.generators import OpenAIGenerator
from ragas import EvaluationDataset
import pandas as pd

# Load the synthetic evaluation dataset
print("Loading evaluation dataset...")
dataset = pd.read_csv("./data_for_eval/synthetic_tests_advanced_branching_2.csv")

print(f"Dataset loaded successfully!")
print(f"Dataset contains {len(dataset)} test cases")
print(f"Columns: {list(dataset.columns)}")

Loading evaluation dataset...
Dataset loaded successfully!
Dataset contains 3 test cases
Columns: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name']


In [8]:
# Let's examine the structure of our evaluation dataset
print("Examining the evaluation dataset structure:")
print("=" * 50)
dataset

Examining the evaluation dataset structure:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What role does TikTok play in the context of a...,"[""What is AI, how does it work and why are som...","TikTok, along with other social platforms like...",single_hop_specific_query_synthesizer
1,How does the use of automated classifiers ensu...,['<1-hop>\n\nPrivacy via Automated Classifiers...,The use of automated classifiers ensures priva...,multi_hop_specific_query_synthesizer
2,How do Context-Augmented Message Classificatio...,['<1-hop>\n\n‚Äù (truncated)\n[user]: ‚Äú10 more‚Äù\...,Context-Augmented Message Classifications ensu...,multi_hop_abstract_query_synthesizer


## Step 5: Generate RAG Responses for Evaluation

Now we'll run our RAG system on all test queries to generate responses that RAGAS can evaluate. This is where we collect the data needed for systematic assessment.

**What we're doing:**
- Run each test query through our Hybrid RAG system
- Extract both the generated response and retrieved contexts
- Store results in the format RAGAS expects

In [9]:
# Create a helper function to run RAG system and extract needed data
def run_rag_system(query, rag_system):
    """
    Helper function to run a RAG system and return formatted response
    
    Args:
        query: The question to ask the RAG system
        rag_system: The RAG pipeline to use
    
    Returns:
        dict: Contains the response text and retrieved document contexts
    """
    response = rag_system.run(query=query)
    return {
        'response': response['replies'][0] if 'replies' in response else '',
        'reference': [doc.content for doc in response.get('documents', [])]
    }

# Apply the hybrid RAG system to all test queries
print("Running Hybrid RAG on all evaluation queries...")
print("This may take a few minutes depending on dataset size...")

hybrid_results = dataset['user_input'].apply(lambda query: run_rag_system(query, hybrid_rag_sc))

# Extract response and retrieved contexts into separate columns
dataset['response'] = hybrid_results.apply(lambda x: x['response'])
dataset['reference'] = hybrid_results.apply(lambda x: x['reference'])

print(f"Successfully generated responses for {len(dataset)} queries!")
print(f"Added columns: 'response', 'reference' (retrieved contexts)")
print(f"Final dataset shape: {dataset.shape}")

Running Hybrid RAG on all evaluation queries...
This may take a few minutes depending on dataset size...
Successfully generated responses for 3 queries!
Added columns: 'response', 'reference' (retrieved contexts)
Final dataset shape: (3, 5)
Successfully generated responses for 3 queries!
Added columns: 'response', 'reference' (retrieved contexts)
Final dataset shape: (3, 5)


## Step 6: Format Data for RAGAS Evaluation

RAGAS expects data in a specific format called `SingleTurnSample`. We need to transform our dataset to match these requirements:

**RAGAS Required Fields:**
- `user_input` (str) - The question/query
- `response` (str) - RAG system's answer  
- `retrieved_contexts` (List[str]) - Documents retrieved by RAG
- `reference` (str) - Ground truth answer

Let's transform our data to match this schema.

In [10]:
# Transform dataset format for RAGAS SingleTurnSample requirements
import ast

def parse_contexts(context_str):
    """
    Parse string representation of list to actual list
    
    Args:
        context_str: String or list containing context information
    
    Returns:
        list: Parsed context as a list of strings
    """
    try:
        if isinstance(context_str, str):
            return ast.literal_eval(context_str)
        elif isinstance(context_str, list):
            return context_str
        else:
            return []
    except (ValueError, SyntaxError):
        if isinstance(context_str, str):
            return [context_str]
        return []

print("Preparing dataset for RAGAS evaluation...")

# Parse reference_contexts from string to list (ground truth contexts)
dataset['reference_contexts_parsed'] = dataset['reference_contexts'].apply(parse_contexts)

# Ensure user_input is a string, not a list
dataset['user_input'] = dataset['user_input'].apply(
    lambda x: x[0] if isinstance(x, list) else x
)

# Create the final RAGAS-compatible dataset
ragas_dataset = pd.DataFrame({
    'user_input': dataset['user_input'],           # Question/query as string
    'response': dataset['response'],                # RAG response as string  
    'retrieved_contexts': dataset['reference'],     # Retrieved contexts as list
    'reference': dataset['reference_contexts_parsed'].apply(
        lambda x: x[0] if x else ""
    )  # Reference answer as string
})

print(f"Dataset successfully formatted for RAGAS!")
print(f"Final evaluation dataset shape: {ragas_dataset.shape}")
print(f"Columns: {list(ragas_dataset.columns)}")

Preparing dataset for RAGAS evaluation...
Dataset successfully formatted for RAGAS!
Final evaluation dataset shape: (3, 4)
Columns: ['user_input', 'response', 'retrieved_contexts', 'reference']


In [11]:
# Create the RAGAS EvaluationDataset object
from ragas import EvaluationDataset

print("Creating RAGAS EvaluationDataset...")

# Create evaluation dataset with the properly formatted data
evaluation_dataset = EvaluationDataset.from_pandas(ragas_dataset)

print("EvaluationDataset created successfully!")
print(f"Dataset size: {len(evaluation_dataset)} samples")
print(f"Sample type: {evaluation_dataset.get_sample_type()}")
print("Ready for evaluation!")

Creating RAGAS EvaluationDataset...
EvaluationDataset created successfully!
Dataset size: 3 samples
Sample type: <class 'ragas.dataset_schema.SingleTurnSample'>
Ready for evaluation!


In [12]:
# Let's examine our final dataset structure
print("Final RAGAS dataset structure:")
print("=" * 50)
ragas_dataset

Final RAGAS dataset structure:


Unnamed: 0,user_input,response,retrieved_contexts,reference
0,What role does TikTok play in the context of a...,I don't have enough information to answer.,"[What is AI, how does it work and why are some...","What is AI, how does it work and why are some ..."
1,How does the use of automated classifiers ensu...,The use of automated classifiers ensures priva...,"[We describe the contents of each dataset, the...",<1-hop>\n\nPrivacy via Automated Classifiers.N...
2,How do Context-Augmented Message Classificatio...,Context-Augmented Message Classifications ensu...,[‚Äù (truncated)\n[user]: ‚Äú10 more‚Äù\nTable 2:Ill...,<1-hop>\n\n‚Äù (truncated)\n[user]: ‚Äú10 more‚Äù\nT...


## Step 7: Configure RAGAS Evaluator

RAGAS needs an LLM to evaluate responses. We'll use OpenAI's GPT 4o-mini model wrapped in RAGAS's Haystack wrapper. This LLM will act as the "judge" that scores our RAG system's performance.

In [None]:
# Set up the evaluator LLM for RAGAS
from ragas import evaluate
from ragas.llms import HaystackLLMWrapper
from haystack.components.generators import OpenAIGenerator
from haystack.utils import Secret
from dotenv import load_dotenv
load_dotenv(".env")

print("Setting up RAGAS evaluator LLM...")

model = OpenAIGenerator(model="gpt-4o-mini",
                        api_key=Secret.from_env_var("OPENAI_API_KEY"))
evaluator_llm = HaystackLLMWrapper(haystack_generator=model)

print("Evaluator LLM configured successfully!")
print("This LLM will act as the 'judge' for evaluating RAG performance")

Setting up RAGAS evaluator LLM...
Evaluator LLM configured successfully!
This LLM will act as the 'judge' for evaluating RAG performance
Evaluator LLM configured successfully!
This LLM will act as the 'judge' for evaluating RAG performance


## Step 8: Run Basic RAGAS Evaluation

Now for the main event! We'll run RAGAS evaluation using basic metrics to get a focused assessment of our RAG system.

### Basic Metrics We're Using:

1. **Faithfulness**: Are the responses factually consistent with retrieved contexts?
2. **Response Relevancy**: How relevant are responses to the input questions?
3. **LLM Context Recall**: How well does the system retrieve relevant contexts that contain the ground truth answer?
4. **Factual Correctness**: How factually accurate is the generated response compared to the ground truth?

These core metrics provide essential insights while keeping evaluation simple and efficient.

**Note**: This process uses API calls and may take several minutes to complete.

In [16]:
# Import basic evaluation metrics
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy
from ragas import evaluate, RunConfig

print("Starting basic RAGAS evaluation...")
print("Using core metrics: Faithfulness, Response Relevancy, LLM Context Recall, Factual Correctness")
print()

# Use basic metrics only
basic_metrics = [
    Faithfulness(), 
    ResponseRelevancy(),
    LLMContextRecall(),
    FactualCorrectness()
]

print("Running evaluation with basic metrics...")
print("This may take several minutes... Please wait...")

try:
    # Create evaluation dataset
    evaluation_dataset = EvaluationDataset.from_pandas(ragas_dataset)
    
    # Configure evaluation settings with extended timeout
    custom_run_config = RunConfig(timeout=360)
    
    # Run the evaluation
    result = evaluate(
        dataset=evaluation_dataset,
        metrics=basic_metrics,
        llm=evaluator_llm,
        run_config=custom_run_config
    )
    
    print("Evaluation completed successfully!")
    print("\nRAGAS Basic Evaluation Results:")
    print("=" * 50)
    print(result)
    
except Exception as e:
    print(f"Evaluation failed with error: {e}")
    print(f"Error type: {type(e).__name__}")


Starting basic RAGAS evaluation...
Using core metrics: Faithfulness, Response Relevancy, LLM Context Recall, Factual Correctness

Running evaluation with basic metrics...
This may take several minutes... Please wait...


Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.


Evaluation completed successfully!

RAGAS Basic Evaluation Results:
{'faithfulness': 0.6275, 'answer_relevancy': 0.6473, 'context_recall': 0.9471, 'factual_correctness(mode=f1)': 0.3633}


## Step 9: Interpret Your Results

Let's examine the evaluation results in detail. Each metric provides insights into different aspects of your RAG system's performance.

In [None]:
# Display detailed results with interpretation
print("BASIC RAGAS EVALUATION RESULTS")
print("=" * 50)
print()
print(result)

### Understanding Your Basic RAGAS Scores
*Simplified interpretation of core metrics*

## **Basic Metric Explanations**

### **1. [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/)**
*Measures how factually consistent responses are with retrieved context*

**What it measures:** Whether your RAG system sticks to the facts or "hallucinates" information.

**Formula:** `Faithfulness = (Claims supported by context) / (Total claims in response)`

**Score meaning:**
- **High scores (>0.8)** = Responses stick to retrieved facts
- **Medium scores (0.5-0.8)** = Generally accurate with some unsupported claims
- **Low scores (<0.5)** = System making many unsupported claims or hallucinating

**Example:**
- Context: "Einstein born 14 March 1879"
- Good response: "Einstein was born in Germany on 14th March 1879" ‚Üí Score: 1.0
- Poor response: "Einstein was born in Germany on 20th March 1879" ‚Üí Score: 0.5

---

### **2. [Response Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_relevance/)**
*Measures how relevant responses are to the input questions*

**What it measures:** Whether your system directly answers what was asked.

**How it works:** 
1. Generate artificial questions from the response
2. Calculate similarity between original question and generated questions
3. Higher similarity = more relevant response

**Score meaning:**
- **High scores (>0.8)** = Responses directly answer what's asked
- **Medium scores (0.5-0.8)** = Generally relevant with some tangential content
- **Low scores (<0.5)** = Responses are off-topic or incomplete

**Example:**
- Question: "Where is France and what is its capital?"
- Poor answer: "France is in western Europe" ‚Üí Lower relevancy (incomplete)
- Good answer: "France is in western Europe and Paris is its capital" ‚Üí Higher relevancy

---

### **3. [LLM Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_recall/)**
*Measures how well the retrieval system finds contexts containing the ground truth answer*

**What it measures:** The effectiveness of your retrieval component in finding relevant information.

**How it works:**
1. Compare retrieved contexts against the ground truth reference answer
2. Determine what proportion of the ground truth is present in retrieved contexts
3. Higher scores = better retrieval coverage

**Score meaning:**
- **High scores (>0.8)** = Retrieval finds most relevant documents
- **Medium scores (0.5-0.8)** = Some relevant documents missed
- **Low scores (<0.5)** = Retrieval is missing critical information

**Example:**
- Ground truth: "Paris is the capital of France, located on the Seine River"
- Good retrieval: Contains documents about Paris, France, and the Seine ‚Üí High recall
- Poor retrieval: Only general France info, nothing about Paris ‚Üí Low recall

---

### **4. [Factual Correctness](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/factual_correctness/)**
*Measures factual accuracy of the generated response compared to ground truth*

**What it measures:** How well the generated answer matches the expected correct answer.

**How it works:**
1. Extract factual claims from both the generated response and ground truth
2. Compare claims for correctness and completeness
3. Calculate overlap and accuracy

**Score meaning:**
- **High scores (>0.8)** = Response contains correct facts matching ground truth
- **Medium scores (0.5-0.8)** = Some facts correct, some missing or incorrect
- **Low scores (<0.5)** = Response has many factual errors or omissions

**Example:**
- Ground truth: "The Eiffel Tower is 330 meters tall and was built in 1889"
- Good response: "The Eiffel Tower stands at 330 meters and was constructed in 1889" ‚Üí High correctness
- Poor response: "The Eiffel Tower is about 300 meters tall and was built in the 1880s" ‚Üí Lower correctness

---

## **Improvement Strategies by Score Pattern**

**If Faithfulness is Low:**
- Add explicit instructions to stick to retrieved context
- Implement fact-checking components  
- Use stronger grounding techniques in prompts
- Review and improve document quality in knowledge base

**If Response Relevancy is Low:**
- Improve query understanding and processing
- Add query classification for better routing
- Enhance prompt engineering to focus on question requirements
- Consider query expansion or reformulation techniques

**If LLM Context Recall is Low:**
- Improve retrieval strategy (adjust embedding model, search parameters)
- Increase number of retrieved documents (top_k parameter)
- Enhance document chunking strategy
- Consider hybrid search (dense + sparse retrieval)
- Improve document preprocessing and indexing

**If Factual Correctness is Low:**
- Improve prompt engineering to emphasize accuracy
- Use stronger/larger generation models
- Implement retrieval result reranking
- Add fact verification steps
- Ensure ground truth data quality is high

**General Improvements:**
- **All scores low**: Review overall RAG architecture and data quality
- **All scores high**: System is performing well - consider advanced metrics for fine-tuning
- **Mixed scores**: Focus on the lowest-performing area first
- **High Recall but Low Faithfulness**: Generation model not using retrieved context properly
- **Low Recall but High Faithfulness**: Good generation but poor retrieval - fix retrieval first

## Congratulations!

You've successfully completed a basic RAGAS evaluation of your RAG system! 

**What you've accomplished:**

- **Set up RAGAS framework** - Imported and configured the evaluation toolkit  
- **Tested RAG system** - Ran queries against the Hybrid RAG implementation  
- **Prepared evaluation data** - Formatted synthetic test data for systematic assessment  
- **Ran basic metrics** - Evaluated your RAG system using core performance measures  
- **Interpreted results** - Learned how to understand and act on RAGAS scores  

## **Next Steps**

Now that you have basic evaluation working, you can:

1. **Expand metrics** - Add more RAGAS metrics like Context Recall and Context Precision
2. **Compare systems** - Evaluate different RAG implementations side by side  
3. **Iterate and improve** - Use these insights to enhance your RAG system
4. **Automate evaluation** - Integrate RAGAS into your development pipeline

Great job on completing this essential RAG evaluation workflow!