🔧 **Setup Required**: Before running this notebook, please follow the [setup instructions](../README.md#setup-instructions) to configure your environment and API keys. **You will need to ensure you've executed the Indexing pipeline before completing this exercise**

# RAGAS Evaluation Tutorial: A Complete Guide to RAG System Assessment

This notebook provides a comprehensive walkthrough for evaluating Retrieval-Augmented Generation (RAG) systems using the RAGAS (RAG Assessment) framework. 

## What You'll Learn

In this tutorial, you will:
1. **Set up RAGAS** - Import necessary libraries and understand the evaluation framework
2. **Test RAG Systems** - Run queries against different RAG implementations (Hybrid vs Naive)
3. **Prepare Evaluation Data** - Format synthetic test data for RAGAS evaluation
4. **Run Comprehensive Metrics** - Evaluate your RAG system using multiple assessment criteria
5. **Interpret Results** - Understand what the metrics tell us about system performance

## Prerequisites

- Basic understanding of RAG (Retrieval-Augmented Generation) systems
- Familiarity with Python and Pandas
- OpenAI API key configured in your environment

Let's begin!

## Step 1: Import Required Libraries and RAG Systems

First, let's import the RAG systems we'll be evaluating. We have two implementations:
- **Hybrid RAG**: Uses both dense and sparse retrieval methods
- **Naive RAG**: Uses a simpler retrieval approach

We'll also import utility functions for pretty-printing results.

In [1]:
from scripts.rag.hybridrag import hybrid_rag_sc
from scripts.rag.naiverag import naive_rag_sc
from scripts.rag.pretty_print import pretty_print_rag_answer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Define a sample query to test both RAG systems
query = "Who uses ChatGPT more, men or women, and how does this change by 2025"

# Run the Hybrid RAG system
print("🔄 Running Hybrid RAG system...")
hybrid_answer = hybrid_rag_sc.run(query=query)

🔄 Running Hybrid RAG system...


Batches: 100%|██████████| 1/1 [00:00<00:00,  5.70it/s]



## Step 2: Test Your RAG Systems with a Sample Query

Before running evaluation metrics, let's test both RAG systems with a sample query to understand their behavior. This helps us verify everything is working correctly.

In [3]:
# Display the Hybrid RAG results in a formatted way
pretty_print_rag_answer(hybrid_answer, "Hybrid RAG", query)

🔍 HYBRID RAG ANSWER
📝 Query: Who uses ChatGPT more, men or women, and how does this change by 2025
--------------------------------------------------------------------------------
💬 Answer:
   Initially, a significant share (around 80%) of weekly active users of
   ChatGPT in the first few months after its release were users with
   typically masculine first names. However, by the first half of 2025, the
   share of active users with typically feminine names reached near-parity
   with those having typically masculine names. By June 2025, active users
   were more likely to have typically feminine names, suggesting that
   gender gaps in ChatGPT usage have closed substantially over time.

📚 Source Documents (3 found):
--------------------------------------------------
1. Source: Unknown source
   Preview: The prompts for each of these automated classifiers (with the
exception of interaction quality) are available in Appendix A. Values represent the aver...

2. Source: Unknown source
  

In [4]:
# Run the Naive RAG system for comparison
print("🔄 Running Naive RAG system...")
naive_answer = naive_rag_sc.run(query=query)

🔄 Running Naive RAG system...


Batches: 100%|██████████| 1/1 [00:00<00:00, 14.94it/s]



In [5]:
# Display the Naive RAG results
pretty_print_rag_answer(naive_answer, "Naive RAG", query)

🔍 NAIVE RAG ANSWER
📝 Query: Who uses ChatGPT more, men or women, and how does this change by 2025
--------------------------------------------------------------------------------
💬 Answer:
   Initially, around 80% of the weekly active users (WAU) in the first few
   months after ChatGPT was released had typically masculine first names.
   However, by the first half of 2025, the share of active users with
   typically feminine and typically masculine names reached near-parity. By
   June 2025, it was observed that active users were more likely to have
   typically feminine names, indicating that gender gaps in ChatGPT usage
   have closed significantly over time.

📚 Source Documents (3 found):
--------------------------------------------------
1. Source: Unknown source
   Preview: The prompts for each of these automated classifiers (with the
exception of interaction quality) are available in Appendix A. Values represent the aver...

2. Source: Unknown source
   Preview: X’s indicate tha

**💡 Observation**: Compare the responses from both systems. Notice differences in:
- Answer quality and completeness
- Retrieved context relevance
- Response structure and clarity

This manual comparison gives us intuition, but RAGAS provides systematic evaluation metrics.

## Step 3: Introduction to RAGAS Framework

**RAGAS** (RAG Assessment) is a framework designed to evaluate RAG systems comprehensively. It provides several key metrics:

### Key RAGAS Metrics We'll Use:

1. **Faithfulness** - How factually consistent the answer is with the retrieved context
2. **Answer Relevancy** - How relevant the answer is to the question
3. **Context Recall** - How well the retrieval captures relevant information
4. **Context Precision** - Quality of the retrieved context (signal vs noise)
5. **Factual Correctness** - Accuracy of factual claims in the response

Let's import RAGAS and prepare our evaluation dataset.

In [6]:
# Import RAGAS core modules
from ragas import evaluation
print("✅ RAGAS imported successfully!")

✅ RAGAS imported successfully!


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Step 4: Load and Examine Evaluation Dataset

For systematic evaluation, we need a test dataset with:
- **Questions** - Queries to test the RAG system
- **Ground Truth Answers** - Expected correct responses  
- **Reference Contexts** - Documents that should be retrieved

We'll use a synthetic dataset that was pre-generated for evaluation purposes.

In [7]:
# Import additional required libraries
from haystack.components.generators import OpenAIGenerator
from ragas import EvaluationDataset
import pandas as pd

# Load the synthetic evaluation dataset
print("📊 Loading evaluation dataset...")
dataset = pd.read_csv("./data_for_eval/synthetic_tests_advanced_branching_10.csv")
dataset_5 = dataset.copy()

print(f"✅ Dataset loaded successfully!")
print(f"📈 Dataset contains {len(dataset_5)} test cases")
print(f"🔍 Columns: {list(dataset_5.columns)}")

📊 Loading evaluation dataset...
✅ Dataset loaded successfully!
📈 Dataset contains 10 test cases
🔍 Columns: ['user_input', 'reference_contexts', 'reference', 'synthesizer_name']


In [8]:
# Let's examine the structure of our evaluation dataset
print("🔍 Examining the evaluation dataset structure:")
print("=" * 50)
dataset_5

🔍 Examining the evaluation dataset structure:


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What are the ethical implications and concerns...,"['What is AI, how does it work and why are som...","The rise of Meta AI, like other generative AI ...",single_hop_specific_query_synthesizer
1,What is the estimated energy consumption of th...,['How does AI effect the environment?\nIt is n...,Some researchers estimate that the AI industry...,single_hop_specific_query_synthesizer
2,Wut is the significanse of Artificial Intellig...,['This article was published in 2018. To read ...,Artificial Intelligence (AI) is a technology t...,single_hop_specific_query_synthesizer
3,What does Figure 22 illustrate about the varia...,['<1-hop>\n\n37% of messages are work-related\...,Figure 22 illustrates the variation in ChatGPT...,multi_hop_specific_query_synthesizer
4,What does Figure 22 show about how ChatGPT is ...,['<1-hop>\n\nPanel A.Work Related\n Panel B1.A...,Figure 22 illustrates the classification of wo...,multi_hop_specific_query_synthesizer
5,How does ChatGPT Business usage vary by occupa...,['<1-hop>\n\nCorporate users may also use Chat...,ChatGPT Business usage varies significantly by...,multi_hop_specific_query_synthesizer
6,How does the environmental impact of artificia...,"['<1-hop>\n\nWhat is AI, how does it work and ...",The environmental impact of artificial intelli...,multi_hop_abstract_query_synthesizer
7,How do privacy protections and de-identificati...,['<1-hop>\n\nWe describe the contents of each ...,Privacy protections in the analysis of ChatGPT...,multi_hop_abstract_query_synthesizer
8,What trends can be observed in user cohort ana...,['<1-hop>\n\nThe yellow line represents the fi...,User cohort analysis reveals that there has be...,multi_hop_abstract_query_synthesizer
9,What are the environmental concerns related to...,"['<1-hop>\n\nWhat is AI, how does it work and ...",The environmental concerns related to artifici...,multi_hop_abstract_query_synthesizer


## Step 5: Generate RAG Responses for Evaluation

Now we'll run our RAG system on all test queries to generate responses that RAGAS can evaluate. This is where we collect the data needed for systematic assessment.

**What we're doing:**
- Run each test query through our Hybrid RAG system
- Extract both the generated response and retrieved contexts
- Store results in the format RAGAS expects

In [None]:
# Create a helper function to run RAG systems and extract needed data
def run_rag_system(query, rag_system):
    """
    Helper function to run a RAG system and return formatted response
    
    Args:
        query: The question to ask the RAG system
        rag_system: The RAG pipeline to use
    
    Returns:
        dict: Contains the response text and retrieved document contexts
    """
    response = rag_system.run(query=query)
    return {
        'response': response['replies'][0] if 'replies' in response else '',
        'reference': [doc.content for doc in response.get('documents', [])]
    }

# Create lambda functions for each RAG system
run_hybrid_rag = lambda query: run_rag_system(query, hybrid_rag_sc)

# Apply the hybrid RAG system to all test queries
print("🔄 Running Hybrid RAG on all evaluation queries...")
print("This may take a few minutes depending on dataset size...")

hybrid_results = dataset_5['user_input'].apply(run_hybrid_rag)

# Extract response and retrieved contexts into separate columns
dataset_5['response'] = hybrid_results.apply(lambda x: x['response'])
dataset_5['reference'] = hybrid_results.apply(lambda x: x['reference'])

print(f"✅ Successfully generated responses for {len(dataset_5)} queries!")
print(f"📊 Added columns: 'response', 'reference' (retrieved contexts)")
print(f"📈 Final dataset shape: {dataset_5.shape}")

🔄 Running Hybrid RAG on all evaluation queries...
This may take a few minutes depending on dataset size...


Batches: 100%|██████████| 1/1 [00:00<00:00, 14.19it/s]

Batches: 100%|██████████| 1/1 [00:00<00:00,  8.21it/s]

Batches: 100%|██████████| 1/1 [00:00<00:00,  7.46it/s]

Batches: 100%|██████████| 1/1 [00:00<00:00,  9.04it/s]

Batches: 100%|██████████| 1/1 [00:00<00:00,  7.66it/s]

Batches: 100%|██████████| 1/1 [00:00<00:00,  7.97it/s]

Batches: 100%|██████████| 1/1 [00:00<00:00,  7.45it/s]

Batches: 100%|██████████| 1/1 [00:00<00:00, 18.98it/s]

Batches: 100%|██████████| 1/1 [00:00<00:00,  9.40it/s]

Batches: 100%|██████████| 1/1 [00:00<00:00,  9.01it/s]



✅ Successfully generated responses for 10 queries!
📊 Added columns: 'response', 'reference' (retrieved contexts)
📈 Final dataset shape: (10, 5)


## Step 6: Format Data for RAGAS Evaluation

RAGAS expects data in a specific format called `SingleTurnSample`. We need to transform our dataset to match these requirements:

**RAGAS Required Fields:**
- `user_input` (str) - The question/query
- `response` (str) - RAG system's answer  
- `retrieved_contexts` (List[str]) - Documents retrieved by RAG
- `reference` (str) - Ground truth answer

Let's transform our data to match this schema.

In [10]:
# Transform dataset format for RAGAS SingleTurnSample requirements
import ast

def parse_contexts(context_str):
    """
    Parse string representation of list to actual list
    
    Args:
        context_str: String or list containing context information
    
    Returns:
        list: Parsed context as a list of strings
    """
    try:
        if isinstance(context_str, str):
            return ast.literal_eval(context_str)
        elif isinstance(context_str, list):
            return context_str
        else:
            return []
    except (ValueError, SyntaxError):
        if isinstance(context_str, str):
            return [context_str]
        return []

print("🔄 Preparing dataset for RAGAS evaluation...")

# Create a clean dataset for RAGAS evaluation
eval_dataset = dataset_5.copy()

# Parse reference_contexts from string to list (ground truth contexts)
eval_dataset['reference_contexts_parsed'] = eval_dataset['reference_contexts'].apply(parse_contexts)

# Ensure user_input is a string, not a list
eval_dataset['user_input'] = eval_dataset['user_input'].apply(
    lambda x: x[0] if isinstance(x, list) else x
)

# Create the final RAGAS-compatible dataset
ragas_dataset = pd.DataFrame({
    'user_input': eval_dataset['user_input'],           # Question/query as string
    'response': eval_dataset['response'],                # RAG response as string  
    'retrieved_contexts': eval_dataset['reference'],     # Retrieved contexts as list
    'reference': eval_dataset['reference_contexts_parsed'].apply(
        lambda x: x[0] if x else ""
    )  # Reference answer as string
})

print(f"✅ Dataset successfully formatted for RAGAS!")
print(f"📊 Final evaluation dataset shape: {ragas_dataset.shape}")
print(f"🔍 Columns: {list(ragas_dataset.columns)}")

🔄 Preparing dataset for RAGAS evaluation...
✅ Dataset successfully formatted for RAGAS!
📊 Final evaluation dataset shape: (10, 4)
🔍 Columns: ['user_input', 'response', 'retrieved_contexts', 'reference']


In [11]:
# Create the RAGAS EvaluationDataset object
from ragas import EvaluationDataset

print("🔄 Creating RAGAS EvaluationDataset...")

# Create evaluation dataset with the properly formatted data
evaluation_dataset = EvaluationDataset.from_pandas(ragas_dataset)

print("✅ EvaluationDataset created successfully!")
print(f"📊 Dataset size: {len(evaluation_dataset)} samples")
print(f"🔧 Sample type: {evaluation_dataset.get_sample_type()}")
print("🎯 Ready for evaluation!")

🔄 Creating RAGAS EvaluationDataset...
✅ EvaluationDataset created successfully!
📊 Dataset size: 10 samples
🔧 Sample type: <class 'ragas.dataset_schema.SingleTurnSample'>
🎯 Ready for evaluation!


In [12]:
# Let's examine our final dataset structure
print("🔍 Final RAGAS dataset structure:")
print("=" * 50)
ragas_dataset

🔍 Final RAGAS dataset structure:


Unnamed: 0,user_input,response,retrieved_contexts,reference
0,What are the ethical implications and concerns...,The ethical implications and concerns surround...,"[What is AI, how does it work and why are some...","What is AI, how does it work and why are some ..."
1,What is the estimated energy consumption of th...,The estimated energy consumption of the AI ind...,"[What is AI, how does it work and why are some...",How does AI effect the environment?\nIt is not...
2,Wut is the significanse of Artificial Intellig...,The significance of artificial intelligence (A...,[This article was published in 2018. To read m...,This article was published in 2018. To read mo...
3,What does Figure 22 illustrate about the varia...,Figure 22 illustrates that there is a variatio...,[The prompts for each of these automated class...,<1-hop>\n\n37% of messages are work-related\nf...
4,What does Figure 22 show about how ChatGPT is ...,Figure 22 presents variation in ChatGPT usage ...,[X’s indicate that the ranking is\nunavailable...,<1-hop>\n\nPanel A.Work Related\n Panel B1.Ask...
5,How does ChatGPT Business usage vary by occupa...,I don't have enough information to answer.,[The prompts for each of these automated class...,<1-hop>\n\nCorporate users may also use ChatGP...
6,How does the environmental impact of artificia...,The environmental impact of artificial intelli...,"[What is AI, how does it work and why are some...","<1-hop>\n\nWhat is AI, how does it work and wh..."
7,How do privacy protections and de-identificati...,Privacy protections and de-identification proc...,[•Sampled from all ChatGPT users:a random samp...,<1-hop>\n\nWe describe the contents of each da...
8,What trends can be observed in user cohort ana...,The user cohort analysis regarding ChatGPT que...,[X’s indicate that the ranking is\nunavailable...,<1-hop>\n\nThe yellow line represents the firs...
9,What are the environmental concerns related to...,The environmental concerns related to artifici...,"[What is AI, how does it work and why are some...","<1-hop>\n\nWhat is AI, how does it work and wh..."


## Step 7: Configure RAGAS Evaluator

RAGAS needs an LLM to evaluate responses. We'll use OpenAI's GPT model wrapped in RAGAS's Haystack wrapper. This LLM will act as the "judge" that scores our RAG system's performance.

In [13]:
# Set up the evaluator LLM for RAGAS
from ragas import evaluate
from ragas.llms import HaystackLLMWrapper
from haystack.components.generators import OpenAIGenerator

print("🔧 Setting up RAGAS evaluator LLM...")

# Create evaluator LLM using OpenAI GPT model
# Note: Using a smaller model to reduce API costs while maintaining quality
evaluator_llm = HaystackLLMWrapper(OpenAIGenerator(model="gpt-4o-mini"))

print("✅ Evaluator LLM configured successfully!")
print("💡 This LLM will act as the 'judge' for evaluating RAG performance")

🔧 Setting up RAGAS evaluator LLM...
✅ Evaluator LLM configured successfully!
💡 This LLM will act as the 'judge' for evaluating RAG performance


## Step 8: Run Comprehensive RAGAS Evaluation

Now for the main event! We'll run RAGAS evaluation using multiple metrics to get a comprehensive assessment of our RAG system.

### Metrics We're Using:

1. **LLMContextRecall**: How well did retrieval capture relevant information from the knowledge base?
2. **Faithfulness**: Are the responses factually consistent with retrieved contexts?
3. **FactualCorrectness**: Are the factual claims in responses accurate?
4. **ResponseRelevancy**: How relevant are responses to the input questions?
5. **ContextEntityRecall**: Does retrieval capture important entities mentioned in ground truth?
6. **NoiseSensitivity**: How robust is the system to irrelevant context?

⚠️ **Note**: This process uses API calls and may take several minutes to complete.

In [14]:
# Import all the evaluation metrics we'll use
from ragas.metrics import (
    LLMContextRecall, 
    Faithfulness, 
    FactualCorrectness, 
    ResponseRelevancy, 
    ContextEntityRecall, 
    NoiseSensitivity
)
from ragas import evaluate, RunConfig

print("🚀 Starting comprehensive RAGAS evaluation...")
print("📊 Metrics to be computed:")
print("   • LLMContextRecall - Retrieval quality")
print("   • Faithfulness - Factual consistency")  
print("   • FactualCorrectness - Accuracy of claims")
print("   • ResponseRelevancy - Answer relevance")
print("   • ContextEntityRecall - Entity coverage")
print("   • NoiseSensitivity - Robustness to noise")
print()
print("⏱️ This may take several minutes... Please wait...")

# Configure evaluation settings with extended timeout
custom_run_config = RunConfig(timeout=360)  # 6-minute timeout per evaluation

# Run the comprehensive evaluation
baseline_result = evaluate(
    dataset=evaluation_dataset,
    metrics=[
        LLMContextRecall(), 
        Faithfulness(), 
        FactualCorrectness(), 
        ResponseRelevancy(), 
        ContextEntityRecall(), 
        NoiseSensitivity()
    ],
    llm=evaluator_llm,
    run_config=custom_run_config
)

print("🎉 Evaluation completed successfully!")
baseline_result

🚀 Starting comprehensive RAGAS evaluation...
📊 Metrics to be computed:
   • LLMContextRecall - Retrieval quality
   • Faithfulness - Factual consistency
   • FactualCorrectness - Accuracy of claims
   • ResponseRelevancy - Answer relevance
   • ContextEntityRecall - Entity coverage
   • NoiseSensitivity - Robustness to noise

⏱️ This may take several minutes... Please wait...


Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Evaluating:  10%|█         | 6/60 [00:13<01:48,  2.01s/it]LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Evaluating:  23%|██▎       | 14/60 [00:31<01:34,  2.05s/it]LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
Evaluating:  

🎉 Evaluation completed successfully!


{'context_recall': 0.7714, 'faithfulness': 0.8888, 'factual_correctness(mode=f1)': 0.3086, 'answer_relevancy': 0.7632, 'context_entity_recall': 0.4553, 'noise_sensitivity(mode=relevant)': 0.3369}

## Step 9: Interpret Your Results

Let's examine the evaluation results in detail. Each metric provides insights into different aspects of your RAG system's performance.

In [17]:
# Display detailed results with interpretation
print("📊 DETAILED EVALUATION RESULTS")
print("=" * 50)
print()

# Display the results
baseline_result

📊 DETAILED EVALUATION RESULTS



{'context_recall': 0.7714, 'faithfulness': 0.8888, 'factual_correctness(mode=f1)': 0.3086, 'answer_relevancy': 0.7632, 'context_entity_recall': 0.4553, 'noise_sensitivity(mode=relevant)': 0.3369}

### 📈 Understanding Your RAGAS Scores
*Based on official RAGAS documentation*

**Score Interpretation Guide:**

- **🟢 0.8-1.0**: Excellent performance
- **🟡 0.6-0.8**: Good performance  
- **🟠 0.4-0.6**: Fair performance - needs improvement
- **🔴 0.0-0.4**: Poor performance - significant issues

---


   
   
   
   
   

## **Detailed Metric Explanations** 📊

### **1. [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/)** 🎯
*Evaluates the retriever's ability to rank relevant chunks higher than irrelevant ones*

**What it measures:** How well your system prioritizes relevant documents at the top of retrieval results.

**Formula:** `Precision@K = (Relevant chunks at rank K) / (Total chunks at rank K)`

**Score meaning:**
- **High scores (>0.8)** = Relevant documents appear at the top of search results
- **Low scores (<0.5)** = Important documents buried under irrelevant ones

**Real example from RAGAS docs:**
- Query: "Where is the Eiffel Tower located?"
- Good retrieval: ["Eiffel Tower is in Paris", "Brandenburg Gate is in Berlin"] → Score: 0.99
- Poor retrieval: ["Brandenburg Gate is in Berlin", "Eiffel Tower is in Paris"] → Score: 0.50

---

### **2. [Context Recall (LLMContextRecall)](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_recall/)** 🔍
*Measures how many relevant documents were successfully retrieved*

**What it measures:** Whether your system finds all the important information available in your knowledge base.

**Formula:** `Context Recall = (Claims in reference supported by retrieved context) / (Total claims in reference)`

**Score meaning:**
- **High scores (>0.8)** = Your system finds most relevant information
- **Low scores (<0.5)** = Missing important context from knowledge base

**Key insight:** This metric focuses on "not missing anything important" - essential for comprehensive answers.

---

### **3. [Context Entity Recall](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_entities_recall/)** 👥
*Measures recall of important entities (people, places, dates, etc.)*

**What it measures:** How well your retrieval captures key entities mentioned in the ground truth.

**Formula:** `Entity Recall = (Common entities between retrieved & reference) / (Total entities in reference)`

**Score meaning:**
- **High scores (>0.8)** = Important entities (names, dates, places) are retrieved
- **Low scores (<0.5)** = Missing key entities from context

**Use case:** Especially important for fact-based applications like tourism help desks, historical QA.

**Example from RAGAS:**
- Reference entities: ['Taj Mahal', 'Yamuna', 'Agra', '1631', 'Shah Jahan', 'Mumtaz Mahal']
- Retrieved entities: ['Taj Mahal', 'Agra', 'Shah Jahan', 'Mumtaz Mahal', 'India']
- Score: 4/6 = 0.67

---

### **4. [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/noise_sensitivity/)** 🤝
*Measures how factually consistent responses are with retrieved context*

**What it measures:** Whether your RAG system sticks to the facts or "hallucinates" information.

**Formula:** `Faithfulness = (Claims supported by context) / (Total claims in response)`

**Score meaning:**
- **High scores (>0.8)** = Responses stick to retrieved facts
- **Low scores (<0.5)** = System making unsupported claims or hallucinating

**Example from RAGAS:**
- Context: "Einstein born 14 March 1879"
- Good response: "Einstein was born in Germany on 14th March 1879" → Score: 1.0
- Poor response: "Einstein was born in Germany on 20th March 1879" → Score: 0.5

---

### **5. [Answer Relevance](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_relevance/)y** 📌
*Measures how relevant responses are to the input questions*

**What it measures:** Whether your system directly answers what was asked.

**How it works:** 
1. Generate artificial questions from the response
2. Calculate similarity between original question and generated questions
3. Higher similarity = more relevant response

**Score meaning:**
- **High scores (>0.8)** = Responses directly answer what's asked
- **Low scores (<0.5)** = Responses are off-topic or incomplete

**Example:**
- Question: "Where is France and what is its capital?"
- Poor answer: "France is in western Europe" → Lower relevancy
- Good answer: "France is in western Europe and Paris is its capital" → Higher relevancy

---

### **6. [Noise Sensitivity](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/)** 🔇
*Measures how often the system makes errors when using irrelevant context*

**What it measures:** System robustness to irrelevant or noisy retrieved documents.

**Formula:** `Noise Sensitivity = (Incorrect claims in response) / (Total claims in response)`

**Score meaning (LOWER is better):**
- **Low scores (<0.2)** = Handles irrelevant context well
- **High scores (>0.5)** = Gets confused by noise in retrieved documents

**Key insight:** Unlike other metrics, **lower scores indicate better performance** for noise sensitivity.

---

## **Improvement Strategies by Score Pattern** 🚀

**If Context Precision/Recall are Low:**
- Improve embedding models or add hybrid search
- Tune retrieval parameters (similarity thresholds, chunk size)
- Expand knowledge base coverage

**If Faithfulness is Low:**
- Add explicit instructions to stick to retrieved context
- Implement fact-checking components  
- Use stronger grounding techniques

**If Response Relevancy is Low:**
- Improve query understanding and processing
- Add query classification for better routing
- Enhance prompt engineering

**If Noise Sensitivity is High:**
- Improve context ranking and filtering
- Add reranking components after retrieval
- Implement context compression techniques

## 🎓 Congratulations!

You've successfully completed a comprehensive RAGAS evaluation of your RAG system! 

**What you've accomplished:**

✅ **Set up RAGAS framework** - Imported and configured the evaluation toolkit  
✅ **Tested RAG systems** - Compared Hybrid vs Naive RAG approaches  
✅ **Prepared evaluation data** - Formatted synthetic test data for systematic assessment  
✅ **Ran comprehensive metrics** - Evaluated 6 key performance dimensions  
✅ **Interpreted results** - Learned how to understand and act on RAGAS scores  
