# RAGAS Evaluation Study: Retrieval-Augmented Generation Assessment

This study demonstrates the implementation and analysis of **RAGAS** (Retrieval-Augmented Generation Assessment) for evaluating RAG systems using EPAM DIAL models.

## Study Objectives

This document covers the following key areas:

1. **System Setup** - Configuration of EPAM DIAL API integration
2. **Dataset Preparation** - Creation and loading of evaluation datasets
3. **Model Configuration** - Setup of LLM and embedding models
4. **Evaluation Execution** - Implementation of RAGAS metrics
5. **Results Analysis** - Interpretation and export of evaluation results

## RAGAS Metrics Framework

The following metrics will be evaluated in this study:

- **Context Recall** - Measures the completeness of retrieved context relative to ground truth
- **Context Precision** - Evaluates the relevance of retrieved context to the query
- **Faithfulness** - Assesses how well answers are grounded in the provided context
- **Answer Correctness** - Compares answer accuracy against ground truth references


In [None]:
# Import Required Libraries and Dependencies
import pandas as pd
import os
from dotenv import load_dotenv
from datasets import Dataset

# Import custom utilities for RAGAS evaluation
from utils import (
    create_ragas_dataset,
    create_langchain_llm, 
    create_langchain_embeddings
)

# Import RAGAS evaluation framework
from ragas import evaluate
from ragas.metrics import (
    context_recall,
    context_precision, 
    faithfulness,
    answer_correctness
)

print("Library imports completed successfully.")
print("Dependencies loaded:")
print("  - pandas: Data manipulation and analysis")
print("  - datasets: HuggingFace dataset handling")
print("  - ragas: RAG evaluation framework")
print("  - utils: Custom EPAM DIAL integration modules")


Library imports completed successfully.
Dependencies loaded:
  - pandas: Data manipulation and analysis
  - datasets: HuggingFace dataset handling
  - ragas: RAG evaluation framework
  - utils: Custom EPAM DIAL integration modules


## Dataset Preparation and Loading

This section demonstrates the loading of a diverse evaluation dataset designed to test various RAG system behaviors. The dataset includes carefully crafted examples that demonstrate different combinations of RAGAS metrics.

### Dataset Design Principles

The evaluation dataset is structured to test the following scenarios:

1. **Perfect Performance** - Complete context with accurate answers
2. **Context Recall Issues** - Missing important information in retrieved context
3. **Context Precision Problems** - Irrelevant information retrieved
4. **Faithfulness Violations** - Answers not grounded in provided context
5. **Answer Correctness Errors** - Incorrect answers despite good context
6. **Mixed Scenarios** - Various combinations of the above issues
7. **Partial Context Recall** - Some relevant information missing
8. **High Precision, Low Recall** - Highly relevant but incomplete context

This design allows for comprehensive evaluation of RAG system performance across different failure modes.

This section load our **diverse dataset** that demonstrates different combinations of RAGAS metrics. This dataset includes 8 carefully crafted examples that will show:

###  Dataset Scenarios:
1. **PERFECT SCORES** - All metrics should be high
2. **LOW CONTEXT RECALL** - Missing important information  
3. **LOW CONTEXT PRECISION** - Irrelevant context retrieved
4. **LOW FAITHFULNESS** - Answer not grounded in context (hallucination)
5. **LOW ANSWER CORRECTNESS** - Wrong answer despite good context
6. **MIXED SCENARIO** - Good context but partial answer
7. **PARTIAL CONTEXT RECALL** - Some relevant info missing
8. **HIGH PRECISION, LOW RECALL** - Very relevant but incomplete

This will help us understand how each RAGAS metric behaves in different scenarios!


In [19]:
# Load our diverse dataset
print("Information: Loading Diverse Evaluation Dataset...")
print("=" * 50)

# Create the dataset directly as RAGAS-compatible format
ragas_dataset = create_ragas_dataset()

print(f"Dataset shape: {ragas_dataset.shape}")
print(f"Columns: {ragas_dataset.column_names}")
print("\nFirst few rows:")
print(ragas_dataset.to_pandas().head())

print("\nInformation: Dataset Summary:")
print(f"  - Total examples: {len(ragas_dataset)}")
print(f"  - Questions: {len(set(ragas_dataset['question']))} unique")
print(f"  - Average answer length: {sum(len(a) for a in ragas_dataset['answer']) / len(ragas_dataset):.1f} characters")
print(f"  - Average context length: {sum(len(c) for c in ragas_dataset['context']) / len(ragas_dataset):.1f} characters")

print(f"\nInformation: Dataset Scenarios:")
scenarios = [
    "1. PERFECT SCORES - All metrics should be high",
    "2. LOW CONTEXT RECALL - Missing important information", 
    "3. LOW CONTEXT PRECISION - Irrelevant context retrieved",
    "4. LOW FAITHFULNESS - Answer not grounded in context (hallucination)",
    "5. LOW ANSWER CORRECTNESS - Wrong answer despite good context",
    "6. MIXED SCENARIO - Good context but partial answer",
    "7. PARTIAL CONTEXT RECALL - Some relevant info missing",
    "8. HIGH PRECISION, LOW RECALL - Very relevant but incomplete"
]

for scenario in scenarios:
    print(f"  {scenario}")

print(f"\nStatus: Success Diverse dataset ready for RAGAS evaluation!")


Information: Loading Diverse Evaluation Dataset...
Dataset shape: (8, 5)
Columns: ['question', 'answer', 'context', 'ground_truth', 'retrieved_contexts']

First few rows:
                                  question  \
0           What is the capital of France?   
1  What are the main ingredients in pizza?   
2            How does photosynthesis work?   
3         What is the population of Tokyo?   
4              Who wrote Romeo and Juliet?   

                                              answer  \
0                    The capital of France is Paris.   
1  Pizza contains dough, tomato sauce, cheese, an...   
2  Plants use sunlight, water, and carbon dioxide...   
3  Tokyo has approximately 50 million people and ...   
4    Charles Dickens wrote Romeo and Juliet in 1850.   

                                             context  \
0  France is a country located in Western Europe....   
1                   Pizza is a popular Italian dish.   
2  Cooking is a great hobby. Many people enjoy 

##   Configure LangChain Models for RAGAS

This section create our LangChain wrappers for the EPAM DIAL models that RAGAS will use for evaluation.


In [20]:
# Configure our EPAM DIAL models for RAGAS
print("Information: Configuring LangChain Models for RAGAS...")
print("=" * 50)

# Create LangChain LLM wrapper (for Faithfulness & Answer Accuracy)
# Using original deployments that have access
langchain_llm = create_langchain_llm(deployment_name="gpt-4.1-mini-2025-04-14")
print(f"Status: Success LangChain LLM configured: gpt-4.1-mini-2025-04-14")

# Create LangChain Embedding wrapper (for Context Recall & Precision)
# Using original deployments that have access
langchain_embeddings = create_langchain_embeddings(deployment_name="text-embedding-3-small-1")
print(f"Status: Success LangChain Embeddings configured: text-embedding-3-small-1")

print("\nInformation: Model Configuration Complete!")
print("These LangChain wrappers will be used for:")
print("  - LangChain LLM: Faithfulness & Answer Accuracy evaluation")
print("  - LangChain Embeddings: Context Recall & Precision evaluation")


Information: Configuring LangChain Models for RAGAS...
Status: Success LangChain LLM configured: gpt-4.1-mini-2025-04-14
Status: Success LangChain Embeddings configured: text-embedding-3-small-1

Information: Model Configuration Complete!
These LangChain wrappers will be used for:
  - LangChain LLM: Faithfulness & Answer Accuracy evaluation
  - LangChain Embeddings: Context Recall & Precision evaluation


##  Run RAGAS Evaluation

This section run the RAGAS evaluation using our configured models and dataset!


In [21]:
# Run RAGAS evaluation
print("Information: Running RAGAS Evaluation...")
print("=" * 40)

# Define the metrics we want to evaluate
metrics = [
    context_recall,
    context_precision,
    faithfulness,
    answer_correctness]

print("Information: Evaluating metrics:")
for metric in metrics:
    print(f"  - {metric.__class__.__name__}")

print("\nInformation: Starting evaluation (this may take a few minutes)...")

# Run the evaluation
try:
    result = evaluate(
        ragas_dataset,
        metrics=metrics,
        llm=langchain_llm,
        embeddings=langchain_embeddings
    )
    
    print("Status: Success Evaluation completed successfully!")
    
except Exception as e:
    print(f"Error: Evaluation failed: {e}")
    print("This might be due to API access restrictions or model availability.")


Information: Running RAGAS Evaluation...
Information: Evaluating metrics:
  - ContextRecall
  - ContextPrecision
  - Faithfulness
  - AnswerCorrectness

Information: Starting evaluation (this may take a few minutes)...


Evaluating:   0%|          | 0/32 [00:00<?, ?it/s]

Status: Success Evaluation completed successfully!


##   Analyze Results & Understand Metric Variations

This section analyze the evaluation results and understand what they mean for our RAG system. With our diverse dataset, we should see different score patterns that demonstrate how each RAGAS metric works.

###  Expected Results by Scenario:
- **Example 1 (France)**: High scores across all metrics (perfect scenario)
- **Example 2 (Pizza)**: Low Context Recall (missing ingredient details)
- **Example 3 (Photosynthesis)**: Low Context Precision (irrelevant cooking info)
- **Example 4 (Tokyo)**: Low Faithfulness (hallucinated population data)
- **Example 5 (Romeo & Juliet)**: Low Answer Correctness (wrong author)
- **Example 6 (Exercise)**: Mixed scores (good context, partial answer)
- **Example 7 (Machine Learning)**: Low Context Recall (missing details)
- **Example 8 (Speed of Light)**: High Precision, Low Recall (precise but minimal)


In [22]:
# Analyze the results with detailed explanations
print("Information: Analyzing RAGAS Results...")
print("=" * 50)

try:
    # Display the results
    print("Status: Success Overall Scores:")
    print(result)
    
    # Convert to DataFrame for better analysis
    results_df = result.to_pandas()
    
    print("\nInformation: Detailed Results by Example:")
    print("-" * 50)
    
    # Add scenario descriptions to results
    scenarios = [
        "1. PERFECT SCORES - All metrics should be high",
        "2. LOW CONTEXT RECALL - Missing important information", 
        "3. LOW CONTEXT PRECISION - Irrelevant context retrieved",
        "4. LOW FAITHFULNESS - Answer not grounded in context (hallucination)",
        "5. LOW ANSWER CORRECTNESS - Wrong answer despite good context",
        "6. MIXED SCENARIO - Good context but partial answer",
        "7. PARTIAL CONTEXT RECALL - Some relevant info missing",
        "8. HIGH PRECISION, LOW RECALL - Very relevant but incomplete"
    ]
    
    for i, (idx, row) in enumerate(results_df.iterrows()):
        print(f"\nExample {i+1}: {row['user_input']}")
        print(f"Scenario: {scenarios[i]}")
        print(f"Context Recall: {row['context_recall']:.3f}")
        print(f"Context Precision: {row['context_precision']:.3f}")
        print(f"Faithfulness: {row['faithfulness']:.3f}")
        print(f"Answer Correctness: {row['answer_correctness']:.3f}")
    
    # Calculate average scores
    print("\nInformation: Average Scores Across All Examples:")
    print("-" * 50)
    for metric in metrics:
        metric_name = metric.__class__.__name__.lower()
        if metric_name in results_df.columns:
            avg_score = results_df[metric_name].mean()
            print(f"  - {metric_name}: {avg_score:.3f}")
    
    # Interpret the results
    print("\nInformation: Metric Interpretation Guide:")
    print("-" * 50)
    print("  - Scores range from 0 to 1 (higher is better)")
    print("  - Context Recall: How complete is the retrieved context?")
    print("  - Context Precision: How relevant is the retrieved context?")
    print("  - Faithfulness: How well answers are grounded in context?")
    print("  - Answer Correctness: How accurate is the answer vs ground truth?")
    
    print("\nInformation: Analysis Summary")
    print("-" * 30)
    print("  - Look for patterns in the scores across different scenarios")
    print("  - Notice how different types of problems affect different metrics")
    print("  - Use these insights to improve your RAG system!")
    
except NameError:
    print("Error: No results available. Evaluation may have failed.")
    print("Please check your API configuration and try again.")
except Exception as e:
    print(f"Error: Error analyzing results: {e}")


Information: Analyzing RAGAS Results...
Status: Success Overall Scores:
{'context_recall': 0.5000, 'context_precision': 0.6250, 'faithfulness': 0.4375, 'answer_correctness': 0.6004}

Information: Detailed Results by Example:
--------------------------------------------------

Example 1: What is the capital of France?
Scenario: 1. PERFECT SCORES - All metrics should be high
Context Recall: 1.000
Context Precision: 1.000
Faithfulness: 1.000
Answer Correctness: 0.862

Example 2: What are the main ingredients in pizza?
Scenario: 2. LOW CONTEXT RECALL - Missing important information
Context Recall: 0.000
Context Precision: 0.000
Faithfulness: 0.000
Answer Correctness: 0.941

Example 3: How does photosynthesis work?
Scenario: 3. LOW CONTEXT PRECISION - Irrelevant context retrieved
Context Recall: 0.000
Context Precision: 0.000
Faithfulness: 0.000
Answer Correctness: 0.960

Example 4: What is the population of Tokyo?
Scenario: 4. LOW FAITHFULNESS - Answer not grounded in context (hallucinatio

##   Export Results

This section save our results to a CSV file for further analysis and reporting.


In [23]:
# Export results to CSV
print("Information: Exporting Results...")
print("=" * 30)

try:
    results_df.to_csv('ragas_evaluation_results.csv', index=False)
    print("Status: Success Complete evaluation saved to 'ragas_evaluation_results.csv'")
    
    print(f"\nInformation: File created:")
    print(f"  - ragas_evaluation_results.csv ({len(results_df)} rows)")
    print(f"  - Contains: questions, answers, contexts, ground truth, and all RAGAS scores")
    
except NameError:
    print("Error: No results to export. Please run the evaluation first.")
except Exception as e:
    print(f"Error: Error exporting results: {e}")

Information: Exporting Results...
Status: Success Complete evaluation saved to 'ragas_evaluation_results.csv'

Information: File created:
  - ragas_evaluation_results.csv (8 rows)
  - Contains: questions, answers, contexts, ground truth, and all RAGAS scores


##  Summary & Key Learnings

Congratulations! You've successfully set up and run a RAGAS evaluation using EPAM DIAL models with a **diverse dataset** that demonstrates different metric combinations.

###  Study Results
1. **Connected to EPAM DIAL API** using Azure OpenAI endpoints
2. **Created LangChain wrappers** for LLM and embedding models
3. **Loaded diverse evaluation dataset** with 8 carefully crafted scenarios
4. **Ran RAGAS metrics** for comprehensive RAG evaluation
5. **Analyzed metric variations** across different problem types
6. **Exported results** for further analysis

###  RAGAS Metrics Demonstrated:
- **Context Recall**: How complete is the retrieved context?
  - *Low scores*: Missing important information (Examples 2, 7)
  - *High scores*: Complete relevant information (Examples 1, 8)
  
- **Context Precision**: How relevant is the retrieved context?
  - *Low scores*: Irrelevant information retrieved (Example 3)
  - *High scores*: Highly relevant context (Examples 1, 8)
  
- **Faithfulness**: Is the answer grounded in context?
  - *Low scores*: Hallucinated information (Example 4)
  - *High scores*: Well-grounded answers (Examples 1, 6)
  
- **Answer Correctness**: How accurate is the answer vs ground truth?
  - *Low scores*: Wrong answers despite good context (Example 5)
  - *High scores*: Accurate answers (Examples 1, 8)

###  Recommendations
1. **Real Dataset**: Replace diverse examples with your actual RAG system data
2. **More Metrics**: Add additional RAGAS metrics like `answer_relevancy`
3. **Batch Evaluation**: Evaluate larger datasets with real retrieval systems
4. **Monitoring**: Set up regular evaluation pipelines
5. **Optimization**: Use results to improve your RAG system's retrieval and generation

###  Resources:
- [RAGAS Documentation](https://docs.ragas.io/)
- [EPAM DIAL Documentation](https://dial.epam.com/)
- [LangChain Azure Integration](https://python.langchain.com/docs/integrations/llms/azure_openai)
