# 🎯 RAGAS Evaluation Demo with EPAM DIAL

This notebook demonstrates how to use **RAGAS** (Retrieval-Augmented Generation Assessment) to evaluate RAG systems using EPAM DIAL models.

## 📋 What We'll Learn:
1. **Setup** - Connect to EPAM DIAL API
2. **Dataset** - Load our fake evaluation dataset
3. **Models** - Configure LLM and embedding models
4. **Evaluation** - Run RAGAS metrics
5. **Results** - Analyze and export results

## 🔧 RAGAS Metrics We'll Evaluate:
- **Context Recall** - How complete is the retrieved context?
- **Context Precision** - How relevant is the retrieved context?
- **Faithfulness** - Is the answer grounded in the context?
- **Answer Accuracy** - How accurate is the answer compared to ground truth?


In [1]:
# Step 1: Import Required Libraries
import pandas as pd
import os
from dotenv import load_dotenv
from datasets import Dataset

# Import our custom utilities
from utils import (
    create_fake_ragas_dataset, 
    create_ragas_dataset,
    create_langchain_llm, 
    create_langchain_embeddings
)

# Import RAGAS components
from ragas import evaluate
from ragas.metrics import (
    context_recall,
    context_precision, 
    faithfulness,
    answer_correctness
)

print("[SUCCESS] All imports successful!")
print("[INFO] Libraries loaded:")
print("  - pandas for data handling")
print("  - datasets for RAGAS format")
print("  - utils for EPAM DIAL integration") 
print("  - ragas for evaluation metrics")


[SUCCESS] All imports successful!
[INFO] Libraries loaded:
  - pandas for data handling
  - datasets for RAGAS format
  - utils for EPAM DIAL integration
  - ragas for evaluation metrics


## 📊 Step 2: Load Evaluation Dataset

Let's load our fake dataset that contains the 4 required columns for RAGAS evaluation.


In [None]:
# Load our fake dataset
print("[INFO] Loading Evaluation Dataset...")
print("=" * 40)

# Create the dataset directly as RAGAS-compatible format
ragas_dataset = create_ragas_dataset()

print(f"Dataset shape: {ragas_dataset.shape}")
print(f"Columns: {ragas_dataset.column_names}")
print("\nFirst few rows:")
print(ragas_dataset.to_pandas().head())

print("\n[INFO] Dataset Summary:")
print(f"  - Total examples: {len(ragas_dataset)}")
print(f"  - Questions: {len(set(ragas_dataset['question']))} unique")
print(f"  - Average answer length: {sum(len(a) for a in ragas_dataset['answer']) / len(ragas_dataset):.1f} characters")
print(f"  - Average context length: {sum(len(c) for c in ragas_dataset['context']) / len(ragas_dataset):.1f} characters")

print(f"\n[SUCCESS] Dataset ready for RAGAS evaluation!")


[INFO] Loading Evaluation Dataset...
Dataset shape: (5, 5)
Columns: ['question', 'answer', 'context', 'ground_truth', 'retrieved_contexts']

First few rows:
                                  question  \
0           What is the capital of France?   
1  What are the main ingredients in pizza?   
2            How does photosynthesis work?   
3         What is the population of Tokyo?   
4              Who wrote Romeo and Juliet?   

                                              answer  \
0                    The capital of France is Paris.   
1  Pizza typically contains dough, tomato sauce, ...   
2  Plants use sunlight, water, and carbon dioxide...   
3         Tokyo has approximately 14 million people.   
4  William Shakespeare wrote Romeo and Juliet, on...   

                                             context  \
0  France is a country located in Western Europe....   
1  Pizza is a popular Italian dish consisting of ...   
2  Photosynthesis is the process by which plants ...   
3  To

## 🤖 Step 3: Configure LangChain Models for RAGAS

Now let's create our LangChain wrappers for the EPAM DIAL models that RAGAS will use for evaluation.


In [3]:
# Configure our EPAM DIAL models for RAGAS
print("[INFO] Configuring LangChain Models for RAGAS...")
print("=" * 50)

# Create LangChain LLM wrapper (for Faithfulness & Answer Accuracy)
# Using original deployments that have access
langchain_llm = create_langchain_llm(deployment_name="gpt-4.1-mini-2025-04-14")
print(f"[SUCCESS] LangChain LLM configured: gpt-4.1-mini-2025-04-14")

# Create LangChain Embedding wrapper (for Context Recall & Precision)
# Using original deployments that have access
langchain_embeddings = create_langchain_embeddings(deployment_name="text-embedding-3-small-1")
print(f"[SUCCESS] LangChain Embeddings configured: text-embedding-3-small-1")

print("\n[INFO] Model Configuration Complete!")
print("These LangChain wrappers will be used for:")
print("  - LangChain LLM: Faithfulness & Answer Accuracy evaluation")
print("  - LangChain Embeddings: Context Recall & Precision evaluation")


[INFO] Configuring LangChain Models for RAGAS...
[SUCCESS] LangChain LLM configured: gpt-4.1-mini-2025-04-14
[SUCCESS] LangChain Embeddings configured: text-embedding-3-small-1

[INFO] Model Configuration Complete!
These LangChain wrappers will be used for:
  - LangChain LLM: Faithfulness & Answer Accuracy evaluation
  - LangChain Embeddings: Context Recall & Precision evaluation


## 📈 Step 4: Run RAGAS Evaluation

Now let's run the RAGAS evaluation using our configured LangChain models and dataset!


In [4]:
# Configure our EPAM DIAL models for RAGAS
print("[INFO] Configuring LangChain Models for RAGAS...")
print("=" * 50)

# Create LangChain LLM wrapper (for Faithfulness & Answer Accuracy)
# Using original deployments that have access
langchain_llm = create_langchain_llm(deployment_name="gpt-4.1-mini-2025-04-14")
print(f"[SUCCESS] LangChain LLM configured: gpt-4.1-mini-2025-04-14")

# Create LangChain Embedding wrapper (for Context Recall & Precision)
# Using original deployments that have access
langchain_embeddings = create_langchain_embeddings(deployment_name="text-embedding-3-small-1")
print(f"[SUCCESS] LangChain Embeddings configured: text-embedding-3-small-1")

print("\n[INFO] Model Configuration Complete!")
print("These LangChain wrappers will be used for:")
print("  - LangChain LLM: Faithfulness & Answer Accuracy evaluation")
print("  - LangChain Embeddings: Context Recall & Precision evaluation")


[INFO] Configuring LangChain Models for RAGAS...
[SUCCESS] LangChain LLM configured: gpt-4.1-mini-2025-04-14
[SUCCESS] LangChain Embeddings configured: text-embedding-3-small-1

[INFO] Model Configuration Complete!
These LangChain wrappers will be used for:
  - LangChain LLM: Faithfulness & Answer Accuracy evaluation
  - LangChain Embeddings: Context Recall & Precision evaluation


## 📈Step 5: Run RAGAS Evaluation

Now let's run the RAGAS evaluation using our configured models and dataset!


In [4]:
# Run RAGAS evaluation
print("[INFO] Running RAGAS Evaluation...")
print("=" * 40)

# Define the metrics we want to evaluate
metrics = [
    context_recall,
    context_precision,
    faithfulness,
    answer_correctness]

print("[INFO] Evaluating metrics:")
for metric in metrics:
    print(f"  - {metric.__class__.__name__}")

print("\n[INFO] Starting evaluation (this may take a few minutes)...")

# Run the evaluation
try:
    result = evaluate(
        ragas_dataset,
        metrics=metrics,
        llm=langchain_llm,
        embeddings=langchain_embeddings
    )
    
    print("[SUCCESS] Evaluation completed successfully!")
    
except Exception as e:
    print(f"[ERROR] Evaluation failed: {e}")
    print("This might be due to API access restrictions or model availability.")


[INFO] Running RAGAS Evaluation...
[INFO] Evaluating metrics:
  - ContextRecall
  - ContextPrecision
  - Faithfulness
  - AnswerCorrectness

[INFO] Starting evaluation (this may take a few minutes)...


Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

[SUCCESS] Evaluation completed successfully!


## 📊 Step 5: Analyze Results

Let's analyze the evaluation results and understand what they mean for our RAG system.


In [12]:
# Analyze the results
print("[INFO] Analyzing RAGAS Results...")
print("=" * 40)

try:
    # Display the results
    print("[SUCCESS] Overall Scores:")
    print(result)
    
    # Convert to DataFrame for better analysis
    results_df = result.to_pandas()
    
    print("\n[INFO] Detailed Results:")
    print(results_df)
    
    # Calculate average scores
    print("\n[INFO] Average Scores:")
    for metric in metrics:
        metric_name = metric.__class__.__name__.lower()
        if metric_name in results_df.columns:
            avg_score = results_df[metric_name].mean()
            print(f"  - {metric_name}: {avg_score:.3f}")
    
    # Interpret the results
    print("\n[INFO] Result Interpretation:")
    print("  - Scores range from 0 to 1 (higher is better)")
    print("  - Context Recall: How complete is the retrieved context?")
    print("  - Context Precision: How relevant is the retrieved context?")
    print("  - Faithfulness: How well answers are grounded in context?")
    print("  - AnswerCorrectness:metric that measures accuracy against ground truth")
    
except NameError:
    print("[ERROR] No results available. Evaluation may have failed.")
    print("Please check your API configuration and try again.")
except Exception as e:
    print(f"[ERROR] Error analyzing results: {e}")


[INFO] Analyzing RAGAS Results...
[SUCCESS] Overall Scores:
{'context_recall': 1.0000, 'context_precision': 1.0000, 'faithfulness': 0.7000, 'answer_correctness': 0.6708}

[INFO] Detailed Results:
                                user_input  \
0           What is the capital of France?   
1  What are the main ingredients in pizza?   
2            How does photosynthesis work?   
3         What is the population of Tokyo?   
4              Who wrote Romeo and Juliet?   

                                  retrieved_contexts  \
0  [France is a country located in Western Europe...   
1  [Pizza is a popular Italian dish consisting of...   
2  [Photosynthesis is the process by which plants...   
3  [Tokyo is the capital city of Japan and one of...   
4  [Romeo and Juliet is a tragic play written by ...   

                                            response  \
0                    The capital of France is Paris.   
1  Pizza typically contains dough, tomato sauce, ...   
2  Plants use sunlight

## 💾 Step 6: Export Results

Let's save our results to a CSV file for further analysis and reporting.


In [13]:
# Export results to CSV
print("[INFO] Exporting Results...")
print("=" * 30)

try:
    results_df.to_csv('ragas_evaluation_results.csv', index=False)
    print("[SUCCESS] Complete evaluation saved to 'ragas_evaluation_results.csv'")
    
    print(f"\n[INFO] File created:")
    print(f"  - ragas_evaluation_results.csv ({len(results_df)} rows)")
    print(f"  - Contains: questions, answers, contexts, ground truth, and all RAGAS scores")
    
except NameError:
    print("[ERROR] No results to export. Please run the evaluation first.")
except Exception as e:
    print(f"[ERROR] Error exporting results: {e}")

[INFO] Exporting Results...
[SUCCESS] Complete evaluation saved to 'ragas_evaluation_results.csv'

[INFO] File created:
  - ragas_evaluation_results.csv (5 rows)
  - Contains: questions, answers, contexts, ground truth, and all RAGAS scores


## 🎉 Summary & Next Steps

Congratulations! You've successfully set up and run a RAGAS evaluation using EPAM DIAL models.

### ✅ What We Accomplished:
1. **Connected to EPAM DIAL API** using Azure OpenAI endpoints
2. **Created LangChain wrappers** for LLM and embedding models
3. **Loaded evaluation dataset** with proper RAGAS format
4. **Ran RAGAS metrics** for comprehensive RAG evaluation
5. **Analyzed and exported results** for further use

### 🔧 RAGAS Metrics Explained:
- **Context Recall**: How complete is the retrieved context?
- **Context Precision**: How relevant is the retrieved context?
- **Faithfulness**: Is the answer grounded in context?
- **Answer Relevancy**: How relevant is the answer to the question?

### 🚀 Next Steps:
1. **Real Dataset**: Replace fake data with your actual RAG system data
2. **More Metrics**: Add additional RAGAS metrics like `answer_correctness`
3. **Batch Evaluation**: Evaluate larger datasets
4. **Monitoring**: Set up regular evaluation pipelines
5. **Optimization**: Use results to improve your RAG system

### 📚 Resources:
- [RAGAS Documentation](https://docs.ragas.io/)
- [EPAM DIAL Documentation](https://dial.epam.com/)
- [LangChain Azure Integration](https://python.langchain.com/docs/integrations/llms/azure_openai)
