# RAG (Retrieval-Augmented Generation) Evaluation Tutorial

## A Comprehensive Guide to Building and Evaluating RAG Systems with DeepEval

---

### Overview

This tutorial provides a complete walkthrough of building a RAG (Retrieval-Augmented Generation) system and evaluating it using the **DeepEval** framework. RAG systems combine the power of information retrieval with large language models to generate accurate, grounded responses based on source documents.

### What You'll Learn

1. **PDF Document Processing**: How to load and chunk PDF documents for embedding
2. **Vector Store Management**: Using ChromaDB to store and retrieve document embeddings
3. **RAG Pipeline Implementation**: Building an end-to-end question-answering system
4. **Retrieval Evaluation**: Measuring how well the system retrieves relevant information
   - Contextual Precision
   - Contextual Recall  
   - Contextual Relevancy
5. **Generator Evaluation**: Assessing the quality of generated answers
   - Answer Relevancy
   - Faithfulness
   - Hallucination Check
   - G-Eval (Custom Metrics)
6. **Model Performance Comparison**: Tracking and comparing different LLM backends

### LLM Backend Options

This notebook supports two LLM backends:
- **OpenAI API**: Cloud-based, requires API key
- **Ollama**: Local inference, runs models on your machine

You can switch between backends and compare their performance on the same evaluation tasks.

### Prerequisites

- Python 3.8+
- OpenAI API key (for OpenAI backend)
- Ollama installed locally (for Ollama backend)
- The healthcare PDF document at `data/07.Healthcare_2016.pdf`

---

## Section 1: Environment Setup and Dependencies

### Installing Required Packages

Before we begin, we need to install all the necessary Python packages. This includes:

- **langchain**: Framework for building LLM applications
- **langchain-openai**: OpenAI integration for LangChain
- **langchain-community**: Community integrations including Ollama
- **chromadb**: Vector database for storing embeddings
- **deepeval**: Evaluation framework for LLM applications
- **pymupdf**: PDF parsing library
- **tenacity**: Retry logic for API calls
- **pandas**: Data manipulation for performance tracking

In [None]:
# Install required packages
!pip install -q langchain langchain-openai langchain-community chromadb deepeval pymupdf tenacity pandas python-dotenv ollama openai

### Importing Libraries

Now we import all the necessary libraries. We organize imports by functionality:

1. **Standard library imports**: os, datetime for system operations and timestamps
2. **LangChain imports**: For document loading, text splitting, and LLM interactions
3. **ChromaDB imports**: For vector storage and retrieval
4. **DeepEval imports**: For RAG evaluation metrics
5. **Utility imports**: For retry logic and data handling

In [None]:
# Standard library imports
import os
import json
from datetime import datetime
from typing import Optional, List, Dict, Any

# Data handling
import pandas as pd

# LangChain imports
from langchain_community.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.chat_models import ChatOllama
from langchain_community.embeddings import OllamaEmbeddings
from langchain.schema import Document

# ChromaDB imports
import chromadb
from langchain_community.vectorstores import Chroma

# DeepEval imports
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
    GEval
)
from deepeval.models.base_model import DeepEvalBaseLLM

# Retry logic
from tenacity import retry, stop_after_attempt, wait_exponential

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")

---

## Section 2: Configuration and Authentication

### Setting Up LLM Backend Options

This notebook supports two LLM backends:

1. **OpenAI API**: Uses cloud-based models like GPT-4o-mini
   - Requires `OPENAI_API_KEY` environment variable
   - Higher quality but costs money per API call
   
2. **Ollama**: Uses locally-hosted open-source models
   - Free to use once models are downloaded
   - Runs entirely on your machine
   - Supports models like Llama 3, Mistral, etc.

### Configuration Class

We create a configuration class to manage all settings in one place. This makes it easy to switch between backends and models.

In [None]:
class RAGConfig:
    """
    Configuration class for RAG system settings.
    
    This class centralizes all configuration options including:
    - LLM backend selection (OpenAI or Ollama)
    - Model names for chat and embeddings
    - Document processing parameters
    - Vector store settings
    """
    
    def __init__(
        self,
        backend: str = "openai",  # "openai" or "ollama"
        chat_model: str = None,
        embedding_model: str = None,
        chunk_size: int = 600,
        chunk_overlap: int = 100,
        persist_directory: str = "./single_pdf_rag_eval_db",
        pdf_path: str = "data/07.Healthcare_2016.pdf"
    ):
        self.backend = backend.lower()
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.persist_directory = persist_directory
        self.pdf_path = pdf_path
        
        # Set default models based on backend
        if self.backend == "openai":
            self.chat_model = chat_model or "gpt-4o-mini"
            self.embedding_model = embedding_model or "text-embedding-3-small"
        elif self.backend == "ollama":
            self.chat_model = chat_model or "llama3.2"
            self.embedding_model = embedding_model or "nomic-embed-text"
        else:
            raise ValueError(f"Unknown backend: {backend}. Use 'openai' or 'ollama'")
    
    def __repr__(self):
        return (
            f"RAGConfig(backend='{self.backend}', "
            f"chat_model='{self.chat_model}', "
            f"embedding_model='{self.embedding_model}')"
        )


# Display available configuration options
print("Available Backends:")
print("  - 'openai': Uses OpenAI API (requires OPENAI_API_KEY)")
print("  - 'ollama': Uses local Ollama models")
print("\nDefault Models:")
print("  OpenAI: gpt-4o-mini (chat), text-embedding-3-small (embeddings)")
print("  Ollama: llama3.2 (chat), nomic-embed-text (embeddings)")

### Initialize Configuration

Choose your backend and model configuration here. You can run the notebook multiple times with different configurations to compare performance.

**Important**: 
- For OpenAI, ensure `OPENAI_API_KEY` is set in your environment
- For Ollama, ensure Ollama is running and models are downloaded

In [None]:
# ============================================================
# CONFIGURATION - MODIFY THIS CELL TO CHANGE SETTINGS
# ============================================================

# Choose backend: "openai" or "ollama"
BACKEND = "openai"

# Optional: Specify custom models (or leave as None for defaults)
CHAT_MODEL = None  # e.g., "gpt-4o", "llama3.2", "mistral"
EMBEDDING_MODEL = None  # e.g., "text-embedding-3-large", "nomic-embed-text"

# Document settings
PDF_PATH = "data/07.Healthcare_2016.pdf"
PERSIST_DIRECTORY = "./single_pdf_rag_eval_db"

# Create configuration
config = RAGConfig(
    backend=BACKEND,
    chat_model=CHAT_MODEL,
    embedding_model=EMBEDDING_MODEL,
    pdf_path=PDF_PATH,
    persist_directory=PERSIST_DIRECTORY
)

print(f"Configuration initialized: {config}")

### Load Environment Variables and Initialize Clients

Now we set up the LLM and embedding clients based on the selected backend. The code automatically handles:

1. **OpenAI Backend**:
   - Loads `OPENAI_API_KEY` from environment variables
   - Creates `ChatOpenAI` and `OpenAIEmbeddings` clients

2. **Ollama Backend**:
   - Connects to local Ollama server (default: http://localhost:11434)
   - Creates `ChatOllama` and `OllamaEmbeddings` clients

In [None]:
def initialize_clients(config: RAGConfig):
    """
    Initialize LLM and embedding clients based on the configuration.
    
    Parameters:
    -----------
    config : RAGConfig
        Configuration object specifying backend and model settings
        
    Returns:
    --------
    tuple : (chat_model, embeddings_model)
        Initialized chat and embeddings clients
    """
    
    if config.backend == "openai":
        # Get OpenAI API key from environment
        api_key = os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError(
                "OPENAI_API_KEY not found in environment variables. "
                "Please set it using: export OPENAI_API_KEY='your-key-here'"
            )
        
        # Initialize OpenAI clients
        chat_model = ChatOpenAI(
            model=config.chat_model,
            temperature=0,
            api_key=api_key
        )
        
        embeddings_model = OpenAIEmbeddings(
            model=config.embedding_model,
            api_key=api_key
        )
        
        print(f"✓ OpenAI clients initialized")
        print(f"  Chat model: {config.chat_model}")
        print(f"  Embeddings model: {config.embedding_model}")
        
    elif config.backend == "ollama":
        # Initialize Ollama clients
        chat_model = ChatOllama(
            model=config.chat_model,
            temperature=0
        )
        
        embeddings_model = OllamaEmbeddings(
            model=config.embedding_model
        )
        
        print(f"✓ Ollama clients initialized")
        print(f"  Chat model: {config.chat_model}")
        print(f"  Embeddings model: {config.embedding_model}")
        print(f"  Note: Ensure Ollama is running (ollama serve)")
    
    return chat_model, embeddings_model


# Initialize clients
chat_model, embeddings_model = initialize_clients(config)

---

## Section 3: DeepEval Model Wrapper

### Why We Need a Wrapper

**Important Note**: DeepEval doesn't directly support LangChain's chat model classes. To use our LLM models with DeepEval's evaluation metrics, we need to create a wrapper class that bridges the gap between LangChain and DeepEval.

### Understanding the Wrapper Class

The wrapper class (`CustomModelWrapper`) implements DeepEval's `DeepEvalBaseLLM` interface. Let's walk through each method:

1. **`__init__`**: The constructor takes a LangChain chat model and stores it for use in other methods.

2. **`load_model`**: Returns the stored model. DeepEval may call this internally.

3. **`generate`**: Takes a prompt string, calls the model's `invoke` method to send the prompt to the LLM, and returns the generated response content.

4. **`a_generate`**: Similar to `generate` but asynchronous. It awaits the response and extracts the content.

5. **`get_model_name`**: Returns a string identifier naming the model for DeepEval reporting and tracking.

In [None]:
class CustomModelWrapper(DeepEvalBaseLLM):
    """
    Wrapper class to make LangChain chat models compatible with DeepEval.
    
    This wrapper bridges the gap between LangChain's ChatModel interface
    and DeepEval's expected model interface. It enables us to use any
    LangChain-compatible model (OpenAI, Ollama, etc.) with DeepEval's
    evaluation metrics.
    
    Parameters:
    -----------
    model : ChatModel
        A LangChain chat model (ChatOpenAI, ChatOllama, etc.)
    model_name : str
        A descriptive name for the model (used in reporting)
    """
    
    def __init__(self, model, model_name: str):
        """
        Initialize the wrapper with a LangChain chat model.
        
        The model is stored in self.model for use in other methods.
        """
        self.model = model
        self._model_name = model_name
    
    def load_model(self):
        """
        Return the stored model.
        
        DeepEval may call this method internally to access the underlying model.
        """
        return self.model
    
    def generate(self, prompt: str) -> str:
        """
        Generate a response for the given prompt (synchronous).
        
        This method:
        1. Calls self.model.invoke(prompt) to send the prompt to the LLM
        2. Extracts and returns the response content
        
        Parameters:
        -----------
        prompt : str
            The input prompt to send to the model
            
        Returns:
        --------
        str : The generated response text
        """
        response = self.model.invoke(prompt)
        return response.content
    
    async def a_generate(self, prompt: str) -> str:
        """
        Generate a response for the given prompt (asynchronous).
        
        Similar to generate() but asynchronous - it awaits the response
        and extracts the content.
        
        Parameters:
        -----------
        prompt : str
            The input prompt to send to the model
            
        Returns:
        --------
        str : The generated response text
        """
        response = await self.model.ainvoke(prompt)
        return response.content
    
    def get_model_name(self) -> str:
        """
        Return the model identifier for DeepEval reporting and tracking.
        
        This name appears in evaluation reports and helps identify
        which model was used for each evaluation run.
        """
        return self._model_name


# Create the wrapped model for DeepEval
# We pass our chat_model into the CustomModelWrapper class
wrapped_model = CustomModelWrapper(
    model=chat_model,
    model_name=f"{config.backend}_{config.chat_model}"
)

print(f"✓ Created wrapped model: {wrapped_model.get_model_name()}")
print("\nThis wrapped model generates responses like the original model,")
print("but is fully compatible with DeepEval's interface.")

---

## Section 4: Performance Tracking System

### Why Track Performance?

When comparing different LLM backends (OpenAI vs Ollama) and models, it's essential to track performance metrics systematically. This allows you to:

1. Compare how different models perform on the same evaluation tasks
2. Track improvements or regressions over time
3. Make informed decisions about which model to use in production

### Performance Tracker Class

We create a `PerformanceTracker` class that:
- Stores evaluation results with timestamps and run numbers
- Supports multiple evaluation categories (retrieval, generation)
- Exports results to CSV and displays summary tables

In [None]:
class PerformanceTracker:
    """
    Track and compare model performance across evaluation runs.
    
    This class maintains a log of all evaluation results, allowing
    comparison between different models, backends, and configurations.
    
    Attributes:
    -----------
    results : list
        List of dictionaries containing evaluation results
    run_counter : int
        Counter for tracking run numbers
    """
    
    def __init__(self):
        """Initialize the performance tracker with empty results."""
        self.results = []
        self.run_counter = 0
    
    def log_result(
        self,
        metric_name: str,
        metric_category: str,  # "retrieval" or "generation"
        score: float,
        success: bool,
        backend: str,
        model_name: str,
        test_variant: str = "standard",  # "standard", "with_noise", "at_k"
        reason: str = None,
        additional_info: dict = None
    ):
        """
        Log an evaluation result.
        
        Parameters:
        -----------
        metric_name : str
            Name of the evaluation metric
        metric_category : str
            Category: "retrieval" or "generation"
        score : float
            The evaluation score (0.0 to 1.0)
        success : bool
            Whether the evaluation passed the threshold
        backend : str
            LLM backend used ("openai" or "ollama")
        model_name : str
            Specific model name
        test_variant : str
            Variant of the test (standard, with_noise, at_k)
        reason : str
            Explanation of the score
        additional_info : dict
            Any additional metadata
        """
        self.run_counter += 1
        
        result = {
            "run_number": self.run_counter,
            "timestamp": datetime.now().isoformat(),
            "metric_name": metric_name,
            "metric_category": metric_category,
            "test_variant": test_variant,
            "score": score,
            "success": success,
            "backend": backend,
            "model_name": model_name,
            "reason": reason
        }
        
        if additional_info:
            result.update(additional_info)
        
        self.results.append(result)
        
    def get_dataframe(self) -> pd.DataFrame:
        """Return results as a pandas DataFrame."""
        return pd.DataFrame(self.results)
    
    def get_summary(self) -> pd.DataFrame:
        """
        Get a summary of results grouped by model and metric.
        
        Returns:
        --------
        pd.DataFrame : Summary statistics
        """
        df = self.get_dataframe()
        if df.empty:
            return df
        
        summary = df.groupby(['backend', 'model_name', 'metric_name', 'test_variant']).agg({
            'score': ['mean', 'std', 'count'],
            'success': 'mean'
        }).round(3)
        
        return summary
    
    def save_to_csv(self, filepath: str = "rag_performance_results.csv"):
        """Save results to a CSV file."""
        df = self.get_dataframe()
        df.to_csv(filepath, index=False)
        print(f"✓ Results saved to {filepath}")
    
    def display_results(self):
        """Display formatted results table."""
        df = self.get_dataframe()
        if df.empty:
            print("No results recorded yet.")
            return
        
        # Select key columns for display
        display_cols = [
            'run_number', 'timestamp', 'metric_name', 'test_variant',
            'score', 'success', 'backend', 'model_name'
        ]
        
        # Format timestamp for readability
        display_df = df[display_cols].copy()
        display_df['timestamp'] = pd.to_datetime(display_df['timestamp']).dt.strftime('%Y-%m-%d %H:%M:%S')
        display_df['score'] = display_df['score'].round(3)
        
        return display_df


# Initialize the global performance tracker
performance_tracker = PerformanceTracker()

print("✓ Performance Tracker initialized")
print("\nThe tracker will log all evaluation results with:")
print("  - Run number and timestamp")
print("  - Metric name and category")
print("  - Score and success status")
print("  - Backend and model information")

---

## Section 5: PDF Document Loading

### Loading PDFs with LangChain

The first step in our RAG pipeline is loading the source document. We use LangChain's `PyMuPDFLoader` to load PDF content. PyMuPDF is a fast and accurate PDF parsing library.

### The `load_pdf_with_langchain` Function

This function:
1. Takes a PDF file path as input
2. Uses PyMuPDFLoader to parse the PDF
3. Returns the text content as document chunks

Each document chunk includes:
- `page_content`: The extracted text
- `metadata`: Information about the source (file path, page number, etc.)

In [None]:
def load_pdf_with_langchain(pdf_path: str) -> List[Document]:
    """
    Load a PDF file and extract its text content using LangChain's PyMuPDFLoader.
    
    PyMuPDF (also known as fitz) is a fast and accurate PDF parsing library
    that extracts text while preserving document structure.
    
    Parameters:
    -----------
    pdf_path : str
        Path to the PDF file to load
        
    Returns:
    --------
    List[Document] : List of LangChain Document objects containing:
        - page_content: The extracted text from each page
        - metadata: Source information (file path, page number)
    """
    
    # Verify file exists
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f"PDF file not found: {pdf_path}")
    
    # Initialize the PyMuPDF loader
    loader = PyMuPDFLoader(pdf_path)
    
    # Load and return documents
    documents = loader.load()
    
    print(f"✓ Loaded {len(documents)} pages from: {pdf_path}")
    
    return documents


# Test the function (don't run yet - we'll use it in the pipeline)
print("Function 'load_pdf_with_langchain' defined.")
print(f"\nPDF path configured: {config.pdf_path}")

---

## Section 6: Document Chunking

### Why Chunk Documents?

Large documents need to be split into smaller chunks for several reasons:

1. **Embedding Limitations**: Embedding models have token limits
2. **Retrieval Precision**: Smaller chunks enable more precise retrieval
3. **Context Window**: LLMs have limited context windows

### RecursiveCharacterTextSplitter

We use LangChain's `RecursiveCharacterTextSplitter` which:
- Splits text at natural boundaries (paragraphs, sentences, words)
- Maintains chunk overlap to preserve context at boundaries

### Configuration

- **chunk_size**: 600 characters per chunk
- **chunk_overlap**: 100 characters overlap between consecutive chunks

The overlap ensures that context is preserved across chunk boundaries, which is important when relevant information spans multiple chunks.

In [None]:
def chunk_documents(documents: List[Document], chunk_size: int = 600, chunk_overlap: int = 100) -> List[Document]:
    """
    Split documents into smaller chunks for embedding.
    
    Uses RecursiveCharacterTextSplitter which splits text at natural
    boundaries (paragraphs, sentences, words) while maintaining overlap
    to preserve context at chunk boundaries.
    
    Parameters:
    -----------
    documents : List[Document]
        List of LangChain Document objects to split
    chunk_size : int
        Maximum size of each chunk in characters (default: 600)
    chunk_overlap : int
        Number of characters to overlap between chunks (default: 100)
        
    Returns:
    --------
    List[Document] : List of chunked documents
    """
    
    # Initialize the text splitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        separators=["\n\n", "\n", " ", ""]
    )
    
    # Split all documents
    chunks = text_splitter.split_documents(documents)
    
    print(f"✓ Split into {len(chunks)} chunks")
    print(f"  Chunk size: {chunk_size} chars")
    print(f"  Overlap: {chunk_overlap} chars")
    
    return chunks


print("Function 'chunk_documents' defined.")

---

## Section 7: ChromaDB Configuration

### About ChromaDB

ChromaDB is an open-source embedding database that:
- Stores document embeddings as vectors
- Enables fast similarity search
- Persists data to disk for reuse

In [None]:
# Configure ChromaDB settings
os.environ["TOKENIZERS_PARALLELISM"] = "true"
os.environ["ANONYMIZED_TELEMETRY"] = "False"

print("✓ ChromaDB configuration set")

---

## Section 8: Vector Store Management

### The `store_embeddings` Function

This function manages embeddings using ChromaDB with retry logic for robustness.

In [None]:
@retry(stop=stop_after_attempt(6), wait=wait_exponential(min=45, max=120))
def store_embeddings(
    persist_directory: str,
    docs: Optional[List[Document]] = None,
    embedding_function=None
) -> Chroma:
    """
    Manage embeddings using ChromaDB - load existing or create new.
    """
    if embedding_function is None:
        raise ValueError("embedding_function is required")
    
    if os.path.exists(persist_directory) and os.listdir(persist_directory):
        print(f"Loading existing vector store from: {persist_directory}")
        vector_store = Chroma(
            persist_directory=persist_directory,
            embedding_function=embedding_function
        )
        print(f"✓ Loaded vector store")
    else:
        if docs is None:
            raise ValueError("docs parameter required for first-time embedding")
        print(f"Creating new vector store in: {persist_directory}")
        vector_store = Chroma.from_documents(
            documents=docs,
            embedding=embedding_function,
            persist_directory=persist_directory
        )
        print(f"✓ Created vector store with {len(docs)} chunks")
    
    return vector_store

print("Function 'store_embeddings' defined.")

---

## Section 9: Retrieval and Answer Generation

In [None]:
def retrieve_chunks(query: str, vector_store: Chroma, top_k: int = 5) -> List[Document]:
    """Find the most semantically relevant chunks for a query."""
    results = vector_store.similarity_search(query, k=top_k)
    
    seen_content = set()
    unique_results = []
    for doc in results:
        if doc.page_content not in seen_content:
            seen_content.add(doc.page_content)
            unique_results.append(doc)
    
    print(f"Retrieved {len(unique_results)} unique chunks")
    return unique_results


def generate_answer(query: str, chunks: List[Document], chat_model) -> str:
    """Generate an answer using the LLM based on retrieved chunks."""
    context = "\n\n".join([chunk.page_content for chunk in chunks])
    
    prompt = f"""Use only the context below to answer the question. 
If the answer is not found in the context, say "This query is not as per the PDF."

Context:
{context}

Question: {query}

Answer:"""
    
    response = chat_model.invoke(prompt)
    return response.content

print("Functions 'retrieve_chunks' and 'generate_answer' defined.")

---

## Section 10: Complete RAG Pipeline

In [None]:
def pdf_chatbot_pipeline(
    pdf_path: str,
    query: str,
    persist_directory: str,
    chat_model,
    embedding_function,
    chunk_size: int = 600,
    chunk_overlap: int = 100,
    top_k: int = 5
) -> Dict[str, Any]:
    """Complete RAG pipeline for PDF-based question answering."""
    
    print("="*60)
    print("PDF Chatbot Pipeline")
    print("="*60)
    
    # Check if vector store exists
    if os.path.exists(persist_directory) and os.listdir(persist_directory):
        print("\n[Step 1] Loading existing vector store...")
        vector_store = Chroma(
            persist_directory=persist_directory,
            embedding_function=embedding_function
        )
    else:
        print("\n[Step 1] No existing embeddings found. Processing PDF...")
        documents = load_pdf_with_langchain(pdf_path)
        chunks = chunk_documents(documents, chunk_size, chunk_overlap)
        print(f"\nLoaded {len(chunks)} document chunks from PDF")
        
        print(f"\n[Step 2] Creating new vector store...")
        vector_store = store_embeddings(
            persist_directory=persist_directory,
            docs=chunks,
            embedding_function=embedding_function
        )
    
    print("\n[Step 3] Retrieving relevant chunks...")
    retrieved_chunks = retrieve_chunks(query, vector_store, top_k)
    
    print("\n[Step 4] Generating answer...")
    answer = generate_answer(query, retrieved_chunks, chat_model)
    
    context = [chunk.page_content for chunk in retrieved_chunks]
    
    print("\n" + "="*60)
    print("Pipeline Complete")
    print("="*60)
    
    return {
        'question': query,
        'context': context,
        'answer': answer,
        'chunks': retrieved_chunks
    }

print("Function 'pdf_chatbot_pipeline' defined.")

---

## Section 11: Running the Pipeline

### Test Query: How does MIoT improve hospital safety?

In [None]:
# Define the test query
test_query = "How does MIoT improve hospital safety?"

# Run the pipeline
response = pdf_chatbot_pipeline(
    pdf_path=config.pdf_path,
    query=test_query,
    persist_directory=config.persist_directory,
    chat_model=chat_model,
    embedding_function=embeddings_model,
    chunk_size=config.chunk_size,
    chunk_overlap=config.chunk_overlap
)

print("\n" + "="*60)
print("RESPONSE")
print("="*60)
print(f"\nQuestion: {response['question']}")
print(f"\nAnswer:\n{response['answer']}")

In [None]:
# Store results for evaluation
retrieved_context = response['context']
ai_output = response['answer']

print(f"Retrieved {len(retrieved_context)} context chunks for evaluation")

---

## Section 12: Preparing for Evaluation

### Human-Written Expected Response and Noise Chunks

In [None]:
# Human-written expected response
expected_output = """MIoT (Medical Internet of Things) improves hospital safety in several key ways:

1. Error Reduction: MIoT systems help minimize preventable medical errors through real-time 
   monitoring and automated alerts.

2. Real-time Monitoring: Connected medical devices provide continuous patient monitoring.

3. Interoperability: The ICE standard enables cross-device communication.

4. Infection Prevention: Connected systems can track hygiene compliance.

5. Asset Management: IoT sensors track medical equipment location and status.

6. Decision Support: Aggregated data provides better clinical decision-making."""

# Create noise chunks
noise_chunks = [
    """Healthcare marketing strategies have evolved significantly in the digital age. 
    Social media platforms enable hospitals to reach potential patients through targeted 
    advertising campaigns.""",
    
    """Healthcare consulting fees vary widely depending on the scope of services provided. 
    Management consultants typically charge between $200-500 per hour.""",
    
    """Patient behavior research indicates that convenience is a primary factor in 
    healthcare provider selection. Studies show that 67% prefer providers within 10 miles.""",
    
    """Healthcare economics examines the production and consumption of health and healthcare. 
    Supply and demand dynamics play crucial roles in determining costs."""
]

# Create context variants
clean_context = retrieved_context.copy()
context_with_noise = retrieved_context.copy() + noise_chunks
noise_at_top_context = noise_chunks + retrieved_context.copy()

print("Context variants created:")
print(f"  Clean context: {len(clean_context)} chunks")
print(f"  Context with noise: {len(context_with_noise)} chunks")
print(f"  Noise at top: {len(noise_at_top_context)} chunks")

---

## Section 13: Retrieval Evaluation - Contextual Precision

**Contextual Precision** measures how well the retrieval system ranks relevant chunks at the top.

In [None]:
# Define Contextual Precision metric
contextual_precision_metric = ContextualPrecisionMetric(
    threshold=0.6,
    model=wrapped_model,
    verbose_mode=True
)

# Test with clean context
precision_test_clean = LLMTestCase(
    input=test_query,
    actual_output=ai_output,
    expected_output=expected_output,
    retrieval_context=clean_context
)

print("Running Contextual Precision evaluation (Clean Context)...")
precision_results = evaluate(test_cases=[precision_test_clean], metrics=[contextual_precision_metric])

score = precision_results.test_results[0].metrics_data[0].score
success = precision_results.test_results[0].metrics_data[0].success
reason = precision_results.test_results[0].metrics_data[0].reason

print(f"\nResults: Success={success}, Score={score}")
print(f"Reason: {reason}")

performance_tracker.log_result(
    metric_name="Contextual Precision",
    metric_category="retrieval",
    score=score, success=success,
    backend=config.backend, model_name=config.chat_model,
    test_variant="clean_context", reason=reason
)

---

## Section 14: Retrieval Evaluation - Contextual Recall

**Contextual Recall** measures how well the retrieved context supports the expected output.

In [None]:
# Define Contextual Recall metric
contextual_recall_metric = ContextualRecallMetric(
    threshold=0.6,
    model=wrapped_model,
    verbose_mode=True
)

# Test with clean context
recall_test_clean = LLMTestCase(
    input=test_query,
    actual_output=ai_output,
    expected_output=expected_output,
    retrieval_context=clean_context
)

print("Running Contextual Recall evaluation (Clean Context)...")
recall_results = evaluate(test_cases=[recall_test_clean], metrics=[contextual_recall_metric])

score = recall_results.test_results[0].metrics_data[0].score
success = recall_results.test_results[0].metrics_data[0].success
reason = recall_results.test_results[0].metrics_data[0].reason

print(f"\nResults: Success={success}, Score={score}")
print(f"Reason: {reason}")

performance_tracker.log_result(
    metric_name="Contextual Recall",
    metric_category="retrieval",
    score=score, success=success,
    backend=config.backend, model_name=config.chat_model,
    test_variant="clean_context", reason=reason
)

---

## Section 15: Retrieval Evaluation - Contextual Relevancy

**Contextual Relevancy** measures how relevant the retrieved content is for answering the query.

In [None]:
# Define Contextual Relevancy metric
contextual_relevancy_metric = ContextualRelevancyMetric(
    threshold=0.6,
    model=wrapped_model,
    verbose_mode=True
)

# Test with clean context
relevancy_test_clean = LLMTestCase(
    input=test_query,
    actual_output=ai_output,
    retrieval_context=clean_context
)

print("Running Contextual Relevancy evaluation (Clean Context)...")
relevancy_results = evaluate(test_cases=[relevancy_test_clean], metrics=[contextual_relevancy_metric])

score = relevancy_results.test_results[0].metrics_data[0].score
success = relevancy_results.test_results[0].metrics_data[0].success
reason = relevancy_results.test_results[0].metrics_data[0].reason

print(f"\nResults: Success={success}, Score={score}")
print(f"Reason: {reason}")

performance_tracker.log_result(
    metric_name="Contextual Relevancy",
    metric_category="retrieval",
    score=score, success=success,
    backend=config.backend, model_name=config.chat_model,
    test_variant="clean_context", reason=reason
)

---

## Section 16: Generator Evaluation - Answer Relevancy

**Answer Relevancy** evaluates how well the generated response matches the user's query.

In [None]:
# Define Answer Relevancy metric
answer_relevancy_metric = AnswerRelevancyMetric(
    threshold=0.6,
    model=wrapped_model,
    verbose_mode=True
)

answer_relevancy_test = LLMTestCase(
    input=test_query,
    actual_output=ai_output
)

print("Running Answer Relevancy evaluation...")
answer_results = evaluate(test_cases=[answer_relevancy_test], metrics=[answer_relevancy_metric])

score = answer_results.test_results[0].metrics_data[0].score
success = answer_results.test_results[0].metrics_data[0].success
reason = answer_results.test_results[0].metrics_data[0].reason

print(f"\nResults: Success={success}, Score={score}")
print(f"Reason: {reason}")

performance_tracker.log_result(
    metric_name="Answer Relevancy",
    metric_category="generation",
    score=score, success=success,
    backend=config.backend, model_name=config.chat_model,
    test_variant="standard", reason=reason
)

---

## Section 17: Generator Evaluation - Faithfulness

**Faithfulness** checks if the model's response is factually consistent with the retrieved context.

In [None]:
# Define Faithfulness metric
faithfulness_metric = FaithfulnessMetric(
    threshold=0.6,
    model=wrapped_model,
    verbose_mode=True
)

faithfulness_test = LLMTestCase(
    input=test_query,
    actual_output=ai_output,
    retrieval_context=clean_context
)

print("Running Faithfulness evaluation...")
faithfulness_results = evaluate(test_cases=[faithfulness_test], metrics=[faithfulness_metric])

score = faithfulness_results.test_results[0].metrics_data[0].score
success = faithfulness_results.test_results[0].metrics_data[0].success
reason = faithfulness_results.test_results[0].metrics_data[0].reason

print(f"\nResults: Success={success}, Score={score}")
print(f"Reason: {reason}")

performance_tracker.log_result(
    metric_name="Faithfulness",
    metric_category="generation",
    score=score, success=success,
    backend=config.backend, model_name=config.chat_model,
    test_variant="standard", reason=reason
)

---

## Section 18: Generator Evaluation - Hallucination Check

**Hallucination** metric detects fabricated information. Lower scores are better!

In [None]:
# Define Hallucination metric
hallucination_metric = HallucinationMetric(
    threshold=0.6,
    model=wrapped_model,
    verbose_mode=True
)

hallucination_test = LLMTestCase(
    input=test_query,
    actual_output=ai_output,
    context=[expected_output]
)

print("Running Hallucination Check...")
print("Note: For hallucination, LOWER scores are BETTER!")
hallucination_results = evaluate(test_cases=[hallucination_test], metrics=[hallucination_metric])

score = hallucination_results.test_results[0].metrics_data[0].score
success = hallucination_results.test_results[0].metrics_data[0].success
reason = hallucination_results.test_results[0].metrics_data[0].reason

print(f"\nResults: Success={success}, Score={score}")
print(f"Reason: {reason}")

performance_tracker.log_result(
    metric_name="Hallucination",
    metric_category="generation",
    score=score, success=success,
    backend=config.backend, model_name=config.chat_model,
    test_variant="clean_context", reason=reason
)

---

## Section 19: Generator Evaluation - G-Eval (Custom Metric)

**G-Eval** allows custom evaluation criteria tailored to your use case.

In [None]:
# Define G-Eval metric
geval_metric = GEval(
    name="RAG Fact Checker",
    criteria="Evaluate for accuracy, completeness, grounding, and factual consistency.",
    evaluation_steps=[
        "Break the answer into individual factual statements.",
        "Check if each statement is relevant to the question.",
        "Compare each statement with the expected answer.",
        "Verify that each statement is grounded in the context.",
        "Flag any unsupported statements.",
        "Calculate overall score based on accuracy and completeness."
    ],
    threshold=0.6,
    model=wrapped_model,
    verbose_mode=True
)

geval_test = LLMTestCase(
    input=test_query,
    actual_output=ai_output,
    expected_output=expected_output,
    retrieval_context=clean_context
)

print("Running G-Eval (RAG Fact Checker)...")
geval_results = evaluate(test_cases=[geval_test], metrics=[geval_metric])

score = geval_results.test_results[0].metrics_data[0].score
success = geval_results.test_results[0].metrics_data[0].success
reason = geval_results.test_results[0].metrics_data[0].reason

print(f"\nResults: Success={success}, Score={score}")
print(f"Reason: {reason}")

performance_tracker.log_result(
    metric_name="G-Eval (RAG Fact Checker)",
    metric_category="generation",
    score=score, success=success,
    backend=config.backend, model_name=config.chat_model,
    test_variant="standard", reason=reason
)

---

## Section 20: Model Performance Summary

In [None]:
# Display all results
print("\n" + "="*80)
print("COMPLETE MODEL PERFORMANCE REPORT")
print("="*80)
print(f"\nBackend: {config.backend}")
print(f"Chat Model: {config.chat_model}")
print(f"Embedding Model: {config.embedding_model}")
print(f"Total Evaluations: {performance_tracker.run_counter}")

results_df = performance_tracker.display_results()
print("\n" + "-"*80)
print(results_df.to_string() if results_df is not None else "No results")

In [None]:
# Save results
output_filename = f"rag_performance_{config.backend}_{config.chat_model}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
performance_tracker.save_to_csv(output_filename)

print(f"\n✓ Results saved to: {output_filename}")
print("\nRun this notebook with different configurations to compare models!")

---

## Appendix: Deviations from Original Transcript

This notebook has been modified from the original tutorial:

### 1. Removed Azure Dependencies
- **Original**: Used Azure OpenAI service with Azure-specific authentication
- **Modified**: Uses standard OpenAI API with `OPENAI_API_KEY` environment variable

### 2. Added Ollama Support
- **Original**: Only supported Azure OpenAI
- **Modified**: Supports both OpenAI API and local Ollama inference

### 3. Added Performance Tracking
- **Original**: Displayed results inline without tracking
- **Modified**: Includes `PerformanceTracker` class for model comparison

### 4. Simplified Wrapper Class
- **Original**: `AzureChatModelWrapper` bridged Azure OpenAI and DeepEval
- **Modified**: `CustomModelWrapper` works with any LangChain chat model

---

**End of Tutorial**

You now have the knowledge to:
- Build production-ready RAG pipelines
- Evaluate retrieval and generation quality
- Compare different LLM backends
- Track and improve model performance over time