# Lab 3.5.6: RAGAS Evaluation Framework

**Module:** 3.5 - RAG Systems & Vector Databases  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê (Advanced)

---

## üéØ Learning Objectives

By the end of this notebook, you will:

- [ ] Understand the key metrics for evaluating RAG systems
- [ ] Create evaluation datasets with ground truth
- [ ] Use RAGAS to measure faithfulness, relevancy, and precision
- [ ] Build custom evaluation metrics
- [ ] Set quality thresholds for production deployment

---

## üìö Prerequisites

- Completed: Labs 3.5.1-3.5.5
- A working RAG pipeline to evaluate

---

## üåç Real-World Context

**The Problem:** Your RAG system is deployed, but how do you know it's actually working well? Users complain sometimes, but you can't manually check every response.

**The Solution:** Systematic evaluation with metrics like RAGAS. Just like unit tests for code, these metrics catch regressions before users do.

**Industry Standard:** Companies like Arize, LangSmith, and Weights & Biases all build evaluation tooling around these concepts.

---

## üßí ELI5: RAG Evaluation Metrics

> **Imagine grading a student's open-book exam:**
>
> **Faithfulness**: Did they only use information from the book? (No making stuff up!)
>
> **Answer Relevancy**: Did they actually answer the question? (Not just copy random paragraphs)
>
> **Context Precision**: Did they find the RIGHT pages to look at? (Found relevant sections)
>
> **Context Recall**: Did they find ALL the relevant pages? (Didn't miss important info)
>
> A good student (and a good RAG) scores high on all four!

---

## Part 1: Setup

In [None]:
# Install dependencies
!pip install -q \
    ragas==0.1.21 \
    langchain langchain-community langchain-huggingface \
    chromadb sentence-transformers \
    datasets \
    ollama

print("‚úÖ Dependencies installed!")

In [None]:
import os
import time
import json
from pathlib import Path
from typing import List, Dict, Any, Optional
from dataclasses import dataclass, asdict
import numpy as np

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

import ollama

import torch
import gc

print(f"CUDA: {torch.cuda.is_available()}")

In [None]:
# Load documents and build RAG pipeline
DOCS_PATH = Path("../data/sample_documents")

documents = []
for file_path in sorted(DOCS_PATH.glob("*.md")):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = f.read()
    documents.append(Document(
        page_content=content,
        metadata={"source": file_path.name}
    ))

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_documents(documents)

print(f"üìö Loaded {len(documents)} documents ‚Üí {len(chunks)} chunks")

In [None]:
# Build the RAG pipeline
print("üîÑ Building RAG pipeline...")

embedding_model = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    model_kwargs={"device": "cuda" if torch.cuda.is_available() else "cpu"},
    encode_kwargs={"normalize_embeddings": True}
)

import shutil
CHROMA_PATH = "./eval_chroma_db"
if Path(CHROMA_PATH).exists():
    shutil.rmtree(CHROMA_PATH)

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory=CHROMA_PATH
)

print("‚úÖ RAG pipeline ready!")

In [None]:
# RAG query function
LLM_MODEL = "qwen3:8b"

def rag_query(question: str, k: int = 5) -> Dict[str, Any]:
    """
    Execute RAG query and return structured result for evaluation.
    """
    # Retrieve
    results = vectorstore.similarity_search(question, k=k)
    contexts = [doc.page_content for doc in results]
    
    # Generate
    context_str = "\n\n---\n\n".join(contexts)
    prompt = f"""Answer the question based ONLY on the provided context.
If the context doesn't contain the answer, say "I don't have enough information."

Context:
{context_str}

Question: {question}

Answer:"""
    
    response = ollama.chat(
        model=LLM_MODEL,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return {
        "question": question,
        "contexts": contexts,
        "answer": response["message"]["content"]
    }

# Test
test_result = rag_query("What is DGX Spark's memory capacity?")
print(f"‚úÖ RAG query working!")
print(f"   Answer: {test_result['answer'][:100]}...")

---

## Part 2: Understanding RAGAS Metrics

### The Four Key Metrics

| Metric | What It Measures | Input Required |
|--------|-----------------|----------------|
| **Faithfulness** | Is the answer grounded in context? | Question, Context, Answer |
| **Answer Relevancy** | Does the answer address the question? | Question, Answer |
| **Context Precision** | Are retrieved docs relevant? | Question, Context, Ground Truth |
| **Context Recall** | Are all needed docs retrieved? | Context, Ground Truth |

### Python Dataclasses

We'll use Python's `@dataclass` decorator to create structured data containers:

```python
from dataclasses import dataclass, asdict

@dataclass
class Example:
    name: str
    value: int = 0  # Default value
    
# Usage:
ex = Example(name="test", value=42)
print(ex.name)  # "test"
print(asdict(ex))  # {"name": "test", "value": 42}
```

Key functions:
- `@dataclass`: Decorator that auto-generates `__init__`, `__repr__`, etc.
- `asdict(obj)`: Converts a dataclass instance to a dictionary

In [None]:
# Since RAGAS requires an LLM for evaluation, we'll implement custom metrics
# that can work with local LLMs via Ollama

from dataclasses import dataclass, asdict

@dataclass
class EvaluationSample:
    """
    A single sample for RAG evaluation.
    
    @dataclass automatically generates:
    - __init__(question, ground_truth, contexts=None, answer=None)
    - __repr__ for pretty printing
    - __eq__ for comparison
    """
    question: str
    ground_truth: str
    contexts: List[str] = None
    answer: str = None

    
@dataclass 
class EvaluationResult:
    """
    Evaluation results for a single sample.
    
    Properties can be defined in dataclasses using @property decorator.
    """
    question: str
    faithfulness: float
    answer_relevancy: float
    context_precision: float
    context_recall: float
    
    @property
    def average(self) -> float:
        """Compute average of all metrics."""
        return (self.faithfulness + self.answer_relevancy + 
                self.context_precision + self.context_recall) / 4

---

## Part 3: Creating an Evaluation Dataset

In [None]:
# Create evaluation dataset with ground truth answers
evaluation_dataset = [
    EvaluationSample(
        question="What is the memory capacity of DGX Spark?",
        ground_truth="DGX Spark has 128GB of unified LPDDR5X memory that is shared between CPU and GPU."
    ),
    EvaluationSample(
        question="How many CUDA cores does DGX Spark have?",
        ground_truth="DGX Spark has 6,144 CUDA cores."
    ),
    EvaluationSample(
        question="What is the attention mechanism in transformers?",
        ground_truth="The attention mechanism allows each position in a sequence to attend to all other positions, computing Query, Key, and Value vectors to capture relationships."
    ),
    EvaluationSample(
        question="How does LoRA reduce memory requirements?",
        ground_truth="LoRA freezes pretrained weights and injects trainable low-rank decomposition matrices, training only 0.1-1% of parameters."
    ),
    EvaluationSample(
        question="What is QLoRA?",
        ground_truth="QLoRA combines LoRA with 4-bit quantization (NF4), keeping the base model quantized while training LoRA adapters in FP16/BF16."
    ),
    EvaluationSample(
        question="What is GPTQ quantization?",
        ground_truth="GPTQ is a one-shot weight quantization method using approximate second-order information to find optimal quantized values."
    ),
    EvaluationSample(
        question="What are the advantages of RAG over fine-tuning?",
        ground_truth="RAG provides dynamic knowledge updates, grounded responses with source citations, scalable knowledge without retraining, and domain expertise."
    ),
    EvaluationSample(
        question="What is hybrid search in RAG?",
        ground_truth="Hybrid search combines dense retrieval (embeddings) with sparse retrieval (BM25/keywords) using fusion methods like RRF."
    ),
    EvaluationSample(
        question="How does ChromaDB compare to FAISS?",
        ground_truth="ChromaDB is Python-native and easy to use, while FAISS offers GPU acceleration and better performance at scale but lacks built-in filtering."
    ),
    EvaluationSample(
        question="What is positional encoding in transformers?",
        ground_truth="Positional encoding adds position information to tokens since transformers process positions in parallel, using sinusoidal functions or learned embeddings."
    ),
]

print(f"üìã Created {len(evaluation_dataset)} evaluation samples")

In [None]:
# Run RAG on all evaluation samples
print("üîÑ Running RAG on evaluation samples...")

for i, sample in enumerate(evaluation_dataset):
    result = rag_query(sample.question)
    sample.contexts = result["contexts"]
    sample.answer = result["answer"]
    print(f"   [{i+1}/{len(evaluation_dataset)}] {sample.question[:40]}...")

print("‚úÖ All samples processed!")

---

## Part 4: Implementing Custom Evaluation Metrics

We'll implement simplified versions of RAGAS metrics that work with local LLMs.

In [None]:
class RAGEvaluator:
    """
    Custom RAG evaluator using local LLM.
    """
    
    def __init__(self, llm_model: str = "qwen3:8b"):
        self.llm_model = llm_model
        
    def _llm_judge(self, prompt: str) -> str:
        """Get LLM judgment."""
        response = ollama.chat(
            model=self.llm_model,
            messages=[{"role": "user", "content": prompt}]
        )
        return response["message"]["content"].strip()
    
    def evaluate_faithfulness(self, sample: EvaluationSample) -> float:
        """
        Evaluate if the answer is grounded in the provided context.
        Returns score from 0 to 1.
        """
        context_str = "\n".join(sample.contexts[:3])  # Use top 3 contexts
        
        prompt = f"""You are an expert evaluator. Determine if the answer is faithfully grounded in the provided context.

CONTEXT:
{context_str}

ANSWER:
{sample.answer}

EVALUATION CRITERIA:
- Score 1.0: All claims in the answer are supported by the context
- Score 0.5: Some claims are supported, some are not
- Score 0.0: The answer contains claims not found in context (hallucination)

Respond with ONLY a number: 0.0, 0.5, or 1.0"""
        
        try:
            response = self._llm_judge(prompt)
            # Extract number from response
            for val in ["1.0", "0.5", "0.0", "1", "0"]:
                if val in response:
                    return float(val)
            return 0.5  # Default if parsing fails
        except:
            return 0.5
    
    def evaluate_answer_relevancy(self, sample: EvaluationSample) -> float:
        """
        Evaluate if the answer addresses the question.
        """
        prompt = f"""You are an expert evaluator. Determine if the answer directly addresses the question.

QUESTION:
{sample.question}

ANSWER:
{sample.answer}

EVALUATION CRITERIA:
- Score 1.0: The answer directly and completely addresses the question
- Score 0.5: The answer partially addresses the question or is incomplete
- Score 0.0: The answer does not address the question at all

Respond with ONLY a number: 0.0, 0.5, or 1.0"""
        
        try:
            response = self._llm_judge(prompt)
            for val in ["1.0", "0.5", "0.0", "1", "0"]:
                if val in response:
                    return float(val)
            return 0.5
        except:
            return 0.5
    
    def evaluate_context_precision(self, sample: EvaluationSample) -> float:
        """
        Evaluate if retrieved contexts are relevant to the question.
        """
        relevant_count = 0
        
        for context in sample.contexts[:5]:  # Check top 5
            prompt = f"""Is this context relevant to answering the question?

QUESTION: {sample.question}

CONTEXT: {context[:500]}

Respond with ONLY: YES or NO"""
            
            try:
                response = self._llm_judge(prompt).upper()
                if "YES" in response:
                    relevant_count += 1
            except:
                pass
        
        return relevant_count / min(5, len(sample.contexts))
    
    def evaluate_context_recall(self, sample: EvaluationSample) -> float:
        """
        Evaluate if the key information from ground truth is in the contexts.
        """
        context_str = "\n".join(sample.contexts)
        
        prompt = f"""Does the provided context contain the information needed to produce this ground truth answer?

GROUND TRUTH ANSWER:
{sample.ground_truth}

RETRIEVED CONTEXT:
{context_str[:2000]}

EVALUATION:
- Score 1.0: All key information from ground truth is present in context
- Score 0.5: Some key information is present
- Score 0.0: Little to no relevant information in context

Respond with ONLY a number: 0.0, 0.5, or 1.0"""
        
        try:
            response = self._llm_judge(prompt)
            for val in ["1.0", "0.5", "0.0", "1", "0"]:
                if val in response:
                    return float(val)
            return 0.5
        except:
            return 0.5
    
    def evaluate(self, sample: EvaluationSample) -> EvaluationResult:
        """
        Run all evaluations on a sample.
        """
        return EvaluationResult(
            question=sample.question,
            faithfulness=self.evaluate_faithfulness(sample),
            answer_relevancy=self.evaluate_answer_relevancy(sample),
            context_precision=self.evaluate_context_precision(sample),
            context_recall=self.evaluate_context_recall(sample)
        )


# Create evaluator
evaluator = RAGEvaluator(llm_model=LLM_MODEL)
print("‚úÖ Evaluator ready!")

---

## Part 5: Running Evaluation

In [None]:
# Run evaluation on all samples
print("üî¨ Running evaluation...")
print("   (This may take a few minutes)\n")

results = []
for i, sample in enumerate(evaluation_dataset):
    print(f"   [{i+1}/{len(evaluation_dataset)}] Evaluating: {sample.question[:40]}...")
    result = evaluator.evaluate(sample)
    results.append(result)
    print(f"       Faithfulness: {result.faithfulness:.2f}, Relevancy: {result.answer_relevancy:.2f}, "
          f"Precision: {result.context_precision:.2f}, Recall: {result.context_recall:.2f}")

print("\n‚úÖ Evaluation complete!")

In [None]:
# Aggregate results
def calculate_aggregate_metrics(results: List[EvaluationResult]) -> Dict[str, float]:
    """Calculate aggregate metrics across all samples."""
    faithfulness = np.mean([r.faithfulness for r in results])
    answer_relevancy = np.mean([r.answer_relevancy for r in results])
    context_precision = np.mean([r.context_precision for r in results])
    context_recall = np.mean([r.context_recall for r in results])
    
    return {
        "faithfulness": faithfulness,
        "answer_relevancy": answer_relevancy,
        "context_precision": context_precision,
        "context_recall": context_recall,
        "average": (faithfulness + answer_relevancy + context_precision + context_recall) / 4
    }

metrics = calculate_aggregate_metrics(results)

print("\n" + "=" * 60)
print("üìä EVALUATION RESULTS")
print("=" * 60)
print(f"\n{'Metric':<25} {'Score':<10} {'Status'}")
print("-" * 50)

for metric, score in metrics.items():
    if metric == "average":
        continue
    status = "‚úÖ Good" if score >= 0.7 else ("‚ö†Ô∏è Needs Work" if score >= 0.5 else "‚ùå Poor")
    bar = "‚ñà" * int(score * 20) + "‚ñë" * (20 - int(score * 20))
    print(f"{metric:<25} {bar} {score:.2f} {status}")

print("-" * 50)
print(f"{'OVERALL AVERAGE':<25} {metrics['average']:.2f}")
print("=" * 60)

---

## Part 6: Detailed Analysis

In [None]:
# Find problematic samples
print("\nüîç Detailed Sample Analysis:")
print("=" * 70)

for i, (sample, result) in enumerate(zip(evaluation_dataset, results)):
    avg = result.average
    status = "‚úÖ" if avg >= 0.75 else ("‚ö†Ô∏è" if avg >= 0.5 else "‚ùå")
    
    print(f"\n{status} Sample {i+1}: {sample.question[:50]}...")
    print(f"   Faithfulness: {result.faithfulness:.2f} | "
          f"Relevancy: {result.answer_relevancy:.2f} | "
          f"Precision: {result.context_precision:.2f} | "
          f"Recall: {result.context_recall:.2f}")
    print(f"   Average: {avg:.2f}")
    
    # Show answer preview for low scores
    if avg < 0.5:
        print(f"   Answer: {sample.answer[:100]}...")
        print(f"   Ground Truth: {sample.ground_truth[:100]}...")

In [None]:
# Identify failure patterns
print("\nüìà Failure Pattern Analysis:")
print("=" * 60)

low_faithfulness = [r for r in results if r.faithfulness < 0.5]
low_relevancy = [r for r in results if r.answer_relevancy < 0.5]
low_precision = [r for r in results if r.context_precision < 0.5]
low_recall = [r for r in results if r.context_recall < 0.5]

print(f"\nüî∏ Low Faithfulness ({len(low_faithfulness)}/{len(results)}): Hallucination issues")
print(f"üî∏ Low Answer Relevancy ({len(low_relevancy)}/{len(results)}): Not answering question")
print(f"üî∏ Low Context Precision ({len(low_precision)}/{len(results)}): Retrieving irrelevant docs")
print(f"üî∏ Low Context Recall ({len(low_recall)}/{len(results)}): Missing relevant docs")

# Recommendations
print("\nüí° Recommendations:")
if len(low_faithfulness) > len(results) * 0.3:
    print("   - Improve prompt to emphasize using only context")
    print("   - Consider adding 'if unsure, say so' instruction")
if len(low_precision) > len(results) * 0.3:
    print("   - Try hybrid search (dense + sparse)")
    print("   - Add reranking stage")
if len(low_recall) > len(results) * 0.3:
    print("   - Increase chunk overlap")
    print("   - Try smaller chunk sizes for more granular retrieval")

---

## Part 7: Setting Quality Thresholds

In [None]:
# Define quality thresholds for production
@dataclass
class QualityThresholds:
    """Quality thresholds for production deployment."""
    min_faithfulness: float = 0.7
    min_answer_relevancy: float = 0.7
    min_context_precision: float = 0.6
    min_context_recall: float = 0.6
    min_average: float = 0.65


def check_production_ready(
    metrics: Dict[str, float],
    thresholds: QualityThresholds = QualityThresholds()
) -> Dict[str, Any]:
    """
    Check if the system meets production quality thresholds.
    """
    checks = {
        "faithfulness": metrics["faithfulness"] >= thresholds.min_faithfulness,
        "answer_relevancy": metrics["answer_relevancy"] >= thresholds.min_answer_relevancy,
        "context_precision": metrics["context_precision"] >= thresholds.min_context_precision,
        "context_recall": metrics["context_recall"] >= thresholds.min_context_recall,
        "average": metrics["average"] >= thresholds.min_average,
    }
    
    return {
        "passed": all(checks.values()),
        "checks": checks,
        "metrics": metrics,
        "thresholds": asdict(thresholds)
    }


# Check production readiness
thresholds = QualityThresholds()
prod_check = check_production_ready(metrics, thresholds)

print("\n" + "=" * 60)
print("üöÄ PRODUCTION READINESS CHECK")
print("=" * 60)

for metric, passed in prod_check["checks"].items():
    icon = "‚úÖ" if passed else "‚ùå"
    actual = metrics.get(metric, 0)
    threshold = getattr(thresholds, f"min_{metric}", 0)
    print(f"   {icon} {metric}: {actual:.2f} (threshold: {threshold:.2f})")

print("-" * 60)
if prod_check["passed"]:
    print("üéâ SYSTEM IS PRODUCTION READY!")
else:
    print("‚ö†Ô∏è SYSTEM NEEDS IMPROVEMENT BEFORE PRODUCTION")

In [None]:
# Save evaluation report
report = {
    "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
    "num_samples": len(evaluation_dataset),
    "metrics": metrics,
    "production_ready": prod_check["passed"],
    "thresholds": asdict(thresholds),
    "sample_results": [
        {
            "question": r.question,
            "faithfulness": r.faithfulness,
            "answer_relevancy": r.answer_relevancy,
            "context_precision": r.context_precision,
            "context_recall": r.context_recall,
            "average": r.average
        }
        for r in results
    ]
}

report_path = Path("./evaluation_report.json")
with open(report_path, 'w') as f:
    json.dump(report, f, indent=2)

print(f"\nüìÑ Evaluation report saved to: {report_path}")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Evaluating Without Ground Truth
```python
# ‚ùå Wrong: No ground truth, can't measure recall
eval_samples = [{"question": "..."}]  # Missing ground_truth

# ‚úÖ Right: Include ground truth for meaningful evaluation
eval_samples = [
    {"question": "...", "ground_truth": "Expected answer..."}
]
```

### Mistake 2: Too Few Evaluation Samples
```python
# ‚ùå Wrong: 5 samples isn't statistically meaningful
eval_dataset = samples[:5]

# ‚úÖ Right: At least 50 samples for reliable metrics
eval_dataset = samples[:50]
```

### Mistake 3: Not Testing Edge Cases
```python
# ‚ùå Wrong: Only testing happy-path questions
questions = ["What is X?", "How does Y work?"]

# ‚úÖ Right: Include edge cases
questions = [
    "What is X?",                          # Normal
    "What is the recipe for pizza?",       # Out of domain
    "Compare X and Y and Z in detail",     # Complex
    "X?",                                   # Minimal query
]
```

---

## ‚úã Try It Yourself

### Exercise 1: Add More Samples
Expand the evaluation dataset to 30+ samples covering all documents.

### Exercise 2: Edge Case Testing
Add samples for: out-of-domain questions, ambiguous questions, and multi-hop reasoning.

### Exercise 3: A/B Comparison
Compare your current RAG with one using different chunking or retrieval settings.

<details>
<summary>üí° Hint for Exercise 3</summary>

```python
# Build two RAG variants
rag_v1 = build_rag(chunk_size=256)
rag_v2 = build_rag(chunk_size=1024)

# Evaluate both
results_v1 = evaluate_all(rag_v1, eval_dataset)
results_v2 = evaluate_all(rag_v2, eval_dataset)

# Compare
print(f"V1 Average: {calculate_average(results_v1):.2f}")
print(f"V2 Average: {calculate_average(results_v2):.2f}")
```
</details>

---

## üéâ Checkpoint

You've learned:
- ‚úÖ The four key RAGAS metrics: Faithfulness, Relevancy, Precision, Recall
- ‚úÖ How to create evaluation datasets with ground truth
- ‚úÖ How to implement custom evaluation metrics
- ‚úÖ How to set and check production quality thresholds

**Key Insight:** Systematic evaluation is what separates demo projects from production systems. Always measure before deploying!

---

## üßπ Cleanup

In [None]:
# Clean up
del embedding_model, vectorstore
gc.collect()
torch.cuda.empty_cache()

if Path(CHROMA_PATH).exists():
    shutil.rmtree(CHROMA_PATH)

print("‚úÖ Cleanup complete!")

---

## Next Steps

In the final lab, we'll build a **production-ready RAG system** with all the best practices!

‚û°Ô∏è Continue to [Lab 3.5.7: Production RAG System](./lab-3.5.7-production-rag.ipynb)