# Module 7: Evaluating RAG Applications

## Learning Objectives
- Understand what components to evaluate in RAG systems
- Learn key evaluation metrics for RAG
- Use MLflow for RAG evaluation
- Implement evaluation best practices
- Interpret evaluation results


## 1. Components to Evaluate

### 1.1 Chunking

**What to evaluate**:
- **Method**: Fixed-size, semantic, header-based, etc.
- **Size**: Chunk size and overlap
- **Quality**: Do chunks preserve semantic meaning?

**Metrics**:
- Chunk size distribution
- Overlap percentage
- Semantic coherence
- Context preservation

### 1.2 Embedding Model

**What to evaluate**:
- **Model selection**: Is the model appropriate for the domain?
- **Quality**: Do embeddings capture semantic relationships?
- **Performance**: Query and document embedding alignment

**Metrics**:
- Embedding quality (semantic similarity)
- Query-document alignment
- Domain-specific performance
- Dimensionality and efficiency

### 1.3 Vector Store

**What to evaluate**:
- **Retrieval**: How well does retrieval work?
- **Performance**: Query latency and throughput
- **Scalability**: Performance at scale

**Metrics**:
- Retrieval precision
- Retrieval recall
- Query latency
- Throughput

### 1.4 Retrieval and Re-ranker

**What to evaluate**:
- **Initial retrieval**: Quality of first-stage retrieval
- **Re-ranking**: Improvement from re-ranking
- **Filtering**: Effectiveness of metadata filters

**Metrics**:
- Top-k accuracy
- Re-ranking improvement
- Filter effectiveness
- Overall retrieval quality

### 1.5 Generator

**What to evaluate**:
- **Response quality**: Accuracy, relevance, completeness
- **Faithfulness**: Is the response grounded in context?
- **Format**: Does it meet requirements?

**Metrics**:
- Answer accuracy
- Answer relevancy
- Faithfulness
- Completeness


## 2. Key Evaluation Metrics

### 2.1 Context Precision

**Definition**: Proportion of retrieved chunks that are relevant to the query

**Formula**: Relevant chunks retrieved / Total chunks retrieved

**Example**:
- Query: "What is RAG?"
- Retrieved: 5 chunks
- Relevant: 4 chunks
- Context Precision: 4/5 = 0.8

**Interpretation**: Higher is better (1.0 = all retrieved chunks are relevant)

### 2.2 Context Relevancy

**Definition**: How relevant are the retrieved chunks to the query?

**Measures**: 
- Semantic relevance
- Topic alignment
- Information usefulness

**Evaluation**: 
- Human evaluation (0-1 scale)
- LLM-based evaluation
- Semantic similarity scores

**Interpretation**: Higher is better

### 2.3 Context Recall

**Definition**: Proportion of all relevant chunks that were retrieved

**Formula**: Relevant chunks retrieved / Total relevant chunks available

**Example**:
- Total relevant chunks in corpus: 10
- Retrieved relevant chunks: 7
- Context Recall: 7/10 = 0.7

**Interpretation**: Higher is better (1.0 = all relevant chunks retrieved)

**Trade-off**: Higher recall may lower precision

### 2.4 Faithfulness

**Definition**: Is the generated answer faithful to the retrieved context?

**Measures**:
- No hallucinations
- Grounded in provided context
- No contradictions

**Evaluation**:
- Check if answer can be derived from context
- Verify claims against context
- Detect hallucinations

**Interpretation**: Higher is better (1.0 = completely faithful)

**Critical**: Low faithfulness = hallucinations = unreliable system

### 2.5 Answer Relevancy

**Definition**: How relevant is the generated answer to the query?

**Measures**:
- Does it answer the question?
- Is it on-topic?
- Is it useful?

**Evaluation**:
- Human evaluation
- LLM-based evaluation
- Semantic similarity to expected answer

**Interpretation**: Higher is better

### 2.6 Answer Correctness

**Definition**: Is the generated answer factually correct?

**Measures**:
- Factual accuracy
- Correct information
- No errors

**Evaluation**:
- Compare to ground truth
- Fact-checking
- Expert review

**Interpretation**: Higher is better

**Note**: Requires ground truth or reference answers


## 3. MLflow Evaluation for RAG

### 3.1 Setting Up Evaluation

**Create evaluation dataset**:

```python
eval_dataset = [
    {
        "query": "What is RAG?",
        "expected_answer": "RAG is Retrieval-Augmented Generation...",
        "context": ["chunk1", "chunk2", ...]
    },
    # ... more examples
]
```

### 3.2 Running Evaluation

**Evaluate RAG chain**:

```python
import mlflow

# Load model
rag_model = mlflow.pyfunc.load_model("models:/rag-qa-system/Production")

# Evaluate
results = mlflow.evaluate(
    model=rag_model,
    data=eval_dataset,
    evaluators="default",
    evaluator_config={
        "col_mapping": {
            "inputs": "query",
            "targets": "expected_answer"
        }
    }
)
```

### 3.3 Custom RAG Evaluators

**Create custom evaluator**:

```python
from mlflow.models import EvaluationMetric

def faithfulness_metric(eval_df, builtin_metrics):
    # Calculate faithfulness
    # Check if answers are grounded in context
    return faithfulness_score

mlflow.evaluate(
    model=rag_model,
    data=eval_dataset,
    evaluators="custom",
    custom_metrics=[
        EvaluationMetric(
            name="faithfulness",
            function=faithfulness_metric
        )
    ]
)
```

### 3.4 Logging Evaluation Results

**Log to MLflow**:

```python
with mlflow.start_run():
    # Run evaluation
    results = evaluate_rag(rag_model, eval_dataset)
    
    # Log metrics
    mlflow.log_metric("context_precision", results["context_precision"])
    mlflow.log_metric("context_recall", results["context_recall"])
    mlflow.log_metric("faithfulness", results["faithfulness"])
    mlflow.log_metric("answer_relevancy", results["answer_relevancy"])
    
    # Log evaluation dataset
    mlflow.log_table(eval_dataset, "eval_dataset.json")
    
    # Log results
    mlflow.log_table(results, "evaluation_results.json")
```


## 4. Evaluation Best Practices

### 4.1 Create Comprehensive Test Sets

**Include**:
- Diverse query types
- Different difficulty levels
- Edge cases
- Domain-specific queries

**Size**: 
- Minimum: 50-100 examples
- Recommended: 200+ examples
- Production: 1000+ examples

### 4.2 Evaluate End-to-End

**Don't just evaluate components in isolation**:
- Evaluate full RAG chain
- Test real-world scenarios
- Measure user-facing metrics

### 4.3 Use Multiple Metrics

**Don't rely on a single metric**:
- Combine retrieval and generation metrics
- Balance precision and recall
- Consider faithfulness and relevancy

### 4.4 Regular Evaluation

**Evaluate regularly**:
- After model updates
- After data updates
- After configuration changes
- Periodic production evaluation

### 4.5 Human Evaluation

**Combine automated and human evaluation**:
- Automated: Scale, consistency
- Human: Quality, nuance
- Use both for comprehensive assessment

### 4.6 Track Over Time

**Monitor metrics over time**:
- Track degradation
- Identify regressions
- Measure improvements

## 5. Interpreting Results

### 5.1 Low Context Precision

**Symptom**: Retrieved chunks are not relevant

**Possible Causes**:
- Poor embedding model
- Wrong chunking strategy
- Mismatched query-document embeddings

**Solutions**:
- Try different embedding model
- Adjust chunking
- Improve query processing

### 5.2 Low Context Recall

**Symptom**: Missing relevant chunks

**Possible Causes**:
- Top-k too small
- Poor retrieval
- Chunks too granular

**Solutions**:
- Increase top-k
- Improve retrieval
- Adjust chunking

### 5.3 Low Faithfulness

**Symptom**: Answers not grounded in context

**Possible Causes**:
- Poor prompt engineering
- Model ignoring context
- Irrelevant context

**Solutions**:
- Improve prompts
- Use better context
- Add instructions to use context

### 5.4 Low Answer Relevancy

**Symptom**: Answers don't address the query

**Possible Causes**:
- Poor retrieval
- Wrong context
- Model issues

**Solutions**:
- Improve retrieval
- Better context selection
- Model fine-tuning

## 6. Summary

### Key Takeaways

1. **Evaluate all components** - chunking, embeddings, retrieval, generation
2. **Use multiple metrics** - precision, recall, faithfulness, relevancy
3. **MLflow helps** - track experiments, log metrics, compare versions
4. **Regular evaluation** - catch issues early, track improvements
5. **Interpret results** - understand what metrics mean and how to improve

### Continuous Improvement

RAG evaluation is an ongoing process:
- Start with basic metrics
- Refine evaluation over time
- Add domain-specific metrics
- Monitor in production


## Exercises

1. **Exercise 1**: Create an evaluation dataset for a RAG application
2. **Exercise 2**: Calculate context precision and recall for a retrieval system
3. **Exercise 3**: Evaluate faithfulness of generated answers
4. **Exercise 4**: Use MLflow to track and compare RAG evaluation results
5. **Exercise 5**: Interpret evaluation results and suggest improvements
