# Module 6: Building RAG Applications with MLflow

## Learning Objectives
- Understand the complete RAG application workflow
- Learn how to assemble RAG applications
- Use MLflow for RAG solution management
- Deploy and serve RAG applications
- Manage RAG chains and pipelines


## 1. RAG Application Workflow Diagram

### Complete RAG Application Flow

```
┌─────────────────────────────────────────────────────────────┐
│              RAG Application Workflow                        │
└─────────────────────────────────────────────────────────────┘

┌──────────────┐
│  User Query  │
└──────┬───────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  Query Processing                                            │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  1. Embed query                                       │  │
│  │  2. Search vector store                               │  │
│  │  3. Retrieve top-k chunks                             │  │
│  └──────────────────────────────────────────────────────┘  │
└──────┬───────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  Context Assembly                                            │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  - Combine retrieved chunks                           │  │
│  │  - Apply filters                                      │  │
│  │  - Re-rank if needed                                  │  │
│  └──────────────────────────────────────────────────────┘  │
└──────┬───────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  Prompt Construction                                         │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  Context: [Retrieved chunks]                         │  │
│  │  Question: [User query]                              │  │
│  │  Instructions: [System prompts]                       │  │
│  └──────────────────────────────────────────────────────┘  │
└──────┬───────────────────────────────────────────────────────┘
       │
       ▼
┌─────────────────────────────────────────────────────────────┐
│  Generation                                                  │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  LLM generates response                              │  │
│  │  - Using augmented prompt                           │  │
│  │  - With retrieved context                            │  │
│  └──────────────────────────────────────────────────────┘  │
└──────┬───────────────────────────────────────────────────────┘
       │
       ▼
┌──────────────┐
│   Response    │
│  (with sources)│
└──────────────┘
```

### Key Components

1. **Query Processing**: Convert query to embedding, search vector store
2. **Context Assembly**: Retrieve and prepare relevant chunks
3. **Prompt Construction**: Build augmented prompt
4. **Generation**: LLM generates response
5. **Response**: Return answer with source citations


## 2. Assembling a RAG Application

### 2.1 Core Components

A RAG application consists of:

1. **Retrieval Component**:
   - Vector search index
   - Embedding model
   - Retrieval logic

2. **Augmentation Component**:
   - Context assembly
   - Prompt construction
   - Filtering and ranking

3. **Generation Component**:
   - LLM model
   - Prompt engineering
   - Response formatting

### 2.2 Building Blocks

#### Retrieval Block

```python
def retrieve(query: str, top_k: int = 5):
    # 1. Embed query
    query_embedding = embedding_model(query)
    
    # 2. Search vector store
    results = vector_search.query(
        query_vector=query_embedding,
        top_k=top_k
    )
    
    # 3. Return chunks
    return results
```

#### Augmentation Block

```python
def augment_prompt(query: str, chunks: list):
    # 1. Combine chunks
    context = "\n\n".join([chunk.text for chunk in chunks])
    
    # 2. Construct prompt
    prompt = f"""
    Context:
    {context}
    
    Question: {query}
    
    Answer based on the context above:
    """
    
    return prompt
```

#### Generation Block

```python
def generate(prompt: str):
    # 1. Call LLM
    response = llm.generate(prompt)
    
    # 2. Format response
    return response
```

### 2.3 Complete RAG Chain

```python
class RAGChain:
    def __init__(self, embedding_model, vector_search, llm):
        self.embedding_model = embedding_model
        self.vector_search = vector_search
        self.llm = llm
    
    def __call__(self, query: str):
        # 1. Retrieve
        chunks = self.retrieve(query)
        
        # 2. Augment
        prompt = self.augment_prompt(query, chunks)
        
        # 3. Generate
        response = self.generate(prompt)
        
        # 4. Return with sources
        return {
            "answer": response,
            "sources": [chunk.metadata for chunk in chunks]
        }
```


## 3. Using MLflow for RAG Solutions

### 3.1 What is MLflow?

**MLflow** is an open-source platform for managing the ML lifecycle, including:
- **Tracking**: Log parameters, metrics, and artifacts
- **Models**: Package and deploy models
- **Registry**: Centralized model store
- **Projects**: Reproducible ML workflows

### 3.2 Why MLflow for RAG?

**Benefits**:
- **Version Control**: Track different RAG configurations
- **Experimentation**: Compare different approaches
- **Reproducibility**: Reproduce RAG setups
- **Deployment**: Package and deploy RAG chains
- **Monitoring**: Track performance over time

### 3.3 MLflow Components for RAG

#### Tracking RAG Experiments

```python
import mlflow

# Start experiment
mlflow.set_experiment("rag-experiments")

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("chunk_size", 500)
    mlflow.log_param("top_k", 5)
    mlflow.log_param("embedding_model", "BGE-large")
    mlflow.log_param("llm", "gpt-4")
    
    # Log metrics
    mlflow.log_metric("retrieval_precision", 0.85)
    mlflow.log_metric("answer_accuracy", 0.92)
    mlflow.log_metric("latency_ms", 250)
    
    # Log artifacts (prompts, examples)
    mlflow.log_artifact("prompt_template.txt")
```

#### Logging RAG Models

```python
# Log RAG chain as MLflow model
mlflow.pyfunc.log_model(
    "rag_chain",
    python_model=RAGChain(...),
    registered_model_name="rag-qa-system"
)
```

#### Model Registry

```python
# Register model
mlflow.register_model(
    "runs:/<run_id>/rag_chain",
    "rag-qa-system"
)

# Transition to production
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="rag-qa-system",
    version=1,
    stage="Production"
)
```

### 3.4 RAG-Specific MLflow Features

#### Custom RAG Metrics

```python
# Log custom RAG metrics
mlflow.log_metric("context_precision", 0.88)
mlflow.log_metric("context_recall", 0.82)
mlflow.log_metric("faithfulness", 0.91)
mlflow.log_metric("answer_relevancy", 0.89)
```

#### Prompt Versioning

```python
# Log prompt templates
mlflow.log_text(prompt_template, "prompt_template.txt")

# Track prompt changes
mlflow.log_param("prompt_version", "v2.1")
```

#### Retrieval Analysis

```python
# Log retrieval results
mlflow.log_dict(retrieval_results, "retrieval_results.json")
```


## 4. Deploying RAG Applications

### 4.1 Model Serving

**Deploy RAG chain as MLflow model**:

```python
# Deploy to Model Serving
mlflow.deployments.create_deployment(
    name="rag-qa-endpoint",
    model_uri="models:/rag-qa-system/Production",
    target_uri="databricks"
)
```

### 4.2 Serving RAG Chains

**Query deployed RAG endpoint**:

```python
# Query endpoint
response = mlflow.deployments.predict(
    deployment_name="rag-qa-endpoint",
    inputs={"query": "What is RAG?"}
)
```

### 4.3 Batch Processing

**Process multiple queries**:

```python
# Batch processing
queries = ["What is RAG?", "How does vector search work?"]
results = mlflow.pyfunc.load_model("models:/rag-qa-system/Production").predict(queries)
```

### 4.4 Real-time Serving

**REST API endpoint**:

```python
# REST API
import requests

response = requests.post(
    "https://<workspace>.databricks.com/serving-endpoints/rag-qa-endpoint/invocations",
    headers={"Authorization": f"Bearer {token}"},
    json={"query": "What is RAG?"}
)
```


## 5. Managing RAG Chains

### 5.1 Chain Configuration

**Store configuration**:

```python
rag_config = {
    "retrieval": {
        "top_k": 5,
        "embedding_model": "BGE-large",
        "similarity_threshold": 0.7
    },
    "augmentation": {
        "max_context_length": 2000,
        "include_metadata": True
    },
    "generation": {
        "llm": "gpt-4",
        "temperature": 0.7,
        "max_tokens": 500
    }
}

mlflow.log_dict(rag_config, "rag_config.json")
```

### 5.2 Version Management

**Track versions**:

```python
# Version different components
mlflow.log_param("rag_chain_version", "v1.2.0")
mlflow.log_param("embedding_model_version", "BGE-large-v1.5")
mlflow.log_param("llm_version", "gpt-4-turbo")
```

### 5.3 A/B Testing

**Compare different configurations**:

```python
# Run A with configuration A
with mlflow.start_run(run_name="config-a"):
    # ... RAG chain A
    mlflow.log_metric("accuracy", 0.92)

# Run B with configuration B
with mlflow.start_run(run_name="config-b"):
    # ... RAG chain B
    mlflow.log_metric("accuracy", 0.89)
```

### 5.4 Monitoring

**Track performance**:

```python
# Log performance metrics
mlflow.log_metric("avg_latency_ms", 250)
mlflow.log_metric("success_rate", 0.98)
mlflow.log_metric("error_rate", 0.02)
```


## 6. Best Practices

### 6.1 Experimentation

**Track everything**:
- Parameters (chunk size, top_k, models)
- Metrics (accuracy, latency, cost)
- Artifacts (prompts, examples, results)

### 6.2 Versioning

**Version all components**:
- RAG chain code
- Prompt templates
- Model versions
- Configuration

### 6.3 Testing

**Test before deployment**:
- Unit tests for each component
- Integration tests for full chain
- Performance tests
- Quality tests

### 6.4 Monitoring

**Monitor in production**:
- Latency
- Error rates
- Quality metrics
- Cost

## 7. Summary and Next Steps

### Key Takeaways

1. **RAG applications** consist of retrieval, augmentation, and generation
2. **MLflow** helps manage RAG lifecycle (experimentation, deployment, monitoring)
3. **Version control** is crucial for RAG components
4. **Deployment** options include real-time and batch processing
5. **Monitoring** ensures production quality

### Next Module: Evaluating RAG Applications

In the next module, we'll explore:
- Components to evaluate in RAG systems
- Evaluation metrics (precision, recall, faithfulness, etc.)
- MLflow evaluation for RAG
- Best practices for RAG evaluation


## Exercises

1. **Exercise 1**: Build a complete RAG chain with retrieval, augmentation, and generation
2. **Exercise 2**: Use MLflow to track a RAG experiment with different configurations
3. **Exercise 3**: Deploy a RAG chain as an MLflow model
4. **Exercise 4**: Compare two RAG configurations using MLflow
5. **Exercise 5**: Set up monitoring for a deployed RAG application
