# Task 4.3.1: MLflow Setup for LLM Experiment Tracking

**Module:** 4.3 - MLOps & Experiment Tracking  
**Time:** 2 hours  
**Difficulty:** ⭐⭐ (Beginner-Intermediate)

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand why experiment tracking is essential for ML projects
- [ ] Set up and configure MLflow tracking server on DGX Spark
- [ ] Log parameters, metrics, and artifacts from training runs
- [ ] Navigate the MLflow UI to compare experiments
- [ ] Use MLflow's autolog feature for PyTorch models

---

## Prerequisites

- Completed: Module 4.2 (AI Safety) or basic understanding of model training
- Knowledge of: Python, basic ML training loops
- Running inside NGC PyTorch container with port 5000 exposed

---

## Real-World Context

Picture this: You're working on fine-tuning an LLM for your company. After 50 experiments over 3 weeks, your boss asks:

> "Which configuration gave us the best results? And can you reproduce it?"

Without proper tracking, you're digging through terminal logs, trying to remember which hyperparameters you used, and hoping you saved that one good checkpoint somewhere.

**MLflow solves this.** Companies like Databricks, Meta, and Microsoft use it to track thousands of experiments, ensuring every result is reproducible and comparable.

---

## ELI5: What is Experiment Tracking?

> **Imagine you're baking the perfect chocolate chip cookie.**
>
> You try different recipes: more sugar, less butter, different oven temperatures. But if you don't write down what you did each time, you'll forget which batch was the best!
>
> Experiment tracking is your **recipe book** for machine learning. Every time you train a model, you write down:
> - What ingredients (hyperparameters) you used
> - How it turned out (metrics like accuracy, loss)
> - The actual cookie (the saved model)
>
> **In AI terms:** MLflow is like a lab notebook that automatically records everything about your ML experiments, so you can always find and reproduce your best results.

---

## Part 1: Setting Up MLflow

### Understanding MLflow Components

MLflow has four main components:

| Component | Purpose | Analogy |
|-----------|---------|--------|
| **Tracking** | Log experiments | Recipe book |
| **Projects** | Package code | Recipe cards |
| **Models** | Package models | Cookie tin |
| **Registry** | Version models | Cookbook editions |

Today we'll focus on **Tracking** - the foundation of everything else.

In [None]:
# First, let's install and verify MLflow
!pip install mlflow -q

import mlflow
print(f"MLflow version: {mlflow.__version__}")

In [None]:
# Check our DGX Spark environment
import torch
import os

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

### Starting the MLflow Tracking Server

In production, you'd run the MLflow server as a separate process. For learning, we'll use MLflow's local file-based tracking.

**Note:** To run the full UI server, execute this in a separate terminal:
```bash
mlflow server --host 0.0.0.0 --port 5000 --backend-store-uri ./mlruns
```

In [None]:
# Configure MLflow to use local storage
# In production, you'd use a remote tracking server

MLFLOW_TRACKING_DIR = "./mlruns"
os.makedirs(MLFLOW_TRACKING_DIR, exist_ok=True)

# Set tracking URI to local directory
mlflow.set_tracking_uri(f"file://{os.path.abspath(MLFLOW_TRACKING_DIR)}")

print(f"Tracking URI: {mlflow.get_tracking_uri()}")
print(f"\nMLflow will store experiments in: {os.path.abspath(MLFLOW_TRACKING_DIR)}")

### What Just Happened?

We configured MLflow to store all experiment data locally. In a team setting, you'd point this to a shared server so everyone can see each other's experiments.

---

## Part 2: Your First Tracked Experiment

Let's track a simple training run. We'll create a toy classification problem to demonstrate the concepts.

In [None]:
# Create a toy dataset for demonstration
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np

# Set seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Generate synthetic data
n_samples = 1000
n_features = 20

X = torch.randn(n_samples, n_features)
# True weights for our classification
true_weights = torch.randn(n_features)
y = (X @ true_weights > 0).float()

# Split into train/val
train_X, val_X = X[:800], X[800:]
train_y, val_y = y[:800], y[800:]

print(f"Training samples: {len(train_X)}")
print(f"Validation samples: {len(val_X)}")
print(f"Features: {n_features}")
print(f"Class balance: {y.mean():.2%} positive")

In [None]:
# Define a simple neural network
class SimpleClassifier(nn.Module):
    """A simple 2-layer neural network for binary classification."""
    
    def __init__(self, input_dim: int, hidden_dim: int, dropout: float = 0.1):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, 1),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        return self.network(x).squeeze(-1)

# Quick test
model = SimpleClassifier(n_features, 64)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

### Now Let's Track an Experiment!

This is where MLflow shines. We'll log:
- **Parameters**: hyperparameters like learning rate, hidden size
- **Metrics**: loss, accuracy (tracked over time)
- **Artifacts**: saved model, plots
- **Tags**: metadata for organization

In [None]:
def train_with_mlflow(
    hidden_dim: int = 64,
    learning_rate: float = 0.01,
    dropout: float = 0.1,
    epochs: int = 20,
    batch_size: int = 32,
    experiment_name: str = "Binary-Classification-Demo"
):
    """
    Train a simple classifier with full MLflow tracking.
    
    This demonstrates the core MLflow logging capabilities.
    """
    # Set or create experiment
    mlflow.set_experiment(experiment_name)
    
    # Start a new run
    with mlflow.start_run(run_name=f"hidden-{hidden_dim}-lr-{learning_rate}"):
        # Log parameters (the recipe)
        mlflow.log_param("hidden_dim", hidden_dim)
        mlflow.log_param("learning_rate", learning_rate)
        mlflow.log_param("dropout", dropout)
        mlflow.log_param("epochs", epochs)
        mlflow.log_param("batch_size", batch_size)
        
        # Log tags (metadata)
        mlflow.set_tag("model_type", "SimpleClassifier")
        mlflow.set_tag("developer", "SPARK")
        mlflow.set_tag("environment", "DGX Spark")
        
        # Create model
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        model = SimpleClassifier(n_features, hidden_dim, dropout).to(device)
        
        # Log model architecture as artifact
        with open("model_architecture.txt", "w") as f:
            f.write(str(model))
        mlflow.log_artifact("model_architecture.txt")
        os.remove("model_architecture.txt")
        
        # Setup training
        criterion = nn.BCELoss()
        optimizer = optim.Adam(model.parameters(), lr=learning_rate)
        
        train_dataset = TensorDataset(train_X.to(device), train_y.to(device))
        train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
        
        # Training loop
        best_val_acc = 0.0
        
        for epoch in range(epochs):
            model.train()
            epoch_loss = 0.0
            
            for batch_X, batch_y in train_loader:
                optimizer.zero_grad()
                outputs = model(batch_X)
                loss = criterion(outputs, batch_y)
                loss.backward()
                optimizer.step()
                epoch_loss += loss.item()
            
            # Calculate metrics
            avg_loss = epoch_loss / len(train_loader)
            
            # Validation
            model.eval()
            with torch.no_grad():
                val_outputs = model(val_X.to(device))
                val_preds = (val_outputs > 0.5).float()
                val_acc = (val_preds == val_y.to(device)).float().mean().item()
                val_loss = criterion(val_outputs, val_y.to(device)).item()
            
            # Log metrics with step number
            mlflow.log_metrics({
                "train_loss": avg_loss,
                "val_loss": val_loss,
                "val_accuracy": val_acc
            }, step=epoch)
            
            if val_acc > best_val_acc:
                best_val_acc = val_acc
            
            if epoch % 5 == 0:
                print(f"Epoch {epoch:2d}: train_loss={avg_loss:.4f}, val_acc={val_acc:.4f}")
        
        # Log final metrics
        mlflow.log_metric("best_val_accuracy", best_val_acc)
        
        # Log the model
        mlflow.pytorch.log_model(model, "model")
        
        print(f"\nRun completed! Best validation accuracy: {best_val_acc:.4f}")
        print(f"Run ID: {mlflow.active_run().info.run_id}")
        
        return model, best_val_acc

In [None]:
# Run our first tracked experiment!
model, acc = train_with_mlflow(
    hidden_dim=64,
    learning_rate=0.01,
    epochs=20
)

### What Just Happened?

MLflow automatically:
1. Created an experiment called "Binary-Classification-Demo"
2. Started a run with our parameters logged
3. Tracked loss and accuracy at each epoch
4. Saved the trained model
5. Generated a unique run ID for future reference

Let's run a few more experiments with different configurations to see the power of comparison!

In [None]:
# Run multiple experiments with different hyperparameters
experiments = [
    {"hidden_dim": 32, "learning_rate": 0.001, "dropout": 0.1},
    {"hidden_dim": 128, "learning_rate": 0.01, "dropout": 0.2},
    {"hidden_dim": 64, "learning_rate": 0.1, "dropout": 0.0},
]

results = []
for config in experiments:
    print(f"\n{'='*50}")
    print(f"Running: {config}")
    print(f"{'='*50}")
    _, acc = train_with_mlflow(**config, epochs=20)
    results.append((config, acc))

print("\n" + "="*50)
print("Summary of all experiments:")
print("="*50)
for config, acc in results:
    print(f"Accuracy: {acc:.4f} | {config}")

---

## Part 3: Exploring Experiments Programmatically

The MLflow UI is great, but you can also query experiments with Python!

In [None]:
# List all experiments
from mlflow.tracking import MlflowClient

client = MlflowClient()

print("All experiments:")
for exp in client.search_experiments():
    print(f"  - {exp.name} (ID: {exp.experiment_id})")

In [None]:
# Get all runs from our experiment
experiment = client.get_experiment_by_name("Binary-Classification-Demo")

runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["metrics.best_val_accuracy DESC"]
)

print(f"Found {len(runs)} runs in 'Binary-Classification-Demo'")
print("\nTop runs by validation accuracy:")
print("-" * 70)

for run in runs[:5]:
    params = run.data.params
    metrics = run.data.metrics
    print(f"Run: {run.info.run_id[:8]}... | "
          f"Acc: {metrics.get('best_val_accuracy', 0):.4f} | "
          f"hidden={params.get('hidden_dim', 'N/A')}, "
          f"lr={params.get('learning_rate', 'N/A')}")

In [None]:
# Load the best model
best_run = runs[0]
print(f"Loading best model from run: {best_run.info.run_id}")

# Load model from MLflow
best_model_uri = f"runs:/{best_run.info.run_id}/model"
loaded_model = mlflow.pytorch.load_model(best_model_uri)

# Test it
loaded_model.eval()
with torch.no_grad():
    device = next(loaded_model.parameters()).device
    test_output = loaded_model(val_X[:5].to(device))
    print(f"\nSample predictions: {test_output.cpu().numpy().round(3)}")
    print(f"Actual labels:      {val_y[:5].numpy()}")

---

## Part 4: MLflow Autologging for PyTorch

MLflow can automatically log metrics without modifying your training code!

In [None]:
# Enable autologging for PyTorch
mlflow.pytorch.autolog(
    log_every_n_epoch=1,
    log_models=True,
    disable=False,
    exclusive=False
)

print("PyTorch autologging enabled!")
print("MLflow will automatically log:")
print("  - Model parameters")
print("  - Training/validation metrics")
print("  - Model architecture")
print("  - Saved model artifacts")

In [None]:
# Demonstrate autologging with PyTorch Lightning (cleaner approach)
# Note: For vanilla PyTorch, autolog works with some training frameworks

# Let's create a simple example without Lightning
mlflow.set_experiment("Autolog-Demo")

with mlflow.start_run(run_name="autolog-example"):
    # Create model
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = SimpleClassifier(n_features, 64).to(device)
    
    # Manual training for now - autolog captures what it can
    optimizer = optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.BCELoss()
    
    for epoch in range(10):
        model.train()
        outputs = model(train_X.to(device))
        loss = criterion(outputs, train_y.to(device))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # These will be captured by autolog
        mlflow.log_metric("epoch_loss", loss.item(), step=epoch)
    
    # Log the final model
    mlflow.pytorch.log_model(model, "autolog_model")
    print("Autolog run complete!")

---

## Part 5: Tracking LLM Fine-Tuning (Preview)

Let's see how you'd track a real LLM fine-tuning job. We'll use a mock example that shows the pattern.

In [None]:
def mock_llm_finetuning_run():
    """
    Demonstrates how to track an LLM fine-tuning experiment.
    This is a mock example - replace with actual training code.
    """
    mlflow.set_experiment("LLM-Finetuning-Demo")
    
    with mlflow.start_run(run_name="phi2-lora-r16"):
        # Log LLM-specific parameters
        mlflow.log_params({
            # Model config
            "base_model": "microsoft/phi-2",
            "model_size": "2.7B",
            
            # LoRA config
            "adapter_type": "LoRA",
            "lora_r": 16,
            "lora_alpha": 32,
            "lora_dropout": 0.05,
            "target_modules": "q_proj,v_proj,k_proj,o_proj",
            
            # Training config
            "batch_size": 4,
            "gradient_accumulation_steps": 8,
            "effective_batch_size": 32,
            "learning_rate": 2e-4,
            "lr_scheduler": "cosine",
            "warmup_ratio": 0.03,
            "num_epochs": 3,
            "max_seq_length": 2048,
            
            # Quantization
            "quantization": "4-bit",
            "bnb_4bit_compute_dtype": "bfloat16",
            
            # Data
            "dataset": "custom_instruction_data",
            "train_samples": 10000,
            "val_samples": 1000
        })
        
        # Log DGX Spark environment info
        mlflow.set_tags({
            "hardware": "DGX Spark",
            "gpu_memory": "128GB Unified",
            "gpu_type": "Blackwell GB10",
            "container": "nvcr.io/nvidia/pytorch:25.11-py3"
        })
        
        # Simulate training metrics over epochs
        import random
        random.seed(42)
        
        base_loss = 2.5
        for step in range(100):
            # Simulate decreasing loss
            train_loss = base_loss * (0.95 ** (step / 10)) + random.uniform(-0.05, 0.05)
            val_loss = train_loss * 1.1 + random.uniform(-0.03, 0.03)
            
            mlflow.log_metrics({
                "train/loss": train_loss,
                "val/loss": val_loss,
                "train/learning_rate": 2e-4 * max(0.1, 1 - step/100),
                "train/tokens_per_second": 5000 + random.randint(-100, 100),
                "memory/gpu_allocated_gb": 45 + random.uniform(-2, 2)
            }, step=step)
        
        # Log final evaluation metrics
        mlflow.log_metrics({
            "eval/final_loss": 1.23,
            "eval/perplexity": 3.42,
            "eval/accuracy": 0.78,
            "training_time_hours": 2.5
        })
        
        print("Mock LLM fine-tuning run logged!")
        print(f"Run ID: {mlflow.active_run().info.run_id}")

mock_llm_finetuning_run()

---

## Try It Yourself

Now it's your turn! Create an experiment that tracks different optimization algorithms.

<details>
<summary>Hint</summary>

Try these optimizers: `Adam`, `SGD`, `AdamW`, and compare their convergence.
Log the optimizer name as a parameter and track the loss curves.

</details>

In [None]:
# YOUR CODE HERE
# Create an experiment called "Optimizer-Comparison"
# Try at least 3 different optimizers
# Log the optimizer type, learning rate, and track training loss

# Starter code:
optimizers_to_try = [
    ("Adam", optim.Adam),
    ("SGD", optim.SGD),
    ("AdamW", optim.AdamW),
]

# Your experiment code here...

---

## Common Mistakes

### Mistake 1: Not Setting Experiment Name

```python
# Wrong - runs go to "Default" experiment
with mlflow.start_run():
    mlflow.log_metric("loss", 0.5)

# Right - organized by experiment
mlflow.set_experiment("My-Project")
with mlflow.start_run():
    mlflow.log_metric("loss", 0.5)
```
**Why:** Without experiment names, all your runs get mixed together, making it impossible to find anything later.

### Mistake 2: Forgetting to End Runs

```python
# Wrong - orphaned run if exception occurs
mlflow.start_run()
mlflow.log_metric("loss", 0.5)
# Forgot mlflow.end_run()!

# Right - context manager handles cleanup
with mlflow.start_run():
    mlflow.log_metric("loss", 0.5)
# Automatically ends run
```
**Why:** Orphaned runs can cause issues with subsequent experiments.

### Mistake 3: Logging After Run Ends

```python
# Wrong
with mlflow.start_run():
    pass
mlflow.log_metric("final_loss", 0.3)  # Error!

# Right
with mlflow.start_run():
    # ... training ...
    mlflow.log_metric("final_loss", 0.3)  # Inside context
```
**Why:** MLflow needs an active run to log to. All logging must happen inside the run context.

### Mistake 4: Not Using Consistent Naming

```python
# Wrong - inconsistent metric names
mlflow.log_metric("Loss", 0.5)
mlflow.log_metric("training_loss", 0.4)
mlflow.log_metric("LOSS", 0.3)

# Right - consistent naming convention
mlflow.log_metric("train/loss", 0.5)
mlflow.log_metric("val/loss", 0.4)
mlflow.log_metric("test/loss", 0.3)
```
**Why:** Inconsistent names make it impossible to compare runs in the UI.

---

## Checkpoint

You've learned:
- How to set up MLflow tracking for experiments
- How to log parameters, metrics, and artifacts
- How to query and compare runs programmatically
- How to load saved models from previous runs
- Best practices for LLM fine-tuning tracking

---

## Challenge (Optional)

Create a hyperparameter search that:
1. Uses a grid of learning rates: [0.001, 0.01, 0.1]
2. Uses hidden dimensions: [32, 64, 128]
3. Tracks all 9 combinations with MLflow
4. Finds and loads the best model
5. Creates a summary visualization

---

## Further Reading

- [MLflow Documentation](https://mlflow.org/docs/latest/index.html)
- [MLflow PyTorch Guide](https://mlflow.org/docs/latest/python_api/mlflow.pytorch.html)
- [MLflow Best Practices](https://mlflow.org/docs/latest/tracking.html#organizing-runs-into-experiments)

---

## Cleanup

In [None]:
# Clear GPU memory
import torch
import gc

# Delete variables
del model, loaded_model
gc.collect()
torch.cuda.empty_cache()

print("Cleanup complete!")
print(f"GPU memory cached: {torch.cuda.memory_reserved()/1024**3:.2f} GB")

---

## Next Steps

In the next notebook, we'll explore **Weights & Biases (W&B)** - another popular experiment tracking tool with powerful visualization features. You'll learn when to use MLflow vs W&B and how to integrate both into your workflow.

**Continue to:** [02-wandb-integration.ipynb](02-wandb-integration.ipynb)