# Lab 4.3.1: MLflow Experiment Tracking

**Module:** 4.3 - MLOps & Experiment Tracking  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand why experiment tracking is essential for ML development
- [ ] Set up MLflow tracking server on DGX Spark
- [ ] Log parameters, metrics, and artifacts systematically
- [ ] Compare experiments and find optimal hyperparameters
- [ ] Use the MLflow UI for experiment visualization

---

## üìö Prerequisites

- Completed: Module 4.2 (AI Safety)
- Knowledge of: Python, basic ML training loops, PyTorch fundamentals
- Hardware: DGX Spark (or any CUDA-capable GPU)

---

## üåç Real-World Context

**The Nightmare Scenario:**

You train a model, achieve 95% accuracy... and then:
- "Wait, what learning rate did I use?"
- "Which version of the dataset was this?"
- "Did I use dropout or not?"

**Every ML engineer has been there.** That's why industry leaders use experiment tracking:

| Company | Use Case | Scale |
|---------|----------|-------|
| **Netflix** | Recommendation model A/B tests | 10,000+ experiments/month |
| **Airbnb** | Search ranking experiments | Every model version tracked |
| **Tesla** | Autonomous driving model iterations | Petabytes of run data |
| **OpenAI** | GPT training runs | Every hyperparameter logged |

**MLflow** is the most popular open-source tool for this, with 4 core components:

1. **Tracking** - Log everything about your experiments
2. **Projects** - Package code for reproducibility  
3. **Models** - Manage and deploy models
4. **Registry** - Version and stage models for production

---

## üßí ELI5: What is Experiment Tracking?

> **Imagine you're a chef creating a new chocolate chip cookie recipe.**
>
> Every time you bake a batch, you'd want to write down:
> - **Ingredients** (parameters): 2 cups flour, 1/2 cup sugar, 1 tsp vanilla
> - **How it turned out** (metrics): Taste 8/10, texture 7/10, looks 9/10
> - **A photo** (artifacts): What the cookies actually looked like
>
> After 50 batches, you flip through your notebook and see:
> - "Aha! More brown butter = chewier cookies!"
> - "Batch #37 was the best - those exact ingredients!"
> - "Overbaking always drops the texture score"
>
> **Experiment tracking is your ML notebook!**
> - **Parameters:** learning_rate=0.001, batch_size=32, epochs=10
> - **Metrics:** loss=0.15, accuracy=92%, F1=0.89
> - **Artifacts:** model weights, confusion matrix, training curves
>
> Without it, you're baking cookies with no notes - hoping to remember what worked!

---

## Part 1: Setting Up MLflow on DGX Spark

### Why MLflow Works Great on DGX Spark

- **Pure Python**: No special ARM64 compilation needed
- **Lightweight**: Minimal overhead even for large experiments
- **Local-first**: Works without internet, perfect for secure environments
- **Scalable**: Same API whether tracking 10 or 10,000 runs

In [None]:
# Install MLflow if needed
import subprocess
import sys

try:
    import mlflow
    print(f"‚úÖ MLflow already installed: v{mlflow.__version__}")
except ImportError:
    print("üì¶ Installing MLflow...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "mlflow", "-q"])
    import mlflow
    print(f"‚úÖ MLflow installed: v{mlflow.__version__}")

In [None]:
import mlflow
import mlflow.pytorch
import os
import json
from pathlib import Path
from datetime import datetime

# Additional imports we'll need
import numpy as np
import matplotlib.pyplot as plt

print(f"MLflow version: {mlflow.__version__}")
print(f"Python version: {sys.version.split()[0]}")

In [None]:
# Configure MLflow storage location
# We'll use a local directory - in production you'd use a server

NOTEBOOK_DIR = Path.cwd()
MODULE_DIR = (NOTEBOOK_DIR / "..").resolve()
MLFLOW_DIR = MODULE_DIR / "mlflow"
MLFLOW_DIR.mkdir(exist_ok=True)

# Set tracking URI - this is where all experiment data is stored
tracking_uri = f"file://{MLFLOW_DIR}"
mlflow.set_tracking_uri(tracking_uri)

print(f"üìÅ MLflow storage: {MLFLOW_DIR}")
print(f"üîó Tracking URI: {mlflow.get_tracking_uri()}")
print()
print("üí° Storage Options:")
print("   file:///path/to/dir    - Local development")
print("   http://localhost:5000  - Local server (team sharing)")
print("   http://mlflow.corp.com - Production server")

### üîç What Just Happened?

We configured MLflow to store experiment data locally. The `tracking_uri` tells MLflow:
- **Where to save** parameters, metrics, and artifacts
- **How to connect** (file system vs. HTTP server)

For DGX Spark development, local storage is perfect. For team collaboration, you'd run an MLflow server.

---

## Part 2: Your First Experiment

### üßí ELI5: Experiments vs Runs

> **Experiment** = A recipe book ("Chocolate Chip Cookies")
>
> **Run** = One attempt at the recipe ("Batch #5: with brown butter")
>
> You might have:
> - Experiment: "LLM Fine-tuning"
>   - Run 1: Phi-2, lr=1e-4, epochs=3
>   - Run 2: Phi-2, lr=1e-5, epochs=5
>   - Run 3: Phi-2, lr=1e-4, with LoRA rank=16

In [None]:
# Create or get an experiment
EXPERIMENT_NAME = "LLM-Finetuning-Demo"

# Check if experiment exists
experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)

if experiment is None:
    # Create new experiment with metadata tags
    experiment_id = mlflow.create_experiment(
        EXPERIMENT_NAME,
        tags={
            "project": "dgx-spark-curriculum",
            "module": "4.3",
            "hardware": "DGX Spark",
            "created_by": "student"
        }
    )
    print(f"‚ú® Created new experiment: '{EXPERIMENT_NAME}'")
    print(f"   Experiment ID: {experiment_id}")
else:
    experiment_id = experiment.experiment_id
    print(f"üìÇ Using existing experiment: '{EXPERIMENT_NAME}'")
    print(f"   Experiment ID: {experiment_id}")

# Set as active experiment (all future runs go here)
mlflow.set_experiment(EXPERIMENT_NAME)

In [None]:
# Start your first run!
# The context manager ensures the run is properly closed even if errors occur

with mlflow.start_run(run_name="my-first-run") as run:
    
    # ========== LOG PARAMETERS ==========
    # Parameters are the INPUTS to your experiment
    # Things you set BEFORE training starts
    
    mlflow.log_param("model_name", "microsoft/phi-2")
    mlflow.log_param("learning_rate", 1e-4)
    mlflow.log_param("batch_size", 8)
    mlflow.log_param("epochs", 3)
    mlflow.log_param("lora_rank", 16)
    mlflow.log_param("lora_alpha", 32)
    mlflow.log_param("optimizer", "AdamW")
    
    # You can also log multiple params at once
    mlflow.log_params({
        "warmup_steps": 100,
        "weight_decay": 0.01,
        "gradient_accumulation": 4
    })
    
    # ========== LOG METRICS ==========
    # Metrics are the OUTPUTS/RESULTS of your experiment
    # Things measured DURING or AFTER training
    
    # Final metrics (single values)
    mlflow.log_metric("final_train_loss", 0.45)
    mlflow.log_metric("final_eval_loss", 0.52)
    mlflow.log_metric("final_accuracy", 0.87)
    mlflow.log_metric("training_time_minutes", 45)
    mlflow.log_metric("gpu_memory_gb", 24.5)
    
    # Metrics over time (for charts) - use 'step' parameter
    for epoch in range(3):
        # Simulate decreasing loss, increasing accuracy
        train_loss = 0.8 - epoch * 0.15
        eval_loss = 0.9 - epoch * 0.12
        accuracy = 0.65 + epoch * 0.08
        
        mlflow.log_metrics({
            "epoch_train_loss": train_loss,
            "epoch_eval_loss": eval_loss,
            "epoch_accuracy": accuracy
        }, step=epoch)
    
    # ========== LOG TAGS ==========
    # Tags are metadata for organization and filtering
    
    mlflow.set_tag("hardware", "DGX Spark")
    mlflow.set_tag("framework", "pytorch")
    mlflow.set_tag("status", "completed")
    mlflow.set_tag("dataset", "alpaca-cleaned")
    mlflow.set_tag("notes", "First successful run with LoRA")
    
    # Save run info for later
    run_id = run.info.run_id
    artifact_uri = run.info.artifact_uri

print(f"\nüéâ Run completed!")
print(f"   Run ID: {run_id}")
print(f"   Artifact URI: {artifact_uri}")

In [None]:
# View what we logged
run_data = mlflow.get_run(run_id)

print("üìä LOGGED DATA SUMMARY")
print("=" * 60)

print("\nüìå Parameters (inputs):")
for key, value in sorted(run_data.data.params.items()):
    print(f"   {key}: {value}")

print("\nüìà Metrics (outputs):")
for key, value in sorted(run_data.data.metrics.items()):
    print(f"   {key}: {value}")

print("\nüè∑Ô∏è Tags (metadata):")
for key, value in sorted(run_data.data.tags.items()):
    if not key.startswith("mlflow."):  # Skip internal tags
        print(f"   {key}: {value}")

### üîç Understanding Parameters vs Metrics vs Tags

| Type | When Set | Example | Use Case |
|------|----------|---------|----------|
| **Parameters** | Before training | `learning_rate=0.001` | Hyperparameter search |
| **Metrics** | During/after training | `accuracy=0.95` | Performance comparison |
| **Tags** | Anytime | `status=completed` | Organization & filtering |

---

## Part 3: Logging Artifacts

Artifacts are **files** associated with a run: models, plots, configs, predictions, etc.

### üßí ELI5: What Are Artifacts?

> Back to our cookie recipe:
> - Parameters = ingredient list
> - Metrics = taste scores
> - **Artifacts = photos of the actual cookies!**
>
> In ML:
> - Model weights (the trained model itself)
> - Training curves (loss over time plots)
> - Config files (exact settings used)
> - Predictions (sample outputs)

In [None]:
def create_training_curves():
    """Generate realistic-looking training curves."""
    epochs = np.arange(1, 11)
    
    # Realistic training dynamics
    train_loss = 1.2 * np.exp(-epochs * 0.35) + 0.08 + np.random.normal(0, 0.02, len(epochs))
    val_loss = 1.3 * np.exp(-epochs * 0.28) + 0.12 + np.random.normal(0, 0.03, len(epochs))
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Loss plot
    axes[0].plot(epochs, train_loss, 'b-o', label='Training Loss', linewidth=2, markersize=6)
    axes[0].plot(epochs, val_loss, 'r-s', label='Validation Loss', linewidth=2, markersize=6)
    axes[0].set_xlabel('Epoch', fontsize=12)
    axes[0].set_ylabel('Loss', fontsize=12)
    axes[0].set_title('Training Progress', fontsize=14)
    axes[0].legend(fontsize=11)
    axes[0].grid(True, alpha=0.3)
    axes[0].set_ylim(0, 1.5)
    
    # Accuracy plot
    train_acc = 1 - train_loss * 0.5
    val_acc = 1 - val_loss * 0.5
    axes[1].plot(epochs, train_acc, 'b-o', label='Training Accuracy', linewidth=2, markersize=6)
    axes[1].plot(epochs, val_acc, 'r-s', label='Validation Accuracy', linewidth=2, markersize=6)
    axes[1].set_xlabel('Epoch', fontsize=12)
    axes[1].set_ylabel('Accuracy', fontsize=12)
    axes[1].set_title('Accuracy Progress', fontsize=14)
    axes[1].legend(fontsize=11)
    axes[1].grid(True, alpha=0.3)
    axes[1].set_ylim(0.3, 1.0)
    
    plt.tight_layout()
    return fig

# Preview the plot
fig = create_training_curves()
plt.show()
plt.close()

In [None]:
# Create a run with artifacts
with mlflow.start_run(run_name="run-with-artifacts") as run:
    
    # Log parameters and metrics as before
    mlflow.log_params({
        "model_name": "microsoft/phi-2",
        "learning_rate": 2e-4,
        "batch_size": 16,
        "epochs": 10
    })
    
    mlflow.log_metrics({
        "final_train_loss": 0.12,
        "final_val_loss": 0.18,
        "final_accuracy": 0.91
    })
    
    # ========== ARTIFACT 1: Training Plot ==========
    fig = create_training_curves()
    plot_path = "/tmp/training_curves.png"
    fig.savefig(plot_path, dpi=150, bbox_inches='tight')
    plt.close(fig)
    
    # Log to 'plots' subdirectory in artifacts
    mlflow.log_artifact(plot_path, artifact_path="plots")
    print(f"üìä Logged plot: {plot_path}")
    
    # ========== ARTIFACT 2: Config File ==========
    config = {
        "model": {
            "name": "microsoft/phi-2",
            "dtype": "bfloat16",
            "max_length": 2048
        },
        "training": {
            "learning_rate": 2e-4,
            "batch_size": 16,
            "epochs": 10,
            "warmup_ratio": 0.1,
            "weight_decay": 0.01
        },
        "lora": {
            "rank": 16,
            "alpha": 32,
            "dropout": 0.05,
            "target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]
        },
        "hardware": {
            "device": "DGX Spark",
            "memory_gb": 128,
            "precision": "bfloat16"
        }
    }
    
    config_path = "/tmp/training_config.json"
    with open(config_path, 'w') as f:
        json.dump(config, f, indent=2)
    
    mlflow.log_artifact(config_path, artifact_path="configs")
    print(f"‚öôÔ∏è Logged config: {config_path}")
    
    # ========== ARTIFACT 3: Training Notes ==========
    notes = """# Training Notes

## Setup
- Hardware: DGX Spark with 128GB unified memory
- Framework: PyTorch 2.x with bfloat16
- Method: LoRA fine-tuning (rank=16, alpha=32)

## Observations
- Learning rate 2e-4 worked well (1e-4 was too slow)
- Batch size 16 maximized GPU utilization
- Warmup helped stabilize early training

## Issues
- Initial runs had gradient overflow - fixed with gradient clipping
- Had to reduce context length from 4096 to 2048 for memory

## Next Steps
- Try rank=32 for better quality
- Experiment with learning rate scheduling
"""
    
    notes_path = "/tmp/training_notes.md"
    with open(notes_path, 'w') as f:
        f.write(notes)
    
    mlflow.log_artifact(notes_path)  # Root of artifacts
    print(f"üìù Logged notes: {notes_path}")
    
    # ========== ARTIFACT 4: Sample Predictions ==========
    predictions = [
        {"input": "What is machine learning?", 
         "output": "Machine learning is a subset of AI that enables systems to learn from data..."},
        {"input": "Explain gradient descent",
         "output": "Gradient descent is an optimization algorithm that minimizes loss by..."},
        {"input": "What is a transformer?",
         "output": "A transformer is a neural network architecture that uses self-attention..."}
    ]
    
    predictions_path = "/tmp/sample_predictions.json"
    with open(predictions_path, 'w') as f:
        json.dump(predictions, f, indent=2)
    
    mlflow.log_artifact(predictions_path, artifact_path="predictions")
    print(f"üîÆ Logged predictions: {predictions_path}")
    
    artifact_run_id = run.info.run_id

print(f"\n‚úÖ Run completed with artifacts!")
print(f"   Run ID: {artifact_run_id}")

In [None]:
# List all artifacts for this run
client = mlflow.tracking.MlflowClient()

print("üìÅ ARTIFACTS STRUCTURE")
print("=" * 50)

def list_artifacts_recursive(run_id, path=""):
    """Recursively list all artifacts."""
    artifacts = client.list_artifacts(run_id, path)
    for artifact in artifacts:
        indent = "  " * path.count("/")
        if artifact.is_dir:
            print(f"{indent}üìÇ {artifact.path}/")
            list_artifacts_recursive(run_id, artifact.path)
        else:
            size = artifact.file_size if artifact.file_size else 0
            print(f"{indent}üìÑ {artifact.path} ({size:,} bytes)")

list_artifacts_recursive(artifact_run_id)

---

## Part 4: Logging PyTorch Models

MLflow has native support for PyTorch models. This is incredibly powerful for:
- **Reproducibility**: Load the exact model from any run
- **Deployment**: MLflow models can be served as REST APIs
- **Versioning**: Track model evolution over time

In [None]:
import torch
import torch.nn as nn

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
class SentimentClassifier(nn.Module):
    """
    Simple sentiment classifier for demonstration.
    In practice, you'd use a transformer-based model.
    """
    
    def __init__(self, vocab_size: int = 30000, embed_dim: int = 256, 
                 hidden_dim: int = 512, num_classes: int = 3):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden_dim, num_classes)
        )
    
    def forward(self, x):
        # x: (batch, seq_len)
        embedded = self.embedding(x)  # (batch, seq_len, embed_dim)
        _, (hidden, _) = self.lstm(embedded)  # hidden: (2, batch, hidden_dim)
        
        # Concatenate forward and backward hidden states
        hidden = torch.cat([hidden[0], hidden[1]], dim=1)  # (batch, hidden_dim * 2)
        
        return self.classifier(hidden)

# Create model
model = SentimentClassifier()
model = model.to(device)

total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\nüìä Model Statistics:")
print(f"   Total parameters: {total_params:,}")
print(f"   Trainable parameters: {trainable_params:,}")
print(f"   Model size: ~{total_params * 4 / 1e6:.1f} MB (FP32)")

In [None]:
# Log the model to MLflow
with mlflow.start_run(run_name="sentiment-model-v1") as run:
    
    # Log model architecture parameters
    mlflow.log_params({
        "vocab_size": 30000,
        "embed_dim": 256,
        "hidden_dim": 512,
        "num_classes": 3,
        "total_params": total_params,
        "architecture": "BiLSTM"
    })
    
    # Simulate training metrics
    for epoch in range(5):
        train_loss = 0.8 * (0.7 ** epoch) + np.random.normal(0, 0.02)
        val_loss = 0.9 * (0.75 ** epoch) + np.random.normal(0, 0.03)
        accuracy = 0.6 + epoch * 0.08 + np.random.normal(0, 0.01)
        
        mlflow.log_metrics({
            "train_loss": train_loss,
            "val_loss": val_loss,
            "accuracy": min(accuracy, 0.98)
        }, step=epoch)
    
    # Create sample input for signature
    sample_input = torch.randint(0, 30000, (1, 128)).to(device)
    
    # Log the model!
    # This saves: weights, architecture info, requirements, input example
    mlflow.pytorch.log_model(
        model,
        artifact_path="model",
        input_example=sample_input.cpu().numpy(),
        registered_model_name=None  # Don't register yet (we'll do this in Lab 4.3.6)
    )
    
    model_run_id = run.info.run_id

print(f"\n‚úÖ Model logged successfully!")
print(f"   Run ID: {model_run_id}")
print(f"   Model URI: runs:/{model_run_id}/model")

In [None]:
# Load the model back from MLflow
model_uri = f"runs:/{model_run_id}/model"

print(f"Loading model from: {model_uri}")
loaded_model = mlflow.pytorch.load_model(model_uri)
loaded_model = loaded_model.to(device)
loaded_model.eval()

print(f"\n‚úÖ Model loaded!")
print(f"   Type: {type(loaded_model).__name__}")

# Test inference
test_input = torch.randint(0, 30000, (2, 128)).to(device)
with torch.no_grad():
    output = loaded_model(test_input)

print(f"\nüîÆ Inference test:")
print(f"   Input shape: {test_input.shape}")
print(f"   Output shape: {output.shape}")
print(f"   Predictions: {torch.argmax(output, dim=1).tolist()}")

---

## Part 5: Running a Hyperparameter Sweep

The real power of experiment tracking: running many experiments and finding the best configuration.

### üßí ELI5: Hyperparameter Sweeps

> Imagine trying every possible cookie recipe variation:
> - Sugar: 1/4 cup, 1/2 cup, 3/4 cup
> - Butter: 1/2 cup, 1 cup
> - Baking time: 10 min, 12 min, 15 min
>
> That's 3 √ó 2 √ó 3 = 18 batches of cookies!
>
> With good notes (MLflow), you can easily find:
> - "1/2 cup sugar + 1 cup butter + 12 min = BEST cookies!"
>
> In ML, we call this a **hyperparameter sweep** or **grid search**.

In [None]:
import itertools
import random

# Define hyperparameter grid
param_grid = {
    "learning_rate": [1e-5, 5e-5, 1e-4, 5e-4],
    "batch_size": [8, 16, 32],
    "lora_rank": [8, 16, 32]
}

# Calculate total combinations
total_combinations = 1
for values in param_grid.values():
    total_combinations *= len(values)

print(f"üî¨ HYPERPARAMETER SWEEP")
print("=" * 50)
print(f"Parameters to search:")
for param, values in param_grid.items():
    print(f"   {param}: {values}")
print(f"\nTotal combinations: {total_combinations}")
print("\nRunning sweep...")

In [None]:
# Create a new experiment for the sweep
SWEEP_EXPERIMENT = "Hyperparameter-Sweep-Demo"
mlflow.set_experiment(SWEEP_EXPERIMENT)

# Run all combinations
results = []

for lr, bs, rank in itertools.product(
    param_grid["learning_rate"],
    param_grid["batch_size"],
    param_grid["lora_rank"]
):
    run_name = f"lr={lr}_bs={bs}_r={rank}"
    
    with mlflow.start_run(run_name=run_name):
        # Log parameters
        mlflow.log_params({
            "learning_rate": lr,
            "batch_size": bs,
            "lora_rank": rank,
            "model": "phi-2",
            "epochs": 5
        })
        
        # Simulate training with realistic patterns
        # (In real life, you'd actually train here!)
        
        # Higher LR = faster convergence but more noise
        lr_factor = np.log10(lr) + 5  # Normalized: 0-4
        
        # Larger batch = more stable but slower
        bs_factor = np.log2(bs) / 5  # Normalized: ~0.6-1.0
        
        # Higher rank = better capacity
        rank_factor = rank / 32  # Normalized: 0.25-1.0
        
        # Simulate final metrics
        base_accuracy = 0.75 + 0.1 * rank_factor + 0.05 * lr_factor
        noise = random.gauss(0, 0.02)
        final_accuracy = min(0.98, max(0.5, base_accuracy + noise))
        
        final_loss = (1 - final_accuracy) * 2 + random.gauss(0, 0.05)
        final_loss = max(0.05, final_loss)
        
        # Memory usage increases with batch size and rank
        memory_gb = 10 + bs * 0.5 + rank * 0.3
        
        # Training time inversely related to batch size
        training_time = 120 / (bs / 8) * (rank / 16)
        
        # Log step-wise metrics
        for epoch in range(5):
            epoch_loss = final_loss * (2 - epoch * 0.2) + random.gauss(0, 0.03)
            epoch_acc = final_accuracy * (0.7 + epoch * 0.06) + random.gauss(0, 0.02)
            
            mlflow.log_metrics({
                "train_loss": epoch_loss,
                "accuracy": min(0.98, epoch_acc)
            }, step=epoch)
        
        # Log final metrics
        mlflow.log_metrics({
            "final_loss": final_loss,
            "final_accuracy": final_accuracy,
            "memory_gb": memory_gb,
            "training_time_min": training_time
        })
        
        results.append({
            "lr": lr, "bs": bs, "rank": rank,
            "accuracy": final_accuracy, "loss": final_loss
        })

print(f"\n‚úÖ Sweep complete! {len(results)} runs logged.")

In [None]:
# Query and analyze results
import pandas as pd

# Get experiment ID
sweep_exp = mlflow.get_experiment_by_name(SWEEP_EXPERIMENT)

# Search for all runs, sorted by accuracy
runs_df = mlflow.search_runs(
    experiment_ids=[sweep_exp.experiment_id],
    filter_string="",
    order_by=["metrics.final_accuracy DESC"]
)

# Display top results
display_cols = [
    "params.learning_rate",
    "params.batch_size", 
    "params.lora_rank",
    "metrics.final_accuracy",
    "metrics.final_loss",
    "metrics.memory_gb"
]

print("üèÜ TOP 10 CONFIGURATIONS (by accuracy)")
print("=" * 80)
print(runs_df[display_cols].head(10).to_string(index=False))

In [None]:
# Find the best run
best_run = runs_df.iloc[0]

print("\nü•á BEST CONFIGURATION")
print("=" * 50)
print(f"Run ID: {best_run['run_id'][:8]}...")
print(f"\nHyperparameters:")
print(f"   Learning Rate: {best_run['params.learning_rate']}")
print(f"   Batch Size: {best_run['params.batch_size']}")
print(f"   LoRA Rank: {best_run['params.lora_rank']}")
print(f"\nResults:")
print(f"   Accuracy: {best_run['metrics.final_accuracy']:.4f}")
print(f"   Loss: {best_run['metrics.final_loss']:.4f}")
print(f"   Memory: {best_run['metrics.memory_gb']:.1f} GB")

In [None]:
# Visualize the sweep results
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Prepare data
sweep_data = runs_df.copy()
sweep_data['lr'] = sweep_data['params.learning_rate'].astype(float)
sweep_data['bs'] = sweep_data['params.batch_size'].astype(int)
sweep_data['rank'] = sweep_data['params.lora_rank'].astype(int)

# Plot 1: Learning Rate vs Accuracy (colored by rank)
for rank in sorted(sweep_data['rank'].unique()):
    subset = sweep_data[sweep_data['rank'] == rank]
    axes[0, 0].scatter(
        subset['lr'], 
        subset['metrics.final_accuracy'],
        label=f'rank={rank}',
        s=100, alpha=0.7
    )
axes[0, 0].set_xscale('log')
axes[0, 0].set_xlabel('Learning Rate')
axes[0, 0].set_ylabel('Final Accuracy')
axes[0, 0].set_title('Learning Rate vs Accuracy')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Batch Size vs Loss (colored by rank)
for rank in sorted(sweep_data['rank'].unique()):
    subset = sweep_data[sweep_data['rank'] == rank]
    axes[0, 1].scatter(
        subset['bs'],
        subset['metrics.final_loss'],
        label=f'rank={rank}',
        s=100, alpha=0.7
    )
axes[0, 1].set_xlabel('Batch Size')
axes[0, 1].set_ylabel('Final Loss')
axes[0, 1].set_title('Batch Size vs Loss')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Accuracy vs Memory (Pareto frontier)
sc = axes[1, 0].scatter(
    sweep_data['metrics.memory_gb'],
    sweep_data['metrics.final_accuracy'],
    c=sweep_data['rank'],
    cmap='viridis',
    s=100, alpha=0.7
)
axes[1, 0].set_xlabel('Memory Usage (GB)')
axes[1, 0].set_ylabel('Final Accuracy')
axes[1, 0].set_title('Accuracy vs Memory Trade-off')
plt.colorbar(sc, ax=axes[1, 0], label='LoRA Rank')
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Heatmap of best accuracy per LR x Rank
pivot = sweep_data.pivot_table(
    values='metrics.final_accuracy',
    index='lr',
    columns='rank',
    aggfunc='mean'
)
im = axes[1, 1].imshow(pivot.values, cmap='RdYlGn', aspect='auto')
axes[1, 1].set_xticks(range(len(pivot.columns)))
axes[1, 1].set_xticklabels(pivot.columns)
axes[1, 1].set_yticks(range(len(pivot.index)))
axes[1, 1].set_yticklabels([f'{x:.0e}' for x in pivot.index])
axes[1, 1].set_xlabel('LoRA Rank')
axes[1, 1].set_ylabel('Learning Rate')
axes[1, 1].set_title('Mean Accuracy Heatmap')
plt.colorbar(im, ax=axes[1, 1], label='Accuracy')

plt.tight_layout()
plt.savefig('/tmp/sweep_analysis.png', dpi=150)
plt.show()

print("\nüìä Saved sweep analysis to /tmp/sweep_analysis.png")

---

## Part 6: The MLflow UI

MLflow includes a beautiful web interface for exploring experiments.

In [None]:
print(f"""
üñ•Ô∏è  STARTING THE MLFLOW UI
{'=' * 60}

To view your experiments in a web browser:

1. Open a terminal and run:
   
   mlflow ui --backend-store-uri {MLFLOW_DIR} --host 0.0.0.0 --port 5000

2. Open in browser: http://localhost:5000

{'=' * 60}

üìä UI Features:
   ‚Ä¢ Compare runs side-by-side
   ‚Ä¢ View metric charts over time
   ‚Ä¢ Download artifacts
   ‚Ä¢ Filter and search runs
   ‚Ä¢ Export to CSV

{'=' * 60}

üê≥ For DGX Spark with Docker, start your container with port exposed:

   docker run --gpus all -it --rm \\
       -v $HOME/workspace:/workspace \\
       -v $HOME/.cache/huggingface:/root/.cache/huggingface \\
       --ipc=host \\
       -p 5000:5000 \\
       nvcr.io/nvidia/pytorch:25.11-py3

""")

---

## Part 7: Autologging

MLflow can automatically capture metrics from popular frameworks - no manual logging needed!

In [None]:
# Enable autologging for PyTorch
mlflow.pytorch.autolog(
    log_models=True,        # Automatically log model artifacts
    log_every_n_epoch=1,    # Log metrics every epoch
    log_every_n_step=None,  # Don't log every step (too much data)
    registered_model_name=None  # Don't auto-register
)

print("‚úÖ PyTorch autologging enabled!")
print("""
With autologging, MLflow automatically captures:
‚Ä¢ Training loss and metrics
‚Ä¢ Model architecture
‚Ä¢ Optimizer parameters
‚Ä¢ Model artifacts
‚Ä¢ Hardware info (GPU, memory)

Just run your normal training code - MLflow handles the rest!
""")

In [None]:
# Example of autologging with HuggingFace Trainer
example_code = '''
# Autologging works seamlessly with HuggingFace!

import mlflow
from transformers import Trainer, TrainingArguments

# Enable autologging for transformers
mlflow.transformers.autolog()

# Your normal training code - MLflow captures everything!
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    logging_steps=100,
    evaluation_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# This automatically logs to MLflow!
trainer.train()

# Check the MLflow UI - you'll see:
# - All training arguments
# - Loss curves
# - Evaluation metrics
# - Model checkpoint
'''

print("üìù Example: Autologging with HuggingFace Trainer")
print("=" * 50)
print(example_code)

---

## ‚úã Try It Yourself: Exercise

**Task:** Create your own experiment tracking workflow.

1. Create a new experiment called `"my-first-experiment"`
2. Run at least 5 training simulations with different hyperparameters
3. Log:
   - At least 4 parameters
   - Metrics over time (multiple epochs)
   - One artifact (plot or config file)
4. Query to find the best run
5. Create a visualization comparing runs

<details>
<summary>üí° Hint</summary>

```python
# Step 1: Create experiment
mlflow.set_experiment("my-first-experiment")

# Step 2: Loop through configurations
for lr in [1e-5, 1e-4, 1e-3]:
    for dropout in [0.1, 0.3]:
        with mlflow.start_run(run_name=f"lr={lr}_drop={dropout}"):
            # Log params, simulate training, log metrics
            ...

# Step 3: Query results
runs = mlflow.search_runs(order_by=["metrics.accuracy DESC"])
```
</details>

In [None]:
# YOUR CODE HERE

# Step 1: Create experiment


# Step 2: Run training simulations


# Step 3: Log artifacts


# Step 4: Query for best run


# Step 5: Visualize results


---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Not Using Context Managers

In [None]:
# ‚ùå WRONG: Run never ends if code crashes
# mlflow.start_run()
# ... training code that might crash ...
# mlflow.end_run()  # Never reached!

# ‚úÖ RIGHT: Use context manager - run closes automatically
# with mlflow.start_run():
#     ... training code ...
#     # Run ends automatically, even on error

print("Always use 'with mlflow.start_run():' for automatic cleanup!")

### Mistake 2: Logging Too Frequently

In [None]:
# ‚ùå WRONG: Logging every step creates huge databases
# for step in range(1_000_000):
#     mlflow.log_metric("loss", loss, step=step)  # 1M entries!

# ‚úÖ RIGHT: Log at reasonable intervals
# for step in range(1_000_000):
#     if step % 1000 == 0:  # Every 1000 steps
#         mlflow.log_metric("loss", loss, step=step)  # Only 1000 entries

print("Log metrics at reasonable intervals (every N steps/epochs).")
print("For long training: every 100-1000 steps is usually enough.")

### Mistake 3: Not Setting Experiment

In [None]:
# ‚ùå WRONG: All runs go to "Default" experiment
# with mlflow.start_run():
#     ...  # Which project is this for??

# ‚úÖ RIGHT: Always set experiment first
# mlflow.set_experiment("my-project-name")
# with mlflow.start_run():
#     ...  # Clearly organized!

print("Always call mlflow.set_experiment() before starting runs!")
print("This keeps your experiments organized and findable.")

### Mistake 4: Nested Runs Without Explicit Control

In [None]:
# ‚ùå WRONG: Accidental nested runs
# with mlflow.start_run():  # Parent run
#     for i in range(3):
#         with mlflow.start_run():  # Creates NESTED runs!
#             ...

# ‚úÖ RIGHT: Explicit nested runs OR separate runs
# Option A: Separate runs
# for i in range(3):
#     with mlflow.start_run(run_name=f"run-{i}"):
#         ...

# Option B: Explicit nested runs
# with mlflow.start_run(run_name="parent") as parent:
#     for i in range(3):
#         with mlflow.start_run(run_name=f"child-{i}", nested=True):
#             ...

print("Use nested=True when you intentionally want nested runs.")
print("For hyperparameter sweeps, separate runs are usually better.")

---

## üéâ Checkpoint

You've learned:
- ‚úÖ Setting up MLflow tracking on DGX Spark
- ‚úÖ Logging parameters, metrics, and artifacts
- ‚úÖ Running hyperparameter sweeps and finding optimal configs
- ‚úÖ Using the MLflow UI for visualization
- ‚úÖ Autologging with PyTorch and Transformers

---

## üöÄ Challenge (Optional)

**Set up production-grade MLflow:**

1. Run MLflow server with SQLite/PostgreSQL backend
2. Use MinIO or S3 for artifact storage
3. Set up authentication with nginx reverse proxy
4. Create a CI/CD pipeline that logs training runs automatically

---

## üìñ Further Reading

- [MLflow Documentation](https://mlflow.org/docs/latest/)
- [MLflow Tracking Guide](https://mlflow.org/docs/latest/tracking.html)
- [MLflow with PyTorch](https://mlflow.org/docs/latest/python_api/mlflow.pytorch.html)
- [MLflow with Transformers](https://mlflow.org/docs/latest/llms/transformers/index.html)

---

## üßπ Cleanup

In [None]:
# Clean up resources
import gc

# Clear matplotlib figures
plt.close('all')

# Clear Python garbage
gc.collect()

# Clear GPU memory if available
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print(f"GPU memory freed. Current usage: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

print(f"\nüìÅ MLflow data saved to: {MLFLOW_DIR}")
print(f"\nüñ•Ô∏è  To view results, run:")
print(f"   mlflow ui --backend-store-uri {MLFLOW_DIR} --port 5000")

---

## üìù Summary

In this lab, we:

1. **Set up** MLflow for local experiment tracking on DGX Spark
2. **Created** experiments and logged training runs with parameters, metrics, and artifacts
3. **Logged** PyTorch models for later retrieval
4. **Ran** a hyperparameter sweep and analyzed results
5. **Learned** best practices for production experiment tracking

**Next up:** Lab 4.3.2 - Weights & Biases Integration for team collaboration and advanced visualizations!