# Task 15.3: MLflow Experiment Tracking

**Module:** 15 - Benchmarking, Evaluation & MLOps  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand why experiment tracking matters
- [ ] Set up MLflow tracking server on DGX Spark
- [ ] Log parameters, metrics, and artifacts
- [ ] Create reproducible experiments
- [ ] Use the MLflow UI to compare runs

---

## üìö Prerequisites

- Completed: Tasks 15.1-15.2
- Knowledge of: Python, basic ML training loops
- Hardware: Any system (MLflow is lightweight)

---

## üåç Real-World Context

**Have you ever trained a model, gotten great results... and then forgotten what hyperparameters you used?**

This happens to everyone! That's why companies use experiment tracking:

- **Netflix:** Tracks thousands of recommendation model experiments
- **Airbnb:** Logs all search ranking experiments
- **Meta:** Records every model training run

**MLflow** is the industry-standard open-source tool for:
1. **Tracking** - Log everything about your experiments
2. **Projects** - Package code for reproducibility
3. **Models** - Manage and deploy models
4. **Registry** - Version and stage models

---

## üßí ELI5: What is Experiment Tracking?

> **Imagine you're a chef creating a new recipe.** Each time you cook, you'd want to record:
> - **Ingredients** (parameters): 2 cups flour, 1 tsp salt
> - **How it turned out** (metrics): Taste score 8/10, texture 7/10
> - **The actual dish** (artifacts): A photo of the result
>
> After many attempts, you can look back and see:
> - "Aha! More butter = better taste!"
> - "The recipe from Tuesday was the best!"
>
> **Experiment tracking is the same for ML!** You record:
> - **Parameters:** learning rate, batch size, model architecture
> - **Metrics:** accuracy, loss, training time
> - **Artifacts:** model weights, plots, predictions
>
> Then you can compare and find what works best!

---

## Part 1: Setting Up MLflow

Let's install and configure MLflow.

In [None]:
# Install MLflow
# MLflow is pure Python and works well on ARM64 (DGX Spark)

import subprocess
import sys

try:
    import mlflow
    print(f"MLflow already installed: {mlflow.__version__}")
except ImportError:
    print("Installing MLflow...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "mlflow", "-q"])
    import mlflow
    print(f"MLflow installed: {mlflow.__version__}")

In [None]:
import mlflow
import mlflow.pytorch
import os
import json
from datetime import datetime

print(f"MLflow version: {mlflow.__version__}")

In [None]:
# Create directory for MLflow data
# Use Path for robust cross-platform path handling
from pathlib import Path

# Get module directory relative to notebooks folder
# This works regardless of the current working directory
NOTEBOOK_DIR = Path.cwd()
MODULE_DIR = (NOTEBOOK_DIR / "..").resolve()  # Go up from notebooks/
MLFLOW_DIR = str(MODULE_DIR / "mlflow")

os.makedirs(MLFLOW_DIR, exist_ok=True)

# Set tracking URI to local directory
# In production, you'd use a server: mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_tracking_uri(f"file://{MLFLOW_DIR}")

print(f"MLflow tracking directory: {MLFLOW_DIR}")
print(f"Tracking URI: {mlflow.get_tracking_uri()}")

### üîç MLflow Storage Options

| Storage Type | URI Format | Use Case |
|-------------|------------|----------|
| Local Files | `file:///path/to/mlflow` | Development |
| Local Server | `http://localhost:5000` | Team sharing |
| Remote Server | `http://mlflow.company.com` | Production |
| Databricks | `databricks` | Databricks platform |

For DGX Spark, local files work great for personal use. For teams, run a server!

---

## Part 2: Creating Your First Experiment

Experiments in MLflow are like folders for related runs.

In [None]:
# Create or get an experiment
experiment_name = "LLM-Finetuning-Demo"

# Check if experiment exists
experiment = mlflow.get_experiment_by_name(experiment_name)

if experiment is None:
    experiment_id = mlflow.create_experiment(
        experiment_name,
        tags={
            "project": "dgx-spark-curriculum",
            "module": "15",
            "task": "experiment-tracking"
        }
    )
    print(f"Created experiment '{experiment_name}' with ID: {experiment_id}")
else:
    experiment_id = experiment.experiment_id
    print(f"Using existing experiment '{experiment_name}' with ID: {experiment_id}")

# Set as active experiment
mlflow.set_experiment(experiment_name)

### üßí ELI5: Experiments vs Runs

> **Experiment** = A folder ("Chocolate Chip Cookie Recipes")
>
> **Run** = One attempt ("Recipe #5: with brown butter")
>
> You might have:
> - Experiment: "LLM Finetuning"
>   - Run 1: learning_rate=1e-4, epochs=3
>   - Run 2: learning_rate=1e-5, epochs=5
>   - Run 3: learning_rate=1e-4, epochs=3, with LoRA

In [None]:
# Start a run and log some data
with mlflow.start_run(run_name="demo-run-1") as run:
    
    # Log parameters (inputs to your experiment)
    mlflow.log_param("model_name", "microsoft/phi-2")
    mlflow.log_param("learning_rate", 1e-4)
    mlflow.log_param("batch_size", 8)
    mlflow.log_param("epochs", 3)
    mlflow.log_param("lora_rank", 16)
    
    # Log metrics (outputs/results)
    mlflow.log_metric("train_loss", 0.45)
    mlflow.log_metric("eval_loss", 0.52)
    mlflow.log_metric("accuracy", 0.87)
    mlflow.log_metric("training_time_seconds", 3600)
    
    # Log metrics over time (for charts)
    for epoch in range(3):
        mlflow.log_metric("epoch_loss", 0.5 - epoch * 0.1, step=epoch)
        mlflow.log_metric("epoch_accuracy", 0.7 + epoch * 0.05, step=epoch)
    
    # Log tags (metadata)
    mlflow.set_tag("gpu", "DGX Spark")
    mlflow.set_tag("framework", "pytorch")
    mlflow.set_tag("status", "completed")
    
    print(f"Run ID: {run.info.run_id}")
    print(f"Artifact URI: {run.info.artifact_uri}")

In [None]:
# View what we logged
run_data = mlflow.get_run(run.info.run_id)

print("\nüìä Logged Data:")
print("="*50)
print("\nParameters:")
for key, value in run_data.data.params.items():
    print(f"  {key}: {value}")

print("\nMetrics:")
for key, value in run_data.data.metrics.items():
    print(f"  {key}: {value}")

print("\nTags:")
for key, value in run_data.data.tags.items():
    if not key.startswith("mlflow."):  # Skip internal tags
        print(f"  {key}: {value}")

---

## Part 3: Logging Artifacts

Artifacts are files associated with a run: models, plots, configs, etc.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Create a sample plot to log
def create_training_plot():
    """Create a training progress plot."""
    epochs = np.arange(1, 11)
    train_loss = np.exp(-epochs * 0.3) + 0.1
    val_loss = np.exp(-epochs * 0.25) + 0.15
    
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.plot(epochs, train_loss, 'b-', label='Training Loss', linewidth=2)
    ax.plot(epochs, val_loss, 'r--', label='Validation Loss', linewidth=2)
    ax.set_xlabel('Epoch')
    ax.set_ylabel('Loss')
    ax.set_title('Training Progress')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    return fig

# Create plot
fig = create_training_plot()
plt.show()

In [None]:
# Log artifacts in a new run
with mlflow.start_run(run_name="demo-run-with-artifacts") as run:
    
    # Log parameters
    mlflow.log_params({
        "model_name": "microsoft/phi-2",
        "learning_rate": 2e-4,
        "batch_size": 16,
        "epochs": 10
    })
    
    # Log metrics
    mlflow.log_metrics({
        "final_train_loss": 0.15,
        "final_val_loss": 0.22,
        "accuracy": 0.91
    })
    
    # Save and log the plot
    fig = create_training_plot()
    plot_path = "/tmp/training_progress.png"
    fig.savefig(plot_path, dpi=150, bbox_inches='tight')
    plt.close(fig)
    
    mlflow.log_artifact(plot_path, artifact_path="plots")
    print(f"Logged plot: {plot_path}")
    
    # Log a config file
    config = {
        "model": {
            "name": "microsoft/phi-2",
            "dtype": "bfloat16"
        },
        "training": {
            "learning_rate": 2e-4,
            "batch_size": 16,
            "epochs": 10,
            "warmup_steps": 100
        },
        "lora": {
            "rank": 16,
            "alpha": 32,
            "dropout": 0.1
        }
    }
    
    config_path = "/tmp/config.json"
    with open(config_path, 'w') as f:
        json.dump(config, f, indent=2)
    
    mlflow.log_artifact(config_path, artifact_path="configs")
    print(f"Logged config: {config_path}")
    
    # Log a text file with notes
    notes = """Training Notes
================
- Used LoRA for efficient finetuning
- Training on DGX Spark with 128GB unified memory
- Dataset: Custom instruction dataset (10k samples)
- Notable: Lower learning rate helped stability
"""
    notes_path = "/tmp/notes.txt"
    with open(notes_path, 'w') as f:
        f.write(notes)
    
    mlflow.log_artifact(notes_path)
    print(f"Logged notes: {notes_path}")
    
    print(f"\n‚úÖ Run ID: {run.info.run_id}")

In [None]:
# List artifacts for the run
client = mlflow.tracking.MlflowClient()

print("\nüìÅ Artifacts:")
for artifact in client.list_artifacts(run.info.run_id):
    print(f"  {artifact.path} ({artifact.file_size if artifact.file_size else 'directory'})")
    if artifact.is_dir:
        for sub_artifact in client.list_artifacts(run.info.run_id, artifact.path):
            print(f"    ‚îî‚îÄ‚îÄ {sub_artifact.path}")

---

## Part 4: Logging PyTorch Models

MLflow has built-in support for logging PyTorch models.

In [None]:
import torch
import torch.nn as nn

# Create a simple model for demonstration
class SimpleClassifier(nn.Module):
    """Simple neural network classifier."""
    
    def __init__(self, input_size: int = 768, hidden_size: int = 256, num_classes: int = 10):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_size, hidden_size // 2),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(hidden_size // 2, num_classes)
        )
    
    def forward(self, x):
        return self.layers(x)

# Create model instance
model = SimpleClassifier()
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

In [None]:
# Simulate training and log the model
with mlflow.start_run(run_name="model-logging-demo") as run:
    
    # Log model architecture info
    mlflow.log_params({
        "input_size": 768,
        "hidden_size": 256,
        "num_classes": 10,
        "total_params": sum(p.numel() for p in model.parameters())
    })
    
    # Simulate training metrics
    for epoch in range(5):
        train_loss = 1.0 * (0.7 ** epoch)
        val_loss = 1.2 * (0.75 ** epoch)
        accuracy = 0.5 + epoch * 0.1
        
        mlflow.log_metrics({
            "train_loss": train_loss,
            "val_loss": val_loss,
            "accuracy": accuracy
        }, step=epoch)
    
    # Log the PyTorch model
    # This saves:
    # - Model weights
    # - Model signature (input/output schema)
    # - Requirements (pip dependencies)
    
    # Create a sample input for signature inference
    sample_input = torch.randn(1, 768)
    
    mlflow.pytorch.log_model(
        model,
        artifact_path="model",
        registered_model_name=None,  # Don't register yet
        input_example=sample_input.numpy()
    )
    
    print(f"‚úÖ Model logged to run: {run.info.run_id}")

In [None]:
# Load the model back from MLflow
model_uri = f"runs:/{run.info.run_id}/model"
loaded_model = mlflow.pytorch.load_model(model_uri)

print(f"Loaded model from: {model_uri}")
print(f"Model type: {type(loaded_model)}")

# Test inference
test_input = torch.randn(1, 768)
with torch.no_grad():
    output = loaded_model(test_input)
print(f"Output shape: {output.shape}")

---

## Part 5: Comparing Experiments

The real power of MLflow is comparing multiple runs.

In [None]:
# Run multiple experiments with different hyperparameters
import random

# Hyperparameter grid
learning_rates = [1e-5, 1e-4, 1e-3]
batch_sizes = [8, 16, 32]

print("üî¨ Running hyperparameter sweep...")
print("=" * 50)

for lr in learning_rates:
    for bs in batch_sizes:
        with mlflow.start_run(run_name=f"lr={lr}_bs={bs}"):
            
            # Log parameters
            mlflow.log_params({
                "learning_rate": lr,
                "batch_size": bs,
                "epochs": 5,
                "model": "phi-2"
            })
            
            # Simulate training with some "realistic" patterns
            # Higher LR = faster convergence but more variance
            # Larger batch = more stable but slower convergence
            
            base_loss = 1.0
            lr_factor = 1.0 + (lr - 1e-4) * 1000  # LR affects convergence
            bs_factor = 1.0 - (bs - 16) * 0.01   # BS affects stability
            
            for epoch in range(5):
                noise = random.gauss(0, 0.05)
                loss = base_loss * (0.7 ** (epoch * lr_factor)) * bs_factor + noise
                loss = max(0.1, loss)  # Floor
                
                accuracy = min(0.95, 0.5 + epoch * 0.1 * lr_factor + noise)
                
                mlflow.log_metrics({
                    "loss": loss,
                    "accuracy": accuracy
                }, step=epoch)
            
            # Log final metrics
            mlflow.log_metrics({
                "final_loss": loss,
                "final_accuracy": accuracy
            })
            
            print(f"  lr={lr}, bs={bs} -> loss={loss:.4f}, acc={accuracy:.4f}")

print("\n‚úÖ Hyperparameter sweep complete!")

In [None]:
# Query runs and find the best one
import pandas as pd

# Search for all runs in this experiment
runs_df = mlflow.search_runs(
    experiment_ids=[experiment_id],
    filter_string="",
    order_by=["metrics.final_accuracy DESC"]
)

# Display relevant columns
display_cols = [
    "run_id",
    "params.learning_rate",
    "params.batch_size",
    "metrics.final_loss",
    "metrics.final_accuracy"
]

available_cols = [c for c in display_cols if c in runs_df.columns]

print("\nüìä All Runs (sorted by accuracy):")
print(runs_df[available_cols].head(10).to_string())

In [None]:
# Find the best run
best_run = runs_df.iloc[0]

print("\nüèÜ Best Run:")
print(f"  Run ID: {best_run['run_id'][:8]}...")
print(f"  Learning Rate: {best_run.get('params.learning_rate', 'N/A')}")
print(f"  Batch Size: {best_run.get('params.batch_size', 'N/A')}")
print(f"  Final Accuracy: {best_run.get('metrics.final_accuracy', 'N/A'):.4f}")
print(f"  Final Loss: {best_run.get('metrics.final_loss', 'N/A'):.4f}")

In [None]:
# Visualize the hyperparameter sweep
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Filter for runs with hyperparameter data
sweep_runs = runs_df[runs_df['params.learning_rate'].notna()].copy()

if len(sweep_runs) > 0:
    # Convert to numeric for plotting
    sweep_runs['lr'] = sweep_runs['params.learning_rate'].astype(float)
    sweep_runs['bs'] = sweep_runs['params.batch_size'].astype(float)
    
    # Plot 1: Learning Rate vs Accuracy
    for bs in sweep_runs['bs'].unique():
        subset = sweep_runs[sweep_runs['bs'] == bs]
        axes[0].scatter(
            subset['lr'], 
            subset['metrics.final_accuracy'],
            label=f'batch_size={int(bs)}',
            s=100
        )
    
    axes[0].set_xscale('log')
    axes[0].set_xlabel('Learning Rate')
    axes[0].set_ylabel('Final Accuracy')
    axes[0].set_title('Learning Rate vs Accuracy')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Plot 2: Batch Size vs Loss
    for lr in sweep_runs['lr'].unique():
        subset = sweep_runs[sweep_runs['lr'] == lr]
        axes[1].scatter(
            subset['bs'],
            subset['metrics.final_loss'],
            label=f'lr={lr}',
            s=100
        )
    
    axes[1].set_xlabel('Batch Size')
    axes[1].set_ylabel('Final Loss')
    axes[1].set_title('Batch Size vs Loss')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("/tmp/hyperparam_sweep.png", dpi=150)
plt.show()

print("\nüìä Saved hyperparameter sweep visualization")

---

## Part 6: Starting the MLflow UI

MLflow includes a beautiful web UI for exploring experiments.

In [None]:
# Instructions for starting the MLflow UI
print("""
üñ•Ô∏è  Starting the MLflow UI
{'='*50}

To view your experiments in a web browser, run this in a terminal:

    mlflow ui --backend-store-uri {mlflow_dir} --host 0.0.0.0 --port 5000

Then open: http://localhost:5000

Features of the UI:
- üìä Compare runs side-by-side
- üìà View metric charts over time
- üìÅ Download artifacts
- üîç Filter and search runs
- üìã Export to CSV

For DGX Spark with Docker, expose port 5000:

    docker run -p 5000:5000 ... 

""".format(mlflow_dir=MLFLOW_DIR))

In [None]:
# Optionally start MLflow server in background (for demo purposes)
# Note: This will run in the background; you'll need to kill it manually

# Uncomment to run:
# import subprocess
# server_process = subprocess.Popen(
#     ["mlflow", "ui", "--backend-store-uri", MLFLOW_DIR, "--host", "0.0.0.0", "--port", "5000"],
#     stdout=subprocess.DEVNULL,
#     stderr=subprocess.DEVNULL
# )
# print(f"MLflow UI started on http://localhost:5000 (PID: {server_process.pid})")

---

## Part 7: Autologging

MLflow can automatically log metrics from popular frameworks!

In [None]:
# Enable autologging for PyTorch
mlflow.pytorch.autolog(
    log_models=True,           # Log model artifacts
    log_every_n_epoch=1,       # Log metrics every epoch
    log_every_n_step=None,     # Don't log every step (too much data)
    registered_model_name=None # Don't auto-register
)

print("‚úÖ PyTorch autologging enabled!")
print("""
With autologging, MLflow will automatically capture:
- Training loss and metrics
- Model architecture
- Optimizer parameters
- Model artifacts

Just run your training code normally!
""")

In [None]:
# Example: Using autolog with a training loop
# This would automatically log everything!

demo_code = '''
import mlflow
import torch
from transformers import Trainer, TrainingArguments

# Enable autologging
mlflow.pytorch.autolog()

# Your normal training code - MLflow captures everything automatically!
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    num_train_epochs=3,
    per_device_train_batch_size=8,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# This will automatically log to MLflow!
trainer.train()
'''

print("Example autologging code:")
print(demo_code)

---

## ‚úã Try It Yourself: Exercise

**Task:** Create a complete experiment tracking workflow.

1. Create a new experiment called "my-first-experiment"
2. Run at least 5 training simulations with different hyperparameters
3. Log: parameters, metrics over time, and at least one artifact
4. Query to find the best run
5. Create a visualization comparing runs

<details>
<summary>üí° Hint</summary>

Use the hyperparameter sweep pattern from Part 5, but:
- Add more hyperparameters (dropout, warmup steps, etc.)
- Log a confusion matrix as an artifact
- Add tags to categorize runs

</details>

In [None]:
# YOUR CODE HERE

# Step 1: Create experiment

# Step 2: Run training simulations

# Step 3: Query for best run

# Step 4: Visualize results

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Not Ending Runs Properly

In [None]:
# ‚ùå Wrong: Run never ends if code crashes
# mlflow.start_run()
# ... training code that might crash ...
# mlflow.end_run()  # Never reached!

# ‚úÖ Right: Use context manager
# with mlflow.start_run():
#     ... training code ...
#     # Run automatically ends, even if code crashes

print("Always use 'with mlflow.start_run():' to ensure proper cleanup!")

### Mistake 2: Logging Too Frequently

In [None]:
# ‚ùå Wrong: Logging every step creates huge databases
# for step in range(1000000):
#     mlflow.log_metric("loss", loss, step=step)  # 1M log entries!

# ‚úÖ Right: Log at reasonable intervals
# for step in range(1000000):
#     if step % 1000 == 0:  # Every 1000 steps
#         mlflow.log_metric("loss", loss, step=step)  # Only 1000 entries

print("Log metrics at reasonable intervals (every N steps/epochs).")

### Mistake 3: Not Setting Experiment

In [None]:
# ‚ùå Wrong: All runs go to "Default" experiment
# with mlflow.start_run():
#     ...  # Which project is this for??

# ‚úÖ Right: Always set experiment first
# mlflow.set_experiment("my-project-name")
# with mlflow.start_run():
#     ...  # Clearly organized!

print("Always call mlflow.set_experiment() before starting runs!")

---

## üéâ Checkpoint

You've learned:
- ‚úÖ Setting up MLflow tracking
- ‚úÖ Logging parameters, metrics, and artifacts
- ‚úÖ Comparing multiple experiment runs
- ‚úÖ Using the MLflow UI
- ‚úÖ Autologging with PyTorch

---

## üöÄ Challenge (Optional)

**Set up a production-ready MLflow deployment:**

1. Run MLflow server with PostgreSQL backend
2. Use S3/MinIO for artifact storage
3. Set up authentication
4. Create a CI/CD pipeline that logs to MLflow

---

## üìñ Further Reading

- [MLflow Documentation](https://mlflow.org/docs/latest/)
- [MLflow Tracking Guide](https://mlflow.org/docs/latest/tracking.html)
- [MLflow with PyTorch](https://mlflow.org/docs/latest/python_api/mlflow.pytorch.html)
- [Weights & Biases](https://wandb.ai/) (alternative to MLflow)

---

## üßπ Cleanup

In [None]:
# Clean up
import gc
import torch

gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print(f"MLflow data saved to: {MLFLOW_DIR}")
print("To view results, run: mlflow ui --backend-store-uri " + MLFLOW_DIR)

---

## üìù Summary

In this notebook, we:

1. **Set up** MLflow for local experiment tracking
2. **Created** experiments and logged runs
3. **Logged** parameters, metrics, and artifacts
4. **Ran** a hyperparameter sweep
5. **Compared** runs to find the best configuration
6. **Learned** about autologging and the MLflow UI

**Next up:** In notebook 04, we'll learn about model versioning and registry!