# Task 4.3.2: Weights & Biases Integration

**Module:** 4.3 - MLOps & Experiment Tracking  
**Time:** 2 hours  
**Difficulty:** ⭐⭐ (Beginner-Intermediate)

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Set up Weights & Biases for experiment tracking
- [ ] Create rich training dashboards with custom visualizations
- [ ] Configure and run hyperparameter sweeps
- [ ] Understand when to use W&B vs MLflow
- [ ] Integrate W&B with popular training frameworks

---

## Prerequisites

- Completed: Task 4.3.1 (MLflow Setup)
- Knowledge of: Python, basic ML training
- A free W&B account (we'll create one if needed)

---

## Real-World Context

At OpenAI, Anthropic, and most AI research labs, Weights & Biases is the go-to tool for experiment tracking. Why?

1. **Beautiful dashboards** - Instantly visualize training curves, compare runs
2. **Hyperparameter sweeps** - Automated search with Bayesian optimization
3. **Team collaboration** - Share experiments with one link
4. **Integrations** - Works seamlessly with HuggingFace, PyTorch Lightning, etc.

When Google DeepMind trained Gemini, when OpenAI trained GPT-4 - they tracked everything with tools like W&B.

---

## ELI5: What is Weights & Biases?

> **Imagine you're training for a marathon.**
>
> You could scribble your times in a notebook... or you could use a fitness app that:
> - Automatically tracks your runs with GPS
> - Shows you beautiful charts of your progress
> - Compares your training to other runners
> - Suggests better training schedules
>
> Weights & Biases is that fitness app, but for training AI models. It automatically records everything, shows you pretty graphs, and helps you find the best "training schedule" (hyperparameters) for your model.
>
> **In AI terms:** W&B is a cloud-based experiment tracking platform that logs metrics, visualizes training progress, and helps optimize hyperparameters.

---

## MLflow vs W&B: When to Use Each

| Feature | MLflow | W&B |
|---------|--------|-----|
| **Hosting** | Self-hosted or Databricks | Cloud (free tier) |
| **Visualization** | Basic UI | Rich, interactive dashboards |
| **Sweeps** | Manual setup | Built-in with Bayesian optimization |
| **Collaboration** | Requires server setup | Instant link sharing |
| **Privacy** | Full control | Cloud-based (enterprise for air-gapped) |
| **Best for** | Production pipelines | Research & development |

**Recommendation:** Use W&B during development/research, MLflow for production deployment.

---

## Part 1: Setting Up W&B

First, let's install and configure Weights & Biases.

In [None]:
# Install W&B
!pip install wandb -q

import wandb
print(f"W&B version: {wandb.__version__}")

In [None]:
# Login to W&B
# Option 1: Interactive login (opens browser)
# wandb.login()

# Option 2: Use API key directly (get from https://wandb.ai/authorize)
# wandb.login(key="your-api-key")

# Option 3: For this tutorial, we'll use offline mode (no account needed)
import os
os.environ["WANDB_MODE"] = "offline"  # Remove this line to enable cloud sync

print("W&B configured in offline mode for this tutorial.")
print("To enable cloud sync, run: wandb.login()")

In [None]:
# Setup our training environment
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np

print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

---

## Part 2: Your First W&B Experiment

Let's create a training run with rich logging.

In [None]:
# Create synthetic dataset (same as MLflow notebook)
torch.manual_seed(42)
np.random.seed(42)

n_samples = 1000
n_features = 20

X = torch.randn(n_samples, n_features)
true_weights = torch.randn(n_features)
y = (X @ true_weights > 0).float()

train_X, val_X = X[:800], X[800:]
train_y, val_y = y[:800], y[800:]

print(f"Dataset ready: {n_samples} samples, {n_features} features")

In [None]:
# Define our model
class SimpleClassifier(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, dropout: float = 0.1):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim // 2, 1),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        return self.network(x).squeeze(-1)

In [None]:
def train_with_wandb(
    hidden_dim: int = 64,
    learning_rate: float = 0.01,
    dropout: float = 0.1,
    epochs: int = 30,
    batch_size: int = 32,
    optimizer_type: str = "adam",
    project_name: str = "dgx-spark-classification"
):
    """
    Train a model with full W&B tracking.
    """
    # Initialize a new W&B run
    run = wandb.init(
        project=project_name,
        name=f"{optimizer_type}-h{hidden_dim}-lr{learning_rate}",
        config={
            "hidden_dim": hidden_dim,
            "learning_rate": learning_rate,
            "dropout": dropout,
            "epochs": epochs,
            "batch_size": batch_size,
            "optimizer": optimizer_type,
            "architecture": "SimpleClassifier",
            "dataset_size": n_samples,
            "features": n_features,
            "hardware": "DGX Spark (128GB)"
        },
        tags=["tutorial", "classification", optimizer_type],
        reinit=True  # Allow multiple runs in same script
    )
    
    # Access config through wandb.config for sweep compatibility
    config = wandb.config
    
    # Create model
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = SimpleClassifier(n_features, config.hidden_dim, config.dropout).to(device)
    
    # Watch model to log gradients and parameters
    wandb.watch(model, log="all", log_freq=10)
    
    # Setup optimizer
    if config.optimizer == "adam":
        optimizer = optim.Adam(model.parameters(), lr=config.learning_rate)
    elif config.optimizer == "sgd":
        optimizer = optim.SGD(model.parameters(), lr=config.learning_rate, momentum=0.9)
    elif config.optimizer == "adamw":
        optimizer = optim.AdamW(model.parameters(), lr=config.learning_rate)
    
    criterion = nn.BCELoss()
    
    # Data loaders
    train_dataset = TensorDataset(train_X.to(device), train_y.to(device))
    train_loader = DataLoader(train_dataset, batch_size=config.batch_size, shuffle=True)
    
    # Training loop
    best_val_acc = 0.0
    
    for epoch in range(config.epochs):
        model.train()
        epoch_loss = 0.0
        correct = 0
        total = 0
        
        for batch_X, batch_y in train_loader:
            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
            preds = (outputs > 0.5).float()
            correct += (preds == batch_y).sum().item()
            total += batch_y.size(0)
        
        train_acc = correct / total
        avg_loss = epoch_loss / len(train_loader)
        
        # Validation
        model.eval()
        with torch.no_grad():
            val_outputs = model(val_X.to(device))
            val_preds = (val_outputs > 0.5).float()
            val_acc = (val_preds == val_y.to(device)).float().mean().item()
            val_loss = criterion(val_outputs, val_y.to(device)).item()
        
        # Log metrics to W&B
        wandb.log({
            "epoch": epoch,
            "train/loss": avg_loss,
            "train/accuracy": train_acc,
            "val/loss": val_loss,
            "val/accuracy": val_acc,
            "learning_rate": optimizer.param_groups[0]['lr']
        })
        
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            # Save best model checkpoint
            torch.save(model.state_dict(), "best_model.pt")
            wandb.save("best_model.pt")
        
        if epoch % 10 == 0:
            print(f"Epoch {epoch:2d}: train_loss={avg_loss:.4f}, train_acc={train_acc:.4f}, val_acc={val_acc:.4f}")
    
    # Log final summary metrics
    wandb.run.summary["best_val_accuracy"] = best_val_acc
    wandb.run.summary["final_train_loss"] = avg_loss
    
    # Log confusion matrix
    model.eval()
    with torch.no_grad():
        val_outputs = model(val_X.to(device))
        val_preds = (val_outputs > 0.5).float().cpu().numpy()
        val_labels = val_y.numpy()
    
    wandb.log({
        "confusion_matrix": wandb.plot.confusion_matrix(
            probs=None,
            y_true=val_labels,
            preds=val_preds,
            class_names=["Negative", "Positive"]
        )
    })
    
    print(f"\nTraining complete! Best val accuracy: {best_val_acc:.4f}")
    
    wandb.finish()
    return model, best_val_acc

In [None]:
# Run our first W&B experiment
model, acc = train_with_wandb(
    hidden_dim=64,
    learning_rate=0.01,
    epochs=30,
    optimizer_type="adam"
)

### What Just Happened?

W&B automatically:
1. Created a new project called "dgx-spark-classification"
2. Logged all configuration parameters
3. Tracked loss and accuracy curves at each epoch
4. Recorded model gradients and parameters (via `wandb.watch`)
5. Saved your best model checkpoint
6. Generated a confusion matrix visualization

In online mode, you'd get a link to view all this in a beautiful dashboard!

---

## Part 3: Comparing Multiple Runs

Let's run a few more experiments to compare.

In [None]:
# Run multiple experiments
experiments = [
    {"hidden_dim": 32, "learning_rate": 0.001, "optimizer_type": "adam"},
    {"hidden_dim": 128, "learning_rate": 0.01, "optimizer_type": "sgd"},
    {"hidden_dim": 64, "learning_rate": 0.005, "optimizer_type": "adamw"},
]

results = []
for config in experiments:
    print(f"\n{'='*60}")
    print(f"Running: {config}")
    print(f"{'='*60}")
    _, acc = train_with_wandb(**config, epochs=30)
    results.append((config, acc))

# Summary
print("\n" + "="*60)
print("EXPERIMENT SUMMARY")
print("="*60)
for config, acc in sorted(results, key=lambda x: -x[1]):
    print(f"Val Acc: {acc:.4f} | optimizer={config['optimizer_type']}, "
          f"hidden={config['hidden_dim']}, lr={config['learning_rate']}")

---

## Part 4: Hyperparameter Sweeps

One of W&B's killer features is automated hyperparameter search!

### ELI5: What is a Hyperparameter Sweep?

> **Imagine you're making the perfect pizza.**
>
> You could try:
> - Every possible oven temperature (300°, 325°, 350°...)
> - Every baking time (10 min, 12 min, 15 min...)
> - Every cheese amount (1 cup, 1.5 cups, 2 cups...)
>
> That's called a **grid search** - trying every combination. But that's 100+ pizzas!
>
> A smart approach: Try a few pizzas, see which direction makes them better, and focus your testing there. That's **Bayesian optimization** - W&B learns from each experiment to suggest better ones.
>
> **In AI terms:** A sweep automatically tries different hyperparameter combinations and uses the results to intelligently explore the search space.

In [None]:
# Define a sweep configuration
sweep_config = {
    "method": "bayes",  # Options: grid, random, bayes
    "metric": {
        "name": "val/accuracy",
        "goal": "maximize"
    },
    "parameters": {
        "hidden_dim": {
            "values": [32, 64, 128, 256]
        },
        "learning_rate": {
            "distribution": "log_uniform_values",
            "min": 0.0001,
            "max": 0.1
        },
        "dropout": {
            "distribution": "uniform",
            "min": 0.0,
            "max": 0.5
        },
        "optimizer": {
            "values": ["adam", "sgd", "adamw"]
        },
        "batch_size": {
            "values": [16, 32, 64]
        }
    },
    "early_terminate": {
        "type": "hyperband",
        "min_iter": 5,
        "s": 2
    }
}

print("Sweep Configuration:")
print(f"  Method: {sweep_config['method']} (learns from results)")
print(f"  Optimizing: {sweep_config['metric']['name']}")
print(f"  Parameters to search:")
for param, config in sweep_config['parameters'].items():
    print(f"    - {param}: {config}")

In [None]:
def sweep_train():
    """
    Training function for W&B sweep.
    Config is automatically injected by wandb.agent
    """
    run = wandb.init()
    config = wandb.config
    
    # Create model with sweep config
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = SimpleClassifier(
        n_features, 
        config.hidden_dim, 
        config.dropout
    ).to(device)
    
    # Setup optimizer based on config
    if config.optimizer == "adam":
        optimizer = optim.Adam(model.parameters(), lr=config.learning_rate)
    elif config.optimizer == "sgd":
        optimizer = optim.SGD(model.parameters(), lr=config.learning_rate, momentum=0.9)
    else:
        optimizer = optim.AdamW(model.parameters(), lr=config.learning_rate)
    
    criterion = nn.BCELoss()
    train_dataset = TensorDataset(train_X.to(device), train_y.to(device))
    train_loader = DataLoader(train_dataset, batch_size=config.batch_size, shuffle=True)
    
    # Training
    for epoch in range(20):
        model.train()
        for batch_X, batch_y in train_loader:
            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()
        
        # Validation
        model.eval()
        with torch.no_grad():
            val_outputs = model(val_X.to(device))
            val_preds = (val_outputs > 0.5).float()
            val_acc = (val_preds == val_y.to(device)).float().mean().item()
            val_loss = criterion(val_outputs, val_y.to(device)).item()
        
        wandb.log({"val/accuracy": val_acc, "val/loss": val_loss})
    
    wandb.finish()

In [None]:
# Create and run sweep (limited runs for demo)
# Note: In online mode, this would run on W&B servers
print("Creating sweep...")
print("(In offline mode, sweeps have limited functionality)")
print("\nTo run a full sweep, enable online mode and run:")
print("")
print("  sweep_id = wandb.sweep(sweep_config, project='dgx-spark-sweep')")
print("  wandb.agent(sweep_id, sweep_train, count=20)")
print("")
print("This would run 20 experiments with Bayesian-optimized hyperparameters!")

In [None]:
# Demonstrate sweep with a simple manual version
import random

print("Running mini-sweep demonstration (5 random configurations)...\n")

sweep_results = []
for i in range(5):
    # Randomly sample hyperparameters
    config = {
        "hidden_dim": random.choice([32, 64, 128]),
        "learning_rate": 10 ** random.uniform(-4, -1),
        "dropout": random.uniform(0, 0.3),
        "optimizer_type": random.choice(["adam", "sgd", "adamw"])
    }
    
    print(f"Trial {i+1}: hidden={config['hidden_dim']}, lr={config['learning_rate']:.4f}, "
          f"dropout={config['dropout']:.2f}, opt={config['optimizer_type']}")
    
    _, acc = train_with_wandb(**config, epochs=15)
    sweep_results.append((config, acc))

# Find best
best_config, best_acc = max(sweep_results, key=lambda x: x[1])
print(f"\nBest configuration found:")
print(f"  Accuracy: {best_acc:.4f}")
print(f"  Config: {best_config}")

---

## Part 5: W&B with HuggingFace Transformers

W&B integrates seamlessly with HuggingFace Trainer!

In [None]:
# Example: HuggingFace Trainer integration (code pattern, not executed)
huggingface_example = '''
from transformers import Trainer, TrainingArguments
import wandb

# W&B is automatically enabled when installed!
# Just set the report_to parameter

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    learning_rate=2e-5,
    
    # W&B Integration - that's it!
    report_to="wandb",
    run_name="llama-finetune-v1",
)

# Optionally initialize with more config
wandb.init(
    project="llm-finetuning",
    config={
        "model": "meta-llama/Llama-3.1-8B",
        "dataset": "custom-instructions",
        "lora_r": 16,
    }
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# All metrics automatically logged to W&B!
trainer.train()
'''

print("HuggingFace Trainer + W&B Integration:")
print("="*50)
print(huggingface_example)

---

## Part 6: Advanced W&B Features

### Custom Visualizations and Alerts

In [None]:
# Demonstrate logging different types of data
run = wandb.init(
    project="dgx-spark-demo",
    name="advanced-logging-demo",
    reinit=True
)

# Log images
import matplotlib.pyplot as plt
import io

# Create a sample plot
fig, ax = plt.subplots(figsize=(8, 6))
x = np.linspace(0, 10, 100)
ax.plot(x, np.sin(x), label='sin(x)')
ax.plot(x, np.cos(x), label='cos(x)')
ax.legend()
ax.set_title('Sample Visualization')

# Log the figure
wandb.log({"sample_plot": wandb.Image(fig)})
plt.close()

# Log histograms
data = np.random.normal(0, 1, 1000)
wandb.log({"weight_distribution": wandb.Histogram(data)})

# Log tables
table = wandb.Table(
    columns=["model", "accuracy", "loss", "params"],
    data=[
        ["small", 0.85, 0.42, "1M"],
        ["medium", 0.91, 0.28, "10M"],
        ["large", 0.94, 0.18, "100M"],
    ]
)
wandb.log({"model_comparison": table})

# Log HTML
html_content = """
<h2>Training Summary</h2>
<ul>
  <li>Best Accuracy: 94%</li>
  <li>Training Time: 2 hours</li>
  <li>GPU: DGX Spark (128GB)</li>
</ul>
"""
wandb.log({"summary_html": wandb.Html(html_content)})

print("Logged various data types:")
print("  - Matplotlib figure")
print("  - Histogram")
print("  - Table")
print("  - HTML")

wandb.finish()

In [None]:
# Alerts - notify when something important happens
alert_example = '''
# In online mode, you can set up alerts:

# Alert when validation accuracy drops
if val_acc < 0.8:
    wandb.alert(
        title="Low Accuracy Warning",
        text=f"Validation accuracy dropped to {val_acc:.2f}",
        level=wandb.AlertLevel.WARN
    )

# Alert on training completion
wandb.alert(
    title="Training Complete",
    text=f"Model achieved {best_acc:.2f} accuracy",
    level=wandb.AlertLevel.INFO
)

# Alert on crash/error
try:
    train_model()
except Exception as e:
    wandb.alert(
        title="Training Failed",
        text=str(e),
        level=wandb.AlertLevel.ERROR
    )
    raise
'''

print("W&B Alerts (online mode only):")
print(alert_example)

---

## Try It Yourself

Create a training run that:
1. Uses a different model architecture (add more layers)
2. Logs training curves and a confusion matrix
3. Compares at least 3 different learning rate schedules

<details>
<summary>Hint</summary>

Try using `torch.optim.lr_scheduler` with different schedulers like:
- `StepLR` - decay LR by factor every N epochs
- `CosineAnnealingLR` - cosine decay
- `OneCycleLR` - learning rate warmup + decay

Log the learning rate at each step with `wandb.log({"lr": scheduler.get_last_lr()[0]})`

</details>

In [None]:
# YOUR CODE HERE
# Experiment with learning rate schedules and W&B logging

# Starter code:
from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR, OneCycleLR

schedulers_to_try = {
    "step": lambda opt: StepLR(opt, step_size=10, gamma=0.5),
    "cosine": lambda opt: CosineAnnealingLR(opt, T_max=30),
    "onecycle": lambda opt: OneCycleLR(opt, max_lr=0.1, epochs=30, steps_per_epoch=25)
}

# Your experiment code here...

---

## Common Mistakes

### Mistake 1: Forgetting to Finish Runs

```python
# Wrong - run stays open
wandb.init(project="my-project")
wandb.log({"loss": 0.5})
# Script ends but run isn't finished!

# Right - always finish
run = wandb.init(project="my-project")
wandb.log({"loss": 0.5})
wandb.finish()  # or use with statement
```
**Why:** Unfinished runs can cause issues with subsequent experiments.

### Mistake 2: Not Using Config for Sweeps

```python
# Wrong - hardcoded values don't work with sweeps
def train():
    lr = 0.01  # Ignores sweep config!
    model = create_model(lr=lr)

# Right - use wandb.config
def train():
    config = wandb.config
    model = create_model(lr=config.learning_rate)
```
**Why:** Sweeps inject hyperparameters through `wandb.config`.

### Mistake 3: Logging Too Frequently

```python
# Wrong - logging every step creates huge overhead
for batch in dataloader:
    loss = train_step(batch)
    wandb.log({"batch_loss": loss})  # Every batch!

# Right - log less frequently
for i, batch in enumerate(dataloader):
    loss = train_step(batch)
    if i % 100 == 0:  # Every 100 batches
        wandb.log({"batch_loss": loss})
```
**Why:** Excessive logging slows down training and bloats storage.

---

## Checkpoint

You've learned:
- How to set up W&B for experiment tracking
- How to log metrics, plots, and tables
- How to configure hyperparameter sweeps
- How to integrate W&B with HuggingFace Transformers
- When to use W&B vs MLflow

---

## Challenge (Optional)

Set up a complete hyperparameter sweep that:
1. Searches over model size, learning rate, optimizer, and batch size
2. Uses Bayesian optimization
3. Has early stopping for poor runs
4. Creates a beautiful comparison dashboard

---

## Further Reading

- [W&B Documentation](https://docs.wandb.ai/)
- [W&B Sweeps Guide](https://docs.wandb.ai/guides/sweeps)
- [HuggingFace + W&B Integration](https://docs.wandb.ai/guides/integrations/huggingface)
- [W&B Reports](https://docs.wandb.ai/guides/reports) - Create shareable experiment summaries

---

## Cleanup

In [None]:
# Clear GPU memory
import torch
import gc

gc.collect()
torch.cuda.empty_cache()

# Clean up W&B offline files (optional)
# !rm -rf wandb/

print("Cleanup complete!")

---

## Next Steps

Now that you can track experiments, let's learn how to **evaluate** your models properly! The next notebook covers running standard LLM benchmarks to compare models objectively.

**Continue to:** [03-benchmark-suite.ipynb](03-benchmark-suite.ipynb)