
# **CHAPTER 21: MODEL TRAINING & EXPERIMENTATION**

*Scaling Training from Laptops to Clusters*

## **Chapter Overview**

Moving from experimental notebooks to production training requires rigorous experiment tracking, systematic hyperparameter optimization, and distributed computing strategies. This chapter covers the tools and techniques to train models reliably at scale, ensuring reproducibility and efficient resource utilization across GPUs and TPUs.

**Estimated Time:** 40-50 hours (3-4 weeks)  
**Prerequisites:** Chapter 11 (Deep Learning Frameworks), Chapter 20 (Data Engineering), access to GPU resources (cloud or local)

---

## **21.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Implement comprehensive experiment tracking (metrics, artifacts, lineage) using MLflow and Weights & Biases
2. Design distributed training strategies (data parallelism, model parallelism, pipeline parallelism) for large models
3. Execute efficient hyperparameter searches using Bayesian optimization and population-based training
4. Optimize training throughput with mixed precision, gradient accumulation, and efficient data loading
5. Ensure training reproducibility through deterministic operations and environment management
6. Implement fault-tolerant training with checkpointing and automatic recovery

---

## **21.1 Experiment Tracking**

#### **21.1.1 MLflow Architecture**

Open-source platform for the ML lifecycle with four components:
- **Tracking:** Record parameters, metrics, artifacts
- **Projects:** Package code for reproducibility
- **Models:** Manage deployment artifacts
- **Registry:** Version and stage models

```python
# training_with_mlflow.py
import mlflow
import mlflow.pytorch
from torch.utils.tensorboard import SummaryWriter

# Set experiment
mlflow.set_tracking_uri("http://mlflow-server:5000")
mlflow.set_experiment("fraud-detection-cnn")

with mlflow.start_run(run_name="experiment-1-baseline"):
    # Log parameters
    mlflow.log_params({
        "learning_rate": 0.001,
        "batch_size": 256,
        "epochs": 100,
        "optimizer": "AdamW",
        "model_architecture": "ResNet50"
    })
    
    # Training loop
    for epoch in range(epochs):
        train_loss = train_epoch(model, train_loader)
        val_loss, val_acc = validate(model, val_loader)
        
        # Log metrics
        mlflow.log_metrics({
            "train_loss": train_loss,
            "val_loss": val_loss,
            "val_accuracy": val_acc
        }, step=epoch)
        
        # Log model checkpoint every 10 epochs
        if epoch % 10 == 0:
            mlflow.pytorch.log_model(model, f"checkpoints/epoch-{epoch}")
    
    # Log final artifact (model)
    mlflow.pytorch.log_model(model, "model", 
        registered_model_name="fraud-detection-model")
    
    # Log supplementary artifacts
    mlflow.log_artifact("confusion_matrix.png", "visualizations")
    mlflow.log_artifact("training_config.yaml", "config")
```

#### **21.1.2 Weights & Biases (W&B) for Deep Learning**

Specialized for deep learning with rich visualization and collaboration features.

```python
# wandb_training.py
import wandb

# Initialize run
wandb.init(
    project="nlp-sentiment-analysis",
    name="bert-fine-tuning-run-1",
    config={
        "model": "bert-base-uncased",
        "learning_rate": 2e-5,
        "batch_size": 16,
        "epochs": 3,
        "weight_decay": 0.01
    }
)

config = wandb.config

# Automatic gradient logging
wandb.watch(model, log="all", log_freq=100)

for epoch in range(config.epochs):
    for batch_idx, (data, target) in enumerate(train_loader):
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        
        # Log every 50 steps
        if batch_idx % 50 == 0:
            wandb.log({
                "loss": loss.item(),
                "learning_rate": optimizer.param_groups[0]['lr'],
                "epoch": epoch
            })
    
    # Validation
    val_metrics = evaluate(model, val_loader)
    wandb.log({f"val_{k}": v for k, v in val_metrics.items()})

# Save model with versioning
artifact = wandb.Artifact('model', type='model')
artifact.add_file('model.pth')
wandb.log_artifact(artifact)
```

**Advanced Features:**
- **Sweeps:** Integrated hyperparameter optimization
- **Tables:** Visualize dataset samples and model predictions
- **Reports:** Shareable dashboards with embedded visualizations
- **Artifacts:** Dataset and model lineage tracking

---

## **21.2 Hyperparameter Optimization (HPO)**

#### **21.2.1 Bayesian Optimization with Optuna**

Efficient search using probabilistic models to suggest next hyperparameters based on past results.

```python
# optuna_hpo.py
import optuna
from optuna.integration import PyTorchLightningPruningCallback

def objective(trial):
    # Suggest hyperparameters
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    batch_size = trial.suggest_categorical("batch_size", [32, 64, 128, 256])
    dropout = trial.suggest_float("dropout", 0.1, 0.5)
    optimizer_name = trial.suggest_categorical("optimizer", ["Adam", "SGD", "AdamW"])
    n_layers = trial.suggest_int("n_layers", 1, 3)
    
    # Build model dynamically
    model = create_model(n_layers=n_layers, dropout=dropout)
    optimizer = getattr(torch.optim, optimizer_name)(model.parameters(), lr=lr)
    
    # Training with early stopping based on validation loss
    for epoch in range(100):
        train_loss = train_epoch(model, train_loader, optimizer)
        val_loss = validate(model, val_loader)
        
        # Report intermediate result for pruning
        trial.report(val_loss, epoch)
        if trial.should_prune():
            raise optuna.TrialPruned()
    
    return val_loss

# Create study with TPE sampler (Tree-structured Parzen Estimator)
study = optuna.create_study(
    direction="minimize",
    sampler=optuna.samplers.TPESampler(),
    pruner=optuna.pruners.MedianPruner()  # Prune unpromising trials early
)

study.optimize(objective, n_trials=100, n_jobs=4)

print(f"Best value: {study.best_value} (params: {study.best_params})")
```

**Multi-Objective Optimization:**
```python
# Optimize for both accuracy and inference speed
study = optuna.create_study(directions=["maximize", "minimize"])  # Accuracy, Latency

def objective(trial):
    accuracy = train_and_evaluate(trial)
    latency = measure_inference_time(trial)
    return accuracy, latency

study.optimize(objective, n_trials=50)
```

#### **21.2.2 Population Based Training (PBT)**

Evolutionary strategy where poorly performing trials exploit weights from good trials and explore new hyperparameters.

```python
# ray_tune_pbt.py
from ray import tune
from ray.tune.schedulers import PopulationBasedTraining

pbt_scheduler = PopulationBasedTraining(
    time_attr="training_iteration",
    perturbation_interval=10,
    hyperparam_mutations={
        "lr": tune.loguniform(1e-5, 1e-1),
        "batch_size": [32, 64, 128]
    },
    quantile_fraction=0.25  # Bottom 25% exploit top 25%
)

analysis = tune.run(
    train_fn,
    config={
        "lr": tune.loguniform(1e-5, 1e-1),
        "batch_size": 64
    },
    num_samples=20,
    resources_per_trial={"gpu": 1, "cpu": 4},
    scheduler=pbt_scheduler,
    metric="val_accuracy",
    mode="max"
)
```

#### **21.2.3 Hyperparameter Search at Scale**

**Ash Scheduling:** Early stopping based on intermediate performance (Hyperband algorithm).

```python
from ray.tune.schedulers import ASHAScheduler

scheduler = ASHAScheduler(
    max_t=100,  # Max epochs
    grace_period=10,  # Minimum epochs before pruning
    reduction_factor=2
)
```

---

## **21.3 Distributed Training**

#### **21.3.1 Data Parallelism**

Split batch across multiple GPUs; each holds full model copy.

**PyTorch DDP (DistributedDataParallel):**
```python
# distributed_training.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

def setup():
    # Initialize process group
    dist.init_process_group("nccl")  # NCCL for GPU, Gloo for CPU
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)
    return local_rank

def main():
    local_rank = setup()
    
    # Create model and move to GPU
    model = MyModel().to(local_rank)
    
    # Wrap with DDP
    model = DDP(model, device_ids=[local_rank], 
                output_device=local_rank,
                find_unused_parameters=False)  # Set True if some params unused
    
    # Distributed sampler ensures each GPU gets unique data
    train_sampler = DistributedSampler(dataset)
    train_loader = DataLoader(dataset, batch_size=64, sampler=train_sampler)
    
    for epoch in range(epochs):
        train_sampler.set_epoch(epoch)  # Important for shuffling
        for batch in train_loader:
            inputs, labels = batch
            inputs = inputs.to(local_rank)
            labels = labels.to(local_rank)
            
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

if __name__ == "__main__":
    # Launch with: torchrun --nproc_per_node=4 distributed_training.py
    main()
```

#### **21.3.2 Model Parallelism & Pipeline Parallelism**

For models too large for single GPU (e.g., GPT-3 scale).

**Fully Sharded Data Parallel (FSDP):**
```python
# fsdp_training.py
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy

model = FSDP(
    model,
    auto_wrap_policy=transformer_auto_wrap_policy,
    mixed_precision=torch.bfloat16,
    device_id=torch.cuda.current_device(),
    limit_all_gathers=True  # Reduce memory fragmentation
)
```

**Pipeline Parallelism (GPipe style):**
```python
from torch.distributed.pipeline.sync import Pipe
from torch.distributed.rpc import init_rpc

# Split model across stages (different GPUs)
model = nn.Sequential(stage1, stage2, stage3)
model = Pipe(model, chunks=4)  # Micro-batches for pipeline bubble reduction

output = model(input)
```

#### **21.3.3 ZeRO Optimization (DeepSpeed)**

Memory optimization stages for massive models:
- **ZeRO-1:** Shard optimizer states across GPUs
- **ZeRO-2:** Add gradient sharding
- **ZeRO-3:** Add parameter sharding (model state distributed)
- **ZeRO-Offload:** Offload to CPU/NVMe

```python
# deepspeed_config.json
{
    "train_batch_size": 64,
    "gradient_accumulation_steps": 4,
    "optimizer": {
        "type": "AdamW",
        "params": {"lr": 0.0001}
    },
    "fp16": {"enabled": true},
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "reduce_scatter": true,
        "contiguous_gradients": true
    }
}

# Training
import deepspeed
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    config="deepspeed_config.json"
)
```

---

## **21.4 Training Optimization Techniques**

#### **21.4.1 Mixed Precision Training**

Use FP16/BF16 for forward/backward passes, FP32 for updates (automatic loss scaling).

```python
# Automatic Mixed Precision (AMP)
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for data, target in dataloader:
    optimizer.zero_grad()
    
    # Forward pass with autocast
    with autocast(device_type='cuda', dtype=torch.float16):
        output = model(data)
        loss = criterion(output, target)
    
    # Backward pass with scaling
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
```

**BF16 vs FP16:**
- **FP16:** Requires loss scaling (narrow dynamic range), supported on V100+
- **BF16:** Same range as FP32, no scaling needed, supported on A100+

#### **21.4.2 Gradient Accumulation**

Simulate large batch sizes with limited memory.

```python
# Effective batch size = batch_size * accumulation_steps * num_gpus
accumulation_steps = 4

for i, (data, target) in enumerate(dataloader):
    with autocast():
        loss = model(data, target) / accumulation_steps
    
    loss.backward()
    
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
```

#### **21.4.3 Efficient Data Loading**

```python
loader = DataLoader(
    dataset,
    batch_size=64,
    num_workers=8,        # Parallel data loading
    pin_memory=True,      # Faster CPU→GPU transfer
    persistent_workers=True,  # Keep workers alive between epochs
    prefetch_factor=2     # Batches per worker
)
```

---

## **21.5 Reproducibility**

#### **21.5.1 Deterministic Operations**

```python
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    
    # Deterministic algorithms (slower but reproducible)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False  # Disable auto-tuner
    torch.use_deterministic_algorithms(True)
    
    # For DataLoader
    def seed_worker(worker_id):
        worker_seed = torch.initial_seed() % 2**32
        np.random.seed(worker_seed)
        random.seed(worker_seed)
    
    return seed_worker

# Usage
DataLoader(dataset, worker_init_fn=set_seed(42))
```

#### **21.5.2 Environment Freezing**

```dockerfile
# Dockerfile.reproducible
FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime

# Pin ALL dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Pin base image digest for true reproducibility
# FROM pytorch/pytorch@sha256:abc123...
```

```txt
# requirements.txt (pinned)
torch==2.0.1
numpy==1.24.3
pandas==2.0.2
transformers==4.30.0
```

---

## **21.6 Workbook Labs**

### **Lab 1: Experiment Tracking Setup**
Set up MLflow or W&B for a training pipeline:

1. **Tracking:** Log hyperparameters, metrics (loss, accuracy), and artifacts (model checkpoints)
2. **Comparison:** Run 5 variants, compare learning curves in UI
3. **Model Registry:** Register best model, transition through Staging → Production stages
4. **Reproducibility:** Ensure another engineer can reproduce your best run using logged artifacts

**Deliverable:** Tracked experiment with report comparing runs and registered model.

### **Lab 2: Distributed Training**
Train ResNet-50 on ImageNet-scale data using multiple GPUs:

1. **Single GPU Baseline:** Measure samples/sec and memory usage
2. **DDP:** Convert to DistributedDataParallel, scale to 4 GPUs
3. **Optimization:** Add mixed precision (AMP) and gradient accumulation
4. **Analysis:** Calculate scaling efficiency (ideal: 4x speedup with 4 GPUs)

**Deliverable:** Scaling report showing throughput vs. GPU count and memory profiling.

### **Lab 3: Hyperparameter Optimization**
Optimize a transformer fine-tuning task:

1. **Search Space:** Learning rate (1e-5 to 1e-3), batch size (16-128), dropout (0.1-0.5)
2. **Pruning:** Implement early stopping for unpromising trials
3. **Scheduling:** Use Bayesian optimization (Optuna) vs. Random search—compare efficiency
4. **Multi-objective:** Optimize for both accuracy and model size (FLOPs)

**Deliverable:** HPO study results with visualization of search space exploration.

### **Lab 4: Fault Tolerance**
Implement resilient training:

1. **Checkpointing:** Save model + optimizer state every epoch
2. **Simulation:** Kill training job mid-epoch, resume from checkpoint
3. **Automatic Recovery:** Use try/except with checkpoint reloading on failure
4. **Metrics Preservation:** Ensure experiment tracking continues seamlessly after restart

**Deliverable:** Training script with fault tolerance tested via simulated failures.

---

## **21.7 Common Pitfalls**

1. **Non-Deterministic DataLoaders:** Using multiple workers without `worker_init_fn` creates randomness in data ordering. **Solution:** Always set `worker_init_fn` with seed.

2. **DDP Synchronization Bugs:** Calling `loss.item()` (CPU) inside training loop without `dist.barrier()` causes GPU desync. **Solution:** Use `torch.distributed.all_reduce` for aggregating metrics.

3. **Gradient Accumulation with BatchNorm:** Statistics update every mini-batch, not accumulated batch, causing instability. **Solution:** Use synchronized BatchNorm (`SyncBatchNorm`) or larger batches with gradient checkpointing instead.

4. **Memory Fragmentation:** PyTorch caching allocator fragments memory over long training runs. **Solution:** Call `torch.cuda.empty_cache()` between epochs or use `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`.

5. **Ignoring Numerical Stability:** FP16 underflow with small gradients. **Solution:** Use `GradScaler` with dynamic loss scaling, or switch to BF16.

---

## **21.8 Interview Questions**

**Q1:** Explain the difference between Data Parallelism and Model Parallelism. When would you use each?
*A: Data Parallelism (DDP): Each GPU holds full model, processes different batch portions. Used when model fits in single GPU memory (most common). Model Parallelism: Model split across GPUs (different layers on different devices). Used when model too large for one GPU (e.g., GPT-3). Pipeline Parallelism combines both: split model across GPUs, pipeline micro-batches to keep all GPUs busy. Modern training uses hybrid: FSDP (sharded data parallel) for large models, combining efficiency of data parallel with memory savings of model parallel.*

**Q2:** How does ZeRO (Zero Redundancy Optimizer) reduce memory usage?
*A: ZeRO partitions optimizer states, gradients, and parameters across data parallel processes. ZeRO-1 shards optimizer states (4x memory reduction), ZeRO-2 adds gradient sharding (8x reduction), ZeRO-3 adds parameter sharding (linear reduction with GPU count). ZeRO-Offload moves optimizer states to CPU/NVMe, enabling training models with trillions of parameters on limited GPU memory. Trade-off: Increased communication overhead between GPUs.*

**Q3:** What is the purpose of gradient accumulation, and what are its trade-offs?
*A: Gradient accumulation simulates large batch sizes by accumulating gradients over multiple forward/backward passes before updating weights. Used when GPU memory limits physical batch size. Trade-offs: (1) Training slower (more forward passes per update), (2) BatchNorm statistics less accurate (computed per mini-batch), (3) Effective batch size affects convergence dynamics (larger batches need learning rate warmup/scaling). Benefit: Train with effectively unlimited batch sizes on limited hardware.*

**Q4:** How do you ensure reproducibility in distributed training across different hardware?
*A: (1) Set all random seeds (Python, NumPy, PyTorch, CUDA), (2) Use deterministic algorithms (`torch.use_deterministic_algorithms(True)`), (3) Disable cudnn.benchmark (which selects non-deterministic algorithms), (4) Fixed DataLoader worker seeds, (5) Document exact software versions (Docker images with digests), (6) For distributed: ensure deterministic reduction operations (sum vs. mean ordering). Note: Some ops (e.g., scatter_add) have no deterministic implementation on GPU—fall back to CPU or accept slight variance.*

**Q5:** Design a hyperparameter search for a model with 10-day training time and limited compute budget.
*A: Use early stopping with ASHA (Asynchronous Successive Halving): allocate minimal resources to many configurations, promote promising ones. Start with coarse random search over wide ranges, then Bayesian optimization (TPE) on promising region. Use population-based training to transfer weights between trials (don't train from scratch each time). Prioritize important hyperparameters (learning rate, batch size) over minor ones (dropout). Use multi-fidelity optimization: validate on subset of data or fewer epochs for screening, full training for finalists.*

---

## **21.9 Further Reading**

**Books:**
- *Deep Learning with PyTorch* (Eli Stevens et al.) - Distributed training patterns
- *Designing Machine Learning Systems* (Chip Huyen) - Chapter on training

**Papers:**
- "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" (Rajbhandari et al., 2020)
- "Mixed Precision Training" (Micikevicius et al., 2018)
- "Population Based Training of Neural Networks" (Jaderberg et al., 2017)

**Tools:**
- **Ray Tune:** Scalable HPO with early stopping
- **Weights & Biases:** Experiment tracking
- **DeepSpeed:** Microsoft library for large model training
- **FSDP:** PyTorch native sharded data parallel

---

## **21.10 Checkpoint Project: Distributed LLM Fine-Tuning**

Fine-tune a 7B parameter LLM (LLaMA-2 or Mistral) on a custom dataset using distributed training.

**Requirements:**

1. **Infrastructure:**
   - Multi-GPU setup (4x A100 or 8x V100)
   - Use either DDP or FSDP (recommended for 7B+ models)
   - Mixed precision (BF16)

2. **Optimization:**
   - Gradient checkpointing (trade compute for memory)
   - Gradient accumulation (effective batch size 128)
   - DeepSpeed ZeRO-2 or ZeRO-3 integration

3. **Experiment Tracking:**
   - W&B integration logging perplexity, learning rate, GPU memory
   - Hyperparameter sweep over learning rates (1e-5 to 5e-5) and LoRA ranks

4. **Fault Tolerance:**
   - Checkpoint every 500 steps to shared storage (S3/NFS)
   - Resume capability from latest checkpoint
   - Validation evaluation every epoch

5. **Evaluation:**
   - Perplexity on held-out test set
   - Generation examples logged to W&B
   - Throughput measurement (tokens/sec/GPU)

**Deliverables:**
- `distributed_training/` directory with launch scripts
- W&B project link showing training curves
- Final model pushed to Hugging Face Hub or S3
- Performance report: memory usage per GPU, scaling efficiency, time-to-convergence

**Success Criteria:**
- Model trains without OOM errors on 7B parameters
- Checkpoint resume tested and verified
- Hyperparameter sweep identifies optimal configuration
- Final model achieves target perplexity < X on validation set

---

**End of Chapter 21**

*You can now train models at scale with full reproducibility. Chapter 22 covers Model Deployment & Serving—getting these trained models into production.*

---
