# PyTorch Tutorial: Distributed Training (DDP & FSDP)

When your model is too big for one GPU, or your data is so large that training takes weeks, you need **Distributed Training**. This is a standard requirement for FAANG AI roles working on Foundation Models.

## Learning Objectives
- Understand **Data Parallelism** vs **Model Parallelism**
- Learn the structure of **DDP (Distributed Data Parallel)**
- Introduction to **FSDP (Fully Sharded Data Parallel)** for massive models

## 1. Vocabulary First

- **Node**: A physical machine (server). A node can have multiple GPUs.
- **Rank**: The ID of a process. If you have 4 GPUs, ranks are 0, 1, 2, 3.
- **World Size**: Total number of processes (GPUs) in the training job.
- **Master Address/Port**: Where the processes coordinate (usually Rank 0).
- **Scatter/Gather**: Sending data to GPUs / Collecting results back.
- **All-Reduce**: A synchronization step where all GPUs share their gradients and calculate the average.

### Why Distributed Training Matters

Single-GPU training simply cannot keep up with modern model sizes. Here is the math:

| Model | Parameters | FP32 Memory (Weights Only) | Training Memory (with gradients + optimizer) |
|-------|-----------|---------------------------|----------------------------------------------|
| ResNet-50 | 25M | 100 MB | ~400 MB |
| BERT-Large | 340M | 1.3 GB | ~5 GB |
| GPT-3 | 175B | 700 GB | ~2.8 TB |
| Llama-3 70B | 70B | 280 GB | ~1.1 TB |

**Training memory is roughly 4x the model size** because you need to store:
1. **Weights** (the model parameters)
2. **Gradients** (same size as weights)
3. **Optimizer states** (Adam stores 2 extra copies: momentum + variance)

So a 70B model needs: `70B x 4 bytes x 4 = 1.1 TB` — no single GPU holds that.

### The Two Fundamental Strategies

**Data Parallelism** — Same model on every GPU, different data.
- Each GPU has a full copy of the model
- The dataset is split across GPUs
- Gradients are averaged after each step
- Works when the model fits on one GPU but training is slow

**Model Parallelism** — Different parts of the model on different GPUs.
- The model is split across GPUs (each GPU holds a piece)
- Data flows between GPUs during forward/backward pass
- Necessary when the model is too large for one GPU
- Adds communication overhead between GPUs

## 2. Distributed Data Parallel (DDP)

**How it works:**
1. Copy the model to every GPU.
2. Split the dataset (each GPU gets a different chunk).
3. Forward pass runs independently on each GPU.
4. Backward pass computes gradients.
5. **All-Reduce**: Gradients are averaged across all GPUs.
6. Optimizer updates weights (identical on all GPUs).

### The All-Reduce Operation (Core Concept)

All-Reduce is the heart of distributed training. It ensures every GPU ends up with the **same averaged gradients**.

**Naive Approach (Don't do this)**:
```
GPU 0 sends gradients → GPU Master
GPU 1 sends gradients → GPU Master
GPU 2 sends gradients → GPU Master
GPU Master computes average → sends back to all
```
This creates a bottleneck at the master and doesn't scale.

**Ring All-Reduce (What actually happens)**:
```
GPUs arranged in a ring: 0 → 1 → 2 → 3 → 0
Step 1: Each GPU sends a chunk to its neighbor
Step 2: Each GPU adds received chunk to its own
Step 3: Repeat until all GPUs have the full sum
Step 4: Divide by world_size to get average
```

Ring All-Reduce is efficient because **every GPU sends and receives simultaneously** — no single bottleneck. The communication cost is `O(N)` regardless of the number of GPUs (where N is the gradient size).

### Effective Batch Size and Learning Rate

When you use 4 GPUs with batch_size=32 per GPU, your **effective batch size is 128** (4 × 32).

This matters because larger effective batches need adjusted learning rates:

**Linear Scaling Rule** (from Facebook's paper on large-batch training):
```
new_lr = base_lr × (effective_batch_size / base_batch_size)
```

**Example**: If `lr=0.001` works with batch_size=32, then with 4 GPUs:
```
new_lr = 0.001 × (128 / 32) = 0.004
```

**Learning Rate Warmup**: When scaling up, the model can diverge if you jump straight to a high LR. Warmup gradually increases LR over the first few hundred steps to stabilize training.

### The Code Structure
DDP requires a script, not a notebook, because it spawns multiple processes. Here is the template you would use:

In [None]:
# ddp_script.py (Template)

import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    # Initialize the process group
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

def train(rank, world_size):
    setup(rank, world_size)
    
    # 1. Create Model and move to GPU (Rank)
    model = torch.nn.Linear(10, 10).to(rank)
    
    # 2. Wrap with DDP
    ddp_model = DDP(model, device_ids=[rank])
    
    # 3. Loss and Optimizer
    loss_fn = torch.nn.MSELoss()
    optimizer = torch.optim.SGD(ddp_model.parameters(), lr=0.001)
    
    # 4. Training Loop
    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10).to(rank))
    loss = loss_fn(outputs, torch.randn(20, 10).to(rank))
    loss.backward()
    optimizer.step()
    
    cleanup()

def main():
    world_size = 2 # Number of GPUs
    mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)

# if __name__ == "__main__":
#     main()

## 3. Fully Sharded Data Parallel (FSDP)

DDP replicates the *entire model* on every GPU. If your model is 100GB and your GPU has 80GB, DDP fails.

**FSDP** solves this by **sharding** (splitting) the model parameters, gradients, and optimizer states across GPUs. Each GPU only holds a piece of the model.

### DDP vs FSDP: Memory Comparison

Consider training a **7B parameter model** with Adam optimizer on 4x A100 (80GB each):

**DDP (each GPU stores everything)**:
```
Weights:          7B × 4 bytes = 28 GB
Gradients:        7B × 4 bytes = 28 GB
Optimizer (Adam): 7B × 8 bytes = 56 GB  (momentum + variance)
─────────────────────────────────────────
Total per GPU:                   112 GB  ← Doesn't fit on 80GB A100!
```

**FSDP (sharded across 4 GPUs)**:
```
Weights:          7B × 4 bytes / 4 GPUs =  7 GB
Gradients:        7B × 4 bytes / 4 GPUs =  7 GB
Optimizer (Adam): 7B × 8 bytes / 4 GPUs = 14 GB
─────────────────────────────────────────
Total per GPU:                             28 GB  ← Fits easily!
```

### FSDP Sharding Strategies

FSDP offers three levels of sharding — the more you shard, the more memory you save, but the more communication overhead you pay:

| Strategy | What's Sharded | Memory Savings | Communication Cost |
|----------|---------------|----------------|--------------------|
| `FULL_SHARD` | Weights + Gradients + Optimizer | Maximum (3x savings) | Highest (all-gather before each forward) |
| `SHARD_GRAD_OP` | Gradients + Optimizer only | Moderate (2x savings) | Moderate |
| `NO_SHARD` | Nothing (same as DDP) | None | Lowest |

### When to use FSDP?
- Training LLMs (Llama 3, GPT-4 class models).
- When model size > GPU memory.
- When you need to train (not just inference) large models.

```python
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

# Wrap your model
fsdp_model = FSDP(model)
```

### Common Failure Modes in Distributed Training

These are the bugs that waste hours of expensive GPU time:

1. **Deadlocks**: One GPU crashes, others hang forever waiting for All-Reduce. Fix: Set `timeout` in `init_process_group` and use `NCCL_ASYNC_ERROR_HANDLING=1`.

2. **OOM on one GPU**: Uneven data distribution causes one GPU to get a larger batch. Fix: Use `DistributedSampler` to ensure equal splits.

3. **Inconsistent models**: Forgetting to set the same random seed or not using `DistributedSampler` correctly. Fix: Always set seeds and use `sampler.set_epoch(epoch)` for shuffling.

4. **Slow communication**: Using `gloo` backend instead of `nccl` for GPU training. Fix: Always use `nccl` for GPU-to-GPU communication.

5. **Gradient accumulation bugs**: Not dividing loss by the number of accumulation steps. Fix: `loss = loss / accumulation_steps` before `backward()`.

## Key Takeaways

1. **DDP** is fast and standard for models that fit on one GPU — it replicates the model and averages gradients via Ring All-Reduce.
2. **FSDP** is necessary for giant models (LLMs) — it shards weights, gradients, and optimizer states across GPUs.
3. **`dist.init_process_group`** is the handshake that starts distributed training — always use `nccl` backend for GPUs.
4. **Effective batch size = per_GPU_batch × num_GPUs** — scale learning rate proportionally and use warmup.
5. **Ring All-Reduce** is the communication primitive that makes distributed training scale — no central bottleneck.

### Quick Decision Guide

```
Can your model fit on 1 GPU?
  ├─ Yes → Is training too slow?
  │         ├─ Yes → Use DDP (multiple GPUs, same model)
  │         └─ No  → Single GPU is fine
  └─ No  → Use FSDP (shard the model across GPUs)
              └─ Still doesn't fit? → Combine FSDP with model parallelism (tensor/pipeline parallel)
```