
# DeepSpeed Tutorial

Focus: **why DeepSpeed**, the core **ZeRO** idea.



## üß† Section 1 ‚Äî Why DeepSpeed: The GPU Memory Problem in Training

### 1) Motivation
Training large models hits a **memory wall**. Even with big GPUs, the forward + backward pass stores many tensors simultaneously.

### 2) What consumes memory during training
| Component | What it is | Notes |
|---|---|---|
| **Model Parameters** | The trainable weights | ~1√ó model size |
| **Activations** | Intermediates saved for backward | scales with batch√óseq√óhidden |
| **Gradients** | Produced by backprop | ~1√ó model size |
| **Optimizer States** | For Adam: momentum (m), variance (v) | ‚âà 2√ó model size |
| **Temporary Buffers** | Workspace for matmuls/communication | dynamic overhead |

With Adam, per-GPU memory can be **4‚Äì6√ó** the model size.

### 3) Why na√Øve data parallel wastes memory
Every GPU **replicates** params + grads + optimizer:
```
GPU0: params + grads + optimizer
GPU1: params + grads + optimizer
GPU2: params + grads + optimizer
...
```
And the workflow is:
1. Forward pass: each GPU computes loss on its data.
2. Backward pass: each GPU computes local gradients.
3. Gradients are **all-reduced** across GPUs ‚Üí everyone gets the same averaged gradients.
4. Each GPU **updates** its own full parameter copy with its local optimizer.

So 4 GPUs do **not** make a single GPU need 1/4th the memory ‚Äî each still holds everything.

### 4) Why model parallel is NOT a good idea
Model parallelism = splitting the computation across multiple GPUs, so that each GPU holds a different part of the model instead of a full copy. Each GPU **computes a subset of layers**.
- **Manual partitioning** ‚Äì Layers or tensors must be explicitly assigned to GPUs.  
- **High communication cost** ‚Äì GPUs must constantly exchange activations and gradients.  
- **Sequential dependencies** ‚Äì Layers depend on outputs from previous GPUs, creating idle "pipeline bubbles."  
- **Backward complexity** ‚Äì Gradients must flow across devices, increasing synchronization overhead.  

In short: This reduces both memory *and* compute per GPU, but adds cross-GPU communication for every layer.



### 4) DeepSpeed‚Äôs core idea
**ZeRO (Zero Redundancy Optimizer)** partitions these states **across** GPUs instead of replicating them, cutting memory per GPU roughly by the number of devices.



## ‚öôÔ∏è Section 2 ‚Äî ZeRO: The Heart of DeepSpeed

### Big idea
Don‚Äôt replicate all training states on each GPU; **shard** them. Communication reconstructs what‚Äôs needed on the fly.

### ZeRO‚Äôs three stages

#### üß© Stage 1 ‚Äî Shard optimizer states

In optimizers like **Adam** or **AdamW**, at step $t$ each trainable parameter `Œ∏_i` has its own **state variables** that keep track of its historical updates ‚Äî for example:

$$
\begin{aligned}
m_i^t &\leftarrow \beta_1 m_i^{t-1} + (1 - \beta_1) g_i^t \\
v_i^t &\leftarrow \beta_2 v_i^{t-1} + (1 - \beta_2) (g_i^t)^2 \\
\theta_i^t &\leftarrow \theta_i^{t-1} - \alpha \frac{m_i^t / (1 - \beta_1^t)}{\sqrt{v_i^t / (1 - \beta_2^t)} + \epsilon}
\end{aligned}
$$

Where:
- $ m_i $ = first moment (momentum term)
- $ v_i $ = second moment (variance term)
- $ g_i $ = gradient of that parameter
- $ \theta_i $ = parameter value itself

Each parameter $ \theta_i $ updates **only using its own** $ m_i $, $ v_i $, and $ g_i $.

In vector form, Adam‚Äôs update is applied *elementwise*:
$$\theta \leftarrow \theta - \alpha \frac{m}{\sqrt{v} + \epsilon}$$

So if we split the vector of parameters into chunks, each chunk can be updated **entirely on its own** ‚Äî as long as it has access to its local `m`, `v`, and `g`.

Because the optimizer states are independent:
- We can **shard** the `m` and `v` tensors across GPUs.
- GPU 0 stores `m,v` for parameters [0 ‚Äì 25%], GPU 1 for [25 ‚Äì 50%], etc.
- Each GPU updates *its subset* of parameters using its own local optimizer states.

```
GPU0: gets optimizer states[0‚Äì25%]
GPU1: gets optimizer states[25‚Äì50%]
GPU2: gets optimizer states[50‚Äì75%]
GPU3: gets optimizer states[75‚Äì100%]
```

No communication is needed during the optimizer step except possibly for syncing the updated parameters after.


##### Training Pipeline for ZeRO-1
```
üè≠ Forward (parameters duplicated) ‚Üí Backward (parameters, gradients duplicated)‚Üí All-Reduce (average gradients) ‚Üí Local Update (ONLY its shard of optimizer states) + Moving (m,v to others)
```

***Summary***
- Shard Adam‚Äôs momentum/variance across GPUs.  
- Full params + gradients remain replicated.  
- **Save:** optimizer memory (‚âà 2√ó model size).  
- **Comm:** moderate.

#### üßÆ Stage 2 ‚Äî Shard gradients (plus optimizer)
##### Idea: share graidents

If each parameter has its own independent gradient, why should *every* GPU keep the *entire* gradient tensor?
So ZeRO Stage 2 **partitions gradients** just like optimizer states:
- Now, instead of everyone keeping the full `g_avg`, we **combine reduction and partitioning** in one step. Each GPU contributes its local gradients, but only receives the **shard** of the averaged result it needs.
```
GPU0: gets g_avg[0‚Äì25%]
GPU1: gets g_avg[25‚Äì50%]
GPU2: gets g_avg[50‚Äì75%]
GPU3: gets g_avg[75‚Äì100%]
```
##### Why does this work? Independence of Gradients
Each parameter `Œ∏_i` only needs its own gradient `g_i`.  
No need to store gradients of other parameters.

### üíæ Memory Benefit
Each GPU stores only 1/N of gradients ‚Üí  **gradient memory reduced by factor N**.

##### Training Pipeline for ZeRO-2
```
üè≠ Forward (parameters duplicated) ‚Üí Backward (parameters, gradients duplicated)‚Üí Reduce-scatter (average + partition grads across GPUs) ‚Üí Local Update (ONLY its shard of optimizer states) ‚Üí Broadcasting (updated parameters to others)
```

***Summary***
- Gradients are partitioned using **reduce-scatter**.  
- **Save:** optimizer + gradient memory.  
- **Comm:** low.

#### üß† Stage 3 ‚Äî Shard parameters (everything)
##### Idea: share model parameters

```
GPU0: gets $\theta$[0‚Äì25%]
GPU1: gets $\theta$[25‚Äì50%]
GPU2: gets $\theta$[50‚Äì75%]
GPU3: gets $\theta$[75‚Äì100%]
```

At any given moment, no GPU has the full model in memory.  
Instead, parameters are **gathered just-in-time** when a layer needs them and **released** afterward.

##### Implementation Detail: Parameter Flattening

To avoid managing millions of small tensors,  
ZeRO groups parameters into **flat contiguous memory chunks** called *parameter buckets* (or *flat buffers*).
This eliminates almost all duplication ‚Äî memory per GPU ‚âà (1 / N) of the full model.

##### Training Pipeline for ZeRO-3
```
üè≠ Forward (For each layer, GPUs **all-gather** the parameter shards required for that layer. Each GPU reconstructs the full layer weights *temporarily* in memory.) ‚Üí Backward ‚Üí Reduce-scatter (average + partition grads across GPUs) ‚Üí Local Update (ONLY its shard of optimizer states) + Broadcast (updated paramters to others)
```

***Summary***
- Parameters themselves are partitioned and **gathered just-in-time** for compute.  
- **Save:** optimizer + gradients + parameters.  
- **Comm:** highest; enables training models larger than a single GPU‚Äôs memory.

### Memory scaling (rule of thumb)
| ZeRO Stage | What‚Äôs sharded | Memory reduction | 
|---|---|---|
| 1 | Optimizer states | ~2√ó | 
| 2 | + Gradients | ~3√ó | 
| 3 | + Parameters | ~4‚Äì8√ó | 

### Illustration (conceptual)
```
Naive DP (replicated):   [full][full][full]
ZeRO-1 (opt sharded):    [P,G, O‚ÇÅ][P,G, O‚ÇÇ][P,G, O‚ÇÉ]
ZeRO-2 (opt+grad shard): [P, G‚ÇÅ,O‚ÇÅ][P, G‚ÇÇ,O‚ÇÇ][P, G‚ÇÉ,O‚ÇÉ]
ZeRO-3 (all sharded):    [P‚ÇÅ,G‚ÇÅ,O‚ÇÅ][P‚ÇÇ,G‚ÇÇ,O‚ÇÇ][P‚ÇÉ,G‚ÇÉ,O‚ÇÉ]
P=params, G=grads, O=optimizer shards
```
