# Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

**Focus**: 
- Continue from 3.4 Microbatch Size
- Implement Ring All-Reduce

**References**: 
- Efficient Large-Scale Language Model Training on GPU Clusters
Using Megatron-LM: https://arxiv.org/pdf/2104.04473
- Nvidia/Docs/Using NCCL (Nvidia Collective Communications Library)/Operations: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/operations.html#reducescatter

**Purpose**: 

**Approach**: 

**Result**: 


**Definitions**: 
- microbatch size = # number of examples per data parallel rank (https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/batching.html) where if data parallel size $d$ = 1, # of microbatches in a batch per pipeline $m=\frac{B}{b}$ where $B$ = Global batch size and $b$ is microbatch size.
- global batch size = global batch size = micro_batch_size * data_parallel_size * gradient_accumulation_steps
- gradient accumulation = parameter that supports training with large batch sizes while maintaining a fixed memory footprint, though it requires addiitonal compute.

**Notes**:

3.4. Microbatch Size
- Q: How does microbatch size affect memory?
    - A: Increasing microbatch size will increase memory usage because you'll have more activations to store at a given time. 
- Seems to me that microbatch size $\approx$ gradient accumulation, but microbatch size helps with keeping the pipeline busy
- Q: Why would increasing microbatch size $b$ increase per-GPU throughput by up to 1.3x?
    - A: Increasing microbatch increases # of tokens processed per forward/backward pass, requiring less # of microbatches to be processed per-GPU to cover the same global batch size as well.
- Q: How are microbatches scheduled?
    - A: If you're not doing PP, just DP + grad accumulation, for each microbatch: 1) run forward to compute loss for that microbatch, 2) run backward to compute gradients for that microbatch and add them to the gradient buffers, 3) release activations. 
        - If you're doing PP, GPipe ("all-forward-all-backward"). 1) Run forward pass on all $m$ microbatches stage by stage until the pipeline is full, 2) once all forwards are done, start the backward pass. This means you need to store all intermediate activations for all microbatches until backward begins. (very memory-hungry)
        - (Megatron) 1F1B - as soon as the first microbatch finishes its forward on the last stage, start its backward pass. And then alternate forward on a new microbatch, backward on older microbatch. This allows us to release activations earlier. 
            - Note: gradient buffer accumulates directly, so you do not store different activations - gradient contributions are accumulated directly into that parameter's gradient buffer (and then you zero out the gradients after one training step)
- optimal microbatch size $b$ depends on the throughput and memory footprint characteristics of the model and pipeline depth $p$, data-parallel size $d$, and batch size $B$.
- total time spent computing a batch (ignoring communication cost) = $(b'/b + p - 1) \cdot (t_f(b) + t_b(b))$
    - $t_f(b)$ and $t_b(b)$ = forward and backward computation times for a single microbatch given microbatch size
    - $(t_f(b) + t_b(b))$ = computation time for a full forward and backward for a single microbatch given microbatch size
    - $b' = B/d$, global batch size / data parallel size = model replica batch size
    - $b$ = microbatch size
    - $b'/b$ = # of microbatches per replica (m)
    - $p$ = pipeline parallel size

3.5 Activation Recomputation
- Activation recomputation = optional technique to trade off increased compute for decreased memory by running forward pass a second time just before backward pass. Required to train reasonably large models with PP (stashing only the input activations for a given pipeline stage, instead of the entire set of intermediate activations). 
for most cases, checkpointing every 1 or 2 transformer layers is optimal to minimize the total memory footprint (generally A^intermediate > A^input).
- $l$ = # of layers in a model stage
- $c$ = # of checkpoints in a model stage
- $c \cdot A^{\text{input}} + \frac{l}{c} \cdot A^{\text{intermediate}}$ = total memory footprint for the model stage
- Note: this is computed per model stage because memory bottlenecks are *per device*, so that's why we care about memory footprint on a *stage*, not globally.
- This is also simple if checkpointing every 1 or 2 transformer layers is optimal because that just means you save the input activations of each transformer layer of the stage, and then recompute the intermediate activations.
    - In this case of checkpointing every 1 transformer layer, $c = l$ and the total memory in the model stage = $l \cdot A^{\text{input}} + A^{\text{intermediate}}$, suggesting that typically in transformers, $A^{\text{intermediate}} \approx l \cdot A^{\text{input}}$

**FAQs**:
- Q1: Note: I honestly did not really fully understand Equation 1 in 3.4 Microbatch Size.

**Action items**:
- Continue reading from 4. Implementation