# Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

**Focus**: Comparing the diff to the 2019 Megatron paper to better understand what improved from 2019 -> 2021.

**References**: Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2021) (https://arxiv.org/pdf/2104.04473)

**Purpose**: 
- LLMs have led to SOTA accuracies across several tasks, but training these models efficiently is challenging because a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server and b) the number of compute operations required can result in unrealistically long training times.
- Naive usage of even methods like tensor and pipeline parallelism leads to scaling issues at thousands of GPUs.
- Core question: "How should parallelism techniques be combined to maximize the training throughput of large models given a batch size while retaining strict optimizer semantics?"

**Approach**: 
- Parallelism is required to make training times much more efficient, otherwise we'd have unrealistic training times on the order of hundreds of years.
- A proposed novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. 
- They show how to combine pipeline, tensor, and data parallelism - PTD-P to train LLMs with 52% MFU (52% of peak device throughput) on 1000s of GPUs.
    - Leverages PP across multi-GPU servers (so the all-reduce in TP is not a bottleneck), TP within a multi-GPU server (to take advantage of NVLink), and DP as normal -> practically train models with a trillion parameters with graceful scaling in an optimized cluster environment with high-bandwidth links between GPUs on the same server and across servers. 

**Result**: 
- This approach allows us to perform training iterations on a model with 1T parameters at 502 pFLOP/s on 3072 GPUs (per-GPU throughput of 52% of theoretical peak - MFU?)
    - Note: This is significant because the 2019 paper had ~30-35% MFU.
- They demonstrate close to linear scaling to 3072 A100 GPUs, E2E training throughput of 163 tFLOPs/s per GPU (incl communication, data processing, and optimization) and an aggregate throughput of 502 pFLOP/s on a GPT model with 1T parameters using mixed precision (reduces precision for computational savings, I believe FP32 -> FP16 matmul -> FP32, ie. parameters in FP32 to avoid accumulation error, forward pass casts weights in FP16 and activations are in FP16, compute gradients in FP16/BF16, then accumulate them in FP32 and update weights in FP32.).
- This approach outperforms ZeRO because of PP and TP which helps with multi-server multi-GPU training thorughput.

**Definitions**: 

**Notes**:
- Data parallel works well but suffers from a) beyond a point, per-GPU batch size becomes too small (increasing batch size doesn't help), which reduces GPU utilization and increases communication cost + b) max number of devices that can be used is the batch size, limiting # of accelerators used during training
- Tensor (intra-layer) model parallelism works well for models up to 20B on A100 servers (8 80Gb-A100 GPUs), but breaks down for larger models. Larger models need to be split across multiple nodes, which lead to a) all-reduce communication needs to go through inter-server links which are slower than high-bandwidth NVLink available within a multi-GPU server/node and b) a high degree of model parallelism can create small matmuls (GEMMs - General Matrix-Matrix multiplication), potentially decreasing GPU utilization.
- Pipeline model parallelism has layers of a model split over multiple GPUs. A batch is split into smaller microbatches (to prevent pipeline bubbles), and execution is pipelined across these microbatches.
    - To preserve strict optimizer semantics, optimizer steps need to be synchronized across devices leading to a pipeline flush at the end of every batch. 
    - Q: Does this mean that the entire batch needs to be computed and contribute to the optimizer step before a new microbatch is injected and processed? Otherwise more memory would be taken or an incomplete optimizer step / gradient and activation computation would occur?
        - A: This means that all microbatches in a batch must contribute their gradients to the same optimizer step and use the same weights. That means for grad accumulation, you compute forward + backward for all microbatches in the batch, sum gradients, then do one optimizer step. So you just can't process the next training batch's microbatches until the current batch's last microbatch's backward pass has finished on all stages (relevant for pre-training) = flush. 
            - So "flush" = stop injecting new microbatches for the current batch, let the pipeline "drain" aka remaining microbatches finish their forward+backward, synchronize gradients, do the optimizer step, star the next batch.
            - If you just kept injecting new microbatches forever... 1) memory problem where you'd have to keep activations for microbatches whose backward hasn't run yet -> memory blows up and 2) semantic problem where microbatches would use partially updated weights if the optimizer steps happened mid-stream or have weight staleness via asynchronous pipelining (PipeDream-style?).
            - tl;dr: the crux of the pipeline flush problem is that 1 batch needs to complete their optimizer step before the next batch can be processed
        - "The larger the ratio of number of microbatches to the pipeline size, the smaller the time spent in the pipeline flush." The reason for this is because the "flush" happens only when the last microbatch is being fed to the forward (which takes p pipeline size time) and then the backward-phase which is another p pipeline size time. So increasing the ratio spends more time computing, rather than in pipeline flush of the last microbatch.
    - This work - they introduce a new pipeline schedule that improves efficiency at small batch sizes so you're not forced to use larger batch sizes. Large batch sizes are only not great because it increases memory and convergence risk.



**FAQs**:

**Action items**:
- I mainly just read through the Abstract and Intro, and Conclusion (pass 1). While the first pass should've taken 5-10 mins, it really took 1 hour to carefully read and understand these sections. But from the first pass, the rest of the paper is worth reading deeper to deeper understand the tradeoffs.
- I probably need to read GPipe because that seems like a precursor to this work on pipeline model parallelism