# Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

**Focus**: 

**References**: Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (2021) (https://arxiv.org/pdf/2104.04473)

**Purpose**: 
- LLMs have led to SOTA accuracies across several tasks, but training these models efficiently is challenging because a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server and b) the number of compute operations required can result in unrealistically long training times.
- Naive usage of even methods like tensor and pipeline parallelism leads to scaling issues at thousands of GPUs.
- Core question: "How should parallelism techniques be combined to maximize the training throughput of large models given a batch size while retaining strict optimizer semantics?"

**Approach**: 
- Parallelism is required to make training times much more efficient, otherwise we'd have unrealistic training times on the order of hundreds of years.
- A proposed novel interleaved pipelining schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. 
- They show how to combine pipeline, tensor, and data parallelism - PTD-P to train LLMs with 52% MFU (52% of peak device throughput) on 1000s of GPUs.
    - Leverages PP across multi-GPU servers (so the all-reduce in TP is not a bottleneck), TP within a multi-GPU server (to take advantage of NVLink), and DP as normal -> practically train models with a trillion parameters with graceful scaling in an optimized cluster environment with high-bandwidth links between GPUs on the same server and across servers. 

**Result**: 
- This approach allows us to perform training iterations on a model with 1T parameters at 502 pFLOP/s on 3072 GPUs (per-GPU throughput of 52% of theoretical peak - MFU?)
    - Note: This is significant because the 2019 paper had ~30-35% MFU.
- They demonstrate close to linear scaling to 3072 A100 GPUs, E2E training throughput of 163 tFLOPs/s per GPU (incl communication, data processing, and optimization) and an aggregate throughput of 502 pFLOP/s on a GPT model with 1T parameters using mixed precision (reduces precision for computational savings, I believe FP32 -> FP16 matmul -> FP32, ie. parameters in FP32 to avoid accumulation error, forward pass casts weights in FP16 and activations are in FP16, compute gradients in FP16/BF16, then accumulate them in FP32 and update weights in FP32.).
- This approach outperforms ZeRO because of PP and TP which helps with multi-server multi-GPU training thorughput.

**Definitions**: 
- PTD-P = Pipeline Tensor Data Parallelism (combining all 3)
- Data Parallelism = each worker has a copy of the full model. input dataset is sharded, workers aggregate their gradients periodically to ensure all workers have a consistent version of the weights. can be used on smaller model shards.
- Pipeline Parallelism = layers of a model are sharded across multiple devices. batch is split into microbatches to pipeline across microbatches. 
    - periodic pipeline flushes are retained so that optimizer steps are synchronized across devices. At the start and end of every batch, devices are idle (so the weights don't change while the batch is being processed). 
    - Idle time = pipeline bubble -> want to make it as small as possible. Other approaches do away with flushes completely.
- Tensor Model Parallelism - individual layers of the model are partitioned over multiple devices (Figure 5a).
    - We do the row-wise split to remove the need for an all-gather before it. 

**Notes**:
- They suggest heuristics that they find to work well in practice, and they do not automatically explore the search space of parallelism strategies.
- Pipeline Parallelism Scheduling
    1. Default schedule - GPipe proposes forward passes for all microbatches are first executed, followed by backward passes for all microbatches (Fig 3). 
    - Bubble time fraction (pipeline bubble size) = t_{pb} / t_{id} = (p - 1) / m. This means that you can decrease number of nodes, or increase microbatch size. However, increasing microbatch size will increase activation memory.
    2. PipeDream-Flush. Scheduling can indeed be optimized like that... if I ever need to optimize my schedules, this is a paper worth reading more about to get some ideas on how to do so. 
- Tensor model parallelism

**FAQs**:
- Q: Backward pass takes 2x longer than forward pass - why?
    - A: Backward pass requires 1) gradient w.r.t. inputs and 2) gradients w.r.t. weights, so 2 matmuls per operation while Forward only requires 1 matmul
- Q: Somewhat confused again on why we need a col -> row wise split for TP. 
    - Col-wise weight matrix split requires the full input to give you disjoint full outputs. Row-wise weight matrix split gives you the full shape, but partial outputs, then we all-reduce.
    - The true answer is the multiplying by a row-sharded weight matrix ALWAYS requires us to shard the input matrix col-wise, so after an col-sharded weight matrix mulitply, the output matrix is already sharded col-wise, so we can just use it as is.

**Action items**:
- Start from Section 3: Performance Analysis of Parallelization Configurations