# Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

**Focus**: Continue reading from "3.3 Data and Model Parallelism"
- Where did this come from: "communication time for a ring-based implementation scales with $\frac{d-1}{d}$"

**References**: 
- Efficient Large-Scale Language Model Training on GPU Clusters
Using Megatron-LM: https://arxiv.org/pdf/2104.04473
- Nvidia/Docs/Using NCCL (Nvidia Collective Communications Library)/Operations: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/operations.html#reducescatter

**Purpose**: 

**Approach**: 

**Result**: 


**Definitions**: 

**Notes**:

3.3 Data and Model Parallelism
- Q1: Where did this come from: "communication time for a ring-based implementation scales with $\frac{d-1}{d}$"
    - A1: ring all-reduce with $d$ GPUs does have each GPU send and receive chunks of data in $d-1$ steps in the 1) reduce-scatter and 2) all-gather stages. Each stage sends $\frac{N}{d}$ data chunks. Therefore, communication time scales by $(d-1) * \frac{N}{d} = \frac{d-1}{d} * N$, meaning that communication time for ring-based data parallelism is $O(\frac{d-1}{d})$ assuming N is constant. This effectively means that communication time for data parallelism basically is constant time.
- Increasing batch size $B$ also increases throughput by decreasing pipeline bubble size (via increasing the number of microbatches per gpu), which also increases throughput by reducing the amount of all-reduce communication required by data parallelism.
    - Note: this means we should generally increase batch size because there's no added communication cost really, but the key problem to increasing batch size is the memory constraint (fitting activations + gradients in GPU memory - grad accum grows ithe $B$)

3.3.2 Data and Tensor Model Parallelism
- Biggest problem is that all-reduce needs to be performed for every microbatch (attn + ff layers end with an all-reduce in the forward AND backward pass). Primarily expensive across multi-GPU servers where the networking is slower.
- Data parallelism performs all-reduce once per batch (at the end of the forward and backward passes through the entire model)
    - Note: DP communication volume is higher (all gradients and activations) but just less frequent (once per batch) vs TP communication volume is smaller (just partial results at the end of forward/backward pass of each layer), but with much more frequent (once per layer)
- Generally less efficient too because each TP rank performs a subset of the computation in each model layer, so the if the layers are smaller but still split, GPUs might not be performing at peak efficiency.
- TP is better on fast intra-node NVLink/NVSwitch (low latency). Across nodes, DP scales better because communication is less frequent (just bandwidth bound)


Takeaway #2: DP is always preferred to scale (lowest communication frequency that TP suffers from, not pipeline bubbles from PP), as long as the model fits on 1 GPU. If the model doesn't fit onto 1 GPU, you must introduce TP or PP.
- Note: I think general rule of thumb is TP within a node (8 GPUs so TP=8), and then PP until the model fits into memory across nodes.
- Recipe: 1) TP = node size (ex. 8 GPUs), 2) PP across nodes until the model fits onto your node, and then 3) scale with DP to increase throughput (doesn't increase GPU efficiency though - although I think it does help with reducing the pipeline bubble size given a fixed microbatch size)


tl;dr:
1. TP is good intra-node to shard model layers to fit on GPUs, but suffers from frequent communication overhead and efficiency drops if the shard size is too small. (quite necessary if a single matmul doesn't fit on 1 GPU)
2. PP is good inter-node to shard model layers to fit on GPUs, but suffers from pipeline bubbles (dependent on PP, # of microbatches b', and DP)
3. DP is the simplest, but just requires the model to fit on each GPU.
- TP > PP within a server (no pipeline bubbles), across servers PP > TP (fewer communication operations)

**FAQs**:

**Action items**:
- Implement Ring All-Reduce
- Continue from 3.4 Microbatch Size