# Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Focus: MFU

References:
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, 2019 (https://arxiv.org/pdf/1909.08053)

Purpose: very large models can be quite difficult to train due to memory constraints

Approach: implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. The approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model paralellism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. (sounds like tensor parallelism)

Results: They converge transformer based models up to 8.3B using 512 GPUs. They sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. 8.3B GPT-2 like model and 3.9B BERT-like model achieves SOTA on WikiText103 (10.8 vs 15.8) and LAMBADA (66.5 vs 63.2) and RACE (90.9 vs 89.4).

Notes:
- Other frameworks for model parallelism of different kinds like GPipe and Mesh-Tensorflow require rewriting the model and rely on custom compilers and frameworks that are still underdevelopment. Tensor parallelism is orthogonal to pipeline-based model parallelism as advocated by approaches such as GPipe. 
- MFU = model FLOP utilization = a metric for how efficiently your hardware (GPUs) is being used relative to the model's theoretical peak performance. MFU = Actual Model FLOPs executed per second / Theoretical peak FLOPs of the hardware. >40-50% is good GPU utilization. <20% means training might be bottlenecked by things like I/O, data preprocessing, small batch sizes, or inefficient kernel usage. This measures compute efficiency of GPUs, not just raw GPU utilization % = whether the GPU is "busy"
- Here, they train a 1.2B model on 1 V100 32GB GPU that sustains 39 TeraFLOPs, which is 30% of the theoretical peak FLOPs for a single GPU as configured in a DGX-2H server. So 130 theoretical peak FLOPs / GPU and MFU = 30%. 
- There is a 76% scaling efficiency because 15.1 PetaFLOPs/s / 512 = 29.5 TeraFLOPs/s. MFU = 29.5/130 = 22.7%, which is 22.7/30 = 76% scaling efficiency

Action items
- Finished Abstract + Intro - Continue reading from Backgroudn and Challenges.
- Can look into activation checkpointing. Chen et al., 2016
- Read Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM, 2021 (https://arxiv.org/pdf/2104.04473)