# Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Focus: Mainly just reading the full paper

References:
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, 2019 (https://arxiv.org/pdf/1909.08053)

Purpose: very large models can be quite difficult to train due to memory constraints

Approach: implement a simple, efficient intra-layer model parallel approach that enables training transformer models with billions of parameters. The approach does not require a new compiler or library changes, is orthogonal and complimentary to pipeline model paralellism, and can be fully implemented with the insertion of a few communication operations in native PyTorch. (sounds like tensor parallelism)

Results: They converge transformer based models up to 8.3B using 512 GPUs. They sustain 15.1 PetaFLOPs across the entire application with 76% scaling efficiency when compared to a strong single GPU baseline that sustains 39 TeraFLOPs, which is 30% of peak FLOPs. 8.3B GPT-2 like model and 3.9B BERT-like model achieves SOTA on WikiText103 (10.8 vs 15.8) and LAMBADA (66.5 vs 63.2) and RACE (90.9 vs 89.4).

Definitions:
- GEMM = General Matrix-Matrix multiplication aka matmul

Notes:
- Potentially important focus areas: 
    - locality sensitive hashing (LSH) to deduplicate content with a jaccard similarity greater than 0.7
    - global gradient norm clipping
- Training: they use mixed precision training with dynamic loss scaling, lots of configurations are in the paper like activation checkpointing to decreasem memory footprint, dropout 0.1, gradient norm clipping of 1.0, adam + weight decay \lambda = 0.01, initialization as normal distribution N(0, 0.02), scaling weights before residual layers.
- n = seq_len = 1024, batch size = 512, for 300K iterations
- l4=1.5e-4 with warmup period of 3k iterations before a single cycle cosine decay over remaining 297K iterations. stop decay at 1e-5
- 512 V100 GPUs
- fixed batch size of 8 for model parallel scaling, global batch size = 512, so there's 64 replicas of the model on different shards of data, each processing a batch of 8 examples in parallel.
- "Table 1. Parameters used for scaling studies. Hidden size per attention head is kept constant at 96." - this means d_k = 96 as the output projection of W_q, W_k, W_v. So the input dimension is the model hidden size
- Training model sizes beyond BERT-large can lead to unexpected model degradation. Megatron LM team found that rearranging the order of the layer normalization and the residual connections is critical to enable the scaling of the BERT-style models beyond BERT-large. Q: Why? Other than the fact that empirically this seems to mitigate model degradation.

<div style="text-align:center;">
    <img src="2025-08-29_BERT_architecture_modification.png" style="width:25%">
</div>
Effectively, the main change was that for the residual connection, we don't pass in the layer normalized (each output embedding is normalized, but pass in the raw output embedding). I'm not sure why this makes sense though.

This means instead of a post-LayerNorm residual connection, it's a pre-LayerNorm residual connection. It's likely the normalization constants introduce destabilization. The skip connection bypasses any normalization. This enables more stable gradients across many layers, making optimization easier for large models and avoids representation drift where the residual path no longer carries the original information