# Megatron-LM Code

**Focus**: Going through implementation snippets, mainly reading and trying to understand the code rather than running it
- Specifically, I will be looking at 2019 Megatron-LM's core contribution: tensor model parallelism and try to trace exactly how Megatron-LM implements tensor parallelism.
- Ex. How does Megatron-LM split a Linear layer across multiple GPUs and still produce the correct output?

**References**: 
- https://github.com/NVIDIA/Megatron-LM

**Purpose**: to understand the Megatron-LM implementation of tensor model parallelism

**Approach**: I'll go into its open-source github repo and try to trace through the core bits of the code.

*Definitions*: 

*Notes*:

```plaintext
Megatron-LM/
├── megatron/                    
│   ├── core/                    # Megatron Core (kernels, parallelism, building blocks)
│   │   ├── models/              # Transformer models
│   │   ├── transformer/         # Transformer building blocks
│   │   ├── tensor_parallel/     # Tensor parallelism
```

It looks like layers.py = examples on how to run parallelism for Linear Layers and then mappings.py is great for how to implement all-reduce and all-gather.

```plaintext
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/tensor_parallel/__init__.py

__all__ = [
    # cross_entropy.py
    "vocab_parallel_cross_entropy",
    # data.py
    "broadcast_data",
    # layers.py
    "ColumnParallelLinear",
    "RowParallelLinear",
    "VocabParallelEmbedding",
    "set_tensor_model_parallel_attributes",
    "set_defaults_if_not_set_tensor_model_parallel_attributes",
    "copy_tensor_model_parallel_attributes",
    "param_is_not_tensor_parallel_duplicate",
    "linear_with_grad_accumulation_and_async_allreduce",
    # mappings.py
    "copy_to_tensor_model_parallel_region",
    "gather_from_tensor_model_parallel_region",
    "gather_from_sequence_parallel_region",
    "reduce_from_tensor_model_parallel_region",
    "reduce_scatter_to_sequence_parallel_region",
    "scatter_to_tensor_model_parallel_region",
    "scatter_to_sequence_parallel_region",
    # random.py
    "checkpoint",
    "get_cuda_rng_tracker",
    "model_parallel_cuda_manual_seed",
    "get_expert_parallel_rng_tracker_name",
    "CheckpointWithoutOutput",
    # utils.py
    "split_tensor_along_last_dim",
    "split_tensor_into_1d_equal_chunks",
    "gather_split_1d_tensor",
]
```

It looks like there are 2 classes that are important here

1. class ColumnParallelLinear
- Linear layer with column parallelism. 
- Q: What's column parallelism? A: If our linear layer is Y = XA + b. A is parallelized along its second dimension (column) as A = [A_1, ..., A_p].
- Note: column parallel usually requires all-gathers if you want to get the full result
- So it looks like the implementation is largely the same? Probably the weight splitting happens in `output_parallel = self._forward_impl(`, which for a normal forward pass with gradient computation happens in `linear_with_grad_accumulation_and_async_allreduce`. 
    - Edit: looks like the weight sharding might happen at init?
    - Specifically in the initialize weight section: 


2. class RowParallelLinear
- In general, it looks like when you run this tensor_parallel/layers.py RowParallelLinear, each node will get the world_size, and the identify the self.input_size_per_partition which is the sharding. Then weight variables are created which are the sharded parts. 
- During the forward pass, basically linear_with_grad_accumulation_and_async_allreduce() is called, which completes a tensor parallel all reduce (for row-wise sharding) by calling LinearWithGradAccumulationAndAsyncCommunication?

Oh neat, "LinearWithGradAccumulationAndAsyncCommunication" contains the forward AND backward.
- Forward funciton does indeed save the input and weight for backward
- Fusing allows you to defer weight grad GEMM (matmul) for better efficiency

reduce-scatter?
- reduce scatter just means that you compute the hcunk you'll keep (so it's just a sharded reduce basically)

sequence parallel - you can shard activations along the sequence dimension somewhat surprisingly across ranks. This helps for long sequence training.

**Result**: 
- These are all tricks to avoid running out of GPU RAM, because memory is often (and still is!) a bottleneck. Tensor parallelism + Sequence Parallel are techniques employed in Megatron for large scale machine learning.


**FAQs**:

**Action items**:


In [None]:
# key code lines for ColumnParallelLinear
# typically you set TP size via --tensor-model-parallel-size. Otherwise it defaults to 1. 

self.tp_group = get_tensor_model_parallel_group_if_none(
    self.tp_group, is_expert=self.is_expert
)

# def get_pg_size(group=None):
#     """Get world size for a distributed group.

#     Args:
#         group: Process group to get world size for. If None, uses default group.

#     Returns:
#         int: World size (1 if distributed not initialized or group is None, else group.size())
#     """
#     if not torch.distributed.is_initialized() or group is None:
#         return 1
#     return group.size()
world_size = get_pg_size(self.tp_group)

# We divide the output_size by the world_size (tensor parallel group size)
# def ensure_divisibility(numerator, denominator):
#     """Ensure that numerator is divisible by the denominator."""
#     assert numerator % denominator == 0, "{} is not divisible by {}".format(numerator, denominator)
# def divide(numerator, denominator):
#     """Ensure that numerator is divisible by the denominator and return
#     the division value."""
#     ensure_divisibility(numerator, denominator)
#     return numerator // denominator
self.output_size_per_partition = divide(output_size, world_size)

# Create the parameter which should has the tensors stored as (out_features, in_features)
self.weight = Parameter(
    torch.empty(
        self.output_size_per_partition, self.input_size, dtype=config.params_dtype
    )
)

# Initialize affine weight for model parallel on GPU.
_initialize_affine_weight_gpu(
    self.weight,
    init_method,
    partition_dim=0,
    stride=stride,
    is_expert=self.is_expert,
)