## Megatron-LM张量并行

Megatron是由 NVIDIA 的应用深度学习研究团队开发的大型、强大的Transformer模型，主要针对大规模训练大型 transformer 语言模型的研究。其主要贡献时提出了将模型进行横向分割而进行张量并行的思想。

关于Tensor parallelism的中文解读可以参考：

英伟达中国：https://zhuanlan.zhihu.com/p/420908718

知乎大佬对具体切分方法的图示+伪代码：https://zhuanlan.zhihu.com/p/366906920

博客园罗西的思考：https://www.cnblogs.com/rossiXYZ/p/15840803.html

这一节Notebook主要想介绍针对一个比较简单的模型，如何使用Megatron快速将其进行张量并行的部署

首先，安装megatron库：

In [None]:
!git clone https://github.com/NVIDIA/Megatron-LM.git
!cd Megatron-LM
%pip install -v -e .

测试安装是否正确

In [3]:
import megatron

print(megatron)

<module 'megatron' from '/mnt/configblob/users/ruizhe/Megatron-LM/megatron/__init__.py'>


接下来的代码部分引用了Megatron单元测试的代码：https://github.com/NVIDIA/Megatron-LM/tree/main/tests/unit_tests

我们首先引入Megatron内做模型并行初始化的API：

In [None]:
import os
import torch
import megatron.core.parallel_state as ps
from megatron.core.tensor_parallel.data import broadcast_data


# 这个Utils类的作用是定义了初始化分布式环境，初始化模型并行环境，销毁模型并行环境的函数
class Utils:

    world_size = torch.cuda.device_count()
    # 这个地方需要使用torchrun来启动分布式环境，否则这里的rank不会在环境变量中被发现，于是就直接报错了
    rank = int(os.environ['LOCAL_RANK'])

    @staticmethod
    def initialize_distributed():
        print(f'Initializing torch.distributed with rank: {Utils.rank}, world_size: {Utils.world_size}')
        torch.cuda.set_device(Utils.rank % torch.cuda.device_count())
        init_method = 'tcp://'
        master_ip = os.getenv('MASTER_ADDR', 'localhost')
        master_port = os.getenv('MASTER_PORT', '6000')
        init_method += master_ip + ':' + master_port
        torch.distributed.init_process_group(backend='nccl', world_size=Utils.world_size, rank=Utils.rank, init_method=init_method)
        
    @staticmethod
    def destroy_model_parallel():
        ps.destroy_model_parallel()
        torch.distributed.barrier()

    ''' initial_model_parallel: 初始化模型并行环境：
    tensor_model_parallel_size: 指定张量并行级别
    pipeline_model_parallel_size: 指定模型并行级别
    virtual_pipeline_model_parallel_size: 指定虚拟模型并行级别
    pipeline_model_parallel_split_rank: 指定模型并行切分的rank
    '''
    @staticmethod
    def initialize_model_parallel(tensor_model_parallel_size = 1, pipeline_model_parallel_size = 1, virtual_pipeline_model_parallel_size = None, pipeline_model_parallel_split_rank = None):
        ps.destroy_model_parallel()
        if not torch.distributed.is_initialized():
            Utils.initialize_distributed()
        ps.initialize_model_parallel(tensor_model_parallel_size, pipeline_model_parallel_size, virtual_pipeline_model_parallel_size, pipeline_model_parallel_split_rank)


然后我们可以定义一些单元测试，比如这里的代码测试megatron的张量广播功能：

In [None]:
# 测试广播：指定张量并行级别为2，模型并行级别为4的这样的一个分布式环境，然后制造一些数据，看看broadcast_data的效果
def test_broadcast_data():
    Utils.initialize_model_parallel(2,4)
    input_data = {
        0 : torch.ones((8,8)).cuda() * 0.0,
        1 : torch.ones((8,8)).cuda() * 1.0,
        2 : torch.ones((8,8)).cuda() * 2.0,
        3 : torch.ones((8,8)).cuda() * 3.0,
        4 : torch.ones((8,8)).cuda() * 4.0,
        5 : torch.ones((8,8)).cuda() * 5.0,
        6 : torch.ones((8,8)).cuda() * 6.0,
        7 : torch.ones((8,8)).cuda() * 7.0
        }
    dtype = torch.float32
    # broadcast_data：将rank=0的进程的数据广播到所有进程
    actual_output = broadcast_data([0,1],input_data, dtype)
    assert(torch.equal(actual_output[0], input_data[0]))
    assert(torch.equal(actual_output[1], input_data[1]))
    
    if Utils.rank == 0:
        print("Broadcast assertion passed")
    Utils.destroy_model_parallel()

In [None]:
import megatron.core.tensor_parallel.utils as util

# 测试all_gather
def test_gather_split_1d_tensor():
    rank = Utils.rank
    Utils.initialize_model_parallel(tensor_model_parallel_size=2, pipeline_model_parallel_size=4)
    input_tensor = torch.ones((2,4)).cuda() * rank
    actual_output_tensor = util.gather_split_1d_tensor(input_tensor)
    if rank %2 == 0:
        expected_output_tensor = torch.concat((input_tensor.flatten(), input_tensor.flatten() + 1))
    else : 
        expected_output_tensor = torch.concat((input_tensor.flatten() - 1, input_tensor.flatten()))
    assert(torch.equal(actual_output_tensor, expected_output_tensor))
    Utils.destroy_model_parallel()

启动多卡的单元测试仍需要使用torchrun多卡启动程序

In [10]:
!torchrun --nproc_per_node=8 test_megatron.py

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Initializing torch.distributed with rank: 3, world_size: 8
Initializing torch.distributed with rank: 1, world_size: 8
Initializing torch.distributed with rank: 5, world_size: 8
Initializing torch.distributed with rank: 2, world_size: 8
Initializing torch.distributed with rank: 7, world_size: 8
Initializing torch.distributed with rank: 4, world_size: 8
Initializing torch.distributed with rank: 0, world_size: 8
Initializing torch.distributed with rank: 6, world_size: 8
Broadcast assertion passed
rank: 0, input_tensor: tensor([[0., 0., 0., 0.],
        [0., 0., 0., 0.]], device='cuda:0'); output_tensor: tensor([0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1.],
       device='cuda:0')
A

这里从all_gather的结果看出，因为是设置了张量并行级别为2，流水线并行级别为4，因此同一批次的tensor从横向是被划分到两张卡上（张量并行），所以`test_gather_split_1d_tensor()`这个函数的all_gather收集是横向收集的，而不涉及后续的纵向流水线并行。