## 模型并行

在模型非常大，单张GPU无法承载所有的模型参数时，就需要使用模型并行。根据Megatron-LM和GPipe的论文，模型并行主要分为横向划分模型参数的张量并行（Tensor Parallelism）和纵向划分参数的流水线并行（Pipeline Parallelism）

在这一节中我们主要依据Pytorch官方的[tutorials](https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html)，来给出模型并行的简单示例。

首先我们定义一个简单的网络。我们采用最笨的办法，在网络定义时就讲第一个线性层放在第一张GPU上，第二个线性层放在第二个GPU上：

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim

class ToyModel(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.net1 = nn.Linear(4, 6).to('cuda:0')
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(6, 2).to('cuda:1')
        
    def forward(self, x):
        # 在模型的前向过程中，需要将输入分配到不同的设备上
        x = self.relu(self.net1(x.to('cuda:0')))
        return self.net2(x.to('cuda:1'))

除此之外，在模型前向过程中不需要额外的修改，`backward()`函数和`step()`函数会自动处理多个模型上的梯度，就好像模型就在同一张卡上一样

唯一需要注意的是，计算损失函数时需要保证函数的输出output和标签label是在同一张卡上的，这个例子这里便是`cuda:1`

In [3]:
model = ToyModel()

print(f"net1: {model.net1.weight}")
print(f"net2: {model.net2.weight}")

loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr = 0.001)

optimizer.zero_grad()
outputs = model(torch.randn(20, 4))
labels = torch.randn(20, 2).to('cuda:1')
loss_fn(outputs, labels).backward()
optimizer.step()

print("---------After update---------")
print(f"net1: {model.net1.weight}")
print(f"net2: {model.net2.weight}")

net1: Parameter containing:
tensor([[-0.3877,  0.3361, -0.4700, -0.1861],
        [-0.2826,  0.2885,  0.3352, -0.2339],
        [ 0.1411,  0.2614,  0.0342,  0.3235],
        [-0.3217,  0.3397, -0.0691,  0.3553],
        [ 0.3835, -0.3737,  0.0907,  0.2470],
        [ 0.0396,  0.2315,  0.4344,  0.3013]], device='cuda:0',
       requires_grad=True)
net2: Parameter containing:
tensor([[-0.2344, -0.3099,  0.2224,  0.3760,  0.1694, -0.2281],
        [-0.2098, -0.1935,  0.3041,  0.1562, -0.3340, -0.2639]],
       device='cuda:1', requires_grad=True)
---------After update---------
net1: Parameter containing:
tensor([[-0.3876,  0.3361, -0.4700, -0.1861],
        [-0.2825,  0.2885,  0.3352, -0.2339],
        [ 0.1411,  0.2614,  0.0341,  0.3235],
        [-0.3217,  0.3397, -0.0692,  0.3553],
        [ 0.3834, -0.3736,  0.0907,  0.2471],
        [ 0.0397,  0.2315,  0.4344,  0.3012]], device='cuda:0',
       requires_grad=True)
net2: Parameter containing:
tensor([[-0.2343, -0.3098,  0.2224,  0.376

我们可以继承EXP-2的例子，将之前训练cifar-10的代码使用上面的方法，将模型的不同部分分配到不同的GPU上，然后进行计算。

In [4]:
# 以下是一个手动将不同层分配到不同GPU上的ConvNet。这里我们使用了四张GPU，每张GPU上分配了一个层。
class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        
        self.conv1 = nn.Sequential(
            nn.Conv2d(3, 6, kernel_size=5),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)).to('cuda:0')
        
        self.conv2 = nn.Sequential(
            nn.Conv2d(6, 16, kernel_size=5),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)).to('cuda:1')
        
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(16 * 5 * 5, 120),
            nn.ReLU(),
            nn.Linear(120, 84),
            nn.ReLU()).to('cuda:2')
        
        self.out = nn.Sequential(
            nn.Linear(84, num_classes),
            nn.Softmax(dim=1)).to('cuda:3')

    def forward(self, x):
        x = self.conv1(x.to('cuda:0'))
        x = self.conv2(x.to('cuda:1'))
        x = self.fc(x.to('cuda:2'))
        return self.out(x.to('cuda:3'))

检查一下模型不同层所在的设备：

In [None]:
model = ConvNet()
print(f"conv1 in {model.conv1[0].weight.device}")
print(f"conv2 in {model.conv2[0].weight.device}")
print(f"fc in {model.fc[1].weight.device}")
print(f"out in {model.out[0].weight.device}")

conv1 in cuda:0
conv2 in cuda:1
fc in cuda:2
out in cuda:3


然后执行简单的训练过程，看看是否能够正常运行。下面这段代码和2-DataParallel中的内容一致，而且删去了所有数据并行的内容

In [6]:
import torchvision
import os
import time

def get_dataset(path='./data'):
    DOWNLOAD = False
    if not(os.path.exists(path)) or not os.listdir(path):
    # not cifar dir or cifar is empyt dir
        DOWNLOAD = True
    else:
        print("Cifar dataset already exist in '{}', skip download".format(path))

    transform = torchvision.transforms.Compose([
        torchvision.transforms.ToTensor(),
        torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])
    trainset = torchvision.datasets.CIFAR10(
        root = path,
        train = True,
        transform = transform,
        download = DOWNLOAD
    )
    testset = torchvision.datasets.CIFAR10(
        root = path,
        train = False,
        transform = transform,
        download = DOWNLOAD
    )
    
    return trainset, testset

def main():
    net = ConvNet()

    trainset, testset = get_dataset("../2-DataParallel/data")
    train_loader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

    criteria = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(net.parameters(), lr=0.001)


    for epoch in range(10):
        t0 = time.time()
        net.train()
        
        loss_sum,acc_sum = 0,0
        for i, (inputs, labels) in enumerate(train_loader):
            
            #! 这里，输入张量在cuda:0上，标签张量在cuda:3上
            inputs, labels = inputs.to('cuda:0'), labels.to('cuda:3')
            outputs = net(inputs)
            loss = criteria(outputs, labels)
            
            loss_sum += loss.item()
            predict = torch.argmax(outputs, dim=1)
            acc_sum += torch.sum(predict == labels).item()

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        print("Epoch: {}, Loss: {:.2f}, acc: {:.2f}, time cost: {:.2f}s".format(epoch, loss_sum/len(train_loader), acc_sum/len(trainset), time.time()-t0))
        
main()

Cifar dataset already exist in '../2-DataParallel/data', skip download
Epoch: 0, Loss: 2.13, acc: 0.33, time cost: 15.83s
Epoch: 1, Loss: 2.03, acc: 0.42, time cost: 15.11s
Epoch: 2, Loss: 2.00, acc: 0.46, time cost: 14.49s
Epoch: 3, Loss: 1.97, acc: 0.49, time cost: 14.94s
Epoch: 4, Loss: 1.95, acc: 0.51, time cost: 14.37s
Epoch: 5, Loss: 1.93, acc: 0.53, time cost: 14.98s
Epoch: 6, Loss: 1.91, acc: 0.55, time cost: 14.59s
Epoch: 7, Loss: 1.90, acc: 0.56, time cost: 15.24s
Epoch: 8, Loss: 1.89, acc: 0.57, time cost: 15.05s
Epoch: 9, Loss: 1.88, acc: 0.58, time cost: 15.04s


对比一下相同GPU、相同环境下单卡运行时的log：

```txt
Cifar dataset already exist in './data', skip download
Epoch: 0, Loss: 2.12, acc: 0.34, time cost: 15.32s
Epoch: 1, Loss: 2.03, acc: 0.43, time cost: 14.14s
Epoch: 2, Loss: 2.00, acc: 0.46, time cost: 14.19s
Epoch: 3, Loss: 1.97, acc: 0.48, time cost: 14.51s
Epoch: 4, Loss: 1.95, acc: 0.50, time cost: 14.23s
Epoch: 5, Loss: 1.94, acc: 0.52, time cost: 14.22s
Epoch: 6, Loss: 1.92, acc: 0.54, time cost: 14.27s
Epoch: 7, Loss: 1.91, acc: 0.55, time cost: 14.32s
Epoch: 8, Loss: 1.90, acc: 0.56, time cost: 13.78s
Epoch: 9, Loss: 1.88, acc: 0.58, time cost: 14.36s
```

发现相较于单卡运行，简单地把模型拆成四块分到四张卡时，精度一致，节省了每张卡所消耗的显存，但没有起到特别明显的加速效果。究其原因，目前这种简单地将不同层放到不同GPU的做法显然效率低下.首先,GPU间复制数据的通信开销很大,在使用更大的数据做训练时这一点将更为明显.其次,顺序执行该模型的前向过程时,只有一张GPU在工作,其他的都在等待数据而白白浪费了资源.这一点就需要通过流水线并行来实现,会在第四部分详述.

关于流水线并行，Pytorch tutorial还提供了一个简单的实现版本。我们对上面的`ConvNet()`类做一个封装，来实现流水线操作执行前向过程：

In [21]:
class PipelineConvNet(ConvNet):
    
    # 在父类的基础上，增加了一个split_size参数，用于指定流水线上的mini_batch_size
    # 上面所指定的cifar数据集的batch_size为64，如果split_size=8，则每次计算8个数据，GPU0将这8个数据计算完毕并将结果传递给GPU1后，就可以立即计算下一批mini_batch了，而不需要一直等待
    def __init__(self, split_size=8, *args, **kwargs):
        super(PipelineConvNet, self).__init__(*args, **kwargs)
        self.split_size = split_size
        
    def forward(self, x):
        batch_size = x.shape[0]
        
        splits = x.split(self.split_size, dim=0)
        s_next, s_prev_1, s_prev_2, s_prev_3 = None, None, None, None
        ret = []
        
        # n卡做流水线，一共batch_size/split_size个mini_batch要处理，则需要batch_size/split_size+(n-1)次前向过程
        assert len(splits) == batch_size // self.split_size
        total_forward_len = len(splits) + (4 - 1)  # 在本次示例中，n=4，总共要执行的前向过程为64/8 + (4-1) = 11次
        
        for index in range(total_forward_len):
            # 获取下一个要计算的mini-batch。注意：整个流水线的最后(n-1)个前向过程没有新的mini-batch被获取了
            if index < len(splits):
                s_next = splits[index]
            else:
                s_next = None
                
            # A. 在cuda:3上利用s_prev_3计算s_ret。这个顺序必须要是倒过来的，否则新的mini-batch的计算结果会覆盖掉上一个mini-batch的计算结果
            if s_prev_3 is not None:
                s_ret = self.out(s_prev_3.to('cuda:3'))
                ret.append(s_ret)
            
            # B. 在cuda:2上利用s_prev_2计算s_prev_3
            if s_prev_2 is not None:
                s_prev_3 = self.fc(s_prev_2.to('cuda:2'))
                
            # C. 在cuda:1上利用s_prev_1计算s_prev_2
            if s_prev_1 is not None:
                s_prev_2 = self.conv2(s_prev_1.to('cuda:1'))
            
            # D. 在cuda:0上利用s_next计算s_prev_1
            if s_next is not None:
                s_prev_1 = self.conv1(s_next.to('cuda:0'))
     
        return torch.cat(ret, dim=0)

上面这段代码其实挺冗长的，要实现的主要部分就是流水线式的前向过程。就像下图所示:

![pipeline](../assets/pipeline.png)

每个小方块就是一个mini-batch，会按流水线形式依次进行前向计算。后向计算的部分等后续再更新

接下来简要用代码测试一下流水线的正确性（目前测不了高效性，hhh）

In [22]:
def main():
    net = PipelineConvNet()

    trainset, testset = get_dataset("../2-DataParallel/data")
    train_loader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

    criteria = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(net.parameters(), lr=0.001)


    for epoch in range(10):
        t0 = time.time()
        net.train()
        
        loss_sum,acc_sum = 0,0
        for i, (inputs, labels) in enumerate(train_loader):
            
            #! 这里，输入张量在cuda:0上，标签张量在cuda:3上
            inputs, labels = inputs.to('cuda:0'), labels.to('cuda:3')
            outputs = net(inputs)
            loss = criteria(outputs, labels)
            
            loss_sum += loss.item()
            predict = torch.argmax(outputs, dim=1)
            acc_sum += torch.sum(predict == labels).item()

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        print("Epoch: {}, Loss: {:.2f}, acc: {:.2f}, time cost: {:.2f}s".format(epoch, loss_sum/len(train_loader), acc_sum/len(trainset), time.time()-t0))
        
main()

Cifar dataset already exist in '../2-DataParallel/data', skip download
Epoch: 0, Loss: 2.12, acc: 0.33, time cost: 24.38s
Epoch: 1, Loss: 2.04, acc: 0.41, time cost: 25.27s
Epoch: 2, Loss: 2.00, acc: 0.45, time cost: 24.86s
Epoch: 3, Loss: 1.97, acc: 0.49, time cost: 27.58s
Epoch: 4, Loss: 1.95, acc: 0.51, time cost: 22.80s
Epoch: 5, Loss: 1.93, acc: 0.53, time cost: 26.39s
Epoch: 6, Loss: 1.92, acc: 0.54, time cost: 27.05s
Epoch: 7, Loss: 1.90, acc: 0.56, time cost: 28.28s
Epoch: 8, Loss: 1.89, acc: 0.57, time cost: 27.12s
Epoch: 9, Loss: 1.88, acc: 0.58, time cost: 25.86s


理想的情况当然是利用已有的库高效实现流水线并行，这部分我们在第四部分再说吧~