GradScaler是PyTorch中用于自动混合精度训练的一个重要组件。它通过动态调整梯度的缩放因子来解决在使用半精度（float16）进行训练时可能出现的数值不稳定性问题。为了验证GradScaler的功能和性能，编写一些测试用例（test case）是非常有用的。以下是一个简单的GradScaler测试用例的示例，它将展示如何使用GradScaler进行模型训练，并验证其是否正常工作。

# GradScaler测试用例

## 1. 测试目的
验证GradScaler是否能够正确地缩放和调整梯度，以及是否能够在训练过程中保持模型的收敛性。

## 2. 测试环境
- PyTorch版本：1.6及以上（因为从1.6版本开始内置了`torch.cuda.amp`）
- GPU环境：支持CUDA的NVIDIA GPU

## 3. 测试步骤

### 3.1 准备测试数据和模型
```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import GradScaler, autocast

# 定义一个简单的神经网络模型
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc = nn.Linear(10, 1)

    def forward(self, x):
        return self.fc(x)

# 创建模型和优化器
model = SimpleNet().cuda()
optimizer = optim.SGD(model.parameters(), lr=0.01)

```

In [3]:
import os
import sys
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import Dataset, DataLoader, DistributedSampler
from torch.cuda.amp import GradScaler, autocast

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.linear = nn.Linear(10, 10)

    def forward(self, x):
        return self.linear(x)

class ToyDataset(Dataset):
    def __len__(self):
        return 100

    def __getitem__(self, idx):
        return torch.rand(10), torch.rand(10)

def setup(rank, world_size):
    # os.environ['MASTER_ADDR'] = '172.17.0.2'
    # os.environ['MASTER_PORT'] = '50574'
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

    

def train(rank, world_size):

    setup(rank, world_size)

    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    dataset = ToyDataset()
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    dataloader = DataLoader(dataset, sampler=sampler, batch_size=10)

    criterion = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    scaler = GradScaler()
    
    for epoch in range(2):  # loop over the dataset multiple times
        for i, (inputs, labels) in enumerate(dataloader):
            inputs = inputs.to(rank)
            labels = labels.to(rank)

            optimizer.zero_grad()
            with autocast():
                outputs = ddp_model(inputs.cuda())
                loss = criterion(outputs, labels)
                
            # 缩放损失并进行反向传播
            scaler.scale(loss).backward()

            # 更新梯度
            scaler.step(optimizer)

            # 更新GradScaler的缩放因子
            scaler.update()
            print(f"rank: {rank}, epoch: {epoch}, iteration:{i}, loss: {loss.item():.3f}")

    cleanup()

In [4]:
os.environ['MASTER_ADDR'] = '172.17.0.2'
os.environ['MASTER_PORT'] = '50574'
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'
rank = int(os.environ['RANK'])
world_size = int(os.environ['WORLD_SIZE'])
train(rank, world_size)

rank: 0, epoch: 0, iteration:0, loss: 0.553
rank: 0, epoch: 0, iteration:1, loss: 0.476
rank: 0, epoch: 0, iteration:2, loss: 0.507
rank: 0, epoch: 0, iteration:3, loss: 0.553
rank: 0, epoch: 0, iteration:4, loss: 0.479
rank: 0, epoch: 0, iteration:5, loss: 0.491
rank: 0, epoch: 0, iteration:6, loss: 0.501
rank: 0, epoch: 0, iteration:7, loss: 0.421
rank: 0, epoch: 0, iteration:8, loss: 0.542
rank: 0, epoch: 0, iteration:9, loss: 0.487
rank: 0, epoch: 1, iteration:0, loss: 0.420
rank: 0, epoch: 1, iteration:1, loss: 0.482
rank: 0, epoch: 1, iteration:2, loss: 0.434
rank: 0, epoch: 1, iteration:3, loss: 0.510
rank: 0, epoch: 1, iteration:4, loss: 0.545
rank: 0, epoch: 1, iteration:5, loss: 0.468
rank: 0, epoch: 1, iteration:6, loss: 0.414
rank: 0, epoch: 1, iteration:7, loss: 0.491
rank: 0, epoch: 1, iteration:8, loss: 0.444
rank: 0, epoch: 1, iteration:9, loss: 0.454
