Mixed precision training slower than FP32 training #297

miguelvr · 2019-05-09T09:52:29Z

I've been doing some experiments on CIFAR10 with ResNets and decided to give APEX AMP a try.

However, I ran into some performance issues:

AMP with pytorch's torch.nn.parallel.DistributedDataParallel was extremely slow.
AMP with apex.parallel.DistributedDataParallel was slower than the default training with torch.nn.DistributedDataParallel (no apex involved). For reference, normal training took about 15 min, while apex AMP training took 21 minutes (90 epochs on CIFAR-10 with ResNet20)

I followed the installation instructions, but I couldn't install the C++ extensions because of my GCC/CUDA version. Does this justify this slowdown?

You can see the code here:
https://github.com/braincreators/octconv/blob/34440209c4b37fb5198f75e4e8c052e92e80e85d/benchmarks/train.py#L1-L498

And run it (2 GPUs):

Without APEX AMP:
python -m torch.distributed.launch --nproc_per_node 2 train.py -c configs/cifar10/resnet20_small.yml --batch-size 128 --lr 0.1

With APEX AMP:
python -m torch.distributed.launch --nproc_per_node 2 train.py -c configs/cifar10/resnet20_small.yml --batch-size 128 --lr 0.1 --mixed-precision

The text was updated successfully, but these errors were encountered:

mcarilli · 2019-05-09T17:41:03Z

If your network or batch size is small, you may be underutilizing the device, in which case there's not much for Amp to accelerate. What kind of GPUs are you using? Also, is Amp slower than normal training within a single process as well?

miguelvr · 2019-05-09T17:46:17Z

I ran the tests with 2x GTX 1080 TI and a batch size of 128 (so 64 per device)

I haven't tested with a single device yet. I'll let you know.

zsef123 · 2019-05-10T05:10:59Z

GTX 1080 TI have low-rate FP16 performance.

If you want to better performance with FP16, then must be using Volta architecture, or RTX series.

Check this topic https://devtalk.nvidia.com/default/topic/1023708/gpu-accelerated-libraries/fp16-support-on-gtx-1060-and-1080/

miguelvr · 2019-05-10T06:47:16Z

Alright, I'll test it on a few V100

mcarilli · 2019-05-10T16:35:33Z

Yes, the 1080Ti was intended for gaming, so it has really low compute throughput for FP16 math. You need a Tensor Core-enabled GPU (Turing or Volta) to get best results with mixed precision.

patrickpjiang · 2020-03-24T02:34:41Z

hello, I ran into the same problem when I was trying to run exps on 1x RTX2080, however the performance with O1 is worse than O0, more time cost and more memory consumed.

The compute capablitiy of RTX2080 is 7.5 and I think it should works with amp(see docs https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions ). Anyone knows why?

patrickpjiang · 2020-03-24T02:42:19Z

here is my code, and my env is RTX2080 with CUDA10.1

`import os
from datetime import datetime
import argparse
import torchvision
import torchvision.transforms as transforms
import torch
import torch.nn as nn
from apex import amp

def main():
parser = argparse.ArgumentParser()
parser.add_argument('-g', '--gpus', default=1, type=int,
help='number of gpus per node')
parser.add_argument('--epochs', default=2, type=int, metavar='N',
help='number of total epochs to run')
args = parser.parse_args()
args.gpu=0
train(args)

def train(args):
gpu = args.gpu
torch.manual_seed(0)
model = torchvision.models.vgg19(pretrained=False)
torch.cuda.set_device(gpu)
model.cuda(gpu)
batch_size = 200
# define loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss().cuda(gpu)
optimizer = torch.optim.SGD(model.parameters(), 1e-4)
# Wrap the model
model, optimizer = amp.initialize(model, optimizer, opt_level='O1')

# Data loading code
train_dataset = torchvision.datasets.CIFAR100(
    root='./data',
    train=True,
    transform=transforms.ToTensor(),
    download=True
)
train_loader = torch.utils.data.DataLoader(
    dataset=train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=1,
    pin_memory=True,
    drop_last=True
)

start = datetime.now()
total_step = len(train_loader)
for epoch in range(args.epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.cuda(non_blocking=True)
        labels = labels.cuda(non_blocking=True)
        # with torch.autograd.profiler.profile(use_cuda=True) as prof:
        if True:
            model.train()
            outputs = model(images)
            loss = criterion(outputs, labels)
            optimizer.zero_grad()
            with amp.scale_loss(loss, optimizer) as scaled_loss:
                scaled_loss.backward()
            optimizer.step()
        # print(prof)
        if (i + 1) % 100 == 0 and gpu == 0:
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(
                epoch + 1,
                args.epochs,
                i + 1,
                total_step,
                loss.item())
            )
    print("Training complete in: " + str(datetime.now() - start))

if name == 'main':
main()`

patrickpjiang · 2020-03-24T06:21:46Z

I notice there is an “ImportError”, so I resinatall the apex（with another pytorch version 1.4） and meet another problem named "version mismatch", according to this #323 I deleted the some code about "matching version" and finally installed with no warning!

However, when I ran my test code, the traing time is still longer with O1 than O0 while memory cost is indeed slightly decreased, is that normal?

mode memroy time

O0 3855M 26s/epoch

O1 3557M 33s/epoch

miguelvr mentioned this issue May 20, 2019

Test time no speed increase with half-precision SeanNaren/deepspeech.pytorch#402

Closed

NaleRaphael mentioned this issue Dec 1, 2019

LRFinder w/ Gradient Accumulation davidtvs/pytorch-lr-finder#8

Closed

CDahmsCellarEye mentioned this issue Apr 24, 2020

Apex AMP performance bad on GTX 1650, good on RTX 2080 and Volta w/ Tensor Cores, is this normal ?? #806

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixed precision training slower than FP32 training #297

Mixed precision training slower than FP32 training #297

miguelvr commented May 9, 2019 •

edited

Loading

mcarilli commented May 9, 2019

miguelvr commented May 9, 2019 •

edited

Loading

zsef123 commented May 10, 2019

miguelvr commented May 10, 2019

mcarilli commented May 10, 2019

patrickpjiang commented Mar 24, 2020

patrickpjiang commented Mar 24, 2020

patrickpjiang commented Mar 24, 2020

Mixed precision training slower than FP32 training #297

Mixed precision training slower than FP32 training #297

Comments

miguelvr commented May 9, 2019 • edited Loading

mcarilli commented May 9, 2019

miguelvr commented May 9, 2019 • edited Loading

zsef123 commented May 10, 2019

miguelvr commented May 10, 2019

mcarilli commented May 10, 2019

patrickpjiang commented Mar 24, 2020

patrickpjiang commented Mar 24, 2020

patrickpjiang commented Mar 24, 2020

mode memroy time

O0 3855M 26s/epoch

O1 3557M 33s/epoch

miguelvr commented May 9, 2019 •

edited

Loading

miguelvr commented May 9, 2019 •

edited

Loading