# Automatic Mixed Precision

```torch.cuda.amp``` provides convenience methods for mixed precision, where some operations use ```torch.float32``` datatype and other operations use ```torch.float16 (half)```. Some ops, linear and convolution layers are much faster in ```torch.float16``` while other operations like reductions require the dynamic range of ```torch.float32```.

`Mixed Precision` tries to match each op to its appropriate datatype, which can reduce the network's `runtime` and ```memory footprint```.

```"Automatic Mixed Precision Training"``` uses ```torch.cuda.amp.autocast``` and ```torch.cuda.amp.GradScaler``` together. 

In this, we measure the performance of a Simple Network in ```default precision```, then adding ```autocast``` and ```GradScaler``` to run the same Net in `mixed Precision` with improved performance.

In [1]:
# My GPU's Architecture (Single GPU RTX 2060 6GB)
!nvidia-smi

Sat Jul 31 04:42:37 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 466.77       Driver Version: 466.77       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   60C    P8     8W /  N/A |    255MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
import torch
import time
import gc

start_time = None

def start_timer():
    global start_time
    gc.collect()
    
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.synchronize()
    
    start_time = time.time()
    
def end_timer(local_msg):
    torch.cuda.synchronize()
    end_time = time.time()
    
    print("\n" + local_msg)
    print("Total execution time = {:.3f} sec".format(end_time - start_time))
    print("Max memory used by tensors = {} bytes".format(torch.cuda.max_memory_allocated()))

## A Simple Neural Network
Neural Net with Linear Layers and ReLU.

In [3]:
def make_model(in_size, out_size, num_layers):
    layers = []
    for _ in range(num_layers - 1):
        layers.append(torch.nn.Linear(in_size, in_size))
        layers.append(torch.nn.ReLU())
    layers.append(torch.nn.Linear(in_size, out_size))
    return torch.nn.Sequential(*tuple(layers)).cuda()

```batch_size, in_size, out_size, and num_layers``` are chosen to be large enough to saturate the GPU with work. Typically, `mixed precision` provides the greatest speedup when the `GPU` is saturated. Small networks may be `CPU` bound, in which case mixed precision won’t improve performance. 

Sizes are also chosen such that linear layers’ participating dimensions are multiples of 8, to permit `Tensor Core` usage on `Tensor Core-capable GPUs`.

In [4]:
batch_size = 512 
in_size = 4096
out_size = 4096
num_layers = 3
num_batches = 50
epochs = 3

# Creates data in default precision.
# The same data is used for both default and mixed precision trials below.
# You don't need to manually change inputs' dtype when enabling mixed precision.
data = [torch.randn(batch_size, in_size, device='cuda') for _ in range(num_batches)]
targets = [torch.randn(batch_size, out_size, device='cuda') for _ in range(num_batches)]

loss_fn = torch.nn.MSELoss().cuda()

In [5]:
data

[tensor([[-0.8316,  0.9260,  0.1173,  ...,  0.1361,  0.0983,  1.3280],
         [-1.3049,  0.0830,  0.9496,  ..., -1.2548, -0.9005, -0.7632],
         [-0.1109,  0.3389, -1.1462,  ..., -1.2532, -0.4346, -0.9692],
         ...,
         [-1.3842, -0.2880, -0.5669,  ...,  0.0857, -1.6083, -0.9542],
         [-0.1475,  2.0280,  0.3341,  ...,  1.0582,  1.4145, -0.6916],
         [-1.0510,  1.1050,  1.7701,  ..., -2.1392, -1.3736, -0.9380]],
        device='cuda:0'),
 tensor([[-0.1558, -0.6074,  0.6044,  ...,  1.6116,  0.6203,  1.5289],
         [ 0.6683, -1.3451, -1.6826,  ...,  1.6395, -1.0697, -0.0771],
         [-1.1532, -1.2533, -1.5787,  ...,  1.2253,  0.6631, -0.4423],
         ...,
         [ 0.0622,  0.7171,  0.4504,  ..., -0.3777, -0.6482, -0.4877],
         [ 0.1807,  0.6535, -1.3003,  ..., -0.3982,  0.3516,  0.7678],
         [ 0.4318,  0.7493, -0.2879,  ..., -0.6993,  0.0477,  1.7387]],
        device='cuda:0'),
 tensor([[ 0.2838,  0.1942,  0.7200,  ...,  0.0209, -1.1207,  1.27

In [6]:
# At this time
!nvidia-smi 

Sat Jul 31 04:52:03 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 466.77       Driver Version: 466.77       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   62C    P2    28W /  N/A |   1844MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Default Precision
Without ```torch.cuda.amp```, the net will execute all ops in `default precision` (```torch.float32```)

In [7]:
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)

start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        output = net(input)
        loss = loss_fn(output, target)
        loss.backward()
        opt.step()
        opt.zero_grad() # set_to_none=True
end_timer("Default Precision: ")




Default Precision: 
Total execution time = 8.063 sec
Max memory used by tensors = 1367458816 bytes


In [8]:
!nvidia-smi

Sat Jul 31 04:56:00 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 466.77       Driver Version: 466.77       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   64C    P2    29W /  N/A |   2536MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Adding Autocast

Instances of ```torch.cuda.amp.autocast``` serve as context managers that allow some regions here to run in ```mixed precision```.

In these regions, `CUDA` ops run in a `dtype` chosen by ```autocast``` to improve performance while maintaining accuracy.

In [9]:
for epoch in range(0):
    for input, target in zip(data, targets):
        # Run Forward Pass under Autocast
        with torch.cuda.amp.autocast():
            output = net(input)
            # output -> torch.float16 becoz Linear Layers autocast to Float16
            assert output.dtype is torch.float16
            
            loss = loss_fn(output, target)
            # loss -> torch.float32 becoz MSELoss layers autocast to Float32
            assert loss.dtype is torch.float32
        
        # Exit AutoCast after backward pass
        # Backward ops run in the same dtype autocast chose for corresponding forward ops.
        loss.backward()
        opt.step()
        opt.zero_grad()

## Adding GradScaler

```Gradient Scaling``` helps prevent gradients with small magnitudes from ```flushing to zero``` when training with ```mixed precision```.

```torch.cuda.amp.GradScaler``` performs the steps of ```gradient Scaling```

In [10]:
# Constructs scaler once, at the beginning of the convergence run, using default args.
# The same GradScaler instance should be used for the entire convergence run.
# If you perform multiple convergence runs in the same script, each run should use
# a dedicated fresh GradScaler instance.  GradScaler instances are lightweight.
scaler = torch.cuda.amp.GradScaler()

for epoch in range(0): # Not Now
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast():
            output = net(input)
            loss = loss_fn(output, target)
            
        # Scales losses, Calls backward() on scaled loss to create scaled gradients
        scaler.scale(loss).backward()
        
        # scaler.step() first unscales the gradients of the optimizer's assigned params.
        # If these gradients do not contain infs or NaNs, optimizer.step() is then called,
        # otherwise, optimizer.step() is skipped.
        scaler.step(opt)
        
        # Updates the scale for next iter.
        scaler.update()
        opt.zero_grad()

###  Automatic Mixed Precision (Combined Autocast and GradScaler)

In this, there is an optional argument ```enabled``` to ```autocast``` and ```GradScaler```. If ```False```, `autocast` and `GradScaler`'s calls become no-ops. This allows switching between `default precision and mixed precision` without if/else statements.

In [12]:
use_amp = True # Optional Argument

net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)

start_timer()
for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast(enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad()
        
end_timer("Automatic Mixed Precison: ")




Automatic Mixed Precison: 
Total execution time = 3.298 sec
Max memory used by tensors = 1803759616 bytes


In [13]:
!nvidia-smi

Sat Jul 31 14:05:57 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 466.77       Driver Version: 466.77       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   52C    P2    26W /  N/A |   2957MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Inspecting/Modifying Graphs 

All gradients produced by ```scaler.scale(loss).backward()``` are scaled. If we wish to ```inspect or modify``` the parameters. ```.grad``` attributes between ```backward()``` and ```scaler.step(optimizer)```, but first we should `unscale` them using ```scaler.unscale_(optimizer)```.

In [14]:
for epoch in range(0): 
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast():
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()

        # Unscales the gradients of optimizer's assigned params in-place
        scaler.unscale_(opt)

        # Since the gradients of optimizer's assigned params are now unscaled, clips as usual.
        # You may use the same value for max_norm here as you would without gradient scaling.
        torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=0.1)

        scaler.step(opt)
        scaler.update()
        opt.zero_grad() 