## Utilizing GPU's for fast inference and Training in PyTorch

One reason why Deep Learning has been flourishing in the last couple of years, is the advancements made in computing technology allowing for more and faster training. PyTorch offers an easy to use interface to utilize GPU's, additionally, also easy parallization to multiple GPU's is possible.  

The most used GPU's for this are from NVIDIA, using https://developer.nvidia.com/cuda-toolkit. For the example, we are going to use a pretrained version of https://arxiv.org/abs/1905.11946 given by PyTorch in Torchvision. 

Generally, moving something onto the GPU/CPU is done via the `to.(device)` method, where the `device` is an object of class `torch.device`. Using this, we can move models, but also Tensors and `state_dict`s. 

In [3]:
import torch 
import torch.nn as nn
import torchvision
import time

In [4]:
model = torchvision.models.efficientnet_b2(pretrained=True)



In [5]:
#Now, lets see how we can move this model onto the GPU, this is often done via a variable string called "device"

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu") #check if cuda, i.e. a compatible GPU and toolkit etc. is available, otherwise, we just work on CPU

model = model.to(device) 

inputs = torch.randn(1, 3, 224, 224).to(device) #create a random input tensor, with the same dimensions as the input of the model and get it to the device

output = model(inputs) #inference on GPU

#It is recommended to reduce the amount of moves we do between devices, as these moves are not cheap

In [None]:
#Let's see how much faster the inference is on the GPU compared to the CPU.
def measure_inference_time(model, input_tensor, iterations=100, device = 'cpu'):
    model.to(device)
    model.eval()
    input_tensor = input_tensor.to(device)
    with torch.no_grad():
        start_time = time.time()
        for _ in range(iterations):
            _ = model(input_tensor)
        end_time = time.time()

    avg_inference_time = (end_time - start_time) / iterations
    return avg_inference_time

time_on_gpu = measure_inference_time(model,inputs, device=device)

#translate model to Torchscript
time_on_cpu = measure_inference_time(model, inputs)

# Compare results
print(f"GPU Inference is {time_on_cpu/time_on_gpu:.2f}x faster than the CPU.")
#In training, this difference will be even more substantial, especially when we have large models that can utilize the parallelism of multiple GPU's.

GPU Inference is 2.60x faster than the CPU.


### Parallalization using DataParallel and DistributedDataParallel

Using https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html we can utilize multiple devices for the data, note that this is not fully distributed training, for this, we need https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html, see https://pytorch.org/tutorials/beginner/dist_overview.html for an introduction to the PyTorch API's that support Parallelism and the general torch.distributed package.