# Memory management and Multi-GPU Usage in PyTorch

In this notebook, we will cover the following topics

1. How to use multiple GPUs for your network, either using data parallelism or model parallelism. 
2. How to automate selection of GPU while creating a new objects. 
3. How to diagnose and analyse memory issues should they arise. 

In [0]:
import torch
import torch.nn as nn





## Moving tensors around CPU / GPUs




We use to `to` or `cuda` function to move the tensors around to CPU / GPUs. We pass the index of the GPU as the argument.

In [0]:
if torch.cuda.is_available():
	dev = "cuda:0"
else:
	dev = "cpu"

device = torch.device(dev)

a = torch.zeros(4,3)   
a = a.to(0)       #alternatively, a.to(0)

In similar fashion as above, you can also move the `nn.Module` objects GPUs as well

In [4]:
class myNetwork(nn.Module):
   def __init__(self):
      super().__init__()
      self.net = nn.Linear(5,1)
   
   def forward(self, x):
      return self.net(x)

clf = myNetwork()
clf.to(0)

myNetwork(
  (net): Linear(in_features=5, out_features=1, bias=True)
)

We can get the device of a tensor by `get_device()`. Only supported for GPU Tensors

In [0]:
dev = a.get_device() 
b = torch.tensor(a.shape).to(dev)

We can also set the default device on which GPU tensors are created. 

In [6]:
torch.cuda.set_device(0)

tens = torch.Tensor(3,4).cuda()
tens.get_device()

0

## The new_* functions

One can also make use of the bunch of new_ functions that made their way to PyTorch in version 1.0. When a function like new_ones is called on a Tensor it returns a new tensor of same data type, and on the same device as the tensor on which the new_ones function was invoked.

In [0]:
ones = torch.ones((2,)).cuda(0)

# Create a tensor of ones of size (3,4) on same device as of "ones"
newOnes = ones.new_ones((3,4)) 

randTensor = torch.randn(2,4)



A detailed list of new_ functions can be found in PyTorch docs the link of which I have provided below. 

## Using Multiple GPUs

There are two ways how we could make use of multiple GPUs. 

1. **Data Parallelism**, where we divide batches into smaller batches, and process these smaller batches in parallel on multiple GPU.
2. **Model Parallelism**, where we break the neural network into smaller sub networks and then execute these sub networks on different GPUs.


### Data Parallelism

Data Parallelism in PyTorch is achieved through the `nn.DataParallel` class. You initialize a `nn.DataParallel` object with a nn.Module object representing your network, and a list of GPU IDs, across which the batches have to be parallelised.

There are a few things I want to shed light over. Despite the fact our data has to be parallelised over multiple GPUs, we have to **initially** store it on a single GPU.

We also need to make sure the DataParallel object is on that particular GPU as well. The syntax remains similar to what we did earlier with nn.Module. 

DataParallel takes the input, splits it into smaller batches, replicates the neural network across all the devices, executes the pass and then collects the output back on the original GPU

In [0]:
myNet = myNetwork()
parallel_net = nn.DataParallel(myNet, device_ids = [0])

inputs = torch.Tensor(5,)    #random inputs 

inputs = inputs.to(0)
myNet.to(0)

predictions = parallel_net(inputs)
loss = (1 - predictions).mean()
loss.backward()
#optimiser.step()


One issue with DataParallel can be that it can put asymmetrical load on one GPU (the main node). There are generally two ways to circumvent these problem. 
1. Compute the loss during the forward pass. This makes sure at least the loss calculation phase is parallelised. 
2. Another way is to implement a parallel loss function layer. This is beyond the scope of this article. However, for those interested I have given a link to a medium article detailing implementation of such a layer at the end of this article. 

## Model Parallelism

Model parallelism means that you break your network into smaller subnetworks that you then put on different GPUs. The main motivation for doing such a thing is that your network might be too large to fit inside a single GPU.

Implementing Model parallelism is PyTorch is pretty easy as long as you remember 2 things.
1. The input and the network should always be on the same device. 
1. to and cuda() functions have autograd support, so your gradients 
can be copied from one GPU to another during backward pass. 

We will use the following piece of code to understand this better.

In [9]:
class model_parallel(nn.Module):
	def __init__(self):
		super().__init__()
		self.sub_network1 = nn.Linear(100,32)    #This part stays on GPU 1
		self.sub_network2 = nn.Linear(32,10)     #This part stays on GPU 2

		self.sub_network1.cuda(0)
		self.sub_network2.cuda(1)

	def forward(x):
		x = x.cuda(0)
		x = self.sub_network1(x)
		x = x.cuda(1)
		x = self.sub_network2(x)
		return x


x = torch.Tensor(100,)        # Random Input
x = x.to(0)

net = model_parallel()        # No need to put it on GPUs as that has been taken care of in the 
                              # init function

loss = (1 - net(x)).mean()
loss.backward()

RuntimeError: ignored

## Troubleshooting Out of Memory Errors

### Diagnosing GPU Usage

One can diagnose the GPU usage using the GPUtil python library which can be installed using pip by typing `pip install gputil` in a terminal. The following piece of code illustrates use of the extension. 


In [10]:
!pip install gputil


Collecting gputil
  Downloading https://files.pythonhosted.org/packages/ed/0e/5c61eedde9f6c87713e89d794f01e378cfd9565847d4576fa627d758c554/GPUtil-1.4.0.tar.gz
Building wheels for collected packages: gputil
  Building wheel for gputil (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/3d/77/07/80562de4bb0786e5ea186911a2c831fdd0018bda69beab71fd
Successfully built gputil
Installing collected packages: gputil
Successfully installed gputil-1.4.0


In [11]:
from GPUtil import showUtilization as gpu_usage
gpu_usage()

| ID | GPU | MEM |
------------------
|  0 |  0% |  5% |


AttributeError: ignored

### Dealing with Memory Losses using del keyword

While PyTorch has a pretty aggresive garbage collector. But PyTorch will free up the variable only when there exist no pythonic reference to the object. 

It is to be kept in mind that Python doesn't enforce scoping rules as strongly as other languages such as C/C++. A variable is only freed when there exists no pointers to it. Consider the following case. 

In [12]:
for x in range(10):
  tensor = torch.randn(1,4)

print(tensor)   



tensor([[-1.2010,  0.2314, -0.2265, -0.0234]])


We defined `tensor` inside the loop, however, it still existed after we exited the loop. Similarly, tensors holding the network's inputs, losses, output continue to take up memory even after we exit the training loop. 

A good practice to get rid of these variables is by using the `del` keyword.

In [13]:
net = myNetwork()
opt = torch.optim.SGD(net.parameters(),lr = 0.01)

inp = torch.randn(5,)

for x in range(10):
  out = net(inp)
  loss = (1 - out).mean()
  opt.zero_grad()
  loss.backward()
  opt.step()

print(out, loss)                      # these variables still exist
del out, loss                         # Free the memory taken by these variables


tensor([0.1282], grad_fn=<AddBackward0>) tensor(0.8718, grad_fn=<MeanBackward0>)


### Using Python Data Types Instead Of 1-D Tensors

Often, we aggregate values in our training loop to compute some metrics. Biggest example of this is that we update the running loss  each iteration. However, if not done carefully in PyTorch, such a thing can lead to excess use of memory than what is required. 

Consider the following snippet of code. 

In [0]:
total_loss = 0

for x in range(10):
  # assume loss is computed 
  iter_loss = torch.randn(3,4).mean()
  iter_loss.requires_grad = True     # losses are supposed to differentiable
  total_loss += iter_loss            # use total_loss += iter_loss.item) instead
  

 We expect that in the subsequent iterations, the reference to `iter_loss` is reasigned to new `iter_loss`, and the object representing `iter_loss` from earlier representation will be freed. But this doesn't happen. Why?
 
 Since `iter_loss` is differentiable, the line `total_loss += iter_loss` creates a computation graph with one `AddBackward` function node. During subsequent iterations, `AddBackward` nodes are added to this graph and no object holding values of `iter_loss` is freed. 
 
 The solution to this is to add a python data type, and not a tensor to `total_loss` which prevents creation of any computation graph.
 
 We merely replace the line `total_loss += iter_loss` with `total_loss += iter_loss.item()`. `item` returns the python data type from a tensor containing single values.

### Using torch.no_grad() for inference

Whenever you are doing inference with your network, or any operation that doesn't require backpropagation of gradients, you should always put the code inside `torch.no_grad()` context manager. 

In [0]:
net = myNetwork()
inp = torch.randn(5,)

with torch.no_grad():
  out = net(inp)

### Emptying CUDA cache

While PyTorch aggressively frees up memory, a pytorch process may not give back the memory back to the OS even after you del your tensors. This memory is cached so that it can be quickly allocated to new tensors being allocated without requesting the OS new extra memory.

This can be a problem when you are using more than two processes in your workflow.

The first process can hold onto the GPU memory even if it's work is done causing OOM when the second process is launched. To remedy this, you can write the command at the end of your code. 



In [16]:

import torch
from GPUtil import showUtilization as gpu_usage

print("Initial GPU Usage")
gpu_usage()                             

tensorList = []
for x in range(10):
  tensorList.append(torch.randn(10000000,10).cuda())   # reduce the size of tensor if you are getting OOM
  
  

print("GPU Usage after allcoating a bunch of Tensors")
gpu_usage()

del tensorList

print("GPU Usage after deleting the Tensors")
gpu_usage()  

print("GPU Usage after emptying the cache")
torch.cuda.empty_cache()
gpu_usage()



Initial GPU Usage
| ID | GPU | MEM |
------------------
|  0 |  0% |  5% |
GPU Usage after allcoating a bunch of Tensors
| ID | GPU | MEM |
------------------
|  0 |  3% | 30% |
GPU Usage after deleting the Tensors
| ID | GPU | MEM |
------------------
|  0 |  3% | 30% |
GPU Usage after emptying the cache
| ID | GPU | MEM |
------------------
|  0 |  3% |  5% |


### Using CUDNN Backend

One can use the `CUDNN` benchmark to have optimisations in the code. These are specially benificial if your input sized is fixed (You are not using RNNs).



In [0]:
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.enabled = True

### Using Half Precision Floats

One can use half precision floats if the GPU has FP16 support. It's simple enough to convert a normal model to it's half precision variant. 

In [18]:
inp = torch.randn(5,).cuda().half()

model = myNetwork().cuda().half()


model(inp)

tensor([0.1536], device='cuda:0', dtype=torch.float16, grad_fn=<AddBackward0>)

Batch Norm layers have been reported to have convergence issues with half precision floats so it's better to use full precision for them.

In [19]:
import torch
import torch.nn as nn
class myNetworkBN(nn.Module):
  def __init__(self):
    super().__init__()
    self.l1 = nn.Linear(10,5)
    self.bn = nn.BatchNorm1d(5)
    self.l2 = nn.Linear(5,1)
     
  def forward(self,x):
    x = self.l1(x)
    x = self.bn(x)
    x = self.l2(x)
    return x 

inp = torch.randn(10,).cuda().half().unsqueeze(0)       # Unsquueze op to add mini-batch dimension

model = myNetworkBN().cuda().half().eval()              # Eval mode = use population statistics in BN

model(inp)

tensor([[-0.1593]], device='cuda:0', dtype=torch.float16,
       grad_fn=<AddmmBackward>)

One must always be careful about half precision floats when the value may get too large. It is recommended to use the Nvidia `apex` extension for using mixed precision training. 