<a href="https://colab.research.google.com/github/ShaunakSen/Deep-Learning/blob/master/Copy_of_4_4_GPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part 4: Using GPU acceleration with PyTorch

In [1]:
# Execute this code block to install dependencies when running on colab
try:
    import torch
except:
    from os.path import exists
    from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag
    platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
    cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
    accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

    !pip install -q http://download.pytorch.org/whl/{accelerator}/torch-1.0.0-{platform}-linux_x86_64.whl torchvision

try: 
    import torchbearer
except:
    !pip install torchbearer

Collecting torchbearer
[?25l  Downloading https://files.pythonhosted.org/packages/5a/62/79c45d98e22e87b44c9b354d1b050526de80ac8a4da777126b7c86c2bb3e/torchbearer-0.3.0.tar.gz (84kB)
[K    100% |████████████████████████████████| 92kB 3.5MB/s 
Building wheels for collected packages: torchbearer
  Building wheel for torchbearer (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/6c/cb/69/466aef9cee879fb8f645bd602e34d45e754fb3dee2cb1a877a
Successfully built torchbearer
Installing collected packages: torchbearer
Successfully installed torchbearer-0.3.0


## Manual use of `.cuda()`

Now the magic of PyTorch comes in. So far, we've only been using the CPU to do computation. When we want to scale to a bigger problem, that won't be feasible for very long.
|
PyTorch makes it really easy to use the GPU for accelerating computation. Consider the following code that computes the element-wise product of two large matrices:

In [2]:
import torch

t1 = torch.randn(1000, 1000)
t2 = torch.randn(1000, 1000)
t3 = t1*t2
print(t3)

tensor([[-1.0801,  0.0131, -0.7778,  ..., -0.1991,  1.1845, -0.1576],
        [-0.1461, -0.0925, -1.0823,  ..., -0.4159, -0.2597,  0.3540],
        [-0.5988,  1.0804,  0.2496,  ...,  2.6953,  0.1726, -0.5365],
        ...,
        [-0.9053,  1.0892, -1.6346,  ..., -0.3574,  0.2655, -0.2569],
        [ 0.0534,  0.9225,  0.2201,  ...,  1.9246,  0.1290,  0.8775],
        [-1.0004, -1.6944,  0.2133,  ..., -0.0137,  0.5315, -0.2759]])


By sending all the tensors that we are using to the GPU, all the operations on them will also run on the GPU without having to change anything else. If you're running a non-cuda enabled version of PyTorch the following will throw an error; if you have cuda available the following will create the input matrices, copy them to the GPU and perform the multiplication on the GPU itself:

In [3]:
t1 = torch.randn(1000, 1000).cuda()
t2 = torch.randn(1000, 1000).cuda()
t3 = t1*t2
print(t3)

print (t3.device)

tensor([[ 1.6100, -0.0796, -0.0658,  ..., -0.1225, -0.6994,  0.0386],
        [-0.1560,  0.0362, -0.5935,  ...,  1.0441,  0.4122, -0.1792],
        [ 1.7499,  0.1702, -2.2297,  ...,  0.1659, -1.3387, -1.4362],
        ...,
        [-0.7714, -0.3160,  0.2048,  ...,  0.0502,  0.1226, -0.0235],
        [ 0.3930,  0.7626,  0.2685,  ...,  0.2665, -0.2146,  2.4714],
        [ 0.0910, -1.1168,  0.1299,  ...,  0.1488, -0.2555, -0.3139]],
       device='cuda:0')
cuda:0


If you're running this workbook in colab, now enable GPU acceleration (`Runtime->Runtime Type` and add a `GPU` in the hardware accelerator pull-down). You'll then need to re-run all cells to this point.

If you were able to run the above with hardware acceleration, the print-out of the result tensor would show that it was an instance of `cuda.FloatTensor` type on the the `(GPU 0)` GPU device. If your wanted to copy the tensor back to the CPU, you would use the `.cpu()` method.

## Writing platform agnostic code

Most of the time you'd like to write code that is device agnostic; that is it will run on a GPU if one is available, and otherwise it would fall back to the CPU. The recommended way to do this is as follows:

In [4]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
t1 = torch.randn(1000, 1000).to(device)
t2 = torch.randn(1000, 1000).to(device)
t3 = t1*t2
print(t3)

tensor([[-0.6135,  0.3351,  0.7862,  ...,  3.3680, -0.0036,  1.2411],
        [ 0.0132,  0.6727, -0.2120,  ...,  0.1114, -0.2625,  1.4165],
        [ 0.2163,  0.0668, -1.3345,  ..., -0.0291, -0.0358,  0.8573],
        ...,
        [ 0.0475,  0.3305,  0.3017,  ...,  0.7037, -0.0046, -0.3717],
        [ 0.7401, -0.7087,  0.0668,  ..., -0.2299,  0.0632, -0.1292],
        [-0.1116,  1.2368,  0.0645,  ..., -2.9785, -0.5237, -0.1410]],
       device='cuda:0')


## Accelerating neural net training

If you wanted to accelerate the training of a neural net using raw PyTorch, you would have to copy both the model and the training data to the GPU. Unless you were using a really small dataset like MNIST, you would typically _stream_ the batches of training data to the GPU as you used them in the training loop:

```python
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = BaselineModel(784, 784, 10).to(device)

loss_function = ...
optimiser = ...

for epoch in range(10):
    for data in trainloader:
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)

        optimiser.zero_grad()
        outputs = model(inputs)
        loss = loss_function(outputs, labels)
        loss.backward()
        optimiser.step()
```

Using Torchbearer, this becomes much simpler - you just tell the `Trial` to run on the GPU and that's it!:

```python
model = BetterCNN()

loss_function = ...
optimiser = ...

device = "cuda:0" if torch.cuda.is_available() else "cpu"
trial = Trial(model, optimiser, loss_function, metrics=['loss', 'accuracy']).to(device)
trial.with_generators(trainloader)
trial.run(epochs=10)
```


## Multiple GPUs

Using multiple GPUs is beyond the scope of the lab, but if you have multiple cuda devices, they can be referred to by index: `cuda:0`, `cuda:1`, `cuda:2`, etc. You have to be careful not to mix operations on different devices, and would need how to carefully orchestrate moving of data between the devices (which can really slow down your code to the point at which using the CPU would actually be faster).

## Questions

__Answer the following questions (enter the answer in the box below each one):__

__1.__ What features of GPUs allow them to perform computations faster than a typically CPU?

Computations that can be easily done in parallal can be achieved much faster using a GPU

The number of tasks that a larger task can be broken into depends on the number of cores contained on a particular piece of hardware. Cores are the units that actually do the computation within a given processor, and CPUs typically have four, eight, or sixteen cores while GPUs have potentially thousands.

![](https://www.datascience.com/hs-fs/hubfs/gpu2.png?width=600&name=gpu2.png)

Neural Networks are [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) - so it can be broken up into smaller independent tasks that can be computed independently by the GPU cores.

Neural Network model training is composed of simple matrix math calculations, the speed of which can be greatly enhanced if the computations can be carried out in parallel.


- [http://deeplizard.com/learn/video/6stDhEA0wFQ](http://deeplizard.com/learn/video/6stDhEA0wFQ)
- [https://www.datascience.com/blog/cpu-gpu-machine-learning](https://www.datascience.com/blog/cpu-gpu-machine-learning)
- [https://en.wikipedia.org/wiki/Embarrassingly_parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel)




__2.__ What is the biggest limiting factor for training large models with current generation GPUs?

However, given the size of your model and the size of your batches, you can actually calculate how much GPU memory you need for training without actually running it. For example, training AlexNet with batch size of 128 requires 1.1GB of global memory, and that is just 5 convolutional layers plus 2 fully-connected layers. If we look at a bigger model, say VGG-16, using a batch size of 128 will require about 14GB of global memory. The current state-of-the-art NVIDIA Titan X has a memory capacity of 12GB . VGG-16 has only 16 convolutional layers and 3 fully-connected layers, and is much smaller than the resnet model which could contain about one hundred layers.

Now, if you want to train a model larger than VGG-16, you might have several options to solve the memory limit problem.
– reduce your batch size, which might hinder both your training speed and accuracy.
– distribute your model among multiple GPU(s), which is a complicated process in itself.
– reduce your model size, if you find yourself unwilling to do the aforementioned options , or you have already tried these options but they’re not good.