# Using [torch](https://pytorch.org/docs/stable/cuda.html) to compare CPU/GPU speeds
stough 202-

The [Graphics Processing Unit](https://www.extremetech.com/gaming/269335-how-graphics-cards-work)
is a common [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) designed to do parallel floating point
arithmetic. In the past this was computer graphics, but this massively parallel math is useful 
in all scientific computation.

Also, going to use jupyterlab [magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html).

In [1]:
import torch
import torch.cuda as cuda
import numpy as np

import time
from IPython.display import display, Markdown

In [2]:
cuda.is_available()

True

In [3]:
cuda.get_device_name(0)

'GeForce GTX 1070'

In [4]:
cuda.get_device_properties(0)

_CudaDeviceProperties(name='GeForce GTX 1070', major=6, minor=1, total_memory=8192MB, multi_processor_count=16)

&nbsp;

## We'll do a large matrix multiply operation
in numpy, torch, and torch on the GPU.

In [5]:
A = np.random.rand(400,1000,200)
B = np.random.rand(400,200,1000)

In [6]:
(8*A.size)/(1024**2)

610.3515625

In [7]:
C = np.matmul(A,B)
C.shape

(400, 1000, 1000)

In [8]:
8*C.size/(1024**2)

3051.7578125

In [9]:
%%timeit -n 5 -r 4
# C = np.matmul(A,B)
np.matmul(A,B, out=C)

1.3 s ± 152 ms per loop (mean ± std. dev. of 4 runs, 5 loops each)


&nbsp;

### Test in Torch

In [10]:
At_cpu = torch.from_numpy(A)
Bt_cpu = torch.from_numpy(B)

In [11]:
At_cpu.is_cuda

False

In [12]:
Ct_cpu = torch.matmul(At_cpu, Bt_cpu)
Ct_cpu.shape

torch.Size([400, 1000, 1000])

In [13]:
%%timeit -n 5 -r 4
# Ct_cpu = torch.matmul(At_cpu, Bt_cpu)
torch.matmul(At_cpu, Bt_cpu, out = Ct_cpu)

963 ms ± 63.1 ms per loop (mean ± std. dev. of 4 runs, 5 loops each)


&nbsp;

### Now test in torch, on the GPU

In [14]:
At_gpu = At_cpu.cuda()
Bt_gpu = Bt_cpu.cuda()
Ct_gpu = torch.zeros_like(Ct_cpu).cuda()

In [15]:
At_gpu.is_cuda

True

In [16]:
%%timeit -n 5 -r 4
torch.matmul(At_gpu, Bt_gpu, out=Ct_gpu)

The slowest run took 3273.01 times longer than the fastest. This could mean that an intermediate result is being cached.
25.9 ms ± 44.8 ms per loop (mean ± std. dev. of 4 runs, 5 loops each)


&nbsp;

### Try again, with our own timing.

- [math expressions in markdown](https://stackoverflow.com/questions/48422762/is-it-possible-to-show-print-output-as-latex-in-jupyter-notebook)

In [17]:
times = []

st = time.time()

for i in range(100):
    torch.zeros(Ct_gpu.shape, out=Ct_gpu)
    torch.matmul(At_gpu, Bt_gpu, out=Ct_gpu)
    times.append(time.time() - st)

et = time.time()

# Why not be more complicated...
# print(f'20 iters took {1000000*(et-st):.2f}')
display(Markdown(rf'20 iters took {1000000*(et-st):.2f}$\mu$s per.'))

20 iters took 22315.03$\mu$s per.

In [18]:
times

[0.01884293556213379,
 0.01884293556213379,
 0.01884293556213379,
 0.01884293556213379,
 0.01884293556213379,
 0.01933884620666504,
 0.01933884620666504,
 0.01933884620666504,
 0.01933884620666504,
 0.01933884620666504,
 0.01933884620666504,
 0.01933884620666504,
 0.01933884620666504,
 0.01933884620666504,
 0.01933884620666504,
 0.019671201705932617,
 0.019671201705932617,
 0.019671201705932617,
 0.019671201705932617,
 0.019671201705932617,
 0.019834518432617188,
 0.019834518432617188,
 0.019834518432617188,
 0.019834518432617188,
 0.019834518432617188,
 0.019834518432617188,
 0.019834518432617188,
 0.019834518432617188,
 0.019834518432617188,
 0.019834518432617188,
 0.019834518432617188,
 0.019834518432617188,
 0.019834518432617188,
 0.019834518432617188,
 0.019834518432617188,
 0.019834518432617188,
 0.02033090591430664,
 0.02033090591430664,
 0.02033090591430664,
 0.02033090591430664,
 0.02033090591430664,
 0.02033090591430664,
 0.02033090591430664,
 0.02033090591430664,
 0.02033090