# Using [torch](https://pytorch.org/docs/stable/cuda.html) to compare CPU/GPU speeds
stough 202-

The [Graphics Processing Unit](https://www.extremetech.com/gaming/269335-how-graphics-cards-work)
is a common [coprocessor](https://en.wikipedia.org/wiki/Coprocessor) designed to do parallel floating point
arithmetic. In the past this was computer graphics, but this massively parallel math is useful 
in all scientific computation.

Also, going to use jupyterlab [magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html).

In [1]:
import torch
import torch.cuda as cuda
import numpy as np

import time
from IPython.display import display, Markdown

In [2]:
cuda.is_available()

True

In [3]:
cuda.get_device_name(0)

'GeForce RTX 2080 Ti'

In [4]:
cuda.get_device_properties(0)

_CudaDeviceProperties(name='GeForce RTX 2080 Ti', major=7, minor=5, total_memory=11019MB, multi_processor_count=68)

&nbsp;

## We'll do a large matrix multiply operation
in numpy, torch, and torch on the GPU.

In [5]:
A = np.random.rand(400,1000,200)
B = np.random.rand(400,200,1000)

In [6]:
(8*A.size)/(1024**2)

610.3515625

In [7]:
C = np.matmul(A,B)
C.shape

(400, 1000, 1000)

In [8]:
8*C.size/(1024**2)

3051.7578125

In [9]:
%%timeit -n 5 -r 4
# C = np.matmul(A,B)
np.matmul(A,B, out=C)

369 ms ± 631 µs per loop (mean ± std. dev. of 4 runs, 5 loops each)


&nbsp;

### Test in Torch

In [10]:
At_cpu = torch.from_numpy(A)
Bt_cpu = torch.from_numpy(B)

In [11]:
At_cpu.is_cuda

False

In [12]:
Ct_cpu = torch.matmul(At_cpu, Bt_cpu)
Ct_cpu.shape

torch.Size([400, 1000, 1000])

In [13]:
%%timeit -n 5 -r 4
# Ct_cpu = torch.matmul(At_cpu, Bt_cpu)
torch.matmul(At_cpu, Bt_cpu, out = Ct_cpu)

447 ms ± 14.5 ms per loop (mean ± std. dev. of 4 runs, 5 loops each)


&nbsp;

### Now test in torch, on the GPU

In [14]:
At_gpu = At_cpu.cuda()
Bt_gpu = Bt_cpu.cuda()
Ct_gpu = torch.zeros_like(Ct_cpu).cuda()

In [15]:
At_gpu.is_cuda

True

In [16]:
%%timeit -n 5 -r 4
torch.matmul(At_gpu, Bt_gpu, out=Ct_gpu)

The slowest run took 10121.34 times longer than the fastest. This could mean that an intermediate result is being cached.
37.2 ms ± 64.4 ms per loop (mean ± std. dev. of 4 runs, 5 loops each)


&nbsp;

### Try again, with our own timing.

- [math expressions in markdown](https://stackoverflow.com/questions/48422762/is-it-possible-to-show-print-output-as-latex-in-jupyter-notebook)

In [17]:
times = []

st = time.time()

for i in range(100):
    torch.zeros(Ct_gpu.shape, out=Ct_gpu)
    torch.matmul(At_gpu, Bt_gpu, out=Ct_gpu)
    times.append(time.time() - st)

et = time.time()

# Why not be more complicated...
# print(f'20 iters took {1000000*(et-st):.2f}')
display(Markdown(rf'20 iters took {1000000*(et-st):.2f}$\mu$s per.'))

20 iters took 6332.87$\mu$s per.

In [18]:
times

[0.0035178661346435547,
 0.003555774688720703,
 0.0035858154296875,
 0.003614664077758789,
 0.003643512725830078,
 0.0036721229553222656,
 0.00370025634765625,
 0.003728151321411133,
 0.0037565231323242188,
 0.0037848949432373047,
 0.0038127899169921875,
 0.003840923309326172,
 0.003869295120239258,
 0.0038971900939941406,
 0.0039250850677490234,
 0.003952980041503906,
 0.003981113433837891,
 0.0040094852447509766,
 0.004037618637084961,
 0.004065752029418945,
 0.004093647003173828,
 0.004121541976928711,
 0.004153728485107422,
 0.004182100296020508,
 0.004210233688354492,
 0.004238128662109375,
 0.004266262054443359,
 0.004294395446777344,
 0.004322052001953125,
 0.004349946975708008,
 0.004377841949462891,
 0.004405498504638672,
 0.004433631896972656,
 0.0044612884521484375,
 0.00448918342590332,
 0.004517078399658203,
 0.004547119140625,
 0.004575014114379883,
 0.004602670669555664,
 0.004630327224731445,
 0.00465846061706543,
 0.004686117172241211,
 0.004714012145996094,
 0.0047419