# Chapter 3: GPU programming.

This chapter and the next will make extensive use of GPUs. Sadly, depending on your machine, it can be impossible to use it in python. For example, at the time I do this tutorial, Radeon is not supported. Also some of your laptop may not have GPUs. These reasons pushed me to run the next Chapter on google collab (https://colab.research.google.com/). If you are still interested in using your own GPUs here are some advice/links that might help you:

- https://towardsdatascience.com/installing-tensorflow-with-cuda-cudnn-and-gpu-support-on-windows-10-60693e46e781
- https://www.youtube.com/watch?v=hHWkvEcDBO0
- https://www.youtube.com/watch?v=KZFn0dvPZUQ
- https://towardsdatascience.com/installing-tensorflow-gpu-in-ubuntu-20-04-4ee3ca4cb75d
- https://medium.com/analytics-vidhya/install-tensorflow-2-for-amd-gpus-87e8d7aeb812

Why do we want to use GPUs ?

GPUs hardware is designed for data parallelism. Maximum throughput is achieved when you are computing the same operations on many different elements at once.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

Among the tasks that do significantly benefit from parallel processing is deep learning. Some tasks can't be done in parallel (When you need to have the same object in memory, e.g calculating a series like fibonnaci).

One thing that could be nice would be to write the same code as normal (numpy, pandas,..) but just to run computation on a GPU. This would make it easier to parallelize processes. Some companies/university/people are working on this kind of libraries and that's what we are going to use in this section.

Structure:
- [Collab](#Collab)
- [CuPy](#CuPy)
- [CuDF and CuML](#CuDF)
- [Numba](#Numba)
- [TODO](#TODO)

<a name="Collab"></a>
## Google Collab

Stockage is limited to 60 gb (see on the left) 

Ram is limited to 12 gb (top right)

You can select gpu accelerated from modify->parameter of the notebook. 

Create text block and code block

You can create section.

Resembles jupyter notebook and uses ipynb.

Change color and shortcut in utils

The os you are connected to is ubuntu
To run something in the terminal you need to add "!" in front of it

Python already installed.

Session are limited in time.

To use GPU go to Execution and modify the type of execution

In [None]:
# Check Python Version
!python --version

Python 3.7.12


In [None]:
# Check Ubuntu Version
!lsb_release -a

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.5 LTS
Release:	18.04
Codename:	bionic


In [None]:
# Check CUDA/cuDNN Version
!nvcc -V && which nvcc

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
/usr/local/cuda/bin/nvcc


In [None]:
# Check GPU
!nvidia-smi

Tue Sep 21 08:35:45 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!cat /proc/cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 63
model name	: Intel(R) Xeon(R) CPU @ 2.30GHz
stepping	: 0
microcode	: 0x1
cpu MHz		: 2299.998
cache size	: 46080 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips	: 4599.99
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	:

<a name="CuPy"></a>
## CuPy

CuPy is the GPU equivalent to Numpy. CuPy uses the same methods that numpy so cost entry going from Numpy to CuPy is low

In [None]:
import cupy as cp
import numpy as np

In [None]:
z = cp.arange(6).reshape(2, 3).astype('f')
z

array([[0., 1., 2.],
       [3., 4., 5.]], dtype=float32)

In [None]:
z.mean(axis=0)

array([1.5, 2.5, 3.5], dtype=float32)

In [None]:
z.sum(axis=1)

array([ 3., 12.], dtype=float32)

In [None]:
z.dot(z.T).astype('int')

array([[ 5, 14],
       [14, 50]])

In [None]:
ary = cp.arange(10).reshape((2,5))
print(ary.dtype)
print(ary.shape)
print(ary.strides)
print(ary.device)

int64
(2, 5)
(40, 8)
<CUDA Device 0>


You can easily convert numpy array to cupy array

In [None]:
ary_cpu = np.arange(10)
ary_gpu = cp.asarray(ary_cpu)
print('cpu:', ary_cpu)
print('gpu:', ary_gpu)

cpu: [0 1 2 3 4 5 6 7 8 9]
gpu: [0 1 2 3 4 5 6 7 8 9]


In [None]:
ary_cpu_returned = cp.asnumpy(ary_gpu)
print(repr(ary_cpu_returned))
print(type(ary_cpu_returned))

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
<class 'numpy.ndarray'>


Ufunc are also available on cupy

In [None]:
print(ary_gpu * 2)
print(cp.exp(-0.5 * ary_gpu**2))
print(cp.linalg.norm(ary_gpu))
print(cp.random.normal(loc=5, scale=2.0, size=10))

[ 0  2  4  6  8 10 12 14 16 18]
[1.00000000e+00 6.06530660e-01 1.35335283e-01 1.11089965e-02
 3.35462628e-04 3.72665317e-06 1.52299797e-08 2.28973485e-11
 1.26641655e-14 2.57675711e-18]
16.881943016134134
[ 5.06424225  8.06825008  4.90058205  7.76730818 -0.01053254  6.88729895
  3.58324392  7.93736922  6.83135433  4.12706551]


You may notice a slight pause when you run these functions the first time. This is because CuPy has to compile the CUDA functions on the fly, and then cache them to disk for reuse in the future. Let's compare some performance

In [None]:
import pandas as pd
import cupy as cp
import numpy as np

Let's compare a simple multiplication:

In [None]:
%%timeit 
# small example taken from here https://giters.com/cupy/cupy/issues/4891?amp=1

a_cpu = np.ones((1000, 20000), dtype='float32')
b_cpu = np.ones((20000, 2000), dtype='float32')
z_cpu = np.matmul(a_cpu, b_cpu)

1 loop, best of 5: 1.18 s per loop


In [None]:
%%timeit 
# small example taken from here https://giters.com/cupy/cupy/issues/4891?amp=1

a_gpu = cp.ones((1000, 20000), dtype='float32')
b_gpu = cp.ones((20000, 2000), dtype='float32')
z_gpu = cp.matmul(a_gpu, b_gpu)

The slowest run took 1402.62 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 5: 209 µs per loop


Now the analytical solution of OLS:

In [None]:
X_cpu = np.random.rand(20000, 1000).astype('f')
Y_cpu = np.random.rand(20000, 1).astype('f')

In [None]:
X_gpu = cp.asarray(X_cpu,dtype='float32')
Y_gpu = cp.asarray(Y_cpu,dtype='float32')

In [None]:
%%timeit 
beta = np.matmul(np.linalg.inv(np.matmul(X_cpu.T,X_cpu)),np.matmul(X_cpu.T,Y_cpu))

1 loop, best of 5: 476 ms per loop


In [None]:
%%timeit 
beta = np.matmul(np.linalg.inv(np.matmul(X_gpu.T,X_gpu)),np.matmul(X_gpu.T,Y_gpu))

The slowest run took 80.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 5: 70.6 ms per loop


This does not mean that GPUs are always faster. When are they worst ? 
Read more here https://towardsdatascience.com/heres-how-to-use-cupy-to-make-numpy-700x-faster-4b920dda1f56

In [None]:
df = pd.read_csv("sample_data/california_housing_train.csv")
intercept = np.ones(len(df))

y = np.array(df["median_house_value"],dtype='float32')
X = np.array(df.drop(["median_house_value"],axis=1),dtype='float32')

In [None]:
y_gpu = cp.asarray(y,dtype='float32')
X_gpu = cp.asarray(X,dtype='float32')

In [None]:
%%timeit 
beta = np.matmul(np.linalg.inv(np.matmul(X.T,X)),np.matmul(X.T,y))

The slowest run took 20.02 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 5: 407 µs per loop


In [None]:
%%timeit 
beta = cp.matmul(cp.linalg.inv(cp.matmul(X_gpu.T,X_gpu)),cp.matmul(X_gpu.T,y_gpu))

1000 loops, best of 5: 818 µs per loop


Also cupy works best when using ufunc but like we have seen in the introduction, not every operation can be done using ufunc. To overcome this issue you can create your own "Kernel". (read more here https://docs.cupy.dev/en/stable/user_guide/kernel.html)

In [None]:
x = cp.arange(10, dtype=np.float32).reshape(2, 5)
y = cp.ones(10, dtype=np.float32).reshape(2, 5)


squared_diff = cp.ElementwiseKernel(
   'float32 x, float32 y',
   'float32 z',
   'z = (x - y) * (x - y)',
   'squared_diff')

In [None]:
print("x:", x)
print("y:", y)

result = squared_diff(x,y)

print("result :", result)

x: [[0. 1. 2. 3. 4.]
 [5. 6. 7. 8. 9.]]
y: [[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]
result : [[ 1.  0.  1.  4.  9.]
 [16. 25. 36. 49. 64.]]


In [None]:
# Example taken from docs

x = cp.arange(10, dtype=np.float32).reshape(2, 5)
y = cp.arange(10, dtype=np.float32).reshape(2, 5)

add_reverse = cp.ElementwiseKernel(
    'T x, raw T y', 
    'T z',
    '''
    z = x + y[_ind.size() - i - 1];
    ''',
    'add_reverse')

In [None]:
print("x:", x)
print("y:", y)

result = add_reverse(x,y)

print("result :", result)

x: [[0. 1. 2. 3. 4.]
 [5. 6. 7. 8. 9.]]
y: [[0. 1. 2. 3. 4.]
 [5. 6. 7. 8. 9.]]
result : [[9. 9. 9. 9. 9.]
 [9. 9. 9. 9. 9.]]


You can find even more complex custom CUDA kernel. 

<a name="CuDF"></a>
## CuDF and CuML

CuDF is develeopped by rapidsai (https://rapids.ai/) and like CuPY the goal is to have the features of pandas using GPUs. CuML is also develeopped by rapidsai (https://rapids.ai/) and this time the library we want to apply GPUs is Scikit-learn. Installing them on google collab is a bit complex so we will directly use their google collab cells:https://colab.research.google.com/drive/1rY7Ln6rEE1pOlfSHCYOVaqt8OvDO35J0#forceEdit=true&sandboxMode=true&scrollTo=JI7UTXbhaBon

<a name="Numba"></a>
## Numba

Numba is a just-in-time (https://en.wikipedia.org/wiki/Just-in-time_compilation), type-specializing, function compiler for accelerating numerically-focused Python. That's a long list, so let's break down those terms: 

- function compiler: Numba compiles Python functions, not entire applications, and not parts of functions. Numba does not replace your Python interpreter, but is just another Python module that can turn a function into a (usually) faster function.
- type-specializing: Numba speeds up your function by generating a specialized implementation for the specific data types you are using. Python functions are designed to operate on generic data types, which makes them very flexible, but also very slow. In practice, you only will call a function with a small number of argument types, so Numba will generate a fast implementation for each set of types.
- just-in-time: Numba translates functions when they are first called. This ensures the compiler knows what argument types you will be using. This also allows Numba to be used interactively in a Jupyter notebook just as easily as a traditional application


In [4]:
from numba import jit
import numpy as np

x = np.random.rand(20000, 1000).astype('f')
y = np.random.rand(20000, 1).astype('f')

@jit
def ols(x, y):
    beta = np.dot(np.linalg.inv(np.dot(x.T,x)),np.dot(x.T,y))
    return beta[1]

In [12]:
%timeit ols(x,y)

1 loop, best of 5: 682 ms per loop


In [13]:
%timeit ols.py_func(x,y)

1 loop, best of 5: 468 ms per loop


As a reminder, numpy is written in C which means its already extremly fast. Read more on numpy and numba here https://numba.pydata.org/numba-doc/dev/reference/numpysupported.html

Numba can also be used to create new numpy universal function.
https://numba.pydata.org/numba-doc/dev/user/vectorize.html

Numba’s vectorize allows Python functions taking scalar input arguments to be used as NumPy ufuncs. Creating a traditional NumPy ufunc is not the most straightforward process and involves writing some C code. Numba makes this easy. Using the vectorize() decorator, Numba can compile a pure Python function into a ufunc that operates over NumPy arrays as fast as traditional ufuncs written in C.

Using vectorize(), you write your function as operating over input scalars, rather than arrays. Numba will generate the surrounding loop (or kernel) allowing efficient iteration over the actual inputs. 

In [None]:
%%timeit 

from numba import vectorize, float64
import numpy as np

@vectorize([float64(float64, float64)])
def f(x, y):
    x - y
    return x + y
N = 100000000
A = np.array(np.random.sample(N), dtype=np.float64)
B = np.array(np.random.sample(N), dtype=np.float64)
result = f(A,B)

In [None]:
%%timeit

from numba import vectorize, float64
import numpy as np

@vectorize([float64(float64, float64)], target = "parallel")
def f(x, y):
    x - y
    return x + y

N = 100000000
A = np.array(np.random.sample(N), dtype=np.float64)
B = np.array(np.random.sample(N), dtype=np.float64)
f(A,B)

In [None]:
%%timeit

import numpy as np

def f(x, y):
    x - y
    return x + y

N = 100000000
A = np.array(np.random.sample(N), dtype=np.float64)
B = np.array(np.random.sample(N), dtype=np.float64)
f(A,B)

You can do other things than using numpy

In [9]:
from numba import jit
import math

@jit
def hypot(x, y):
    # Implementation from https://en.wikipedia.org/wiki/Hypot
    x = abs(x)
    y = abs(y)
    t = min(x, y)
    x = max(x, y)
    t = t / x
    return x * math.sqrt(1+t*t)

The first time we call hypot, the compiler is triggered and compiles a machine code implementation for float inputs. Numba also saves the original Python implementation of the function in the .py_func attribute, so we can call the original Python code to make sure we get the same answer:

In [10]:
%timeit hypot(10,20)

The slowest run took 317085.61 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 5: 240 ns per loop


In [11]:
%timeit hypot.py_func(10,20)

The slowest run took 14.34 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 5: 887 ns per loop


How does numba works ? From https://towardsdatascience.com/speed-up-your-algorithms-part-2-numba-293e554c5cc1

![numba](img/numba.png)

Numba also supports GPU programming.

In [None]:
import numba
from numba import cuda

In [None]:
# list of devices
print(cuda.gpus)
# Select your device
numba.cuda.select_device(0)

In [14]:
from numba import vectorize
import numpy as np

@vectorize(['int64(int64, int64)'], target='cuda')
def add_ufunc(x, y):
    return x + y

In [26]:
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])
b_col = b.reshape(4,1)
c = np.arange(4*4).reshape((4,4))

print('a+b:\n', add_ufunc(a, b))
print('\n\n')
print('b_col + c:\n', add_ufunc(b_col, c))

a+b:
 [11 22 33 44]



b_col + c:
 [[10 11 12 13]
 [24 25 26 27]
 [38 39 40 41]
 [52 53 54 55]]


In [27]:
%timeit np.add(b_col, c)   # NumPy on CPU

The slowest run took 113.47 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 5: 1.17 µs per loop


In [28]:
%timeit add_ufunc(b_col, c) # Numba on GPU

1000 loops, best of 5: 1.69 ms per loop


Why is the GPU slower ?
* Our inputs are too small: the GPU achieves performance through parallelism, operating on thousands of values at once. Our test inputs have only 4 and 16 integers, respectively. We need a much larger array to even keep the GPU busy.
* Our calculation is too simple: Sending a calculation to the GPU involves quite a bit of overhead compared to calling a function on the CPU. If our calculation does not involve enough math operations (often called "arithmetic intensity"), then the GPU will spend most of its time waiting for data to move around.
* We copy the data to and from the GPU: While including the copy time can be realistic for a single function, often we want to run several GPU operations in sequence. In those cases, it makes sense to send data to the GPU and keep it there until all of our processing is complete.
* Our data types are larger than necessary: Our example uses int64 when we probably don't need it. Scalar code using data types that are 32 and 64-bit run basically the same speed on the CPU, but 64-bit data types have a significant performance cost on the GPU. Basic arithmetic on 64-bit floats can be anywhere from 2x (Pascal-architecture Tesla) to 24x (Maxwell-architecture GeForce) slower than 32-bit floats. NumPy defaults to 64-bit data types when creating arrays, so it is important to set the dtype attribute or use the ndarray.astype() method to pick 32-bit types when you need them.

Let's see a bigger example

In [3]:
from numba import vectorize
import numpy as np
import math  # Note that for the CUDA target, we need to use the scalar functions from the math module, not NumPy

@vectorize(['float32(float32, float32, float32)'], target='cuda')
def gaussian_pdf(x, mean, sigma):
    '''Compute the value of a Gaussian probability density function at x with given mean and sigma.'''
    return math.exp(-0.5 * ((x - mean) / sigma)**2) / (sigma * np.float32((2*math.pi)**0.5))

In [4]:
x = np.random.uniform(-3, 3, size=1000000).astype(np.float32)
mean = np.float32(0.0)
sigma = np.float32(1.0)

In [5]:
%timeit gaussian_pdf(x, mean, sigma)

The slowest run took 79.18 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 5: 4.36 ms per loop


In [6]:
import scipy.stats # for definition of gaussian distribution
norm_pdf = scipy.stats.norm
%timeit norm_pdf.pdf(x, loc=mean, scale=sigma)

10 loops, best of 5: 63.7 ms per loop


Of course not everything can be vectorized so you'll need to create your own cuda kernel. We won't go into the details here but if you want to learn more on numba, please look at the following links:

- https://www.youtube.com/watch?v=9bBsvpg-Xlk
- https://www.youtube.com/watch?v=CQDsT81GyS8&t
- https://colab.research.google.com/drive/15IDLiUMRJbKqZUZPccyigudINCD5uZ71?usp=sharing
- https://numba.pydata.org/numba-doc/latest/cuda/kernels.html
- https://docs.nvidia.com/cuda/cuda-c-programming-guide/
- https://stackoverflow.com/questions/4391162/cuda-determining-threads-per-block-blocks-per-grid
- https://en.wikipedia.org/wiki/CUDA
- https://nyu-cds.github.io/python-numba/05-cuda/
- https://numba.pydata.org/numba-doc/latest/cuda/ufunc.html

## Numba limitations

Numba accelerates your code. So why should'nt we use it for everything if it's has simple as putting a decorator in front of a function ?

Well it's not that simple.

Numba is numerically-focused: Currently, Numba is focused on numerical data types, like int, float, and complex. There is very limited string processing support, and many string use cases are not going to work well on the GPU. To get best results with Numba, you will likely be using NumPy arrays. When you run a function that uses string or dict, python ignores the jit decorator and run the function as normal.

In [None]:
@jit()
def cannot_compile(x):
    return x['key']

cannot_compile(dict(key='hey heres your value'))

To avoid this type of behavior (we want an error message and not just a warning) we add the argument nopython = True.

In [None]:
@jit(nopython=True)
def cannot_compile(x):
    return x['key']

cannot_compile(dict(key='hey heres your value'))

<a name="TODO"></a>
## TODO

code review: 
- https://www.programcreek.com/python/example/111769/cupy.ElementwiseKernel