# Chapter 3: GPU programming.

Why do we want to use GPUs ?

GPU hardware is designed for data parallelism. Maximum throughput is achieved when you are computing the same operations on many different elements at once.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

Among the tasks that do significantly benefit from parallel processing is deep learning. Other tasks can't be used in parallel (When you need to have the same object in memory, e.g calculating a series like fibonnaci).

One thing that could be nice would be to write the same code as normal (numpy, pandas,..) but just to run computation on a GPU. This would make it easier to parallelize processes. Some companies/university/people are working on this kind of libraries and that's what we are going to use in this section.

Structure:
- [CuPy](#CuPy)
- [Numba](#Numba)
- [CuDF](#CuDF)
- [CuML](#CuML)
- [TODO](#TODO)

<a name="CuPy"></a>
## CuPy

CuPy is the GPU equivalent to Numpy. CuPy uses the same methods that numpy so cost entry going from Numpy to CuPy is low

In [None]:
import cupy as cp
import numpy as np

In [None]:
z = cp.arange(6).reshape(2, 3).astype('f')
z

In [None]:
z.mean(axis=0)

In [None]:
z.sum(axis=1)

In [None]:
z.dot(z.T).astype('int')

In [None]:
ary = cp.arange(10).reshape((2,5))
print(repr(ary))
print(ary.dtype)
print(ary.shape)
print(ary.strides)
print(ary.device)

In [None]:
ary_cpu = np.arange(10)
ary_gpu = cp.asarray(ary_cpu)
print('cpu:', ary_cpu)
print('gpu:', ary_gpu)

In [None]:
ary_cpu_returned = cp.asnumpy(ary_gpu)
print(repr(ary_cpu_returned))
print(type(ary_cpu_returned))

In [None]:
print(ary_gpu * 2)
print(cp.exp(-0.5 * ary_gpu**2))
print(cp.linalg.norm(ary_gpu))
print(cp.random.normal(loc=5, scale=2.0, size=10))

You may notice a slight pause when you run these functions the first time. This is because CuPy has to compile the CUDA functions on the fly, and then cache them to disk for reuse in the future. (kinda the same thing as numba where the first run you convert your function into another "language")

In [None]:
import pandas as pd
import cupy as cp
import numpy as np

In [None]:
df = pd.read_csv("data/Chap3/california_housing_train.csv")
intercept = np.ones(len(df))

y = np.array(df["median_house_value"])
X = np.array(df.drop(["median_house_value"],axis=1))

In [None]:
y_gpu = cp.asarray(y)
X_gpu = cp.asarray(X)

In [None]:
%%timeit 
beta = np.matmul(np.linalg.inv(np.matmul(X.T,X)),np.matmul(X.T,y))

In [None]:
%%timeit 
beta = cp.matmul(cp.linalg.inv(cp.matmul(X_gpu.T,X_gpu)),cp.matmul(X_gpu.T,y_gpu))

In [None]:
x = cp.arange(10, dtype=np.float32).reshape(2, 5)
y = cp.arange(5, dtype=np.float32)

add_reverse = cp.ElementwiseKernel(
    'T x, raw T y', 
    'T z',
    '''
    z = x + y[_ind.size() - i - 1];
    ''',
    'add_reverse')

In [None]:
add_reverse(x,y)

In [None]:
# Customized CUDA Kernel

_fit_calc_distances = cp.ElementwiseKernel(
    'S data, raw S centers, int32 n_clusters, int32 dim', 'raw S dist',
    '''
    for (int j = 0; j < n_clusters; j++){
        int cent_ind[] = {j, i % dim};
        int dist_ind[] = {i / dim, j};
        double diff = centers[cent_ind] - data;
        atomicAdd(&dist[dist_ind], diff * diff);
    }
    ''',
    'calc_distances'
)

_fit_calc_center = cp.ElementwiseKernel(
    'S data, T label, int32 dim', 'raw S centers, raw S group',
    '''
    int cent_ind[] = {label, i % dim};
    atomicAdd(&centers[cent_ind], data);
    atomicAdd(&group[label], 1);
    ''',
    'calc_center'
)

In [None]:
# CPU/GPU-agnostic fit function

def fit(X, n_clusters, max_iter, use_custom_kernel):
    # make sure that X is a matrix not a tensor
    assert X.ndim == 2

    #NumPy/CuPy generic function
    xp = cp.get_array_module(X)
    
    # init pred vector
    pred = xp.zeros(len(X), dtype=np.int32)

    # Choose n_clusters center from X
    initial_indexes = np.random.choice(len(X), n_clusters,
                                       replace=False).astype(np.int32)
    centers = X[initial_indexes]

    # init n_obs and n_variables
    data_num = X.shape[0]
    data_dim = X.shape[1]

    # Repeat the process max_iter
    for _ in range(max_iter):
        # calculate distances between centers and every observation
        if not use_custom_kernel or xp == np:
            # Multiple way to do this operation.
            # You want to calculate the norm for each center : newaxis (None)
            distances = xp.linalg.norm(X[:, None, :] - centers[None, :, :],
                                       axis=2)
        else:
            distances = xp.zeros((data_num, n_clusters), dtype=np.float32)
            _fit_calc_distances(X, centers, n_clusters, data_dim, distances)

        # assign points to the closest center
        new_pred = xp.argmin(distances, axis=1).astype(np.int32)

        # If nothing changed for the prediction you can stop early the algorithm
        if xp.all(new_pred == pred):
            break
        pred = new_pred

        # calculate centers
        if not use_custom_kernel or xp == np:
            centers = xp.stack([X[pred == i].mean(axis=0)
                                for i in range(n_clusters)])
        else:
            # init centers
            centers = xp.zeros((n_clusters, data_dim),
                               dtype=np.float32)
            # init group
            group = xp.zeros(n_clusters, dtype=np.float32)
            # label
            label = pred[:, None]
            _fit_calc_center(X, label, data_dim, centers, group)
            group /= data_dim
            centers /= group[:, None]

    return centers, pred

In [None]:
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=100000, centers=10,
                       center_box=(-10.0, 10,0),
                       cluster_std=1.5, random_state=4)
X = X.astype(np.float32)

In [None]:
X_gpu = cp.asarray(X)

In [None]:
%timeit centers, pred = fit(X, n_clusters=10, max_iter=100, use_custom_kernel=False)

In [None]:
%timeit centers, pred = fit(X_gpu, n_clusters=10, max_iter=100, use_custom_kernel=False)

In [None]:
%timeit centers, pred = fit(X_gpu, n_clusters=10, max_iter=100, use_custom_kernel=True)

<a name="Numba"></a>
## Numba

<a name="CuDF"></a>
## CuDF

<a name="CuML"></a>
## CuML

<a name="TODO"></a>
## TODO