Comparing Stencil Computations on the CPU and GPU
=================================================

This notebook builds off of a [recent blogpost](https://blog.dask.org/2019/04/09/numba-stencil) using Numba, Dask, NumPy, to build parallel compiled computations easily.

At the end of that post I posed the question of how we could run these computations on the GPU.  I was curious both on how much harder it would be to write and on how much faster it would go.  This notebook explores this question by using [numba.cuda.jit](https://numba.pydata.org/numba-doc/dev/cuda/index.html).

We learn that it is, in fact, much faster to use a GPU and that, if you're ok with copy-pasting a bit of code that you might not understand, it's also quite easy for someone who is unfamiliar with GPUs, like myself.

## Stencil computations on CPU

In [1]:
import numpy as np
import numba

@numba.stencil
def _smooth(x):
    return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
            x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
            x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9

@numba.njit
def smooth_cpu(x):
    return _smooth(x)

In [2]:
x_cpu = np.ones((10000, 10000), dtype='int8')

%timeit smooth_cpu(x_cpu)

621 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Stencil computations on GPU

Using the `numba.cuda` module I'm able to get about a 200x increase with a modest increase in code complexity.

In [3]:
from numba import cuda

@cuda.jit
def smooth_gpu(x, out):
    i, j = cuda.grid(2)
    n, m = x.shape
    if 1 <= i < n - 1 and 1 <= j < m - 1:
        out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                     x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                     x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) // 9

In [4]:
import cupy, math

x_gpu = cupy.ones((10000, 10000), dtype='int8')
out_gpu = cupy.zeros((10000, 10000), dtype='int8')

# I copied the four lines below from the Numba docs
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(x_gpu.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(x_gpu.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

%timeit smooth_gpu[blockspergrid, threadsperblock](x_gpu, out_gpu)

2.87 ms ± 90.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


*Note: the GPU solution here cheats a bit because it pre-allocates the output array*

## Final thoughts

GPUs are fast at doing local stencil computations.  This isn't surprising, but it's nice to see.  

What is surprisingly pleasant is how easy it is to write CUDA-like array computing code from Python with Numba and CuPy.

Actually that's not entirely true.  I still don't fully understand how I should determine the `blockspergrid` and `threadsperblock` parameters to the `smooth_gpu` kernel execution, and I suspect that many novice users share my uncertainty.  I am curious if there is a way to make a decent choice here given information that I do know (like the shape of the grid I'd like to compute over) and other information that can be automatically  pulled from the hardware.  I would much rather say, for example...

```python
@cuda.jit
def smooth_gpu(x, out):
    n, m = x.shape
    i, j = cuda.grid[1:n - 1, 1:m - 1]  # <-- define grid here?
    
    out[i, j] = (x[i - 1, j - 1] + x[i - 1, j] + x[i - 1, j + 1] +
                 x[i    , j - 1] + x[i    , j] + x[i    , j + 1] +
                 x[i + 1, j - 1] + x[i + 1, j] + x[i + 1, j + 1]) // 9
    

smooth_gpu(x_gpu, out_gpu)
```

... or something similar (I don't know the technical constraints here well enough to pose solutions with much probability of success.