Numba’s CUDA support exposes facilities to declare and manage this hierarchy of threads. The facilities are largely similar to those exposed by NVidia’s CUDA C language.

Numba also exposes three kinds of GPU memory: global device memory (the large, relatively slow off-chip memory that’s connected to the GPU itself), on-chip shared memory and local memory. For all but the simplest algorithms, it is important that you carefully consider how to use and access memory in order to minimize bandwidth requirements and contention.

# Kernel in NUMBA
A kernel function is a GPU function that is meant to be called from CPU code (*). It gives it two fundamental characteristics:

kernels cannot explicitly return a value; all result data must be written to an array passed to the function (if computing a scalar, you will probably pass a one-element array);

kernels explicitly declare their thread hierarchy when called: i.e. the number of thread blocks and the number of threads per block (note that while a kernel is compiled once, it can be called multiple times with different block sizes or grid sizes).

At first sight, writing a CUDA kernel with Numba looks very much like writing a JIT function for the CPU.

To instantiated/invoke the kernel, we need to specify:
- a number of blocks ("blocks per grid")
- a number of threads per block (block size)

NOTE: the block size determines how many threads share a given area of shared memory.

The product of these 2 numbers will give the total number of threads associated to the running kernel.


NOTE: Kernels run asynchronously: launches queue their execution on the device and then return immediately. You can use cuda.synchronize() to wait for all previous kernel launches to finish executing.

In [None]:
import cupy as cp
import numba
import numpy as np
from numba import cuda

@cuda.jit
def increment_by_one(an_array):
    # Thread id in a 1D block
    tx = cuda.threadIdx.x
    # Block id in a 1D grid
    ty = cuda.blockIdx.x
    # Block width, i.e. number of threads per block
    bw = cuda.blockDim.x
    # Compute flattened index inside the array
    pos = tx + ty * bw
    if pos < an_array.size:  # Check array boundaries
        an_array[pos] += 1

#threadIdx, blockIdx, blockDim and gridDim are special objects provided by the CUDA backend for the sole purpose of knowing the geometry of the thread hierarchy and the position of the current thread within that geometry.
#These objects can be 1D, 2D or 3D, depending on how the kernel was invoked. To access the value at each dimension, use the x, y and z attributes of these objects, respectively.


#Kernel invocation

an_array= cp.ones(10000, dtype=cp.float32)
threadsperblock = 512
blockspergrid = (an_array.size + (threadsperblock - 1)) // threadsperblock
%timeit -n 10 increment_by_one[blockspergrid, threadsperblock](an_array)


The slowest run took 42.57 times longer than the fastest. This could mean that an intermediate result is being cached.
1.54 ms ± 3.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)




Since simple algorithms will tend to always use thread indices in the same way as shown in the example above. Numba provides additional facilities to automate such calculations:

- numba.cuda.grid(ndim): Return the absolute position of the current thread in the entire grid of blocks. ndim should correspond to the number of dimensions declared when instantiating the kernel.
- numba.cuda.gridsize(ndim):Return the absolute size (or shape) in threads of the entire grid of blocks. ndim has the same meaning as in grid() above.

In [None]:
import math

@cuda.jit
def increment_by_one(an_array):
    pos = cuda.grid(1)
    if pos < an_array.size:
        an_array[pos] += 1

@cuda.jit
def increment_a_2D_array(an_array):
    x, y = cuda.grid(2)
    if x < an_array.shape[0] and y < an_array.shape[1]:
       an_array[x, y] += 1


an_array=cp.random.random((1000,1000), dtype=cp.float32)
threadsperblock = (16, 16)
blockspergrid_x = math.ceil(an_array.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(an_array.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)

increment_a_2D_array[blockspergrid, threadsperblock](an_array)
%timeit -n 100 increment_a_2D_array[blockspergrid, threadsperblock](an_array)

236 µs ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Matmul with Numba kernel on GPU

Below you find an example of a matmul based on Numba kernels.

In [None]:
#FASTER MATMUL
from __future__ import division
from numba import cuda, float32
import numpy
import math

# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPB = 16

@cuda.jit
def fast_matmul(A, B, C):
    """
    Perform matrix multiplication of C = A * B
    Each thread computes one element of the result matrix C
    """

    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y

    if x >= C.shape[0] and y >= C.shape[1]:
        # Quit if (x, y) is outside of valid C boundary
        return

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = 0.
    for i in range(int(A.shape[1] / TPB)):
        # Preload data into shared memory
        sA[tx, ty] = A[x, ty + i * TPB]
        sB[tx, ty] = B[tx + i * TPB, y]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]

        # Wait until all threads finish computing
        cuda.syncthreads()

    C[x, y] = tmp

# The data array
A = numpy.full((TPB*2, TPB*3), 3, dtype=numpy.float32) # [32 x 48] matrix containing all 3's
B = numpy.full((TPB*3, TPB*1), 4, dtype=numpy.float32) # [48 x 16] matrix containing all 4's

A_global_mem = cuda.to_device(A)
B_global_mem = cuda.to_device(B)
C_global_mem = cuda.device_array((TPB*2, TPB*1)) # [32 x 16] matrix result

# Configure the blocks
threadsperblock = (TPB, TPB)
blockspergrid_x = int(math.ceil(A.shape[0] / threadsperblock[1]))
blockspergrid_y = int(math.ceil(B.shape[1] / threadsperblock[0]))
blockspergrid = (blockspergrid_x, blockspergrid_y)

# Start the kernel
fast_matmul[blockspergrid, threadsperblock](A_global_mem, B_global_mem, C_global_mem)
res = C_global_mem.copy_to_host()

res_test= numpy.matmul(A,B)

print('results of Numba matmul:', res)

print('results of numpy matmul:', res_test)

results of Numba matmul: [[576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576.
  576. 576.]
 [576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576.
  576. 576.]
 [576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576.
  576. 576.]
 [576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576.
  576. 576.]
 [576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576.
  576. 576.]
 [576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576.
  576. 576.]
 [576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576.
  576. 576.]
 [576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576.
  576. 576.]
 [576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576.
  576. 576.]
 [576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576.
  576. 576.]
 [576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576. 576.
  576. 576.]
 [576. 576. 576. 576. 576. 576. 576. 576

