# Chapter 5: CUDA Kernels with Numba

<img src="images/numba_title.png" style="width:442px;"/>

Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code.

Numba supports CUDA GPU programming by directly compiling a restricted subset of Python code into CUDA kernels and device functions following the CUDA execution model. Kernels written in Numba appear to have direct access to NumPy arrays. NumPy arrays are transferred between the CPU and the GPU automatically.

## Links to Handy References
Numba for CUDA GPU’s: https://numba.pydata.org/numba-doc/latest/cuda/index.html 

CuPy’s interoperability guide (includes Numba): https://docs.cupy.dev/en/stable/user_guide/interoperability.html 

Numba Github repository: https://github.com/numba/numba 


# Examples:

## Defining and Launching a Kernel Function

In [None]:
import numpy as np
from numba import cuda

input = np.asarray(range(10))
output = np.zeros(len(input))

@cuda.jit
def foo(input_array, output_array):
    # Thread id in a 1D block
    thread_id = cuda.threadIdx.x
    # Block id in a 1D grid
    block_id = cuda.blockIdx.x
    # Block width, i.e. number of threads per block
    block_width = cuda.blockDim.x
    # Compute flattened index inside the array
    i = thread_id + block_id * block_width
    if i < an_array.size:  # Check array boundaries
        output_array[i] = input_array[i]

block_threads = 32
grid_blocks = (input.size + (block_threads - 1)) // block_threads

foo[grid_blocks, block_threads](input, output)

output

## Simplified Kernel Function Using grid()

In [None]:
import numpy as np
from numba import cuda

input = np.asarray(range(10))
output = np.zeros(len(input))

@cuda.jit
def foo(input_array, output_array):
    i = cuda.grid(1)
    output_array[i] = input_array[i]
    
foo[1, len(input)](input, output)

output

## Using Numba with CuPy

In [None]:
import cupy
from numba import cuda

@cuda.jit
def add(x_array, y_array, output_array):
        start = cuda.grid(1)
        stride = cuda.gridsize(1)
        for i in range(start, x.shape[0], stride):
                output_array[i] = x_array[i] + y_array[i]

a = cupy.arange(10)
b = a * 2
out = cupy.zeros_like(a)

add[1, 32](a, b, out)

print(out)  # => [ 0  3  6  9 12 15 18 21 24 27]