# CUDA programming

Numba supports CUDA GPU programming by directly compiling a restricted subset of Python code into CUDA kernels and device functions following the CUDA execution model.

One feature that significantly simplifies writing GPU kernels is that Numba makes it appear that the kernel has direct access to NumPy arrays. NumPy arrays that are supplied as arguments to the kernel are transferred between the CPU and the GPU automatically (although this can also be an issue).

Numba does not yet implement the full CUDA API, so some features are not available. However the features that are provided are enough to begin experimenting with writing GPU enable kernels. CUDA support in Numba is being actively developed, so eventually most of the features should be available.

## Terminology

Several important terms in the topic of CUDA programming are listed here:

host : the CPU

device : the GPU

host memory : the system main memory

device memory : onboard memory on a GPU card

kernel : a GPU function launched by the host and executed on the device

device function : a GPU function executed on the device which can only be called from the device (i.e. from a kernel or another device function)

## Device Management

Connect to a GPU runtime and list the currently connected GPUs:

In [1]:
from numba import cuda
print(cuda.gpus)

<Managed Device 0>


## Writing CUDA kernels

CUDA has an execution model unlike the traditional sequential model used for programming CPUs. In CUDA, the code you write will be executed by multiple threads at once (often hundreds or thousands). Your solution will be modeled by defining a thread hierarchy of grid, blocks, and threads.

Numba also exposes three kinds of GPU memory:

* global device memory
* shared memory
* local memory

For all but the simplest algorithms, it is important that you carefully consider how to use and access memory in order to minimize bandwidth requirements and contention.

NVIDIA recommends that programmers focus on following those recommendations to achieve the best performance:

* Find ways to parallelise sequential code
* Minimise data transfers between the host and the device
* Adjust kernel launch configuration to maximize device utilization
* Ensure global memory accesses are coalesced
* Minimise redundant accesses to global memory whenever possible
* Avoid different execution paths within the same warp

NameError: ignored