# Vector addition

This example uses Numba to create on-device arrays and a vector addition kernel; it is a warmup for learning how to write GPU kernels using Numba. We’ll begin with some required imports:

In [2]:
import numpy as np
from numba import cuda

The following function is the kernel. Note that it is defined in terms of Python variables with unspecified types. When the kernel is launched, Numba will examine the types of the arguments that are passed at runtime and generate a CUDA kernel specialized for them.

Note that Numba kernels do not return values and must write any output into arrays passed in as parameters (this is similar to the requirement that CUDA C/C++ kernels have `void` return type). Here we pass in `c` for the results to be written into.

In [3]:
@cuda.jit
def f(a, b, c):
    # like threadIdx.x + (blockIdx.x * blockDim.x)
    tid = cuda.grid(1)
    size = len(c)

    if tid < size:
        c[tid] = a[tid] + b[tid]

`cuda.to_device()` can be used create device-side copies of arrays. `cuda.device_array_like(`) creates an uninitialized array of the same shape and type as an existing array. Here we transfer two vectors and create an empty vector to hold our results:

In [4]:
N = 100000
a = cuda.to_device(np.random.random(N))
b = cuda.to_device(np.random.random(N))
c = cuda.device_array_like(a)

A call to `forall()` generates an appropriate launch configuration with a 1D grid (see Kernel invocation) for a given data size and is often the simplest way of launching a kernel:

In [5]:
f.forall(len(a))(a, b, c)
print(c.copy_to_host())

[1.6755152  0.39443128 0.99926269 ... 0.49445301 0.67592047 0.22192938]


One can also configure the grid manually using the subscripting syntax. The following example launches a grid with sufficient threads to operate on every vector element:

In [6]:
# Enough threads per block for several warps per block
nthreads = 256
# Enough blocks to cover the entire vector depending on its length
nblocks = (len(a) // nthreads) + 1
f[nblocks, nthreads](a, b, c)
print(c.copy_to_host())

[1.6755152  0.39443128 0.99926269 ... 0.49445301 0.67592047 0.22192938]
