Demonstrate a Numba cuda kernel, they are easier to write and debug. It is not using standard cuda, so Nvidia profiling tools won't work.

## Imports

In [1]:
import numpy as np
from numba import cuda, float32, int32

In [2]:
from common import load_numpy

## Parameters

In [3]:
N = 1024**2
BLOCK_SIZE = 512
NR_BLOCKS = (N + BLOCK_SIZE - 1) // BLOCK_SIZE

## Make numba kernels

The basic pattern of nearest neightbour here is to calculate the distance in parallel and then all threads overwrite global memory. This access pattern isn't the best since all threads have to write to the same place at once.

In [4]:
@cuda.jit
def find_nearest_point_kernel(points, query, closest_point, min_distance):
    i = cuda.grid(1)

    if i < points.shape[0]:
        dx = points[i, 0] - query[0]
        dy = points[i, 1] - query[1]
        dz = points[i, 2] - query[2]
        dist = dx**2 + dy**2 + dz**2

        if dist < min_distance[0]:
            min_distance[0] = dist
            closest_point[0] = i

Printing is allowed in a numba kernel, it a bit easier to debug than pycuda but cannot use the normal python deugger.

In [5]:
def find_nearest_point_gpu(points_device, query_device, closest_point, min_distance):
    """ Find closest point and copy to host using numba kernel """
    find_nearest_point_kernel[NR_BLOCKS, BLOCK_SIZE](points_device, query_device, closest_point, min_distance)
    return closest_point.copy_to_host()[0]

## Upload query and points to gpu

Load in points from earlier

In [6]:
points = load_numpy("nearest_neighbour_points.npy")
query = load_numpy("nearest_neighbour_query.npy")
min_distance = np.array([1e20], dtype=np.float32)

Copy all arrays to device

In [7]:
points_device = cuda.to_device(points)
query_device = cuda.to_device(query)
min_distance_device = cuda.to_device(min_distance)
closest_point_device = cuda.device_array(1, dtype=np.int32)

Run the operation once

In [10]:
nearest_idx = find_nearest_point_gpu(points_device, query_device, closest_point_device, min_distance_device)
print("Nearest point:", points[nearest_idx])

Nearest point: [0.72436523 0.10307071 0.2642327 ]


Get a profiling measurment

In [9]:
%timeit points[find_nearest_point_gpu(points_device, query_device, closest_point_device, min_distance_device)]

421 µs ± 17.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
