Day 1: Introduction to GPU Programming with Numba
Goal: Learn the basics of Numba and how to run simple GPU code.

Tasks:
Understand CUDA Basics:

Learn about CUDA architecture (threads, blocks, grids, memory hierarchy).
Numba enables GPU programming using the CUDA model, but you'll use Python syntax.

Install Required Libraries (on Colab):

!pip install numba

Write a Simple GPU Kernel: Numba makes it easy to write GPU kernels (functions executed on the GPU). Here’s a simple example of adding two arrays on the GPU.

In [1]:
import numpy as np
from numba import cuda

# Define the GPU Kernel

@cuda.jit
def add_arrays_kernel(a, b, c):

  # Get the thread index
  idx = cuda.grid(1)
  if idx < a.size: # Make sure thread index is within bounds
    c[idx] = a[idx] + b[idx]

# Create the input arrays

N = 100000
a = np.ones(N, dtype=np.float32)
b = np.ones(N, dtype=np.float32)
c = np.zeros(N, dtype=np.float32)

# Transfer the data to the device (GPU memory)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)

# Define the block and grid size
threads_per_block = 256
blocks_per_grid = (a.size + (threads_per_block -1)) // threads_per_block

#launch the kernel
add_arrays_kernel[blocks_per_grid, threads_per_block](d_a, d_b, d_c)

# Copy the result back to the host (CPU memory)
d_c.copy_to_host(c)
# verify the result
print("Result of first 10 elements:", c[:10])


Result of first 10 elements: [2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]


@cuda.jit: This decorator is used to compile a Python function into a GPU kernel. It allows the function to be executed on the GPU rather than the CPU.

cuda.grid(1): This function retrieves the unique thread index in a 1D grid of threads. It helps to ensure that each thread works on a specific element of the array.

cuda.to_device: This function is used to transfer data from the CPU to the GPU memory. It prepares the data for computation on the GPU.

add_arrays_kernel[blocks_per_grid, threads_per_block]: This line launches the GPU kernel. It specifies the number of blocks and threads per block in the 1D grid, which determines how the GPU will divide and distribute the work across its threads.