# cuTile Python Intro - Vector Add

In this module, you'll be introduced to the fundamentals of the CUDA Tile programming model and write your first kernel with cuTile Python.

cuTile Python is a Python DSL for writing CUDA Tile programs; it lowers to Tile IR, a new MLIR-based intermediate representation.


## Installing cuTile Python

cuTile Python requires:

    - NVIDIA Kernel Driver R580 or later.
    - CUDA Toolkit 13.1 or later.
    - A Blackwell GPU (this restriction will be lifted in later releases).

You can install cuTile Python via the PIP package `cuda-tile`.

In [None]:
import os

if not os.getenv("BREV_ENV_ID") and not os.path.exists("/accelerated-computing-hub-installed"): # If not running in brev
  print("Installing PIP packages.")
  !pip uninstall "cuda-python" --yes > /dev/null
  !pip install "cuda-tile" "cupy-cuda13x" > /dev/null 2>&1
  open("/accelerated-computing-hub-installed", "a").close()

cuTile Python is typically used in concert with an existing GPU programming framework like CuPy, PyTorch, or JAX that provides an array type that handles GPU memory allocation, etc. For this module, we'll use CuPy 

In [None]:
import cuda.tile as ct
from numba import cuda as cs
import cupy as cp
import cupyx as cpx

## Example: Vector Add

For our first example, we'll implement vector addition, e.g. `C[i] = A[i] + B[i]`.

To define a kernel, we write a Python function with the `@ct.kernel` decorator. Kernel functions can take `ct.Array`s as inputs, which could be any type of global array - a CuPy array, a JAX array, a PyTorch tensor, etc. Kernels can also take scalars like ints or floats as inputs.

Parameters annotated with `ct.Constant` will be embedded as literals into the compiled kernel. They behave like C++ template parameters. Changing the value of a constant parameter will require the kernel to be JIT compiled again. Some operations require constant parameters. For example, the size of a tile must be a constant.

cuTile Python supports many of the common NumPy array operations, such as addition, which is highlighted here.

Two operations unique to cuTile Python are `load` and `store`:
- `ct.load(array, N, shape)` returns the `N`th tile of dimensions `shape` of `array`.
- `ct.store(array, N, tile)` stores `tile` at the `N`th tile of dimensions `shape` in `array`.

Let's implement our vector addition kernel, both on CUDA Tile and CUDA SIMT (for comparison).


In [None]:
@ct.kernel
def vector_add_tile(A: ct.Array, B: ct.Array, C: ct.Array, t_shape: ct.Constant[int]):
  a_tile = ct.load(A, index=(ct.bid(0),), shape=(t_shape,))
  b_tile = ct.load(B, index=(ct.bid(0),), shape=(t_shape,))
  c_tile = a_tile + b_tile
  ct.store(C, index=(ct.bid(0),), tile=c_tile)

@cs.jit
def vector_add_simt(A, B, C, items_per_thread):
  bd = cs.blockDim.x
  bx = cs.blockIdx.x
  tx = cs.threadIdx.x
  items_per_block = bd * items_per_thread

  base = tx + bx * items_per_block
  for i in range(0, items_per_block, bd):
    C[base + i] = A[base + i] + B[base + i]

Next, let's validate that we've implemented everything correctly:

In [None]:
a_shape = 2 ** 24
t_shape = 1024

A = cp.random.uniform(-5, 5, a_shape)
B = cp.random.uniform(-5, 5, a_shape)
C = cp.zeros_like(A)

grid = (ct.cdiv(a_shape, t_shape), 1, 1)
ct.launch(cp.cuda.get_current_stream(), grid, vector_add_tile, (A, B, C, t_shape))

cp.testing.assert_array_almost_equal(C, A + B) # Verify the results

threads_per_block = 128
items_per_thread = 4
thread_blocks = int(a_shape / (threads_per_block * items_per_thread))

C = cp.zeros_like(A)

vector_add_simt[thread_blocks, threads_per_block](A, B, C, items_per_thread)

cp.testing.assert_array_almost_equal(C, A + B) # Verify the results

Finally, let's benchmark it against CuPy and SIMT.

In [None]:
from cuda.core.experimental import Device

def get_peak_memory_bandwidth(device_id: int = 0):
  dev = Device(device_id)
  dev.set_current() # Initialize CUDA for this thread

  props = dev.properties
  mem_clock_khz = props.memory_clock_rate        # Peak memory clock in kHz.
  bus_width_bits = props.global_memory_bus_width # DRAM bus width in bits.

  bytes_per_s = (mem_clock_khz * 1_000) * (bus_width_bits / 8) * 2 # DDR factor.
  return bytes_per_s / 2 ** 30

memory_read = (3 * a_shape * A.dtype.itemsize) / 2 ** 30

peak_memory_bandwidth = get_peak_memory_bandwidth(0)

def print_benchmark_results(framework, results):
  time = results.mean()
  time_unc = results.std() / time
  runs = results.size
  bw = memory_read / time
  percent_of_peak_bw = bw / peak_memory_bandwidth

  print(f"{framework}: {time:.3g} s ± {time_unc:.2%} (mean ± relative stdev of {runs} runs), {bw:.1f} GB/s, {percent_of_peak_bw:.2%} of peak bandwidth")

In [None]:
cupy = cpx.profiler.benchmark(
  lambda: A + B,
  (), n_repeat=15, n_warmup=5
).gpu_times[0]
tile = cpx.profiler.benchmark(
  lambda: ct.launch(cp.cuda.get_current_stream(), grid, vector_add_tile, (A, B, C, t_shape)),
  (), n_repeat=15, n_warmup=5
).gpu_times[0]
simt = cpx.profiler.benchmark(
  lambda: vector_add_simt[thread_blocks, threads_per_block](A, B, C, items_per_thread),
  (), n_repeat=15, n_warmup=5
).gpu_times[0]

print_benchmark_results("CuPy", cupy)
print_benchmark_results("Tile", tile)
print_benchmark_results("SIMT", simt)

Kernel performance often depends on tile sizes and execution parameters, so let's explore the parameter space to make sure our results are reasonable. First, let's tune our tile kernel over the tile size:

In [None]:
times = []
for t_shape in [16, 32, 64, 128, 256, 512, 1024]:
  grid = (ct.cdiv(a_shape, t_shape), 1, 1)
  time = cpx.profiler.benchmark(
    lambda: ct.launch(cp.cuda.get_current_stream(), grid, vector_add_tile, (A, B, C, t_shape)),
    (), n_repeat=15, n_warmup=5
  ).gpu_times[0].mean()

  times.append({'t_shape': t_shape, 'time': time})

from functools import reduce

for time in times:
  print(f"t_shape={time['t_shape']}, time={time['time']:.3g} s")

print(reduce(lambda x, y: x if x['time'] < y['time'] else y, times))

Next, let's tune our SIMT kernel over the number of items per thread and the number of threads per block:

In [None]:
times = []
for items_per_thread in [1, 2, 4, 8, 16, 32, 64, 128, 256]:
  for threads_per_block in [32, 64, 128, 256, 512]:
    thread_blocks = int(a_shape / (threads_per_block * items_per_thread))
    time = cpx.profiler.benchmark(
      lambda: vector_add_simt[thread_blocks, threads_per_block](A, B, C, items_per_thread),
      (), n_repeat=15, n_warmup=5
    ).gpu_times[0].mean()

    times.append({'items_per_thread': items_per_thread, 'threads_per_block': threads_per_block, 'time': time})

from functools import reduce

for time in times:
  print(f"items_per_thread={time['items_per_thread']}, threads_per_block={time['threads_per_block']}, time={time['time']:.3g} s")

print(reduce(lambda x, y: x if x['time'] < y['time'] else y, times))