# cuTile Python Intro - Vector Add

In this module, you'll be introduced to the fundamentals of the CUDA Tile programming model and write your first kernel with cuTile Python.

cuTile Python is a Python DSL for writing CUDA Tile programs; it lowers to Tile IR, a new MLIR-based intermediate representation.


## Installing cuTile Python

cuTile Python requires:

    - NVIDIA Kernel Driver R580 or later.
    - CUDA Toolkit 13.1 or later.
    - A Blackwell GPU (this restriction will be lifted in later releases).

You can install cuTile Python via the PIP package `cuda-tile`.

In [None]:
import os

if not os.getenv("BREV_ENV_ID") and not os.path.exists("/accelerated-computing-hub-installed"): # If not running in brev
  print("Installing PIP packages.")
  !pip uninstall "cuda-python" --yes > /dev/null
  !pip install "cuda-tile" "cupy-cuda13x" > /dev/null 2>&1
  open("/accelerated-computing-hub-installed", "a").close()

cuTile Python is typically used in concert with an existing GPU programming framework like CuPy, PyTorch, or JAX that provides an array type that handles GPU memory allocation, etc. For this module, we'll use CuPy 

In [None]:
import cuda.tile as ct
import cupy as cp
import cupyx as cpx

## Example: Vector Add

For our first example, we'll implement vector addition, e.g. `C[i] = A[i] + B[i]`.

To define a kernel, we write a Python function with the `@ct.kernel` decorator. Kernel functions can take `ct.Array`s as inputs, which could be any type of global array - a CuPy array, a JAX array, a PyTorch tensor, etc. Kernels can also take scalars like ints or floats as inputs.

Parameters annotated with `ct.Constant` will be embedded as literals into the compiled kernel. They behave like C++ template parameters. Changing the value of a constant parameter will require the kernel to be JIT compiled again. Some operations require constant parameters. For example, the size of a tile must be a constant.

cuTile Python supports many of the common NumPy array operations, such as addition, which is highlighted here.

Two operations unique to cuTile Python are `load` and `store`:
- `ct.load(array, N, shape)` returns the `N`th tile of dimensions `shape` of `array`.
- `ct.store(array, N, tile)` stores `tile` at the `N`th tile of dimensions `shape` in `array`.


In [None]:
@ct.kernel
def vector_add(A: ct.Array, B: ct.Array, C: ct.Array, t_shape: ct.Constant[int]):
  a_tile = ct.load(A, index=(ct.bid(0),), shape=(t_shape,))
  b_tile = ct.load(B, index=(ct.bid(0),), shape=(t_shape,))
  c_tile = a_tile + b_tile
  ct.store(C, index=(ct.bid(0),), tile=c_tile)

Next, let's validate that our is correct:

In [None]:
a_shape = 10000000
t_shape = 64

A = cp.random.uniform(-5, 5, a_shape)
B = cp.random.uniform(-5, 5, a_shape)
C = cp.zeros_like(A)

grid = (ct.cdiv(a_shape, t_shape), 1, 1)
ct.launch(cp.cuda.get_current_stream(), grid, vector_add, (A, B, C, t_shape))

# Verify the results
cp.testing.assert_array_almost_equal(C, A + B)

Finally, let's benchmark it against CuPy.

In [None]:
cupy = cpx.profiler.benchmark(
    lambda: A + B,
    (), n_repeat=15, n_warmup=4
).gpu_times[0]
tile = cpx.profiler.benchmark(
    lambda: ct.launch(cp.cuda.get_current_stream(), grid, vector_add, (A, B, C, t_shape)),
    (), n_repeat=15, n_warmup=4
).gpu_times[0]

print(f"{cupy.mean():.3g} s ± {(cupy.std() / cupy.mean()):.2%} (mean ± relative stdev of {cupy.size} runs)")
print(f"{tile.mean():.3g} s ± {(tile.std() / tile.mean()):.2%} (mean ± relative stdev of {tile.size} runs)")