# Exercise - `cuda.core` - Devices, Streams, and Memory

`cuda.core` provides a Pythonic interface to the CUDA runtime and other functionality, including:

* Compiling and launching CUDA kernels
* Asynchronous concurrent execution with CUDA graphs, streams and events
* Coordinating work across multiple CUDA devices
* Allocating, transferring, and managing device memory
* Runtime linking of device code with Link-Time Optimization (LTO)


### Example: Getting Device Properties

As a quick example of the `cuda.core` interface, let's look at 

In [23]:
from cuda.core.experimental import Device

device = Device(0)
device.set_current()

# Get device properties
props = device.properties
print(f"Device name: {device.name}")
print(f"Compute capability: {props.compute_capability_major}.{props.compute_capability_minor}")
print(f"Multiprocessor count: {props.multiprocessor_count}")
print(f"Max threads per block: {props.max_threads_per_block}")
print(f"Max block dimensions: ({props.max_block_dim_x}, {props.max_block_dim_y}, {props.max_block_dim_z})")

Device name: NVIDIA A100 80GB PCIe
Compute capability: 8.0
Multiprocessor count: 108
Max threads per block: 1024
Max block dimensions: (1024, 1024, 64)


### Memory Management

The `Device.allocate()` method can be used to allocate memory on the specific CUDA device: 

In [None]:
import numpy as np
from cuda.core.experimental import Device

# Initialize our GPU
device = Device(0)
device.set_current()

# Calculate how much memory we need
# We want to store 1000 float32 numbers
# Each float32 takes 4 bytes, so we need 1000 * 4 = 4000 bytes
size_bytes = 1000 * 4

# Allocate memory on the GPU
device_buffer = device.allocate(size_bytes)

print(f"Allocated {size_bytes} bytes on GPU")
print(f"Buffer memory address: {device_buffer.handle}")

Allocated 4000 bytes on GPU
Buffer memory address: <CUdeviceptr 13421785088>


### Compiling and calling a CUDA C++ kernel from Python

In [25]:
import cupy as cp
from cuda.core.experimental import Device, LaunchConfig, Program, ProgramOptions, launch

First, we define a string containing the CUDA C++ kernel. Note that this is a templated kernel:

In [26]:
# compute c = a + b
code = """
template<typename T>
__global__ void vector_add(const T* A,
                           const T* B,
                           T* C,
                           size_t N) {
    const unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;
    for (size_t i=tid; i<N; i+=gridDim.x*blockDim.x) {
        C[tid] = A[tid] + B[tid];
    }
}
"""

Next, we create a `Device` object and a corresponding `Stream`. Don’t forget to use `Device.set_current()`!

In [27]:
dev = Device()
dev.set_current()
s = dev.create_stream()

Next, we compile the CUDA C++ kernel from earlier using the `Program` class. The result of the compilation is saved as a CUBIN. Note the use of the `name_expressions` parameter to the `Program.compile()` method to specify which kernel template instantiations to compile:

In [35]:
arch = str(f"{dev.compute_capability.major}{dev.compute_capability.minor}")
program_options = ProgramOptions(std="c++17", arch=f"sm_{arch}")
prog = Program(code, code_type="c++", options=program_options)
mod = prog.compile("cubin", name_expressions=("vector_add<float>",))

Next, we retrieve the compiled kernel from the CUBIN and prepare the arguments and kernel configuration. We’re using CuPy arrays as inputs for this example, but you can use PyTorch tensors too.

In [37]:
ker = mod.get_kernel("vector_add<float>")

# Prepare input/output arrays (using CuPy)
size = 50000
rng = cp.random.default_rng()
a = rng.random(size, dtype=cp.float32)
b = rng.random(size, dtype=cp.float32)
c = cp.empty_like(a)

# Configure launch parameters
block = 256
grid = (size + block - 1) // block
config = LaunchConfig(grid=grid, block=block)

Finally, we use the launch() function to execute our kernel on the specified stream with the given configuration and arguments. Note the use of `.data.ptr` to get the pointer to the CuPy array's data.

In [38]:
launch(s, config, ker, a.data.ptr, b.data.ptr, c.data.ptr, cp.uint64(size))
s.sync()

In [41]:
print(a)
print(b)
print(c)

[0.423765   0.9974886  0.8906149  ... 0.98856455 0.75328386 0.12381048]
[0.6481802  0.05253503 0.05977066 ... 0.58469105 0.49861202 0.18568273]
[1.0719452 1.0500237 0.9503856 ... 1.5732555 1.2518959 0.3094932]
