## CUDA Core Tutorial - Low-Level GPU Programming
### Table of Contents

1. Introduction to cuda.core
2. Setting Up Your Environment
3. Understanding CUDA Concepts
4. Memory Management
5. Kernel Compilation and Execution
6. Error Handling
7. Exercise

### 1. Introduction to cuda.core
The `cuda.core` module provides direct access to the CUDA driver API, giving you maximum control over GPU programming. Unlike high-level APIs, cuda.core requires you to manage everything manually:

* Context management: Creating and managing execution contexts
* Memory allocation: Explicitly allocating and freeing GPU memory
* Kernel compilation: Compiling CUDA C/C++ code at runtime
* Synchronization: Managing streams and events

#### When to use cuda.core:
**Great for:**
* Learning GPU programming in a Python-friendly way
* Prototyping GPU algorithms quickly
* Custom parallel algorithms that existing libraries don't provide
* When you need fine control over GPU resources

### 2. Setting Up Your Environment
#### Prerequisites

* NVIDIA GPU with CUDA capability
* CUDA driver version 12.2 or higher
* Python 3.8+

#### Installation

In [None]:
!pip install "cuda-core[cu12]" numpy

**What this does**:
* `cuda-core`: Installs the CUDA Core Python package
* `numpy`: Installs NumPy for array operations

#### Verification

The following Python snippet checks whether CUDA is everything is set up correctly:

In [None]:
from cuda.core.experimental import system, Device
import numpy as np

# Check CUDA driver version
print(f"CUDA driver version: {system.driver_version}")

# Get number of available devices
print(f"Number of CUDA devices: {system.num_devices}")

# Get device information
if system.num_devices > 0:
    device = Device(0)  # Get the first GPU
    device.set_current()  # Tell CUDA we want to use this GPU
    print(f"Device name: {device.name}")
    print(f"Device UUID: {device.uuid}")
    print(f"PCI Bus ID: {device.pci_bus_id}")
else:
    print("No CUDA devices found!")

This is a good first step before running any GPU workloads. If this fails, your drivers or installation may be incorrect.

### 3. Understanding CUDA Concepts
#### Key Terminology
**Host vs Device:**

* **Host**: Your CPU and system memory
* **Device**: Your GPU and video memory

**Execution Model**:

* **Kernel**: A function that runs on the GPU
* **Thread**: Individual execution unit
* **Block**: Group of threads that can cooperate
* **Grid**: Collection of blocks

**Memory Hierarchy**:

* **Global Memory**: Main GPU memory (slow but large)
* **Shared Memory**: Fast memory shared within a block
* **Registers**: Fastest memory, private to each thread

Think of registers as your desk (fast access, limited space), shared memory as your team's filing cabinet (fast for team members, more space), and global memory as the company warehouse (lots of space, but takes time to fetch things).

#### Device Management and Context
**What is a Device and Context?**
In CUDA, a **Device** represents your GPU hardware. A **Context** is like a workspace on that GPU where your programs can run. Think of the Device as the physical GPU card, and the Context as your personal workspace on that card.

Try starting with basic device management:

In [None]:
from cuda.core.experimental import Device

# Get the first GPU device (device 0)
device = Device(0)

# Set as current device (this creates and activates a context)
device.set_current()

print(f"Using device: {device.name}")
print(f"Device id: {device.device_id}")

**What this does**:
1. `Device(0)`: Creates a Device object representing the first GPU (GPU numbering starts at 0)
1. `device.set_current()`: Tells CUDA "I want to use this GPU for my operations"

If you have multiple GPUs, CUDA needs to know which one you want to use, which is why we need `set_current`

#### Getting Device Properties
Let's learn more about our GPU:

In [None]:
from cuda.core.experimental import Device

device = Device(0)
device.set_current()

# Get device properties
props = device.properties
print(f"Device name: {device.name}")
print(f"Compute capability: {props.compute_capability_major}.{props.compute_capability_minor}")
print(f"Multiprocessor count: {props.multiprocessor_count}")
print(f"Max threads per block: {props.max_threads_per_block}")
print(f"Max block dimensions: ({props.max_block_dim_x}, {props.max_block_dim_y}, {props.max_block_dim_z})")

**What these properties mean**:

* Compute capability: Like a GPU "version number", where higher numbers support more features
* Multiprocessor count: How many "processor groups" your GPU has (more = more parallel power)
* Max threads per block: Maximum number of threads you can put in one block
* Max block dimensions: How you can arrange threads in a block (1D, 2D, or 3D)

#### Device Synchronization
Sometimes you need to wait for the GPU to finish all its work:

In [None]:
# ...
# Wait for all previous operations on this device to complete
device.sync()
print("All GPU operations are now complete!")

**When do you need synchronization?**
* Before copying results back from GPU to CPU
* Before timing how long GPU operations took
* Before shutting down your program

Almost as if you're saying: "don't continue until all workers finish their current tasks."

### 4. Memory Management

#### Understanding GPU Memory
GPU memory is separate from your computer's main memory (RAM). To use the GPU, you need to:
* Allocate space in GPU memory
* Copy your data from CPU to GPU
* Process the data on the GPU
* Copy results back from GPU to CPU

CPUs and GPUs have separate memory systems optimized for their different tasks.

In [None]:
import numpy as np
from cuda.core.experimental import Device

# Initialize our GPU
device = Device(0)
device.set_current()

# Calculate how much memory we need
# We want to store 1000 float32 numbers
# Each float32 takes 4 bytes, so we need 1000 * 4 = 4000 bytes
size_bytes = 1000 * 4

# Allocate memory on the GPU
device_buffer = device.allocate(size_bytes)

print(f"Allocated {size_bytes} bytes on GPU")
print(f"Buffer memory address: {device_buffer}")

**What this does:**
1. Calculate size: We figure out how many bytes we need (1000 floats × 4 bytes each)
2. Allocate memory: `device.allocate()` reserves space on the GPU
3. Get a buffer: The returned device_buffer is like a "handle" to our GPU memory

**Important**: Just like with regular Python programming, allocating memory doesn't put any meaningful data there yet. It's just reserved empty space.

### 5. Kernel Compilation and Execution
#### First Kernel: Vector Addition

In [None]:
from cuda.core.experimental import Device, Program
import numpy as np

# CUDA C++ source code for our kernel
vector_add_source = """
extern "C" __global__ void vector_add(float *a, float *b, float *c, int n) {
    // Each thread calculates its unique index
    int i = threadIdx.x + blockIdx.x * blockDim.x;

    // Make sure we don't go beyond our array bounds
    if (i < n) {
        c[i] = a[i] + b[i];  // Add corresponding elements
    }
}
"""

# Initialize our GPU
device = Device(0)
device.set_current()

# Compile the CUDA code into a program
program = Program(vector_add_source, code_type='c++')
compiled_program = program.compile(target_type='cubin')

# Get the specific kernel function we want to use
kernel = compiled_program.get_kernel("vector_add")

print("Kernel compiled successfully!")

**Breakdown:**

1. `extern "C"`: Tells the compiler to use C-style function names
2. `__global__`: Marks this as a kernel function (runs on GPU)
3. `float *a, float *b, float *c`: Pointers to arrays in GPU memory
4. `tid = threadIdx.x + blockIdx.x * blockDim.x`: Calculates unique thread ID
* threadIdx.x: Position of this thread within its block
* blockIdx.x: Which block this thread belongs to
* blockDim.x: How many threads are in each block
6. `c[i] = a[i] + b[i]`: The actual computation each thread performs

**Why the complex index calculation?**
Imagine you have 1000 elements to process with blocks of 256 threads:
* Block 0: threads 0-255 handle elements 0-255
* Block 1: threads 0-255 handle elements 256-511
* Block 2: threads 0-255 handle elements 512-767
* Block 3: threads 0-255 handle elements 768-999

Each thread needs to know which element it should work on.

In [None]:
import cupy as cp
from cuda.core.experimental import launch, LaunchConfig

def execute_vector_add():
    # Initialize device and create a stream
    device = Device(0)
    device.set_current()
    s = device.create_stream()

    # Prepare our test data
    N = 1000  # Number of elements
    a = np.arange(N, dtype=np.float32)      # [0, 1, 2, ..., 999]
    b = np.arange(N, dtype=np.float32)      # [0, 1, 2, ..., 999]
    print(f"Input arrays have {N} elements each")

    # Step 2: Copy input data from CPU to GPU
    d_a = cp.asarray(a)
    d_b = cp.asarray(b)
    d_c = cp.empty(N, dtype=cp.float32)

    # Configure how to launch the kernel
    block_size = 256  # Number of threads per block
    grid_size = (N + block_size - 1) // block_size  # Number of blocks needed

    print(f"Launch configuration: {grid_size} blocks of {block_size} threads each")
    print(f"Total threads: {grid_size * block_size}")

    # Create the launch configuration
    config = LaunchConfig(grid=(grid_size,), block=(block_size,))

    # Launch the kernel
    launch(s, config, kernel, d_a.data.ptr, d_b.data.ptr, d_c.data.ptr, cp.uint64(N))
    s.sync() # Wait for kernel to complete
    print("Kernel launched and executed")

    # Copy the result back from GPU to CPU
    c = cp.asnumpy(d_c)
    print("Results copied back to CPU")

    return c

# Execute our vector addition
result = execute_vector_add()

# Verify the result
expected = np.arange(1000, dtype=np.float32) * 2  # [0, 2, 4, ..., 1998]
success = np.allclose(result, expected)
print(f"Kernel execution successful: {success}")
print(f"First 10 results: {result[:10]}")
print(f"Expected first 10: {expected[:10]}")

**Launch Configuration Deep Dive:**
* Block size: 256 threads per block (common choice, powers of 2 work well)
* Grid size: `(N + block_size - 1) // block_size` ensures we have enough threads
    * For N=1000 and block_size=256: grid_size = (1000 + 255) // 256 = 4 blocks
    * Total threads = 4 × 256 = 1024 threads (more than our 1000 elements, which is fine)

**Why Use grid_size calculation?**

This formula ensures we always have enough threads:
* If N=1000 and block_size=256, we need at least 4 blocks
* If N=256 and block_size=256, we need exactly 1 block
* If N=257 and block_size=256, we need 2 blocks

#### Advanced Kernel Example

We can now try multiplying two matrices:

In [None]:
# Templated matrix multiplication kernel
matmul_source = """
template<typename T>
__global__ void matrix_multiply(const T *A, const T *B, T *C, size_t N) {
    // Calculate which row and column this thread handles
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    // Make sure we're within the matrix bounds
    if (row < N && col < N) {
        T sum = T(0);

        // Compute dot product of row from A and column from B
        for (int k = 0; k < N; k++) {
            sum += A[row * N + k] * B[k * N + col];
        }

        // Store the result
        C[row * N + col] = sum;
    }
}
"""

def matrix_multiply_gpu(A, B):
    # Initialize device and create a stream
    device = Device(0)
    device.set_current()
    s = device.create_stream()

    N = A.shape[0]
    assert A.shape == (N, N) and B.shape == (N, N), "Matrices must be square and same size"
    print(f"Multiplying {N}x{N} matrices")

    # Compile the templated matrix multiplication kernel with specific C++ compiler flags
    program_options = ProgramOptions(std="c++17", arch=f"sm_{arch}")
    program = Program(matmul_source, code_type='c++', options=program_options)
    compiled_program = program.compile(target_type='cubin', name_expressions=("matrix_multiply<float>",))
    kernel = compiled_program.get_kernel("matrix_multiply<float>")

    # Copy input matrices to GPU
    d_A = cp.asarray(A)
    d_B = cp.asarray(B)
    d_C = cp.empty((N, N), dtype=cp.float32)

    # Configure 2D launch (threads arranged in a 2D grid)
    block_size = 16  # 16x16 = 256 threads per block
    grid_size = (N + block_size - 1) // block_size

    print(f"Launch config: {grid_size}x{grid_size} blocks of {block_size}x{block_size} threads")

    # Create 2D launch configuration
    config = LaunchConfig(grid=(grid_size, grid_size), block=(block_size, block_size))

    # Launch the kernel
    launch(s, config, kernel, d_A.data.ptr, d_B.data.ptr, d_C.data.ptr, cp.uint64(N))
    s.sync()

    # Copy result back
    C = cp.asnumpy(d_C)

    print("Matrix multiplication completed on GPU")
    return C

print("Testing matrix multiplication...")
A = np.random.random((64, 64)).astype(np.float32)
B = np.random.random((64, 64)).astype(np.float32)

# Compare GPU result with CPU result
C_gpu = matrix_multiply_gpu(A, B)
C_cpu = np.dot(A, B) # NumPy's optimized matrix multiplication

# Check if results match (within floating-point precision)
matches = np.allclose(C_gpu, C_cpu, atol=1e-5)
print(f"GPU and CPU results match: {matches}")

if matches:
    print("Success! Matrix multiplication kernel works correctly.")
else:
    print("Results don't match - there might be a bug in the kernel.")

**Kernel Explanation:**

1. 2D Thread Layout: Each thread handles one element of the result matrix
* row = blockIdx.y * blockDim.y + threadIdx.y: Which row this thread computes
* col = blockIdx.x * blockDim.x + threadIdx.x: Which column this thread computes
2. Dot Product Calculation: For element C[row][col], we compute:
* Sum of A[row][k] × B[k][col] for all k
* This is the mathematical definition of matrix multiplication
3. 2D Launch Configuration:
* Blocks are arranged in a 2D grid to match the 2D nature of matrices
* Each block is 16×16 threads (256 total threads per block)

**Why 2D layout?**

Matrix multiplication naturally maps to 2D: each thread computes one output element, and output elements are arranged in a 2D grid (the result matrix).

### 6. Error Handling
When working at the low-level CUDA driver API level, you are responsible for checking erros after every API call

GPU programming can fail in many ways:
* Out of memory: Asking for more GPU memory than available
* Invalid kernels: Bugs in CUDA C code
* Device errors: Hardware problems or driver issues
* Launch failures: Invalid grid/block configurations

Good error handling helps you:
* Debug problems quickly
* Write robust applications
* Provide helpful error messages

In [None]:
def safe_cuda_operation():
    device = None
    try:
        print("Attempting CUDA operation...")

        # Initialize device (this can fail)
        device = Device(0)
        device.set_current()
        print("✓ Device initialized successfully")

        # Allocate memory (this can fail if requesting too much)
        buffer = device.allocate(1000 * 4)
        print("✓ Memory allocated successfully")

        # Your CUDA operations here
        print("✓ All CUDA operations completed successfully")

    except RuntimeError as e:
        print(f"✗ CUDA Runtime Error: {e}")
        print("This usually means a problem with GPU drivers or hardware")

    except MemoryError as e:
        print(f"✗ GPU Memory Error: {e}")
        print("Try reducing the size of your data or closing other GPU programs")

    except Exception as e:
        print(f"✗ Unexpected error occurred: {e}")
        print(f"Error type: {type(e).__name__}")

    finally:
        # Cleanup happens automatically in cuda.core
        # But you can add custom cleanup here if needed
        if device is not None:
            print("✓ Cleanup completed")

# Test error handling
safe_cuda_operation()

Error handling best practices:
1. Use try-except blocks: Wrap CUDA operations in try-except
2. Specific exceptions: Catch specific error types when possible
3. Helpful messages: Explain what went wrong and how to fix it
4. Cleanup: Use finally blocks for cleanup code
5. Don't ignore errors: Always handle or propagate exceptions

#### Common Error Patterns

In [None]:
def handle_common_errors():
    errors_encountered = []

    # Error 1: Invalid device number
    print("Testing invalid device...")
    try:
        invalid_device = Device(999)  # Device 999 probably doesn't exist
        invalid_device.set_current()
    except Exception as e:
        error_msg = f"Invalid device error: {e}"
        errors_encountered.append(error_msg)
        print(f"✗ {error_msg}")

    # Error 2: Memory allocation failure
    print("\nTesting memory allocation failure...")
    try:
        device = Device(0)
        device.set_current()

        # Try to allocate an impossibly large amount of memory (1TB)
        huge_amount = 1024 * 1024 * 1024 * 1024  # 1TB in bytes
        huge_alloc = device.allocate(huge_amount)

    except Exception as e:
        error_msg = f"Memory allocation error: {e}"
        errors_encountered.append(error_msg)
        print(f"✗ {error_msg}")

    # Error 3: Invalid kernel compilation
    print("\nTesting kernel compilation failure...")
    try:
        # This CUDA code has syntax errors
        bad_source = """
        extern "C" __global__ void broken_kernel( {
            // Missing closing brace and parameter list
            int i = this_variable_doesnt_exist;
        """
        program = Program(bad_source, code_type='c++')
        program.compile(target_type='cubin')

    except Exception as e:
        error_msg = f"Kernel compilation error: {e}"
        errors_encountered.append(error_msg)
        print(f"✗ {error_msg}")

    # Error 4: Invalid launch configuration
    print("\nTesting invalid launch configuration...")
    try:
        device = Device(0)
        device.set_current()

        # Create a valid kernel first
        valid_source = """
        extern "C" __global__ void test_kernel(float *data, int n) {
            int i = blockIdx.x * blockDim.x + threadIdx.x;
            if (i < n) data[i] = i;
        }
        """
        program = Program(valid_source, code_type='c++')
        kernel = program.compile(target_type='cubin').get_kernel("test_kernel")

        # Try to launch with invalid configuration (0 threads per block)
        bad_config = LaunchConfig(grid=(1,), block=(0,))  # 0 threads is invalid

        # This should fail
        data = device.allocate(100 * 4)
        launch(kernel, bad_config, data, np.int32(100))

    except Exception as e:
        error_msg = f"Invalid launch configuration: {e}"
        errors_encountered.append(error_msg)
        print(f"✗ {error_msg}")

    print(f"\nSummary: Caught {len(errors_encountered)} expected errors")
    return errors_encountered

# Run error handling tests
errors = handle_common_errors()
print("\nError handling demonstration completed")

What this demonstrates:
1. **Device errors**: Wrong device numbers, missing GPUs
2. **Memory errors**: Requesting too much memory
3. **Compilation errors**: Syntax errors in CUDA C code
4. **Launch errors**: Invalid thread/block configurations

### 7. Exercise: Vector Operations
Write a CUDA kernel that performs element-wise multiplication of two vectors.

In [None]:
# Compile your kernel here
from cuda.core.experimental import Device, Program

multiply_kernel_source = """
// TODO: Implement vector multiplication kernel
"""


In [None]:
# Launch your kernel here
import cupy as cp
from cuda.core.experimental import launch, LaunchConfig

def vector_multiply(a, b):
    # TODO: Implement the wrapper function
    pass


### Resources
CUDA Python Reference: https://numba.pydata.org/numba-doc/dev/cuda-reference/

Repository: https://github.com/NVIDIA/cuda-python 