# CUDA Core Tutorial - Low-Level GPU Programming
## Table of Contents

1. Introduction to cuda.core
2. Setting Up Your Environment
3. Understanding CUDA Concepts
4. Memory Management
5. Kernel Compilation and Execution
6. Streams and Synchronization
7. Error Handling
8. Performance Optimization
9. Practical Examples
10. Lab

## 1. Introduction to cuda.core
The `cuda.core` module provides direct access to the CUDA driver API, giving you maximum control over GPU programming. Unlike high-level APIs, cuda.core requires you to manage everything manually:

* Context management: Creating and managing execution contexts
* Memory allocation: Explicitly allocating and freeing GPU memory
* Kernel compilation: Compiling CUDA C/C++ code at runtime
* Synchronization: Managing streams and events

### When to use cuda.core:
**Great for:**
* Learning GPU programming in a Python-friendly way
* Prototyping GPU algorithms quickly
* Custom parallel algorithms that existing libraries don't provide
* When you need fine control over GPU resources

## 2. Setting Up Your Environment
### Prerequisites

* NVIDIA GPU with CUDA capability
* CUDA driver version 12.2 or higher
* Python 3.8+

### Installation

In [None]:
pip install cuda-core numpy

SyntaxError: invalid syntax (2111655890.py, line 1)

**What this does**:
* `cuda-core`: Installs the CUDA Core Python package
* `numpy`: Installs NumPy for array operations

### Verification

The following Python snippet checks whether CUDA is everything is set up correctly:

In [None]:
from cuda.core.experimental import system, Device
import numpy as np

# Check CUDA driver version
print(f"CUDA driver version: {system.driver_version}")

# Get number of available devices  
print(f"Number of CUDA devices: {system.num_devices}")

# Get device information
if system.num_devices > 0:
    device = Device(0)  # Get the first GPU
    device.set_current()  # Tell CUDA we want to use this GPU
    print(f"Device name: {device.name}")
    print(f"Device UUID: {device.uuid}")
    print(f"PCI Bus ID: {device.pci_bus_id}")
else:
    print("No CUDA devices found!")

This is a good first step before running any GPU workloads. If this fails, your drivers or installation may be incorrect.

## 3. Understanding CUDA Concepts
### Key Terminology
**Host vs Device:**

* **Host**: Your CPU and system memory
* **Device**: Your GPU and video memory

**Execution Model**:

* **Kernel**: A function that runs on the GPU
* **Thread**: Individual execution unit
* **Block**: Group of threads that can cooperate
* **Grid**: Collection of blocks

**Memory Hierarchy**:

* **Global Memory**: Main GPU memory (slow but large)
* **Shared Memory**: Fast memory shared within a block
* **Registers**: Fastest memory, private to each thread

Think of registers as your desk (fast access, limited space), shared memory as your team's filing cabinet (fast for team members, more space), and global memory as the company warehouse (lots of space, but takes time to fetch things).

### Device Management and Context
**What is a Device and Context?**
In CUDA, a **Device** represents your GPU hardware. A **Context** is like a workspace on that GPU where your programs can run. Think of the Device as the physical GPU card, and the Context as your personal workspace on that card.

Try starting with basic device management:

In [None]:
from cuda.core.experimental import Device

# Get the first GPU device (device 0)
device = Device(0)

# Set as current device (this creates and activates a context)
device.set_current()

print(f"Using device: {device.name}")
print(f"Device ordinal: {device.ordinal}")

**What this does**:
1. `Device(0)`: Creates a Device object representing the first GPU (GPU numbering starts at 0)
1. `device.set_current()`: Tells CUDA "I want to use this GPU for my operations"

If you have multiple GPUs, CUDA needs to know which one you want to use, which is why we need `set_current`

### Getting Device Properties
Let's learn more about our GPU:

In [None]:
from cuda.core.experimental import Device

device = Device(0)
device.set_current()

# Get device properties
props = device.properties
print(f"Device name: {device.name}")
print(f"Compute capability: {props.major}.{props.minor}")
print(f"Total global memory: {props.total_global_mem // (1024**3)} GB")
print(f"Multiprocessor count: {props.multi_processor_count}")
print(f"Max threads per block: {props.max_threads_per_block}")
print(f"Max block dimensions: ({props.max_block_dim_x}, {props.max_block_dim_y}, {props.max_block_dim_z})")

**What these properties mean**:

* Compute capability: Like a GPU "version number", where higher numbers support more features
* Total global memory: How much video RAM your GPU has
* Multiprocessor count: How many "processor groups" your GPU has (more = more parallel power)
* Max threads per block: Maximum number of threads you can put in one block
* Max block dimensions: How you can arrange threads in a block (1D, 2D, or 3D)

### Device Synchronization
Sometimes you need to wait for the GPU to finish all its work:

In [None]:
# ...
# Wait for all previous operations on this device to complete
device.synchronize()
print("All GPU operations are now complete!")

**When do you need synchronization?**
* Before copying results back from GPU to CPU
* Before timing how long GPU operations took
* Before shutting down your program

Almost as if you're saying: "don't continue until all workers finish their current tasks."

## 4. Memory Management

### Understanding GPU Memory
GPU memory is separate from your computer's main memory (RAM). To use the GPU, you need to:
* Allocate space in GPU memory
* Copy your data from CPU to GPU
* Process the data on the GPU
* Copy results back from GPU to CPU

CPUs and GPUs have separate memory systems optimized for their different tasks.

In [None]:
import numpy as np
from cuda.core.experimental import Device

# Initialize our GPU
device = Device(0)
device.set_current()

# Calculate how much memory we need
# We want to store 1000 float32 numbers
# Each float32 takes 4 bytes, so we need 1000 * 4 = 4000 bytes
size_bytes = 1000 * 4  

# Allocate memory on the GPU
device_buffer = device.allocate(size_bytes)

print(f"Allocated {size_bytes} bytes on GPU")
print(f"Buffer memory address: {device_buffer}")

**What this does:**
1. Calculate size: We figure out how many bytes we need (1000 floats × 4 bytes each)
2. Allocate memory: device.allocate() reserves space on the GPU
3. Get a buffer: The returned device_buffer is like a "handle" to our GPU memory

**Important**: Just like with regular Python programming, allocating memory doesn't put any meaningful data there yet. It's just reserved empty space.

### Memory Transfer

Below is a complete example of moving data between CPU and GPU:

In [None]:
import numpy as np
from cuda.core.experimental import Device

def demonstrate_memory_operations():
    # Step 1: Initialize device
    device = Device(0)
    device.set_current()
    
    # Step 2: Create some data on the CPU (host)
    host_data = np.arange(1000, dtype=np.float32)  # Creates [0, 1, 2, ..., 999]
    print(f"Created host data with shape: {host_data.shape}")
    
    # Step 3: Allocate memory on the GPU (device)
    device_buffer = device.allocate(host_data.nbytes)  # nbytes = number of bytes
    print(f"Allocated {host_data.nbytes} bytes on GPU")
    
    # Step 4: Copy data from CPU to GPU
    device_buffer.copy_from(host_data)
    print("Data copied from CPU to GPU")
    
    # Step 5: Allocate space for results
    result_buffer = device.allocate(host_data.nbytes)
    print("Allocated result buffer on GPU")
    
    # Step 6: Copy data back to CPU (after processing)
    result = np.zeros_like(host_data)  # Create empty array same size as input
    device_buffer.copy_to(result)  # Copy from GPU to CPU
    print("Data copied from GPU back to CPU")
    
    # Step 7: Verify the data made the round trip correctly
    print(f"Data matches: {np.array_equal(host_data, result)}")
    
    return result

# Run the demonstration
result = demonstrate_memory_operations()
print(f"Final result shape: {result.shape}")

**Step-by-step explanation:**
1. Create host data: We make a NumPy array with 1000 numbers
2. Allocate GPU memory: Reserve space on the GPU for our data
3. Copy to GPU: Move our data from CPU memory to GPU memory
4. Allocate result space: Reserve space for the results of our computation
5. Copy back: Move the processed data from GPU back to CPU
6. Verify: Check that our data survived the round trip

**Key Methods:**
* `.nbytes`: NumPy property that tells you how many bytes an array uses
* `.copy_from()`: Copies data FROM the CPU TO the GPU
* `.copy_to()`: Copies data FROM the GPU TO the CPU
* `np.zeros_like()`: Creates an empty array with the same shape and type



## 5. Kernel Compilation and Execution
### First Kernel: Vector Addition

In [None]:
from cuda.core.experimental import Device, Program
import numpy as np

# CUDA C source code for our kernel
vector_add_source = """
extern "C" __global__ void vector_add(float *a, float *b, float *c, int n) {
    // Each thread calculates its unique index
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Make sure we don't go beyond our array bounds
    if (i < n) {
        c[i] = a[i] + b[i];  // Add corresponding elements
    }
}
"""

# Initialize our GPU
device = Device(0)
device.set_current()

# Compile the CUDA code into a program
program = Program(vector_add_source)
compiled_program = program.compile()

# Get the specific kernel function we want to use
kernel = compiled_program.get_kernel("vector_add")

print("Kernel compiled successfully!")

**Breakdown:**

1. `extern "C"`: Tells the compiler to use C-style function names
2. `__global__`: Marks this as a kernel function (runs on GPU)
3. `float *a, float *b, float *c`: Pointers to arrays in GPU memory
4. `int i = blockIdx.x * blockDim.x + threadIdx.x`: Calculates unique thread ID
* blockIdx.x: Which block this thread belongs to
* blockDim.x: How many threads are in each block
* threadIdx.x: Position of this thread within its block
5. `if (i < n)`: Safety check to avoid accessing invalid memory
6. `c[i] = a[i] + b[i]`: The actual computation each thread performs

**Why the complex index calculation?**
Imagine you have 1000 elements to process with blocks of 256 threads:
* Block 0: threads 0-255 handle elements 0-255
* Block 1: threads 0-255 handle elements 256-511
* Block 2: threads 0-255 handle elements 512-767
* Block 3: threads 0-255 handle elements 768-999

Each thread needs to know which element it should work on.

### Executing the Kernel
Now we can use our compiled kernel to actually add two vectors:

In [None]:
from cuda.core.experimental import launch, LaunchConfig

def execute_vector_add():
    # Step 1: Initialize device
    device = Device(0) 
    device.set_current()
    
    # Step 2: Prepare our test data
    N = 1000  # Number of elements
    a = np.arange(N, dtype=np.float32)      # [0, 1, 2, ..., 999]
    b = np.arange(N, dtype=np.float32)      # [0, 1, 2, ..., 999]
    print(f"Input arrays have {N} elements each")
    
    # Step 3: Allocate GPU memory for all our arrays
    d_a = device.allocate(a.nbytes)  # GPU memory for array 'a'
    d_b = device.allocate(b.nbytes)  # GPU memory for array 'b'  
    d_c = device.allocate(a.nbytes)  # GPU memory for result 'c'
    
    # Step 4: Copy input data from CPU to GPU
    d_a.copy_from(a)
    d_b.copy_from(b)
    print("Input data copied to GPU")
    
    # Step 5: Configure how to launch the kernel
    block_size = 256  # Number of threads per block
    grid_size = (N + block_size - 1) // block_size  # Number of blocks needed
    
    print(f"Launch configuration: {grid_size} blocks of {block_size} threads each")
    print(f"Total threads: {grid_size * block_size}")
    
    # Create the launch configuration
    config = LaunchConfig(grid=(grid_size,), block=(block_size,))
    
    # Step 6: Launch the kernel!
    launch(kernel, config, d_a, d_b, d_c, np.int32(N))
    print("Kernel launched and executed")
    
    # Step 7: Copy the result back from GPU to CPU
    c = np.zeros(N, dtype=np.float32)  # Create empty result array
    d_c.copy_to(c)
    print("Results copied back to CPU")
    
    return c

# Execute our vector addition
result = execute_vector_add()

# Verify the result
expected = np.arange(1000, dtype=np.float32) * 2  # [0, 2, 4, ..., 1998]
success = np.allclose(result, expected)
print(f"Kernel execution successful: {success}")
print(f"First 10 results: {result[:10]}")
print(f"Expected first 10: {expected[:10]}")

**Launch Configuration Deep Dive:**
* Block size: 256 threads per block (common choice, powers of 2 work well)
* Grid size: `(N + block_size - 1) // block_size` ensures we have enough threads
    * For N=1000 and block_size=256: grid_size = (1000 + 255) // 256 = 4 blocks
    * Total threads = 4 × 256 = 1024 threads (more than our 1000 elements, which is fine)

**Why Use grid_size calculation?**

This formula ensures we always have enough threads:
* If N=1000 and block_size=256, we need at least 4 blocks
* If N=256 and block_size=256, we need exactly 1 block
* If N=257 and block_size=256, we need 2 blocks

### Advanced Kernel Example

We can now try multiplying two matrices:

In [None]:
# Matrix multiplication kernel
matmul_source = """
extern "C" __global__ void matrix_multiply(float *A, float *B, float *C, int N) {
    // Calculate which row and column this thread handles
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Make sure we're within the matrix bounds
    if (row < N && col < N) {
        float sum = 0.0f;
        
        // Compute dot product of row from A and column from B
        for (int k = 0; k < N; k++) {
            sum += A[row * N + k] * B[k * N + col];
        }
        
        // Store the result
        C[row * N + col] = sum;
    }
}
"""

def matrix_multiply_gpu(A, B):
    # Step 1: Initialize device and verify inputs
    device = Device(0)
    device.set_current() 
    
    N = A.shape[0]
    assert A.shape == (N, N) and B.shape == (N, N), "Matrices must be square and same size"
    print(f"Multiplying {N}x{N} matrices")
    
    # Step 2: Compile the matrix multiplication kernel
    program = Program(matmul_source)
    kernel = program.compile().get_kernel("matrix_multiply")
    
    # Step 3: Allocate GPU memory
    d_A = device.allocate(A.nbytes)
    d_B = device.allocate(B.nbytes)
    d_C = device.allocate(A.nbytes)
    
    # Step 4: Copy input matrices to GPU
    d_A.copy_from(A)
    d_B.copy_from(B)
    
    # Step 5: Configure 2D launch (threads arranged in a 2D grid)
    block_size = 16  # 16x16 = 256 threads per block
    grid_size = (N + block_size - 1) // block_size
    
    print(f"Launch config: {grid_size}x{grid_size} blocks of {block_size}x{block_size} threads")
    
    # Create 2D launch configuration
    config = LaunchConfig(grid=(grid_size, grid_size), block=(block_size, block_size))
    
    # Step 6: Launch the kernel
    launch(kernel, config, d_A, d_B, d_C, np.int32(N))
    
    # Step 7: Copy result back
    C = np.zeros_like(A)
    d_C.copy_to(C)
    
    print("Matrix multiplication completed on GPU")
    return C

# Test matrix multiplication
print("Testing matrix multiplication...")
A = np.random.random((64, 64)).astype(np.float32)
B = np.random.random((64, 64)).astype(np.float32)

# Compare GPU result with CPU result
C_gpu = matrix_multiply_gpu(A, B)
C_cpu = np.dot(A, B)  # NumPy's optimized matrix multiplication

# Check if results match (within floating-point precision)
matches = np.allclose(C_gpu, C_cpu, atol=1e-5)
print(f"GPU and CPU results match: {matches}")

if matches:
    print("Success! Matrix multiplication kernel works correctly.")
else:
    print("Results don't match - there might be a bug in the kernel.")

**Kernel Explanation:**

1. 2D Thread Layout: Each thread handles one element of the result matrix
* row = blockIdx.y * blockDim.y + threadIdx.y: Which row this thread computes
* col = blockIdx.x * blockDim.x + threadIdx.x: Which column this thread computes
2. Dot Product Calculation: For element C[row][col], we compute:
* Sum of A[row][k] × B[k][col] for all k
* This is the mathematical definition of matrix multiplication
3. 2D Launch Configuration:
* Blocks are arranged in a 2D grid to match the 2D nature of matrices
* Each block is 16×16 threads (256 total threads per block)

**Why 2D layout?**

Matrix multiplication naturally maps to 2D: each thread computes one output element, and output elements are arranged in a 2D grid (the result matrix).

## 6. Streams and Synchronization
### Understanding Streams
A Stream in CUDA is like a queue of operations that execute in order on the GPU. Think of it as a to-do list that the GPU works through sequentially.

By creating your own streams, you can:
* Overlap operations: Copy data while computing on other data
* Multiple tasks: Run different kernels at the same time
* Better GPU utilization: Keep the GPU busy with work

Default behavior: If you don't specify a stream, CUDA uses a "default stream" where all operations happen one after another.

Let's see how to create and use streams:

In [None]:
from cuda.core.experimental import Device, Stream, launch, LaunchConfig

def demonstrate_streams():
    # Step 1: Initialize device
    device = Device(0)
    device.set_current()
    
    # Step 2: Create a custom stream
    stream = Stream()
    print("Created custom stream")
    
    # Step 3: Prepare test data
    N = 1000
    a = np.arange(N, dtype=np.float32)
    b = np.arange(N, dtype=np.float32)
    
    # Step 4: Allocate GPU memory
    d_a = device.allocate(a.nbytes)
    d_b = device.allocate(b.nbytes)
    d_c = device.allocate(a.nbytes)
    
    # Step 5: Copy data to GPU (these operations go into the stream)
    d_a.copy_from(a)
    d_b.copy_from(b)
    print("Data copied to GPU")
    
    # Step 6: Launch kernel in our custom stream
    config = LaunchConfig(grid=(4,), block=(256,))
    launch(stream, config, kernel, d_a, d_b, d_c, np.int32(N))
    print("Kernel launched in custom stream")
    
    # Step 7: Copy result back (also goes into the stream)
    c = np.zeros(N, dtype=np.float32)
    d_c.copy_to(c)
    
    # Step 8: Wait for all operations in the stream to complete
    stream.synchronize()
    print("Stream operations completed")
    
    return c

result = demonstrate_streams()
print(f"Stream execution successful: {np.allclose(result, np.arange(1000) * 2)}")

**Explanation**:

1. Create stream: `Stream()` creates a new operation queue
2. Queue operations: Memory copies and kernel launches go into the stream
3. Asynchronous execution: Operations start immediately but may not finish right away
4. Synchronization: `stream.synchronize()` waits for everything to complete

**Why synchronize?**

Without synchronization, your Python program might try to use results before the GPU finishes computing them

### Events for Timing

Events allow you to measure hwo long GPU operations take:

In [None]:
from cuda.core.experimental import Event

def time_kernel_execution():
    # Step 1: Setup
    device = Device(0)
    device.set_current()
    
    # Step 2: Create timing events
    start_event = Event()
    end_event = Event()
    print("Created timing events")
    
    # Step 3: Prepare larger dataset for timing
    N = 1000000  # 1 million elements
    a = np.random.random(N).astype(np.float32)
    b = np.random.random(N).astype(np.float32)
    print(f"Prepared data with {N} elements")
    
    # Step 4: Allocate and copy data
    d_a = device.allocate(a.nbytes)
    d_b = device.allocate(b.nbytes)
    d_c = device.allocate(a.nbytes)
    
    d_a.copy_from(a)
    d_b.copy_from(b)
    
    # Step 5: Record start time
    start_event.record()
    
    # Step 6: Launch kernel (the operation we want to time)
    block_size = 256
    grid_size = (N + block_size - 1) // block_size
    config = LaunchConfig(grid=(grid_size,), block=(block_size,))
    launch(kernel, config, d_a, d_b, d_c, np.int32(N))
    
    # Step 7: Record end time
    end_event.record()
    
    # Step 8: Wait for the end event and calculate time
    end_event.synchronize()
    elapsed_time = Event.elapsed_time(start_event, end_event)
    
    print(f"Kernel execution took: {elapsed_time:.2f} milliseconds")
    
    # Step 9: Calculate performance metrics
    elements_per_second = N / (elapsed_time / 1000.0)  # Convert ms to seconds
    print(f"Processed {elements_per_second:.0f} elements per second")
    
    return elapsed_time

# Time our kernel execution
execution_time = time_kernel_execution()

**Understanding the timing:**
1. Events as markers: Think of events as stopwatch clicks
2. Record timing: `start_event.record()` marks the beginning
3. Synchronize: `end_event.synchronize()` waits for the operation to finish
4. Calculate: `Event.elapsed_time()` gives us the duration in milliseconds

**Why use events instead of Python's time.time()?**
* GPU operations are asynchronous - they start but don't block Python
* Events measure actual GPU execution time, not Python overhead
* More accurate for performance analysis

Multiple Streams for Parallelism
Advanced usage: using multiple streams to overlap operations:

In [None]:
def demonstrate_multiple_streams():
    device = Device(0)
    device.set_current()
    
    # Create multiple streams
    stream1 = Stream()
    stream2 = Stream()
    print("Created two streams for parallel execution")
    
    # Prepare data for both streams
    N = 100000
    data1_a = np.random.random(N).astype(np.float32)
    data1_b = np.random.random(N).astype(np.float32)
    data2_a = np.random.random(N).astype(np.float32)
    data2_b = np.random.random(N).astype(np.float32)
    
    # Allocate memory for both computations
    d1_a = device.allocate(data1_a.nbytes)
    d1_b = device.allocate(data1_b.nbytes)
    d1_c = device.allocate(data1_a.nbytes)
    
    d2_a = device.allocate(data2_a.nbytes)
    d2_b = device.allocate(data2_b.nbytes)
    d2_c = device.allocate(data2_a.nbytes)
    
    # Launch operations in parallel streams
    print("Launching operations in parallel...")
    
    # Stream 1 operations
    d1_a.copy_from(data1_a)
    d1_b.copy_from(data1_b)
    config = LaunchConfig(grid=((N + 255) // 256,), block=(256,))
    launch(stream1, config, kernel, d1_a, d1_b, d1_c, np.int32(N))
    
    # Stream 2 operations (can run concurrently with stream 1)
    d2_a.copy_from(data2_a)
    d2_b.copy_from(data2_b)
    launch(stream2, config, kernel, d2_a, d2_b, d2_c, np.int32(N))
    
    # Wait for both streams to complete
    stream1.synchronize()
    stream2.synchronize()
    print("Both streams completed")
    
    # Copy results back
    result1 = np.zeros(N, dtype=np.float32)
    result2 = np.zeros(N, dtype=np.float32)
    d1_c.copy_to(result1)
    d2_c.copy_to(result2)
    
    return result1, result2

# Demonstrate parallel execution
result1, result2 = demonstrate_multiple_streams()
print("Multiple stream execution completed successfully")

**Benefits of multiple streams:**
1. Parallel execution: If your GPU has resources, both kernels can run simultaneously
2. Better utilization: Keeps more of the GPU busy
3. Overlapping operations: Data transfer in one stream while computing in another

Important notes:
* Not all operations can run truly in parallel (depends on GPU resources)
* Each stream maintains its own order of operations
* Synchronization ensures all work completes before proceeding

## 7. Error Handling
When working at the low-level CUDA driver API level, you are responsible for checking erros after every API call

GPU programming can fail in many ways:
* Out of memory: Asking for more GPU memory than available
* Invalid kernels: Bugs in CUDA C code
* Device errors: Hardware problems or driver issues
* Launch failures: Invalid grid/block configurations

Good error handling helps you:
* Debug problems quickly
* Write robust applications
* Provide helpful error messages

In [None]:
def safe_cuda_operation():
    device = None
    try:
        print("Attempting CUDA operation...")
        
        # Initialize device (this can fail)
        device = Device(0)
        device.set_current()
        print("✓ Device initialized successfully")
        
        # Allocate memory (this can fail if requesting too much)
        buffer = device.allocate(1000 * 4)
        print("✓ Memory allocated successfully")
        
        # Your CUDA operations here
        print("✓ All CUDA operations completed successfully")
        
    except RuntimeError as e:
        print(f"✗ CUDA Runtime Error: {e}")
        print("This usually means a problem with GPU drivers or hardware")
        
    except MemoryError as e:
        print(f"✗ GPU Memory Error: {e}")
        print("Try reducing the size of your data or closing other GPU programs")
        
    except Exception as e:
        print(f"✗ Unexpected error occurred: {e}")
        print(f"Error type: {type(e).__name__}")
        
    finally:
        # Cleanup happens automatically in cuda.core
        # But you can add custom cleanup here if needed
        if device is not None:
            print("✓ Cleanup completed")

# Test error handling
safe_cuda_operation()

Error handling best practices:
1. Use try-except blocks: Wrap CUDA operations in try-except
2. Specific exceptions: Catch specific error types when possible
3. Helpful messages: Explain what went wrong and how to fix it
4. Cleanup: Use finally blocks for cleanup code
5. Don't ignore errors: Always handle or propagate exceptions

### Common Error Patterns

In [None]:
def handle_common_errors():
    errors_encountered = []
    
    # Error 1: Invalid device number
    print("Testing invalid device...")
    try:
        invalid_device = Device(999)  # Device 999 probably doesn't exist
        invalid_device.set_current()
    except Exception as e:
        error_msg = f"Invalid device error: {e}"
        errors_encountered.append(error_msg)
        print(f"✗ {error_msg}")
    
    # Error 2: Memory allocation failure
    print("\nTesting memory allocation failure...")
    try:
        device = Device(0)
        device.set_current()
        
        # Try to allocate an impossibly large amount of memory (1TB)
        huge_amount = 1024 * 1024 * 1024 * 1024  # 1TB in bytes
        huge_alloc = device.allocate(huge_amount)
        
    except Exception as e:
        error_msg = f"Memory allocation error: {e}"
        errors_encountered.append(error_msg)
        print(f"✗ {error_msg}")
    
    # Error 3: Invalid kernel compilation
    print("\nTesting kernel compilation failure...")
    try:
        # This CUDA code has syntax errors
        bad_source = """
        extern "C" __global__ void broken_kernel( {
            // Missing closing brace and parameter list
            int i = this_variable_doesnt_exist;
        """
        program = Program(bad_source)
        program.compile()
        
    except Exception as e:
        error_msg = f"Kernel compilation error: {e}"
        errors_encountered.append(error_msg)
        print(f"✗ {error_msg}")
    
    # Error 4: Invalid launch configuration
    print("\nTesting invalid launch configuration...")
    try:
        device = Device(0)
        device.set_current()
        
        # Create a valid kernel first
        valid_source = """
        extern "C" __global__ void test_kernel(float *data, int n) {
            int i = blockIdx.x * blockDim.x + threadIdx.x;
            if (i < n) data[i] = i;
        }
        """
        program = Program(valid_source)
        kernel = program.compile().get_kernel("test_kernel")
        
        # Try to launch with invalid configuration (0 threads per block)
        bad_config = LaunchConfig(grid=(1,), block=(0,))  # 0 threads is invalid
        
        # This should fail
        data = device.allocate(100 * 4)
        launch(kernel, bad_config, data, np.int32(100))
        
    except Exception as e:
        error_msg = f"Invalid launch configuration: {e}"
        errors_encountered.append(error_msg)
        print(f"✗ {error_msg}")
    
    print(f"\nSummary: Caught {len(errors_encountered)} expected errors")
    return errors_encountered

# Run error handling tests
errors = handle_common_errors()
print("\nError handling demonstration completed")

What this demonstrates:
1. **Device errors**: Wrong device numbers, missing GPUs
2. **Memory errors**: Requesting too much memory
3. **Compilation errors**: Syntax errors in CUDA C code
4. **Launch errors**: Invalid thread/block configurations

## 8. Performance Optimization
**Understanding GPU Performance**

GPUs achieve high performance through **massive parallelism**, but they have specific requirements:

**What makes GPUs fast:**
* Thousands of threads running simultaneously
* High memory bandwidth (when used correctly)
* Specialized hardware for parallel operations

**What makes GPUs slow:**
* Thread divergence (threads taking different paths)
* Poor memory access patterns
* Insufficient parallelism
* Frequent CPU-GPU data transfers


### Memory Coalescing
**Memory Coalescing** means arranging memory accesses so that neighboring threads access neighboring memory locations. This allows the GPU to fetch data efficiently.

Let's look at good vs bad memory access patterns:

In [None]:
# Good: Coalesced memory access pattern
coalesced_kernel = """
extern "C" __global__ void coalesced_access(float *data, int n) {
    // Each thread accesses consecutive memory locations
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        data[i] = data[i] * 2.0f;  // Sequential access - GOOD!
    }
}
"""

# Bad: Strided memory access pattern  
strided_kernel = """
extern "C" __global__ void strided_access(float *data, int n, int stride) {
    // Each thread skips 'stride' elements
    int i = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
    if (i < n) {
        data[i] = data[i] * 2.0f;  // Strided access - BAD!
    }
}
"""

def compare_memory_patterns():
    device = Device(0)
    device.set_current()
    
    # Compile both kernels
    coalesced_program = Program(coalesced_kernel)
    coalesced_func = coalesced_program.compile().get_kernel("coalesced_access")
    
    strided_program = Program(strided_kernel)
    strided_func = strided_program.compile().get_kernel("strided_access")
    
    # Prepare test data
    N = 1000000  # 1 million elements
    data = np.ones(N, dtype=np.float32)
    
    # Test coalesced access
    print("Testing coalesced memory access...")
    d_data = device.allocate(data.nbytes)
    d_data.copy_from(data)
    
    start_event = Event()
    end_event = Event()
    
    start_event.record()
    config = LaunchConfig(grid=((N + 255) // 256,), block=(256,))
    launch(coalesced_func, config, d_data, np.int32(N))
    end_event.record()
    end_event.synchronize()
    
    coalesced_time = Event.elapsed_time(start_event, end_event)
    print(f"Coalesced access time: {coalesced_time:.2f} ms")
    
    # Test strided access (accessing every 4th element)
    print("Testing strided memory access...")
    d_data.copy_from(data)  # Reset data
    
    start_event.record()
    launch(strided_func, config, d_data, np.int32(N), np.int32(4))  # stride=4
    end_event.record()
    end_event.synchronize()
    
    strided_time = Event.elapsed_time(start_event, end_event)
    print(f"Strided access time: {strided_time:.2f} ms")
    
    print(f"Speedup from coalescing: {strided_time / coalesced_time:.1f}x")

# Compare memory access patterns
compare_memory_patterns()

Why coalescing matters:****
* Coalesced: Threads 0-31 access data[0-31] → GPU fetches in 1 transaction
* Strided: Threads 0-31 access data[0, 4, 8, 12, ...] → GPU needs many transactions

Rule of thumb: Arrange your data so that thread N accesses element N (or close to it).

### Shared Memory Usage

**Shared memory** is fast, on-chip memory that all threads in a block can access. It's ideal for:
* Cache frequently accessed data
* Share intermediate results between threads
* Reduce global memory accesses

In [None]:
shared_memory_kernel = """
extern "C" __global__ void shared_memory_example(float *input, float *output, int n) {
    // Declare shared memory (allocated at kernel launch)
    extern __shared__ float shared_data[];
    
    int tid = threadIdx.x;  // Thread ID within block
    int i = blockIdx.x * blockDim.x + tid;  // Global thread ID
    
    // Step 1: Load data from global memory to shared memory
    if (i < n) {
        shared_data[tid] = input[i];
    } else {
        shared_data[tid] = 0.0f;  // Pad with zeros
    }
    
    // Step 2: Wait for all threads in block to finish loading
    __syncthreads();  // This is crucial!
    
    // Step 3: Process data using shared memory (3-point moving average)
    if (tid > 0 && tid < blockDim.x - 1 && i < n) {
        // Use shared memory instead of global memory - much faster!
        output[i] = (shared_data[tid-1] + shared_data[tid] + shared_data[tid+1]) / 3.0f;
    }
}
"""

def demonstrate_shared_memory():
    device = Device(0)
    device.set_current()
    
    print("Demonstrating shared memory usage...")
    
    # Prepare test data (some noise to smooth)
    N = 10000
    data = np.random.random(N).astype(np.float32)
    
    # Compile kernel
    program = Program(shared_memory_kernel)
    kernel = program.compile().get_kernel("shared_memory_example")
    
    # Allocate GPU memory
    d_input = device.allocate(data.nbytes)
    d_output = device.allocate(data.nbytes)
    
    d_input.copy_from(data)
    
    # Configure launch with shared memory
    block_size = 256
    shared_mem_size = block_size * 4  # 4 bytes per float32
    
    config = LaunchConfig(
        grid=((N + block_size - 1) // block_size,),
        block=(block_size,),
        shared_memory_size=shared_mem_size  # Allocate shared memory
    )
    
    print(f"Using {shared_mem_size} bytes of shared memory per block")
    
    # Launch kernel
    launch(kernel, config, d_input, d_output, np.int32(N))
    
    # Get results
    result = np.zeros_like(data)
    d_output.copy_to(result)
    
    print("Shared memory kernel completed successfully")
    print(f"Input data range: [{data.min():.3f}, {data.max():.3f}]")
    print(f"Smoothed data range: [{result[1:-1].min():.3f}, {result[1:-1].max():.3f}]")
    
    return result

# Demonstrate shared memory
smoothed_data = demonstrate_shared_memory()

**Key shared memory concepts:**
1. Declaration: `extern __shared__ float shared_data[]` creates shared memory
2. Allocation: Specify `shared_memory_size` in LaunchConfig
3. Synchronization: `__syncthreads()` ensures all threads finish before proceeding
4. Access: Much faster than global memory for repeated access

**When to use shared memory:**
* When multiple threads need the same data
* For algorithms that reuse data (convolution, matrix multiplication)
* To implement custom caching strategies

## 9. Practical Examples {tag here}
### Example 1: Image Convolution
Image convolution is a common operation in computer vision for tasks like blurring, edge detection, and sharpening.

In [None]:
convolution_kernel = """
extern "C" __global__ void convolution_2d(float *input, float *output, float *kernel, 
                                         int width, int height, int kernel_size) {
    // Calculate which pixel this thread processes
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    
    // Make sure we're within image bounds
    if (col < width && row < height) {
        float sum = 0.0f;
        int half_kernel = kernel_size / 2;
        
        // Apply convolution kernel
        for (int i = -half_kernel; i <= half_kernel; i++) {
            for (int j = -half_kernel; j <= half_kernel; j++) {
                int input_row = row + i;
                int input_col = col + j;
                
                // Handle image boundaries (clamp to edge)
                input_row = max(0, min(input_row, height - 1));
                input_col = max(0, min(input_col, width - 1));
                
                int input_idx = input_row * width + input_col;
                int kernel_idx = (i + half_kernel) * kernel_size + (j + half_kernel);
                
                sum += input[input_idx] * kernel[kernel_idx];
            }
        }
        
        output[row * width + col] = sum;
    }
}
"""

def gpu_convolution(image, conv_kernel):
    """
    Perform 2D convolution on an image using GPU
    
    Args:
        image: 2D numpy array (height, width) 
        conv_kernel: 2D convolution kernel (must be square, odd size)
    
    Returns:
        Convolved image as 2D numpy array
    """
    device = Device(0)
    device.set_current()
    
    # Validate inputs
    height, width = image.shape
    kernel_size = conv_kernel.shape[0]
    assert conv_kernel.shape == (kernel_size, kernel_size), "Kernel must be square"
    assert kernel_size % 2 == 1, "Kernel size must be odd"
    
    print(f"Convolving {width}x{height} image with {kernel_size}x{kernel_size} kernel")
    
    # Flatten arrays for GPU processing
    image_flat = image.flatten().astype(np.float32)
    kernel_flat = conv_kernel.flatten().astype(np.float32)
    
    # Compile kernel
    program = Program(convolution_kernel)
    kernel_func = program.compile().get_kernel("convolution_2d")
    
    # Allocate GPU memory
    d_input = device.allocate(image_flat.nbytes)
    d_output = device.allocate(image_flat.nbytes)
    d_kernel = device.allocate(kernel_flat.nbytes)
    
    # Copy data to GPU
    d_input.copy_from(image_flat)
    d_kernel.copy_from(kernel_flat)
    
    # Configure 2D launch (one thread per output pixel)
    block_size = 16  # 16x16 = 256 threads per block
    grid_x = (width + block_size - 1) // block_size
    grid_y = (height + block_size - 1) // block_size
    
    print(f"Launch configuration: {grid_x}x{grid_y} blocks of {block_size}x{block_size} threads")
    
    # Time the convolution
    start_event = Event()
    end_event = Event()
    
    start_event.record()
    config = LaunchConfig(grid=(grid_x, grid_y), block=(block_size, block_size))
    launch(kernel_func, config, d_input, d_output, d_kernel, 
           np.int32(width), np.int32(height), np.int32(kernel_size))
    end_event.record()
    end_event.synchronize()
    
    time_ms = Event.elapsed_time(start_event, end_event)
    
    # Copy result back and reshape
    output_flat = np.zeros_like(image_flat)
    d_output.copy_to(output_flat)
    
    print(f"Convolution completed in {time_ms:.2f} ms")
    
    return output_flat.reshape(height, width)

# Test convolution with different kernels
print("=== Image Convolution Demo ===")

# Create a test image (simple pattern)
test_image = np.zeros((100, 100), dtype=np.float32)
test_image[40:60, 40:60] = 1.0  # White square in center

# Edge detection kernel (Laplacian)
edge_kernel = np.array([
    [-1, -1, -1],
    [-1,  8, -1],
    [-1, -1, -1]
], dtype=np.float32)

# Blur kernel (Gaussian approximation)
blur_kernel = np.array([
    [1, 2, 1],
    [2, 4, 2], 
    [1, 2, 1]
], dtype=np.float32) / 16.0  # Normalize

# Apply convolutions
edges = gpu_convolution(test_image, edge_kernel)
blurred = gpu_convolution(test_image, blur_kernel)

print(f"Original image range: [{test_image.min():.1f}, {test_image.max():.1f}]")
print(f"Edge detection range: [{edges.min():.1f}, {edges.max():.1f}]")
print(f"Blurred image range: [{blurred.min():.1f}, {blurred.max():.1f}]")

Explanation:
1. 2D thread indexing: Each thread processes one pixel
2. Boundary handling: Clamping to avoid out-of-bounds access
3. Real-world application: Actual image processing algorithm
4. Performance timing: Measuring GPU execution time

## 10. Lab
### Exercise 1: Vector Operations
Write a CUDA kernel that performs element-wise multiplication of two vectors.

In [None]:
# Your solution here
multiply_kernel_source = """
// TODO: Implement vector multiplication kernel
"""

def vector_multiply(a, b):
    # TODO: Implement the wrapper function
    pass

### Exercise 2: Reduction Operation
Implement a parallel reduction to find the maximum value in an array.

In [None]:
# Your solution here
max_reduction_source = """
// TODO: Implement reduction kernel
"""

def find_max_gpu(arr):
    # TODO: Implement the wrapper function
    pass

### Exercise 3: Matrix Transpose
Write a kernel that transposes a matrix efficiently using shared memory.

In [None]:
# Your solution here
transpose_kernel_source = """
// TODO: Implement matrix transpose kernel with shared memory
"""

def matrix_transpose_gpu(matrix):
    # TODO: Implement the wrapper function
    pass

### Exercise 4: Performance Comparison
Compare the performance of your GPU implementations with their CPU counterparts.

In [None]:
def benchmark_operations():
    # TODO: Implement benchmarking code
    pass

### Best Practices Summary

1. **Always initialize CUDA properly** with cuInit(0)
2. **Manage memory carefully** - allocate, copy, free in proper order
3. **Handle errors gracefully** - wrap CUDA calls in try-catch blocks
4. **Use appropriate block sizes** - typically 128, 256, or 512 threads
5. **Consider memory access patterns** - coalesced access is faster
6. **Use shared memory** for data reuse within blocks
7. **Profile your code** - use events for timing
8. **Clean up resources** - always free memory and destroy contexts

This tutorial provides a solid foundation for using cuda.core effectively. Remember that low-level CUDA programming requires careful attention to detail, but it offers maximum performance and flexibility for GPU computing tasks.

## Resources
CUDA Python Reference: https://numba.pydata.org/numba-doc/dev/cuda-reference/

Repository: https://github.com/NVIDIA/cuda-python 