# Module 07: GPU Acceleration with CUDA

**Difficulty**: ⭐⭐⭐

**Estimated Time**: 90 minutes

**Prerequisites**: 
- Module 05: Serial Implementation
- Module 06: Multi-Core Parallelization
- Basic understanding of GPU architecture

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand GPU architecture and CUDA execution model
2. Write CUDA kernels using Numba in Python
3. Manage GPU memory (global, shared, constant)
4. Optimize GPU code for maximum throughput
5. Compare CPU vs GPU performance on various problems
6. Identify when GPU acceleration is beneficial

## Note on GPU Availability

This notebook is designed to work whether you have a GPU or not:
- **With GPU**: Full CUDA acceleration
- **Without GPU**: Code examples and explanations (won't execute kernels)

## 1. Setup and GPU Detection

In [None]:
import numpy as np
import time
import matplotlib.pyplot as plt
import pandas as pd
from typing import Tuple

# Try to import CUDA support
GPU_AVAILABLE = False
try:
    from numba import cuda
    import numba
    
    # Check if GPU is actually available
    if cuda.is_available():
        GPU_AVAILABLE = True
        print("GPU DETECTED!")
        print(f"Numba version: {numba.__version__}")
        print(f"\nGPU Information:")
        gpu = cuda.get_current_device()
        print(f"  Name: {gpu.name.decode('utf-8')}")
        print(f"  Compute Capability: {gpu.compute_capability}")
        print(f"  Total Memory: {gpu.total_memory / 1e9:.2f} GB")
        print(f"  Multiprocessors: {gpu.MULTIPROCESSOR_COUNT}")
        print(f"  Max Threads per Block: {gpu.MAX_THREADS_PER_BLOCK}")
        print(f"  Max Block Dimensions: {gpu.MAX_BLOCK_DIM_X} x {gpu.MAX_BLOCK_DIM_Y} x {gpu.MAX_BLOCK_DIM_Z}")
        print(f"  Max Grid Dimensions: {gpu.MAX_GRID_DIM_X} x {gpu.MAX_GRID_DIM_Y} x {gpu.MAX_GRID_DIM_Z}")
    else:
        print("Numba installed but no GPU detected.")
        print("This notebook will show code examples but won't execute GPU kernels.")
except ImportError:
    print("Numba with CUDA support not installed.")
    print("To install: conda install numba cudatoolkit")
    print("or: pip install numba (if you have CUDA toolkit installed)")
    print("\nThis notebook will show code examples and concepts.")

# Set random seed
np.random.seed(42)

# Configure matplotlib
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

## 2. GPU Architecture Overview

### CUDA Execution Model

```
CPU (Host)                    GPU (Device)
┌─────────────┐              ┌──────────────────────────────┐
│  Main Code  │              │  Grid                        │
│             │              │  ┌────────────────────────┐  │
│  Launch ────┼──────────────┼─>│  Block 0  Block 1  ... │  │
│  Kernel     │              │  │  ┌─────┐  ┌─────┐      │  │
│             │              │  │  │T T T│  │T T T│      │  │
│  Wait  <────┼──────────────┼──│  │T T T│  │T T T│      │  │
│             │              │  │  │T T T│  │T T T│      │  │
└─────────────┘              │  │  └─────┘  └─────┘      │  │
                             │  └────────────────────────┘  │
                             └──────────────────────────────┘
                             T = Thread
```

### Memory Hierarchy

| Memory Type | Scope | Speed | Size | Latency |
|-------------|-------|-------|------|----------|
| **Registers** | Thread | Fastest | ~64KB | 1 cycle |
| **Shared Memory** | Block | Very Fast | ~48KB | ~5 cycles |
| **L1/L2 Cache** | Device | Fast | ~MB | ~50 cycles |
| **Global Memory** | Device | Slow | GB | ~400 cycles |
| **Constant Memory** | Device | Fast* | 64KB | ~5 cycles* |

*if cached

### Key Concepts

1. **Grid**: Collection of blocks
2. **Block**: Collection of threads (up to 1024 threads)
3. **Thread**: Single execution unit
4. **Warp**: Group of 32 threads executing together (SIMT)
5. **Kernel**: Function that runs on GPU

## 3. Your First CUDA Kernel: Vector Addition

Let's start with a simple example to understand CUDA programming.

In [None]:
# CPU version (baseline)
def vector_add_cpu(a, b, c):
    """
    Add two vectors on CPU.
    
    Args:
        a, b: Input vectors
        c: Output vector (c = a + b)
    """
    for i in range(a.size):
        c[i] = a[i] + b[i]


if GPU_AVAILABLE:
    @cuda.jit
    def vector_add_gpu(a, b, c):
        """
        Add two vectors on GPU.
        
        Each thread computes one element of the result.
        
        CUDA Built-in Variables:
        - cuda.threadIdx.x: Thread index within block
        - cuda.blockIdx.x:  Block index within grid
        - cuda.blockDim.x:  Number of threads per block
        """
        # Calculate global thread index
        idx = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
        
        # Boundary check (in case we have more threads than elements)
        if idx < c.size:
            c[idx] = a[idx] + b[idx]
    
    print("GPU kernel defined!")
else:
    print("GPU not available - showing kernel code for reference:")
    print("""
    @cuda.jit
    def vector_add_gpu(a, b, c):
        idx = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
        if idx < c.size:
            c[idx] = a[idx] + b[idx]
    """)

In [None]:
if GPU_AVAILABLE:
    # Test vector addition
    n = 1_000_000
    
    # Create test data on CPU
    a_cpu = np.random.rand(n).astype(np.float32)
    b_cpu = np.random.rand(n).astype(np.float32)
    c_cpu = np.zeros(n, dtype=np.float32)
    
    # CPU execution
    print(f"Testing with {n:,} elements...\n")
    print("CPU version...")
    start = time.time()
    vector_add_cpu(a_cpu, b_cpu, c_cpu)
    cpu_time = time.time() - start
    print(f"  Time: {cpu_time:.4f} seconds")
    
    # GPU execution
    print("\nGPU version...")
    
    # Step 1: Copy data to GPU
    a_gpu = cuda.to_device(a_cpu)
    b_gpu = cuda.to_device(b_cpu)
    c_gpu = cuda.device_array(n, dtype=np.float32)
    
    # Step 2: Configure kernel launch parameters
    threads_per_block = 256
    blocks_per_grid = (n + threads_per_block - 1) // threads_per_block
    
    print(f"  Launching kernel with:")
    print(f"    Threads per block: {threads_per_block}")
    print(f"    Blocks per grid:   {blocks_per_grid}")
    print(f"    Total threads:     {threads_per_block * blocks_per_grid:,}")
    
    # Step 3: Launch kernel and measure time
    start = time.time()
    vector_add_gpu[blocks_per_grid, threads_per_block](a_gpu, b_gpu, c_gpu)
    cuda.synchronize()  # Wait for GPU to finish
    gpu_time = time.time() - start
    
    # Step 4: Copy result back to CPU
    c_gpu_result = c_gpu.copy_to_host()
    
    print(f"  Time: {gpu_time:.4f} seconds")
    
    # Verify correctness
    matches = np.allclose(c_cpu, c_gpu_result)
    print(f"\nResults match: {matches}")
    
    # Calculate speedup
    speedup = cpu_time / gpu_time
    print(f"GPU speedup: {speedup:.2f}x")
    
    if speedup < 1:
        print("\nNote: GPU might be slower for simple operations due to:")
        print("  - Memory transfer overhead (CPU ↔ GPU)")
        print("  - Kernel launch overhead")
        print("  - NumPy's vectorized operations are already optimized")
else:
    print("GPU not available - skipping benchmark")

## 4. Image Processing on GPU

Image filtering is highly parallelizable - perfect for GPU!

In [None]:
def create_test_image(size=(1024, 1024)):
    """Create test image."""
    height, width = size
    image = np.zeros((height, width), dtype=np.uint8)
    
    for i in range(height):
        image[i, :] = int(255 * i / height)
    
    image[100:200, 100:300] = 200
    
    center_y, center_x = height // 2, width // 2
    radius = 80
    y, x = np.ogrid[:height, :width]
    mask = (x - center_x)**2 + (y - center_y)**2 <= radius**2
    image[mask] = 150
    
    noise = np.random.randint(-20, 20, size=(height, width))
    image = np.clip(image.astype(np.int16) + noise, 0, 255).astype(np.uint8)
    
    return image


def create_gaussian_kernel(size=5, sigma=1.0):
    """Create Gaussian kernel."""
    if size % 2 == 0:
        size += 1
    ax = np.arange(-size // 2 + 1, size // 2 + 1)
    xx, yy = np.meshgrid(ax, ax)
    kernel = np.exp(-(xx**2 + yy**2) / (2 * sigma**2))
    return kernel / np.sum(kernel)


# Create test data
test_image = create_test_image((1024, 1024))
gaussian_kernel = create_gaussian_kernel(size=5, sigma=1.5).astype(np.float32)

print(f"Test image: {test_image.shape}, dtype={test_image.dtype}")
print(f"Kernel: {gaussian_kernel.shape}, dtype={gaussian_kernel.dtype}")

In [None]:
if GPU_AVAILABLE:
    @cuda.jit
    def gaussian_blur_gpu(image, kernel, output):
        """
        Apply Gaussian blur on GPU.
        
        Each thread processes one output pixel.
        Uses 2D grid and block configuration.
        """
        # Get 2D thread coordinates
        row = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
        col = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        
        height, width = image.shape
        k_height, k_width = kernel.shape
        k_half = k_height // 2
        
        # Boundary check
        if row < height and col < width:
            # Apply convolution
            value = 0.0
            
            for ki in range(k_height):
                for kj in range(k_width):
                    # Image coordinates (with boundary clamping)
                    img_row = min(max(row + ki - k_half, 0), height - 1)
                    img_col = min(max(col + kj - k_half, 0), width - 1)
                    
                    value += image[img_row, img_col] * kernel[ki, kj]
            
            # Clamp and store result
            if value < 0:
                output[row, col] = 0
            elif value > 255:
                output[row, col] = 255
            else:
                output[row, col] = int(value)
    
    print("GPU image filtering kernel defined!")
else:
    print("GPU not available - kernel code for reference only")

In [None]:
if GPU_AVAILABLE:
    print("="*70)
    print("IMAGE FILTERING BENCHMARK: CPU vs GPU")
    print("="*70)
    
    # CPU baseline (using NumPy vectorization)
    def gaussian_blur_cpu(image, kernel):
        """CPU version using scipy."""
        from scipy.ndimage import convolve
        return convolve(image.astype(np.float32), kernel, mode='nearest').astype(np.uint8)
    
    print("\nCPU version (scipy.ndimage.convolve)...")
    start = time.time()
    result_cpu = gaussian_blur_cpu(test_image, gaussian_kernel)
    cpu_time = time.time() - start
    print(f"  Time: {cpu_time:.4f} seconds")
    
    # GPU version
    print("\nGPU version...")
    
    # Transfer to GPU
    image_gpu = cuda.to_device(test_image)
    kernel_gpu = cuda.to_device(gaussian_kernel)
    output_gpu = cuda.device_array(test_image.shape, dtype=np.uint8)
    
    # Configure 2D grid
    threads_per_block = (16, 16)  # 16x16 = 256 threads per block
    blocks_per_grid_x = (test_image.shape[1] + threads_per_block[0] - 1) // threads_per_block[0]
    blocks_per_grid_y = (test_image.shape[0] + threads_per_block[1] - 1) // threads_per_block[1]
    blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
    
    print(f"  Grid configuration:")
    print(f"    Threads per block: {threads_per_block[0]}x{threads_per_block[1]}")
    print(f"    Blocks per grid:   {blocks_per_grid[0]}x{blocks_per_grid[1]}")
    print(f"    Total threads:     {blocks_per_grid[0]*blocks_per_grid[1]*threads_per_block[0]*threads_per_block[1]:,}")
    
    # Launch kernel
    start = time.time()
    gaussian_blur_gpu[blocks_per_grid, threads_per_block](image_gpu, kernel_gpu, output_gpu)
    cuda.synchronize()
    gpu_time = time.time() - start
    
    # Copy back
    result_gpu = output_gpu.copy_to_host()
    
    print(f"  Time: {gpu_time:.4f} seconds")
    
    # Calculate speedup
    speedup = cpu_time / gpu_time
    print(f"\nGPU speedup: {speedup:.2f}x")
    
    # Visualize results
    plt.figure(figsize=(15, 5))
    
    plt.subplot(131)
    plt.imshow(test_image, cmap='gray')
    plt.title('Original Image')
    plt.axis('off')
    
    plt.subplot(132)
    plt.imshow(result_cpu, cmap='gray')
    plt.title(f'CPU Result ({cpu_time:.3f}s)')
    plt.axis('off')
    
    plt.subplot(133)
    plt.imshow(result_gpu, cmap='gray')
    plt.title(f'GPU Result ({gpu_time:.3f}s) - {speedup:.1f}x faster')
    plt.axis('off')
    
    plt.tight_layout()
    plt.show()
else:
    print("GPU not available - skipping benchmark")

## 5. Matrix Multiplication with Shared Memory

Shared memory is key to high GPU performance. Let's implement tiled matrix multiplication.

In [None]:
if GPU_AVAILABLE:
    @cuda.jit
    def matmul_gpu_naive(A, B, C):
        """
        Naive matrix multiplication on GPU.
        
        Each thread computes one element of C.
        Uses only global memory (slow).
        """
        row = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
        col = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        
        if row < C.shape[0] and col < C.shape[1]:
            value = 0.0
            for k in range(A.shape[1]):
                value += A[row, k] * B[k, col]
            C[row, col] = value
    
    
    @cuda.jit
    def matmul_gpu_shared(A, B, C):
        """
        Tiled matrix multiplication using shared memory.
        
        Shared memory is ~100x faster than global memory.
        We load tiles of A and B into shared memory, then compute.
        
        Algorithm:
        1. Divide matrices into tiles (e.g., 16x16)
        2. Load one tile of A and B into shared memory
        3. Compute partial dot product
        4. Repeat for all tiles
        5. Write final result to global memory
        """
        # Thread indices
        tx = cuda.threadIdx.x
        ty = cuda.threadIdx.y
        
        # Block indices
        bx = cuda.blockIdx.x
        by = cuda.blockIdx.y
        
        # Tile size (same as block size)
        TILE_SIZE = 16
        
        # Allocate shared memory for tiles
        # Each block has its own shared memory
        sA = cuda.shared.array(shape=(16, 16), dtype=numba.float32)
        sB = cuda.shared.array(shape=(16, 16), dtype=numba.float32)
        
        # Global position of this thread's output element
        row = by * TILE_SIZE + ty
        col = bx * TILE_SIZE + tx
        
        # Accumulator for dot product
        value = 0.0
        
        # Loop over tiles
        num_tiles = (A.shape[1] + TILE_SIZE - 1) // TILE_SIZE
        
        for tile_idx in range(num_tiles):
            # Load tile of A into shared memory
            a_col = tile_idx * TILE_SIZE + tx
            if row < A.shape[0] and a_col < A.shape[1]:
                sA[ty, tx] = A[row, a_col]
            else:
                sA[ty, tx] = 0.0
            
            # Load tile of B into shared memory
            b_row = tile_idx * TILE_SIZE + ty
            if b_row < B.shape[0] and col < B.shape[1]:
                sB[ty, tx] = B[b_row, col]
            else:
                sB[ty, tx] = 0.0
            
            # Synchronize to ensure all threads loaded their data
            cuda.syncthreads()
            
            # Compute partial dot product using shared memory
            for k in range(TILE_SIZE):
                value += sA[ty, k] * sB[k, tx]
            
            # Synchronize before loading next tile
            cuda.syncthreads()
        
        # Write result to global memory
        if row < C.shape[0] and col < C.shape[1]:
            C[row, col] = value
    
    print("GPU matrix multiplication kernels defined!")
    print("  - matmul_gpu_naive: Uses only global memory")
    print("  - matmul_gpu_shared: Uses shared memory (optimized)")
else:
    print("GPU not available")

In [None]:
if GPU_AVAILABLE:
    print("="*70)
    print("MATRIX MULTIPLICATION BENCHMARK")
    print("="*70)
    
    # Test matrices
    n = 512
    A = np.random.rand(n, n).astype(np.float32)
    B = np.random.rand(n, n).astype(np.float32)
    
    print(f"\nMatrix size: {n}x{n}")
    print(f"Total operations: {2*n**3:,} (2n³ for n×n matrices)\n")
    
    # CPU baseline (NumPy)
    print("CPU (NumPy)...")
    start = time.time()
    C_cpu = np.dot(A, B)
    cpu_time = time.time() - start
    gflops_cpu = (2 * n**3) / cpu_time / 1e9
    print(f"  Time: {cpu_time:.4f} seconds")
    print(f"  Performance: {gflops_cpu:.2f} GFLOPS")
    
    # GPU naive
    print("\nGPU (naive - global memory only)...")
    A_gpu = cuda.to_device(A)
    B_gpu = cuda.to_device(B)
    C_gpu_naive = cuda.device_array((n, n), dtype=np.float32)
    
    threads_per_block = (16, 16)
    blocks_per_grid_x = (n + threads_per_block[0] - 1) // threads_per_block[0]
    blocks_per_grid_y = (n + threads_per_block[1] - 1) // threads_per_block[1]
    blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
    
    start = time.time()
    matmul_gpu_naive[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu_naive)
    cuda.synchronize()
    gpu_naive_time = time.time() - start
    gflops_gpu_naive = (2 * n**3) / gpu_naive_time / 1e9
    
    print(f"  Time: {gpu_naive_time:.4f} seconds")
    print(f"  Performance: {gflops_gpu_naive:.2f} GFLOPS")
    print(f"  Speedup vs CPU: {cpu_time/gpu_naive_time:.2f}x")
    
    # GPU shared memory
    print("\nGPU (optimized - shared memory)...")
    C_gpu_shared = cuda.device_array((n, n), dtype=np.float32)
    
    start = time.time()
    matmul_gpu_shared[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu_shared)
    cuda.synchronize()
    gpu_shared_time = time.time() - start
    gflops_gpu_shared = (2 * n**3) / gpu_shared_time / 1e9
    
    print(f"  Time: {gpu_shared_time:.4f} seconds")
    print(f"  Performance: {gflops_gpu_shared:.2f} GFLOPS")
    print(f"  Speedup vs CPU: {cpu_time/gpu_shared_time:.2f}x")
    print(f"  Speedup vs GPU naive: {gpu_naive_time/gpu_shared_time:.2f}x")
    
    # Verify correctness
    C_gpu_result = C_gpu_shared.copy_to_host()
    matches = np.allclose(C_cpu, C_gpu_result, rtol=1e-4)
    print(f"\nResults match: {matches}")
    
    # Summary table
    results = pd.DataFrame([
        {'Method': 'CPU (NumPy)', 'Time (s)': cpu_time, 'GFLOPS': gflops_cpu, 'Speedup': 1.0},
        {'Method': 'GPU Naive', 'Time (s)': gpu_naive_time, 'GFLOPS': gflops_gpu_naive, 'Speedup': cpu_time/gpu_naive_time},
        {'Method': 'GPU Shared', 'Time (s)': gpu_shared_time, 'GFLOPS': gflops_gpu_shared, 'Speedup': cpu_time/gpu_shared_time}
    ])
    
    print("\n" + "="*70)
    print("SUMMARY")
    print("="*70)
    print(results.to_string(index=False))
    
    print("\nKey insight: Shared memory is crucial for GPU performance!")
    print(f"Shared memory gave {gpu_naive_time/gpu_shared_time:.1f}x speedup over naive GPU implementation.")
else:
    print("GPU not available")

## 6. Monte Carlo Simulation on GPU

Monte Carlo is embarrassingly parallel - ideal for GPU!

In [None]:
if GPU_AVAILABLE:
    from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
    
    @cuda.jit
    def monte_carlo_pi_gpu(rng_states, results, samples_per_thread):
        """
        Estimate π using Monte Carlo on GPU.
        
        Each thread:
        1. Generates random points
        2. Counts points inside quarter circle
        3. Stores count in results array
        
        Args:
            rng_states: Random number generator states (one per thread)
            results: Output array (one element per thread)
            samples_per_thread: How many samples each thread should generate
        """
        thread_id = cuda.grid(1)
        
        if thread_id < results.size:
            inside = 0
            
            # Generate samples
            for _ in range(samples_per_thread):
                # Generate random point in [0,1] × [0,1]
                x = xoroshiro128p_uniform_float32(rng_states, thread_id)
                y = xoroshiro128p_uniform_float32(rng_states, thread_id)
                
                # Check if inside quarter circle
                if x*x + y*y <= 1.0:
                    inside += 1
            
            results[thread_id] = inside
    
    print("GPU Monte Carlo kernel defined!")
else:
    print("GPU not available")

In [None]:
if GPU_AVAILABLE:
    print("="*70)
    print("MONTE CARLO π ESTIMATION: CPU vs GPU")
    print("="*70)
    
    total_samples = 100_000_000  # 100 million
    
    # CPU baseline
    print(f"\nCPU version ({total_samples:,} samples)...")
    start = time.time()
    x = np.random.random(total_samples)
    y = np.random.random(total_samples)
    inside = np.sum(x*x + y*y <= 1.0)
    pi_cpu = 4.0 * inside / total_samples
    cpu_time = time.time() - start
    
    print(f"  Time: {cpu_time:.4f} seconds")
    print(f"  π estimate: {pi_cpu:.6f}")
    print(f"  Error: {abs(pi_cpu - np.pi):.6f}")
    
    # GPU version
    print(f"\nGPU version ({total_samples:,} samples)...")
    
    threads_per_block = 256
    blocks = 1024
    total_threads = threads_per_block * blocks
    samples_per_thread = total_samples // total_threads
    
    print(f"  Threads per block: {threads_per_block}")
    print(f"  Blocks: {blocks}")
    print(f"  Total threads: {total_threads:,}")
    print(f"  Samples per thread: {samples_per_thread:,}")
    
    # Initialize RNG states
    rng_states = create_xoroshiro128p_states(total_threads, seed=42)
    
    # Allocate result array
    results_gpu = cuda.device_array(total_threads, dtype=np.int32)
    
    # Launch kernel
    start = time.time()
    monte_carlo_pi_gpu[blocks, threads_per_block](rng_states, results_gpu, samples_per_thread)
    cuda.synchronize()
    
    # Sum results
    results_cpu = results_gpu.copy_to_host()
    total_inside = np.sum(results_cpu)
    pi_gpu = 4.0 * total_inside / (total_threads * samples_per_thread)
    gpu_time = time.time() - start
    
    print(f"  Time: {gpu_time:.4f} seconds")
    print(f"  π estimate: {pi_gpu:.6f}")
    print(f"  Error: {abs(pi_gpu - np.pi):.6f}")
    
    # Calculate speedup
    speedup = cpu_time / gpu_time
    print(f"\nGPU speedup: {speedup:.2f}x")
    print(f"Samples per second:")
    print(f"  CPU: {total_samples/cpu_time/1e6:.2f} million/sec")
    print(f"  GPU: {total_samples/gpu_time/1e6:.2f} million/sec")
else:
    print("GPU not available")

## 7. When to Use GPU: Decision Guide

In [None]:
print("="*70)
print("GPU ACCELERATION DECISION GUIDE")
print("="*70)
print("""
✓ GOOD FIT FOR GPU:
  1. Data parallelism (same operation on many data elements)
     - Image/video processing
     - Matrix operations
     - Monte Carlo simulations
  
  2. High arithmetic intensity (compute/memory ratio)
     - Deep learning (convolutions, matrix multiplies)
     - Physics simulations
     - Computational geometry
  
  3. Large datasets (millions of elements)
     - GPU overhead is amortized
     - Memory bandwidth utilized
  
  4. Regular memory access patterns
     - Coalesced memory reads/writes
     - Predictable access patterns

✗ POOR FIT FOR GPU:
  1. Sequential algorithms
     - Dynamic programming
     - Recursive algorithms
     - Algorithms with data dependencies
  
  2. Small datasets (< 10K elements)
     - Transfer overhead dominates
     - Can't saturate GPU
  
  3. Irregular/random memory access
     - Sparse matrices
     - Graph algorithms
     - Tree traversals
  
  4. Frequent CPU ↔ GPU transfers
     - PCIe bandwidth limited (~16 GB/s)
     - Each transfer adds latency
  
  5. Heavy branching (if/else)
     - SIMT execution = all threads follow same path
     - Divergence causes serialization

OPTIMIZATION PRIORITIES:
  1. Minimize CPU ↔ GPU transfers (biggest bottleneck)
  2. Maximize memory coalescing (aligned, sequential access)
  3. Use shared memory for reused data (100x faster than global)
  4. Maximize occupancy (enough threads to hide latency)
  5. Avoid branch divergence within warps

PERFORMANCE EXPECTATIONS:
  - Image processing: 5-20x speedup
  - Dense linear algebra: 10-50x speedup
  - Monte Carlo: 50-200x speedup
  - Deep learning: 10-100x speedup
  
  Note: Actual speedup depends on:
  - GPU model (consumer vs datacenter)
  - Problem size
  - Memory access patterns
  - Implementation quality
""")

## 8. Memory Bandwidth Analysis

In [None]:
if GPU_AVAILABLE:
    @cuda.jit
    def memory_bandwidth_test(data, result):
        """
        Test memory bandwidth by reading and writing large arrays.
        """
        idx = cuda.grid(1)
        if idx < data.size:
            # Read from global memory, do minimal computation, write back
            result[idx] = data[idx] * 2.0
    
    print("Testing GPU memory bandwidth...\n")
    
    # Test different data sizes
    sizes_mb = [1, 10, 100, 500]
    
    for size_mb in sizes_mb:
        n_elements = size_mb * 1024 * 1024 // 4  # 4 bytes per float32
        
        # Create data
        data = np.random.rand(n_elements).astype(np.float32)
        
        # Transfer to GPU
        data_gpu = cuda.to_device(data)
        result_gpu = cuda.device_array(n_elements, dtype=np.float32)
        
        # Configure launch
        threads_per_block = 256
        blocks = (n_elements + threads_per_block - 1) // threads_per_block
        
        # Run kernel
        start = time.time()
        memory_bandwidth_test[blocks, threads_per_block](data_gpu, result_gpu)
        cuda.synchronize()
        elapsed = time.time() - start
        
        # Calculate bandwidth (read + write)
        bytes_transferred = 2 * n_elements * 4  # 2 operations × 4 bytes
        bandwidth_gb_s = bytes_transferred / elapsed / 1e9
        
        print(f"Size: {size_mb:3d} MB, Time: {elapsed:.4f}s, Bandwidth: {bandwidth_gb_s:.2f} GB/s")
    
    print("\nNote: GPU memory bandwidth typically 200-900 GB/s depending on model.")
    print("Lower than theoretical? Check for:")
    print("  - Uncoalesced memory access")
    print("  - Small transfer sizes")
    print("  - PCIe bottleneck (if including CPU-GPU transfer)")
else:
    print("GPU not available")

## 9. Summary and Best Practices

In [None]:
print("="*70)
print("GPU PROGRAMMING BEST PRACTICES")
print("="*70)
print("""
1. MINIMIZE CPU ↔ GPU TRANSFERS:
   - Transfer data once at start
   - Keep data on GPU between kernel calls
   - Use pinned memory for faster transfers
   - Consider computing everything on GPU

2. MEMORY OPTIMIZATION:
   - Use shared memory for frequently accessed data
   - Coalesce global memory accesses
   - Minimize register usage per thread
   - Use constant memory for read-only data

3. THREAD CONFIGURATION:
   - Threads per block: 128-256 (multiples of 32)
   - Maximize occupancy (check with profiler)
   - Consider warp size (32) for efficiency
   - Use 2D/3D blocks for 2D/3D problems

4. ALGORITHM DESIGN:
   - Avoid branch divergence within warps
   - Design for regular access patterns
   - Use reduction for combining results
   - Consider tiling for better cache usage

5. DEBUGGING:
   - Start with small data and single block
   - Print from kernel (only for debugging!)
   - Check bounds carefully (cuda.grid(1) calculations)
   - Verify results against CPU version

6. PROFILING:
   - Use cuda.profile_start/stop
   - Analyze with NVIDIA Nsight
   - Check memory bandwidth utilization
   - Look for bottlenecks (memory vs compute)

7. COMMON PITFALLS:
   - Forgetting cuda.synchronize()
   - Not checking array bounds
   - Race conditions in shared memory
   - Bank conflicts in shared memory
   - Excessive register usage (reduces occupancy)
""")

if GPU_AVAILABLE:
    print("\n" + "="*70)
    print("YOUR GPU PERFORMANCE SUMMARY")
    print("="*70)
    gpu = cuda.get_current_device()
    print(f"GPU: {gpu.name.decode('utf-8')}")
    print(f"Memory: {gpu.total_memory / 1e9:.2f} GB")
    print(f"Compute Capability: {gpu.compute_capability}")
    print("\nRecommended configuration for this GPU:")
    print(f"  Max threads per block: {min(gpu.MAX_THREADS_PER_BLOCK, 256)}")
    print(f"  Optimal block size: 128-256 threads")
    print(f"  Shared memory per block: ~48KB")

print("\n" + "="*70)
print("NEXT STEPS")
print("="*70)
print("""
Module 08: Performance Benchmarking
  - Comprehensive comparison: Serial vs Parallel vs GPU
  - Scalability analysis (strong and weak scaling)
  - Statistical analysis of performance
  - Visualization of results

Module 09: Optimization Techniques
  - Cache optimization strategies
  - Load balancing improvements
  - Hybrid CPU-GPU approaches
  - Iterative refinement process
""")

## 10. Exercises

### Exercise 1: Parallel Reduction on GPU

Implement a parallel sum reduction using shared memory.

**Challenge**: How do you combine results from multiple blocks?

In [None]:
if GPU_AVAILABLE:
    @cuda.jit
    def parallel_sum_gpu(data, block_sums):
        """
        Sum array elements using parallel reduction.
        
        Strategy:
        1. Each thread loads one element into shared memory
        2. Perform tree-based reduction within block
        3. First thread writes block sum to global memory
        """
        # TODO: Implement parallel reduction
        # Hints:
        # - Use cuda.shared.array() for temporary storage
        # - Use cuda.syncthreads() between reduction steps
        # - Reduction pattern: stride = 1, 2, 4, 8, ...
        pass
    
    # Test your implementation
    # data = np.random.rand(1000000).astype(np.float32)
    # expected_sum = np.sum(data)
    # ... launch kernel ...
    # gpu_sum = np.sum(block_sums_cpu)  # Sum the block sums
    # print(f"CPU sum: {expected_sum:.6f}")
    # print(f"GPU sum: {gpu_sum:.6f}")
    # print(f"Match: {np.isclose(expected_sum, gpu_sum)}")
else:
    print("GPU not available - exercise skipped")

### Exercise 2: Image Histogram on GPU

Compute image histogram using GPU.

**Challenge**: Multiple threads may try to increment the same bin (atomic operations needed).

In [None]:
if GPU_AVAILABLE:
    @cuda.jit
    def histogram_gpu(image, histogram):
        """
        Compute histogram on GPU.
        
        Use cuda.atomic.add() to safely increment bins.
        """
        # TODO: Implement GPU histogram
        # Hint: cuda.atomic.add(array, index, value)
        pass
    
    # Test
    # test_img = create_test_image((1024, 1024))
    # histogram_cpu = np.bincount(test_img.ravel(), minlength=256)
    # ... GPU version ...
    # Compare CPU vs GPU results
else:
    print("GPU not available - exercise skipped")

### Exercise 3: Performance Analysis

Measure the overhead of CPU-GPU transfers.

**Task**: For different data sizes, measure:
1. CPU → GPU transfer time
2. Kernel execution time
3. GPU → CPU transfer time
4. Total time

Plot the results to see when GPU becomes worthwhile.

In [None]:
# TODO: Implement transfer overhead analysis
# Test with sizes: 1KB, 10KB, 100KB, 1MB, 10MB, 100MB
# Plot: time vs data size, showing transfer vs compute breakdown

# Your code here...

## Summary

In this module, you learned:

1. **GPU Architecture**
   - CUDA execution model (grid, blocks, threads)
   - Memory hierarchy (global, shared, constant, registers)
   - SIMT execution and warps

2. **CUDA Programming with Numba**
   - Writing GPU kernels in Python
   - Managing GPU memory
   - Thread indexing and synchronization

3. **Optimization Techniques**
   - Shared memory for data reuse
   - Memory coalescing
   - Minimizing CPU-GPU transfers
   - Proper thread configuration

4. **Performance Analysis**
   - CPU vs GPU benchmarking
   - When GPU acceleration is beneficial
   - Memory bandwidth considerations

**What's Next?**

- **Module 08**: Comprehensive benchmarking framework
- **Module 09**: Advanced optimization and hybrid approaches