# CUDA Level 101: A Beginner's Guide to GPU Programming

Welcome to this comprehensive introduction to CUDA programming!

## What You'll Learn
1. What is CUDA and why use it?
2. Understanding GPU architecture
3. Setting up your environment
4. Your first CUDA program
5. Memory management
6. Working with real data (cuda_sample_data.csv)
7. Performance comparison

## 1. What is CUDA?

**CUDA** (Compute Unified Device Architecture) is NVIDIA's parallel computing platform.

### Why Use CUDA?
- **Massive Parallelism**: Thousands of cores executing simultaneously
- **Performance**: Orders of magnitude faster for parallelizable tasks
- **Applications**: ML, scientific computing, image processing

| Feature | CPU | GPU |
|---------|-----|-----|
| Cores | Few (4-64) | Many (thousands) |
| Best For | Sequential tasks | Parallel tasks |

## 2. Setting Up Your Environment

In [None]:
# Check NVIDIA driver and CUDA
!nvidia-smi
!nvcc --version

In [None]:
# Install required packages
!pip install numba cupy-cuda12x numpy pandas matplotlib

In [None]:
# Import libraries
import numpy as np
import pandas as pd
from numba import cuda
import math
import time

print(f"CUDA available: {cuda.is_available()}")
if cuda.is_available():
    print(f"CUDA Device: {cuda.get_current_device().name}")

## 3. Understanding GPU Architecture

### Key Concepts
- **Host**: CPU and its memory
- **Device**: GPU and its memory
- **Kernel**: Function running on GPU
- **Thread**: Smallest execution unit
- **Block**: Group of cooperating threads
- **Grid**: Collection of blocks

In [None]:
# Get GPU information
if cuda.is_available():
    device = cuda.get_current_device()
    print(f"Device: {device.name}")
    print(f"Compute Capability: {device.compute_capability}")
    print(f"Max Threads/Block: {device.MAX_THREADS_PER_BLOCK}")

## 4. Your First CUDA Kernel

In [None]:
# Simple vector addition kernel
@cuda.jit
def add_kernel(a, b, result):
    idx = cuda.grid(1)
    if idx < a.size:
        result[idx] = a[idx] + b[idx]

In [None]:
# Test the kernel
n = 1000000
a = np.ones(n, dtype=np.float32)
b = np.ones(n, dtype=np.float32) * 2
result = np.zeros(n, dtype=np.float32)

# Copy to device
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_result = cuda.to_device(result)

# Launch kernel
threads = 256
blocks = (n + threads - 1) // threads
add_kernel[blocks, threads](d_a, d_b, d_result)

# Get result
result = d_result.copy_to_host()
print(f"Result: {result[:10]}")
print(f"All correct: {np.allclose(result, 3.0)}")

## 5. Memory Management

### Memory Types
1. **Global Memory**: Largest, slowest
2. **Shared Memory**: Fast, shared within block
3. **Registers**: Fastest, very limited

In [None]:
# Matrix multiplication with shared memory
from numba import float32

TILE_SIZE = 16

@cuda.jit
def matmul_shared(A, B, C):
    sA = cuda.shared.array((TILE_SIZE, TILE_SIZE), dtype=float32)
    sB = cuda.shared.array((TILE_SIZE, TILE_SIZE), dtype=float32)
    
    x, y = cuda.grid(2)
    tx, ty = cuda.threadIdx.x, cuda.threadIdx.y
    
    if x >= C.shape[0] or y >= C.shape[1]:
        return
    
    tmp = 0.0
    for i in range((A.shape[1] + TILE_SIZE - 1) // TILE_SIZE):
        if x < A.shape[0] and (i * TILE_SIZE + ty) < A.shape[1]:
            sA[tx, ty] = A[x, i * TILE_SIZE + ty]
        else:
            sA[tx, ty] = 0.0
        if y < B.shape[1] and (i * TILE_SIZE + tx) < B.shape[0]:
            sB[tx, ty] = B[i * TILE_SIZE + tx, y]
        else:
            sB[tx, ty] = 0.0
        cuda.syncthreads()
        for j in range(TILE_SIZE):
            tmp += sA[tx, j] * sB[j, ty]
        cuda.syncthreads()
    C[x, y] = tmp

## 6. Working with Real Data: cuda_sample_data.csv

Let's process our 100,000 record dataset with GPU acceleration!

In [None]:
# Load the dataset
df = pd.read_csv('cuda_sample_data.csv')
print(f"Dataset shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
df.head()

In [None]:
# Extract vector data
vec_a_x = df['vector_a_x'].values.astype(np.float32)
vec_a_y = df['vector_a_y'].values.astype(np.float32)
vec_a_z = df['vector_a_z'].values.astype(np.float32)
vec_b_x = df['vector_b_x'].values.astype(np.float32)
vec_b_y = df['vector_b_y'].values.astype(np.float32)
vec_b_z = df['vector_b_z'].values.astype(np.float32)

print(f"Loaded {len(vec_a_x):,} vectors")

In [None]:
# CUDA kernels for vector operations
@cuda.jit
def dot_product_kernel(ax, ay, az, bx, by, bz, result):
    idx = cuda.grid(1)
    if idx < ax.size:
        result[idx] = ax[idx]*bx[idx] + ay[idx]*by[idx] + az[idx]*bz[idx]

@cuda.jit
def magnitude_kernel(x, y, z, result):
    idx = cuda.grid(1)
    if idx < x.size:
        result[idx] = math.sqrt(x[idx]**2 + y[idx]**2 + z[idx]**2)

@cuda.jit
def cross_product_kernel(ax, ay, az, bx, by, bz, rx, ry, rz):
    idx = cuda.grid(1)
    if idx < ax.size:
        rx[idx] = ay[idx]*bz[idx] - az[idx]*by[idx]
        ry[idx] = az[idx]*bx[idx] - ax[idx]*bz[idx]
        rz[idx] = ax[idx]*by[idx] - ay[idx]*bx[idx]

In [None]:
# Process on GPU
n = len(vec_a_x)

# Transfer to device
d_ax, d_ay, d_az = cuda.to_device(vec_a_x), cuda.to_device(vec_a_y), cuda.to_device(vec_a_z)
d_bx, d_by, d_bz = cuda.to_device(vec_b_x), cuda.to_device(vec_b_y), cuda.to_device(vec_b_z)

# Allocate results
d_dots = cuda.device_array(n, dtype=np.float32)
d_mags = cuda.device_array(n, dtype=np.float32)

# Configure and run
threads, blocks = 256, (n + 255) // 256

start = time.time()
dot_product_kernel[blocks, threads](d_ax, d_ay, d_az, d_bx, d_by, d_bz, d_dots)
magnitude_kernel[blocks, threads](d_ax, d_ay, d_az, d_mags)
cuda.synchronize()
gpu_time = time.time() - start

# Get results
dots = d_dots.copy_to_host()
mags = d_mags.copy_to_host()

print(f"GPU Time: {gpu_time:.4f}s")
print(f"Dot products: {dots[:5]}")
print(f"Magnitudes: {mags[:5]}")

## 7. CPU vs GPU Comparison

In [None]:
# CPU implementation
start = time.time()
cpu_dots = vec_a_x*vec_b_x + vec_a_y*vec_b_y + vec_a_z*vec_b_z
cpu_mags = np.sqrt(vec_a_x**2 + vec_a_y**2 + vec_a_z**2)
cpu_time = time.time() - start

print(f"CPU Time: {cpu_time:.4f}s")
print(f"GPU Time: {gpu_time:.4f}s")
print(f"Speedup: {cpu_time/gpu_time:.2f}x")
print(f"Results match: {np.allclose(dots, cpu_dots) and np.allclose(mags, cpu_mags)}")

## 8. Summary

### What We Covered
- CUDA fundamentals and GPU architecture
- Writing and launching kernels
- Memory management
- Processing real data with GPU acceleration

### Resources
- [NVIDIA CUDA Documentation](https://docs.nvidia.com/cuda/)
- [Numba CUDA Documentation](https://numba.pydata.org/numba-doc/latest/cuda/)

In [None]:
# Exercise: Implement vector normalization on GPU
@cuda.jit
def normalize_kernel(x, y, z, rx, ry, rz):
    # Your code here
    pass