# Accelerated Computing with CuPy

## Table of Contents
1. [Creating Arrays: CPU vs. GPU](#1.-Creating-Arrays:-CPU-vs.-GPU)
2. [Basic Operations](#2.-Basic-Operations)
   - [Sequential Operations & Memory](#Sequential-Operations-&-Memory)
3. [Complex Operations (Linear Algebra)](#3.-Complex-Operations-(Linear-Algebra))
   - [Agnostic Code (NumPy Dispatch)](#Agnostic-Code-(NumPy-Dispatch))
4. [Device Management](#4.-Device-Management)
5. [Exercise - NumPy to CuPy](#Exercise---NumPy-to-CuPy)
   - [Part 1](#Part-1)
   - [Part 2](#Part-2)

---

Let's shift gears to high-level array functionality using **[CuPy](https://cupy.dev/)**.

### What is CuPy?
CuPy is a library that implements the familiar **NumPy API** but runs on the GPU (using CUDA C++ in the backend). 

**Why use it?**
* **Zero Friction:** If you know NumPy, you already know CuPy.
* **Speed:** It provides out-of-the-box GPU acceleration for array operations.
* **Ease of use:** You can often port CPU code to GPU simply by changing `import numpy as np` to `import cupy as cp`.

In [None]:
import numpy as np
import cupy as cp
from cupyx.profiler import benchmark

# Helper to display benchmark results concisely.
# We use CuPy's benchmark() throughout this notebook for accurate GPU timing.
def print_benchmark(result, device="gpu"):
    """Print benchmark result showing only the relevant time."""
    if device == "gpu":
        avg_ms = result.gpu_times.mean() * 1000
        std_ms = result.gpu_times.std() * 1000
        print(f"{result.name}: {avg_ms:.3f} ms +/- {std_ms:.3f} ms")
    else:
        avg_ms = result.cpu_times.mean() * 1000
        std_ms = result.cpu_times.std() * 1000
        print(f"{result.name}: {avg_ms:.3f} ms +/- {std_ms:.3f} ms")

## 1. Creating Arrays: CPU vs. GPU

Let's compare the performance of creating a large 3D array (approx. 2GB in size) on the CPU versus the GPU.

We will use `np.ones` for the CPU and `cp.ones` for the GPU.

In [None]:
# CPU creation
print_benchmark(benchmark(np.ones, ((1000, 500, 500),), n_repeat=10), device="cpu")

In [None]:
# GPU creation
print_benchmark(benchmark(cp.ones, ((1000, 500, 500),), n_repeat=10), device="gpu")

We can see here that creating this array on the GPU is much faster than doing so on the CPU!

**About `cupyx.profiler.benchmark`:**

We use CuPy's built-in `benchmark` utility for timing GPU operations. This is important because GPU operations are **asynchronous** - when you call a CuPy function, the CPU places a task in the GPU's "to-do list" (stream) and immediately moves on without waiting.

The `benchmark` function handles all the complexity of proper GPU timing for us:
- It automatically synchronizes GPU streams to get accurate measurements.
- It runs warm-up iterations to avoid cold-start overhead.
- It reports both CPU and GPU times separately.

This makes it the recommended way to time CuPy code, as it's both accurate and convenient.

## 2. Basic Operations

The syntax for mathematical operations is identical. Let's multiply every value in our arrays by `5`.

In [None]:
# Create fresh arrays for the benchmark
x_cpu = np.ones((1000, 500, 500))
x_gpu = cp.ones((1000, 500, 500))

def multiply(x):
    return x * 5

# CPU Operation
print_benchmark(benchmark(multiply, (x_cpu,), n_repeat=10), device="cpu")

In [None]:
# GPU Operation
print_benchmark(benchmark(multiply, (x_gpu,), n_repeat=10), device="gpu")

The GPU completes this operation notably faster, with the code staying the same.

### Sequential Operations & Memory

Now let's do a couple of operations sequentially, something which would suffer from memory transfer times in Numba examples without explicit memory management.

In [None]:
def sequential_math(x):
    x = x * 5
    x = x * x
    x = x + x
    return x

# CPU: Sequential math
print_benchmark(benchmark(sequential_math, (x_cpu,), n_repeat=10), device="cpu")

In [None]:
# GPU: Sequential math
print_benchmark(benchmark(sequential_math, (x_gpu,), n_repeat=10), device="gpu")

The GPU ran that much faster even without us explicitly managing memory. This is because CuPy is handling all of this for us transparently.

## 3. Complex Operations (Linear Algebra)

GPUs excel at Linear Algebra. Let's look at **Singular Value Decomposition (SVD)**, a computationally heavy $O(N^3)$ operation.

In [None]:
# CPU SVD
x_cpu = np.random.random((1000, 1000))
print_benchmark(benchmark(np.linalg.svd, (x_cpu,), n_repeat=5), device="cpu")

In [None]:
# GPU SVD
x_gpu = cp.random.random((1000, 1000))
print_benchmark(benchmark(cp.linalg.svd, (x_gpu,), n_repeat=5), device="gpu")

The GPU outperforms the CPU again with exactly the same API!

### Agnostic Code (NumPy Dispatch)

A key feature of CuPy is that many **NumPy functions work on CuPy arrays without changing your code**.

When you pass a CuPy GPU array (`x_gpu`) into a NumPy function that supports the `__array_function__` protocol (e.g., `np.linalg.svd`), NumPy detects the CuPy input and **delegates the operation to CuPy's own implementation**, which runs on the GPU.

This allows you to write code using standard `np.*` syntax and have it run on either CPU or GPU seamlessly - **as long as CuPy implements an override for that function.**

CuPy also protects you from hidden performance penalties: **it forbids implicit GPU → CPU copies**, raising a `TypeError` when NumPy tries to convert a `cupy.ndarray` into a `numpy.ndarray` behind the scenes. This ensures all device-to-host transfers are **explicit and intentional**, never silent.

In [None]:
# We create the data on the GPU
x_gpu = cp.random.random((1000, 1000))

# BUT we call the standard NumPy function - CuPy dispatches it to the GPU!
print_benchmark(benchmark(np.linalg.svd, (x_gpu,), n_repeat=5), device="gpu")

## 4. Device Management

If you have multiple GPUs, CuPy uses the concept of a "Current Device" context. 

You can use a `with` statement to ensure specific arrays are created on specific cards (e.g., GPU 0 vs GPU 1).

In [None]:
with cp.cuda.Device(0):
   x_on_gpu0 = cp.random.random((100000, 1000))

print(f"Array is on device: {x_on_gpu0.device}")

**Note:** CuPy functions generally expect all input arrays to be on the **same** device. Passing an array stored on a non-current device may work depending on the hardware configuration but is generally discouraged as it may not be performant.

---

## Exercise - NumPy to CuPy

### Part 1
Let's put the "Drop-in Replacement" philosophy to the test with the same data pipeline as the previous notebook. Specifically, the single block of code below performs the following steps:
1) Generate a massive dataset (50 million elements).
2) Process it using a heavy operation (Sorting).
3) Manipulate the shape and normalize the data (Broadcasting).
4) Verify the integrity of the result.

**TODO:**
1. Run the cell below with `xp = np` (CPU Mode). Note the benchmark output.
2. Change the setup line to `xp = cp` (GPU Mode). Run it again.
3. Observe how the exact same logic runs significantly faster on the GPU with CuPy while retaining the implementation properties of NumPy.

Note: We use `cupyx.profiler.benchmark` for timing, which automatically handles GPU synchronization.

In [None]:
import numpy as np
import cupy as cp
from cupyx.profiler import benchmark

# Re-defined here so this exercise cell is self-contained and can run independently.
def print_benchmark(result, device="gpu"):
    """Print benchmark result showing only the relevant time."""
    if device == "gpu":
        avg_ms = result.gpu_times.mean() * 1000
        std_ms = result.gpu_times.std() * 1000
    else:
        avg_ms = result.cpu_times.mean() * 1000
        std_ms = result.cpu_times.std() * 1000
    print(f"  -> {result.name}: {avg_ms:.3f} ms +/- {std_ms:.3f} ms")

# --- 1. SETUP: CHOOSE YOUR DEVICE ---
# SOLUTION: Changed from 'np' to 'cp' for GPU acceleration
xp = cp  # Toggle this to 'np' for CPU mode

print(f"Running on: {xp.__name__.upper()}")

# --- 2. DATA GENERATION ---
N = 50_000_000
print(f"Generating {N:,} random elements ({N*8/1e9:.2f} GB)...")
arr = xp.random.rand(N)

# --- 3. HEAVY COMPUTATION (TIMED) ---
print("Sorting data...")
# benchmark() handles GPU synchronization automatically
result = benchmark(xp.sort, (arr,), n_repeat=5)
print_benchmark(result, device="gpu" if xp == cp else "cpu")

# --- 4. MANIPULATION & BROADCASTING ---
# Purpose: Demonstrate that CuPy supports complex reshaping and broadcasting rules exactly like NumPy.
# This shows you don't need to rewrite your data processing logic.

# Reshape to a matrix with 5 columns
arr_new = arr.reshape((-1, 5))

# Normalize: Divide every row by its sum using broadcasting
row_sums = arr_new.sum(axis=1)
normalized_matrix = arr_new / row_sums[:, xp.newaxis]

# --- 5. VERIFICATION ---
# Purpose: Verify mathematical correctness/integrity of the result.
check_sums = xp.sum(normalized_matrix, axis=1)
xp.testing.assert_allclose(check_sums, 1.0)

print("  -> Verification: PASSED (All rows sum to 1.0)")

**TODO: When working with CuPy arrays, try changing `xp.testing.assert_allclose` to `np.testing.assert_allclose`. What happens and why?**

**SOLUTION:**

When you change `xp.testing.assert_allclose` to `np.testing.assert_allclose` while working with CuPy arrays (`xp = cp`), you will get a **`TypeError`**.

This happens because:

1. `np.testing.assert_allclose` internally tries to convert its inputs to NumPy arrays.
2. CuPy arrays live on the GPU, and CuPy **explicitly forbids implicit GPU → CPU transfers**.
3. When NumPy's `assert_allclose` attempts to call `np.asarray()` on the CuPy array, CuPy raises a `TypeError` to prevent a silent (and potentially slow) data copy from GPU to CPU memory.

This is a **safety feature** of CuPy! It ensures that all device-to-host transfers are **explicit and intentional**. 

### Part 2
We will now create a massive dataset (50 million points) representing a sine wave and see how fast the GPU can sort it compared to the CPU. 

**TODO:** 
1) **Generate Data:** Create a NumPy array (`y_cpu`) and a CuPy array (`y_gpu`) representing $\sin(x)$ from $0$ to $2\pi$ with `50,000,000` points.
2) **Benchmark CPU and GPU:** Use `benchmark()` from `cupyx.profiler` to measure both `np.sort` and `cp.sort`.

In [None]:
import numpy as np
import cupy as cp
from cupyx.profiler import benchmark

# --- Step 1: Generate Data ---
N = 50_000_000
print(f"Generating {N:,} points...")

# SOLUTION: Create x_cpu using np.linspace from 0 to 2*pi
x_cpu = np.linspace(0, 2 * np.pi, N)
# SOLUTION: Create y_cpu by taking np.sin(x_cpu)
y_cpu = np.sin(x_cpu)

# SOLUTION: Create x_gpu using cp.linspace from 0 to 2*pi
x_gpu = cp.linspace(0, 2 * cp.pi, N)
# SOLUTION: Create y_gpu by taking cp.sin(x_gpu)
y_gpu = cp.sin(x_gpu)

print(f"  CPU array shape: {y_cpu.shape}, dtype: {y_cpu.dtype}")
print(f"  GPU array shape: {y_gpu.shape}, dtype: {y_gpu.dtype}")

# --- Step 2: Benchmark NumPy (CPU) ---
print("\nBenchmarking NumPy Sort (this may take a few seconds)...")
# SOLUTION: Use benchmark with np.sort
cpu_result = benchmark(np.sort, (y_cpu,), n_repeat=5)
cpu_avg_ms = cpu_result.cpu_times.mean() * 1000
cpu_std_ms = cpu_result.cpu_times.std() * 1000
print(f"  NumPy (CPU): {cpu_avg_ms:.3f} ms +/- {cpu_std_ms:.3f} ms")

# --- Step 3: Benchmark CuPy (GPU) ---
print("\nBenchmarking CuPy Sort...")
# SOLUTION: Use benchmark with cp.sort
gpu_result = benchmark(cp.sort, (y_gpu,), n_repeat=5)
gpu_avg_ms = gpu_result.gpu_times.mean() * 1000
gpu_std_ms = gpu_result.gpu_times.std() * 1000
print(f"  CuPy (GPU): {gpu_avg_ms:.3f} ms +/- {gpu_std_ms:.3f} ms")

# --- Summary ---
print(f"\n*** GPU Speedup: {cpu_avg_ms / gpu_avg_ms:.1f}x faster ***")

**EXTRA CREDIT: Benchmark with different array sizes and find the size at which CuPy and NumPy take the same amount of time. Try to extract the timing data from `cupyx.profiler.benchmark`'s return value and customize how the output is displayed. You could even make a graph.**

In [None]:
# SOLUTION: Extra Credit - Finding the crossover point between CPU and GPU performance

import numpy as np
import cupy as cp
from cupyx.profiler import benchmark

# Define array sizes to test (evenly spaced from 1K to 5K)
sizes = [1_000, 2_000, 3_000, 4_000, 5_000]

cpu_times = []
gpu_times = []

print("Benchmarking different array sizes...")
print("=" * 70)
print(f"{'Size':>15} | {'NumPy (CPU)':>15} | {'CuPy (GPU)':>15} | {'Winner':>10}")
print("-" * 70)

for N in sizes:
    # Generate sine wave data
    x_cpu = np.linspace(0, 2 * np.pi, N)
    y_cpu = np.sin(x_cpu)
    
    x_gpu = cp.linspace(0, 2 * cp.pi, N)
    y_gpu = cp.sin(x_gpu)
    
    # Benchmark CPU
    cpu_result = benchmark(np.sort, (y_cpu,), n_repeat=10)
    cpu_time_ms = cpu_result.cpu_times.mean() * 1000
    cpu_times.append(cpu_time_ms)
    
    # Benchmark GPU
    gpu_result = benchmark(cp.sort, (y_gpu,), n_repeat=10)
    gpu_time_ms = gpu_result.gpu_times.mean() * 1000
    gpu_times.append(gpu_time_ms)
    
    # Determine winner
    winner = "GPU" if gpu_time_ms < cpu_time_ms else "CPU"
    
    print(f"{N:>15,} | {cpu_time_ms:>12.3f} ms | {gpu_time_ms:>12.3f} ms | {winner:>10}")

print("=" * 70)

# Find approximate crossover point
crossover_idx = None
for i in range(len(sizes) - 1):
    # Check if GPU becomes faster between size[i] and size[i+1]
    if cpu_times[i] <= gpu_times[i] and cpu_times[i+1] > gpu_times[i+1]:
        crossover_idx = i
        break

if crossover_idx is not None:
    print(f"\nCrossover point: GPU becomes faster between {sizes[crossover_idx]:,} and {sizes[crossover_idx+1]:,} elements")
else:
    if gpu_times[0] < cpu_times[0]:
        print(f"\nGPU is faster for all tested sizes (even at {sizes[0]:,} elements)")
    else:
        print(f"\nCPU is faster for all tested sizes (even at {sizes[-1]:,} elements)")

In [None]:
# SOLUTION: Extra Credit (continued) - Visualization

import matplotlib.pyplot as plt

# Create the plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Absolute times (log-log scale)
ax1.loglog(sizes, cpu_times, 'b-o', label='NumPy (CPU)', linewidth=2, markersize=8)
ax1.loglog(sizes, gpu_times, 'r-s', label='CuPy (GPU)', linewidth=2, markersize=8)
ax1.set_xlabel('Array Size (elements)', fontsize=12)
ax1.set_ylabel('Time (ms)', fontsize=12)
ax1.set_title('Sort Performance: CPU vs GPU', fontsize=14)
ax1.legend(fontsize=11)
ax1.grid(True, alpha=0.3)

# Plot 2: Speedup ratio
speedups = [cpu / gpu for cpu, gpu in zip(cpu_times, gpu_times)]
colors = ['green' if s > 1 else 'red' for s in speedups]
ax2.bar(range(len(sizes)), speedups, color=colors, alpha=0.7, edgecolor='black')
ax2.axhline(y=1.0, color='black', linestyle='--', linewidth=2, label='Break-even')
ax2.set_xticks(range(len(sizes)))
ax2.set_xticklabels([f'{s:,}' for s in sizes], rotation=45, ha='right', fontsize=9)
ax2.set_xlabel('Array Size (elements)', fontsize=12)
ax2.set_ylabel('GPU Speedup (CPU time / GPU time)', fontsize=12)
ax2.set_title('GPU Speedup Factor', fontsize=14)
ax2.legend(fontsize=11)
ax2.grid(True, axis='y', alpha=0.3)

# Add value labels on bars
for i, (speedup, color) in enumerate(zip(speedups, colors)):
    label = f'{speedup:.1f}x'
    ax2.annotate(label, (i, speedup), textcoords="offset points", 
                 xytext=(0, 5), ha='center', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n*** Analysis Complete ***")
print(f"Maximum GPU speedup: {max(speedups):.1f}x at {sizes[speedups.index(max(speedups))]:,} elements")