# Accelerated Computing with CuPy

## Table of Contents
1. [Creating Arrays: CPU vs. GPU](#1.-Creating-Arrays:-CPU-vs.-GPU)
2. [Basic Operations](#2.-Basic-Operations)
   - [Sequential Operations & Memory](#Sequential-Operations-&-Memory)
3. [Complex Operations (Linear Algebra)](#3.-Complex-Operations-(Linear-Algebra))
   - [Agnostic Code (NumPy Dispatch)](#Agnostic-Code-(NumPy-Dispatch))
4. [Device Management](#4.-Device-Management)
5. [Exercise - NumPy to CuPy](#Exercise---NumPy-to-CuPy)
   - [Part 1](#Part-1)
   - [Part 2](#Part-2)

---

Let's shift gears to high-level array functionality using **[CuPy](https://cupy.dev/)**.

### What is CuPy?
CuPy is a library that implements the familiar **NumPy API** but runs on the GPU (using CUDA C++ in the backend). 

**Why use it?**
* **Zero Friction:** If you know NumPy, you already know CuPy.
* **Speed:** It provides out-of-the-box GPU acceleration for array operations.
* **Ease of use:** You can often port CPU code to GPU simply by changing `import numpy as np` to `import cupy as cp`.

In [None]:
import numpy as np
import cupy as cp

# Ensure the GPU is clean and ready
cp.cuda.Stream.null.synchronize()

---


## 1. Creating Arrays: CPU vs. GPU

Let's compare the performance of creating a large 3D array (approx. 2GB in size) on the CPU versus the GPU.

We will use `np.ones` for the CPU and `cp.ones` for the GPU.


In [None]:
%%timeit -r 1 -n 10
# CPU creation
global x_cpu
x_cpu = np.ones((1000, 500, 500))

In [None]:
%%timeit -n 10
# GPU creation
global x_gpu
x_gpu = cp.ones((1000, 500, 500))

# Force the CPU to wait for the GPU to finish before stopping the timer
cp.cuda.Stream.null.synchronize()

We can see here that creating this array on the GPU is much faster than doing so on the CPU! You also likely noticed the line `cp.cuda.Stream.null.synchronize()` in the code above. This is vital for accurate timing.

**How CuPy works:**
1.  When you call a CuPy function, the CPU places a task in the GPU's "to-do list" (stream).
2.  The CPU immediately moves to the next line of code **without waiting** for the GPU to finish.
3.  This is called **Asynchronous Execution**.

If we didn't call `synchronize()`, the timer would stop as soon as the CPU issued the command. This would report a misleadingly fast time because it only measures how long it took to launch the task, not how long the GPU actually took to execute it. `synchronize()` forces the CPU to wait until the GPU has finished its work.

## 2. Basic Operations

The syntax for mathematical operations is identical. Let's multiply every value in our arrays by `5`.

In [None]:
%%time
# CPU Operation
x_cpu *= 5

In [None]:
%%time
# GPU Operation
x_gpu *= 5

cp.cuda.Stream.null.synchronize()

The GPU completes this operation notably faster, with the code staying the same.

### Sequential Operations & Memory

Now let's do a couple of operations sequentially, something which would suffer from memory transfer times in Numba examples without explicit memory management.

In [None]:
%%time
# CPU: Sequential math
x_cpu *= 5
x_cpu *= x_cpu
x_cpu += x_cpu

In [None]:
%%time
# GPU: Sequential math
x_gpu *= 5
x_gpu *= x_gpu
x_gpu += x_gpu

cp.cuda.Stream.null.synchronize()

The GPU ran that much faster even without us explicitly managing memory. This is because CuPy is handling all of this for us transparently.

## 3. Complex Operations (Linear Algebra)

GPUs excel at Linear Algebra. Let's look at **Singular Value Decomposition (SVD)**, a computationally heavy $O(N^3)$ operation.

In [None]:
%%time
# CPU SVD
x_cpu = np.random.random((1000, 1000))
u, s, v = np.linalg.svd(x_cpu)

In [None]:
%%time
# GPU SVD
x_gpu = cp.random.random((1000, 1000))
u, s, v = cp.linalg.svd(x_gpu)

The GPU outperforms the CPU again with exactly the same API!

### Agnostic Code (NumPy Dispatch)

A key feature of CuPy is that many **NumPy functions work on CuPy arrays without changing your code**.

When you pass a CuPy GPU array (`x_gpu`) into a NumPy function that supports the `__array_function__` protocol (e.g., `np.linalg.svd`), NumPy detects the CuPy input and **delegates the operation to CuPy’s own implementation**, which runs on the GPU.

This allows you to write code using standard `np.*` syntax and have it run on either CPU or GPU seamlessly - **as long as CuPy implements an override for that function.**

CuPy also protects you from hidden performance penalties: **it forbids implicit GPU → CPU copies**, raising a `TypeError` when NumPy tries to convert a CuPy array into a NumPy array behind the scenes. This ensures all device-to-host transfers are **explicit and intentional**, never silent.

In [None]:
%%time
# We create the data on the GPU
x_gpu = cp.random.random((1000, 1000))

# BUT we call the standard NumPy function
u, s, v = np.linalg.svd(x_gpu)  

cp.cuda.Stream.null.synchronize()

## 4. Device Management

If you have multiple GPUs, CuPy uses the concept of a "Current Device" context. 

You can use a `with` statement to ensure specific arrays are created on specific cards (e.g., GPU 0 vs GPU 1).


In [None]:
with cp.cuda.Device(0):
   x_on_gpu0 = cp.random.random((100000, 1000))

print(f"Array is on device: {x_on_gpu0.device}")

**Note:** CuPy functions generally expect all input arrays to be on the **same** device. Passing an array stored on a non-current device may work depending on the hardware configuration but is generally discouraged as it may not be performant.


---

## Exercise - NumPy to CuPy

### Part 1
Let's put the "Drop-in Replacement" philosophy to the test with the same data pipeline as the previous notebook. Specficially, the single block of code below performs the following steps:
1) Generate a massive dataset (50 million elements).
2) Process it using a heavy operation (Sorting).
3) Manipulate the shape and normalize the data (Broadcasting).
4) Verify the integrity of the result.

**TODO:**
1. Run the cell below with xp = np (CPU Mode). Note the "Sort Time".
2. Change the setup line to xp = cp (GPU Mode). Run it again.
3. Observe how the exact same logic runs significantly faster on the GPU with CuPy while retaining the implementation properties of NumPy.

In [None]:
import numpy as np
import cupy as cp
import time

# --- 1. SETUP: CHOOSE YOUR DEVICE ---
xp = np  # Toggle this to 'cp' for GPU acceleration

print(f"Running on: {xp.__name__.upper()}")

# --- 2. DATA GENERATION ---
N = 50_000_000
print(f"Generating {N:,} random elements ({N*8/1e9:.2f} GB)...")
arr = xp.random.rand(N)

# --- 3. HEAVY COMPUTATION (TIMED) ---
print("Sorting data...")
t0 = time.perf_counter()

xp.sort(arr)

# Ensure GPU finishes before stopping timer
if xp == cp:
    cp.cuda.Stream.null.synchronize()

t1 = time.perf_counter()
print(f"  -> Sort Time: {t1 - t0:.4f} seconds")

# --- 4. MANIPULATION & BROADCASTING ---
# Purpose: Demonstrate that CuPy supports complex reshaping and broadcasting rules exactly like NumPy.
# This shows you don't need to rewrite your data processing logic.

# Reshape to a matrix with 5 columns
arr_new = arr.reshape((-1, 5))

# Normalize: Divide every row by its sum using broadcasting
row_sums = arr_new.sum(axis=1)
normalized_matrix = arr_new / row_sums[:, xp.newaxis]

# --- 5. VERIFICATION ---
# Purpose: Verify mathematical correctness/integrity of the result.
check_sums = xp.sum(normalized_matrix, axis=1)
xp.testing.assert_allclose(check_sums, 1.0)

print("  -> Verification: PASSED (All rows sum to 1.0)")

**TODO: When working with CuPy arrays, try changing `xp.testing.assert_allclose` to `np.testing.assert_allclose`. What happens and why?**

### Part 2
We will now create a massive dataset (50 million points) representing a sine wave and see how fast the GPU can sort it compared to the CPU. 

**TODO:** 
1) **Generate Data:** Create a NumPy array (`y_cpu`) and a CuPy array (`y_gpu`) representing $\sin(x)$ from $0$ to $2\pi$ with `50,000,000` points.
2) **Benchmark CPU and GPU:** Use `cupyx.profiler.benchmark` to measure both `np.sort` and `cp.sort`.

In [None]:
import numpy as np
import cupy as cp
import cupyx.profiler

# --- Step 1: Generate Data ---
N = 50_000_000
print(f"Generating {N} points...")

# TODO: Create x_cpu using np.linspace from 0 to 2*pi
# TODO: Create y_cpu by taking np.sin(x_cpu)

# TODO: Create x_gpu using cp.linspace from 0 to 2*pi
# TODO: Create y_gpu by taking cp.sin(x_gpu)


# --- Step 2: Benchmark NumPy (CPU) ---
print("Benchmarking NumPy Sort (this may take a few seconds)...")
# TODO: Use cupyx.profiler.benchmark(function, (args,), n_repeat=5)
# Hint: Pass the function `np.sort` and the argument `(y_cpu,)`
# Note: The comma in (y_cpu,) is required to make it a tuple!


# --- Step 3: Benchmark CuPy (GPU) ---
print("Benchmarking CuPy Sort...")
# TODO: Use cupyx.profiler.benchmark(function, (args,), n_repeat=5)
# Hint: Pass the function `cp.sort` and the argument `(y_gpu,)`
# Note: The comma in (y_gpu,) is required to make it a tuple!

**EXTRA CREDIT: Benchmark with different array sizes and find the size at which CuPy and NumPy take the same amount of time. Try to extract the timing data from `cupyx.profiler.benchmark`'s return value and customize how the output is displayed. You could even make a graph.**

In [None]:
...