# 6. GPU Acceleration with CuPy

## CPU vs. GPU: An Analogy

Imagine you need to solve thousands of simple arithmetic problems. You have two choices:

1.  **A master mathematician (the CPU):** This person is brilliant and can solve any complex problem you give them, one at a time, very quickly.
2.  **An army of schoolchildren (the GPU):** None of them are individually as skilled as the master mathematician, but you have thousands of them. You can give each child one of the simple problems, and they can all solve them at the same time.

For complex, sequential tasks, the master mathematician is the best choice. But for a large number of simple, independent tasks, the army of schoolchildren will finish the job much faster. This is the core difference between a CPU and a GPU. CPUs have a few powerful cores for sequential tasks, while GPUs have thousands of simpler cores for parallel tasks.

Many operations in economic modeling, like matrix multiplication, are like the army of schoolchildren problem. By offloading these tasks to a GPU, we can often achieve a notable speedup.

**CuPy** is a Python library that makes GPU computing accessible. It provides a **drop-in replacement for NumPy**, meaning its API is almost identical. This allows you to accelerate your existing NumPy code on an NVIDIA GPU with minimal code changes.

--- 
\n**<font color='red'>IMPORTANT NOTE:</font>** To run this notebook, you **must** have a compatible **NVIDIA GPU** and the **NVIDIA CUDA Toolkit** installed. The CUDA Toolkit is a software layer that gives direct access to the GPU's virtual instruction set. Without it, your system cannot communicate with the GPU for general-purpose computing. If you lack the required hardware or software, the code in this notebook will fail.\n\n---\n
In this notebook, you will learn:
- The fundamental differences between CPU and GPU architectures.
- How to use CuPy to create arrays on the GPU.
- How to move data between the CPU (NumPy) and the GPU (CuPy).
- How to benchmark the performance difference for common numerical operations.

### The CuPy API: NumPy on the GPU

The convention is to import NumPy as `np` and CuPy as `cp`. You can then create arrays on the GPU using familiar syntax.

In [None]:
import numpy as np
import cupy as cp
import timeit

# A flag to check if CuPy is available
try:
    cp.cuda.runtime.getDeviceCount()
    CUPY_AVAILABLE = True
    print("CuPy is available.")
except cp.cuda.runtime.CUDARuntimeError:
    CUPY_AVAILABLE = False
    print("CuPy is not available. Code in this notebook will not run.")

In [None]:
if CUPY_AVAILABLE:
    # Create a NumPy array on the CPU
    x_cpu = np.arange(10)
    print(f"NumPy array on CPU: {x_cpu}")

    # Create a CuPy array on the GPU
    x_gpu = cp.arange(10)
    print(f"CuPy array on GPU: {x_gpu}")

### Moving Data Between CPU and GPU

A critical concept in GPU computing is data transfer. For the GPU to operate on data, that data must first be moved from the host (CPU) memory to the device (GPU) memory. This transfer has a cost, so efficient GPU computing often involves minimizing CPU-GPU data transfers.

- To move a NumPy array to the GPU, use `cupy.asarray()`.
- To move a CuPy array back to the CPU, use `cupy.asnumpy()` or the `.get()` method.

In [None]:
if CUPY_AVAILABLE:
    # Create a NumPy array
    numpy_arr = np.random.rand(5)
    print(f"Original NumPy array: {numpy_arr}")

    # Move it to the GPU
    cupy_arr = cp.asarray(numpy_arr)
    print(f"CuPy array on GPU: {cupy_arr}")

    # Move it back to the CPU
    numpy_arr_back = cp.asnumpy(cupy_arr)
    print(f"Array back on CPU: {numpy_arr_back}")

### Benchmarking: CPU vs. GPU Performance

Let's demonstrate the performance difference with a classic example: multiplying two large matrices. This is a highly parallelizable task where GPUs excel. We use `astype(np.float32)` because single-precision floating-point arithmetic is often much faster on consumer GPUs.

In [None]:
if CUPY_AVAILABLE:
    # Define the size of the matrices
    size = 5000

    # Create two random matrices in NumPy (CPU)
    a_cpu = np.random.rand(size, size).astype(np.float32)
    b_cpu = np.random.rand(size, size).astype(np.float32)

    # Create two random matrices in CuPy (GPU)
    a_gpu = cp.random.rand(size, size).astype(cp.float32)
    b_gpu = cp.random.rand(size, size).astype(cp.float32)

In [None]:
if CUPY_AVAILABLE:
    # Time the CPU operation
    cpu_time = timeit.timeit(lambda: np.dot(a_cpu, b_cpu), number=10)
    print(f"CPU time: {cpu_time / 10:.4f} seconds")

    # Time the GPU operation
    # We use cp.cuda.Stream.null.synchronize() to ensure the computation is finished
    cp.cuda.Stream.null.synchronize()
    gpu_time = timeit.timeit(lambda: cp.dot(a_gpu, b_gpu), number=10)
    cp.cuda.Stream.null.synchronize()
    print(f"GPU time: {gpu_time / 10:.4f} seconds")

    print(f"Speedup: {cpu_time / gpu_time:.1f}x")

### The Cost of Memory Transfer

The speedup from GPU computation is not free. Moving data between the CPU and GPU takes time. For the speedup to be worthwhile, the time saved by the GPU computation must be greater than the time spent on data transfer.

Let's see this in action with a small array.

In [None]:
if CUPY_AVAILABLE:
    size = 100
    a_cpu = np.random.rand(size, size).astype(np.float32)
    b_cpu = np.random.rand(size, size).astype(np.float32)

    # Time just the CPU computation
    cpu_time = timeit.timeit(lambda: np.dot(a_cpu, b_cpu), number=100)
    print(f"Small matrix, CPU only: {cpu_time / 100:.6f} seconds")

    # Time the full process: move to GPU, compute, move back
    def gpu_full_trip(a, b):
        a_gpu = cp.asarray(a)
        b_gpu = cp.asarray(b)
        c_gpu = cp.dot(a_gpu, b_gpu)
        c_cpu = cp.asnumpy(c_gpu)
    
    gpu_time = timeit.timeit(lambda: gpu_full_trip(a_cpu, b_cpu), number=100)
    print(f"Small matrix, GPU full trip: {gpu_time / 100:.6f} seconds")

For the small matrix, the GPU version is likely slower because the overhead of transferring data to and from the GPU outweighs the computational speedup. This illustrates a key principle: **use the GPU for large, intensive computations where the data transfer cost is a small fraction of the total execution time.**

## Summary: When to Use CuPy

CuPy is a powerful tool, but it's not always the right choice. Consider using CuPy when:

1.  **You have an NVIDIA GPU:** CuPy is built on CUDA and only works with NVIDIA hardware.
2.  **Your code is NumPy-heavy:** It is designed as a drop-in replacement for NumPy.
3.  **Your operations are highly parallelizable:** It excels at large-scale matrix algebra, element-wise operations on large arrays, and other vectorized computations.
4.  **The performance bottleneck is significant:** The overhead of transferring data to the GPU is not trivial. For small arrays or simple operations, NumPy on the CPU may be faster. CuPy shines when the computation is heavy enough to make the data transfer cost worthwhile.

By providing a familiar API, CuPy lowers the barrier to entry for GPU computing, enabling economists to harness the parallel processing power of modern graphics cards for their research.