# GPU Acceleration with CuPy

## CPU vs. GPU: A Tale of Two Processors

Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are designed with fundamentally different architectures for different purposes.

- **CPUs** are composed of a few, very powerful cores optimized for **sequential, latency-sensitive tasks**. They are masters of executing complex instructions one after another very quickly.
- **GPUs**, on the other hand, are composed of thousands of smaller, simpler cores designed for **parallel, throughput-sensitive tasks**. They excel at performing the same simple operation on thousands of data points simultaneously.

Many operations in scientific computing and economic modeling, such as matrix multiplication, vector operations, and simulations, are inherently parallel. By offloading these tasks to a GPU, we can often achieve massive performance gains over a CPU.

**CuPy** is a Python library that makes GPU computing incredibly accessible. It provides a **drop-in replacement for NumPy**, meaning its API is designed to be a near-perfect mirror of NumPy's. This allows you to accelerate your existing NumPy code on an NVIDIA GPU with minimal code changes.

--- 
\n**<font color='red'>IMPORTANT NOTE:</font>** To run this notebook, you **must** have a compatible **NVIDIA GPU** and the correct version of the **NVIDIA CUDA Toolkit** installed on your system. If you do not have the required hardware and software, the code cells in this notebook will fail. You can still read the content to understand the concepts, but you will not be able to execute the code.\n\n---\n
In this notebook, you will learn:
- The fundamental differences between CPU and GPU architectures.
- How to use CuPy to create arrays on the GPU.
- How to move data between the CPU (NumPy) and the GPU (CuPy).
- How to benchmark the performance difference for common numerical operations.

### The CuPy API: NumPy on the GPU

The convention is to import NumPy as `np` and CuPy as `cp`. You can then create arrays on the GPU using familiar syntax.

In [None]:
import numpy as np
import timeit

try:
    import cupy as cp
    CUPY_AVAILABLE = True
    # Get GPU device name
    gpu_name = cp.cuda.runtime.getDeviceProperties(0)['name'].decode('utf-8')
    print(f"Found compatible GPU: {gpu_name}")
except (ImportError, cp.cuda.runtime.CUDARuntimeError):
    CUPY_AVAILABLE = False
    print("CuPy is not installed or no compatible NVIDIA GPU is found.")

if CUPY_AVAILABLE:
    # Create a NumPy array on the CPU
    x_cpu = np.arange(10)
    print(f"NumPy array on CPU: {x_cpu}")

    # Create a CuPy array on the GPU
    x_gpu = cp.arange(10)
    print(f"CuPy array on GPU: {x_gpu}")

### Moving Data Between CPU and GPU

A critical concept in GPU computing is data transfer. For the GPU to operate on data, that data must first be moved from the host (CPU) memory to the device (GPU) memory. This transfer has a cost, so efficient GPU computing often involves minimizing CPU-GPU data transfers.

- To move a NumPy array to the GPU, use `cupy.asarray()`.
- To move a CuPy array back to the CPU, use `cupy.asnumpy()` or the `.get()` method.

In [None]:
if CUPY_AVAILABLE:
    # Create a NumPy array
    numpy_arr = np.random.rand(5)
    print(f"Original NumPy array: {numpy_arr}")

    # Move it to the GPU
    cupy_arr = cp.asarray(numpy_arr)
    print(f"CuPy array on GPU: {cupy_arr}")

    # Move it back to the CPU
    numpy_arr_back = cp.asnumpy(cupy_arr)
    print(f"Array back on CPU: {numpy_arr_back}")

### Benchmarking: CPU vs. GPU Performance

Let's demonstrate the performance difference with a classic example: multiplying two large matrices. This is a highly parallelizable task where GPUs excel.

In [None]:
if CUPY_AVAILABLE:
    # Define the size of the matrices
    size = 5000

    # Create two random matrices in NumPy (CPU)
    a_cpu = np.random.rand(size, size).astype(np.float32)
    b_cpu = np.random.rand(size, size).astype(np.float32)

    # Create two random matrices in CuPy (GPU)
    a_gpu = cp.random.rand(size, size).astype(cp.float32)
    b_gpu = cp.random.rand(size, size).astype(cp.float32)

Now, let's time the matrix multiplication on the CPU.

In [None]:
if CUPY_AVAILABLE:
    cpu_time = timeit.timeit(lambda: np.dot(a_cpu, b_cpu), number=10)
    print(f"CPU time: {cpu_time:.4f} seconds")

And now, let's time the same operation on the GPU. Note that for a fair comparison, we should ensure the data is already on the GPU. We also need to synchronize the device to ensure the computation is finished before stopping the timer.

In [None]:
if CUPY_AVAILABLE:
    cp.cuda.runtime.deviceSynchronize()
    gpu_time = timeit.timeit(lambda: cp.dot(a_gpu, b_gpu), number=10)
    cp.cuda.runtime.deviceSynchronize()
    print(f"GPU time: {gpu_time:.4f} seconds")
    print(f"Speedup:  {cpu_time / gpu_time:.1f}x")

You should observe a dramatic speedup, potentially 50-100x or more, depending on your specific CPU and GPU. This is the power of offloading highly parallelizable work to the GPU.

### A More Complex Example: Singular Value Decomposition (SVD)

The benefits extend to more complex linear algebra, which is at the heart of many econometric and statistical methods.

In [None]:
if CUPY_AVAILABLE:
    cpu_svd_time = timeit.timeit(lambda: np.linalg.svd(a_cpu), number=1)
    print(f"CPU SVD time: {cpu_svd_time:.4f} seconds")

In [None]:
if CUPY_AVAILABLE:
    cp.cuda.runtime.deviceSynchronize()
    gpu_svd_time = timeit.timeit(lambda: cp.linalg.svd(a_gpu), number=1)
    cp.cuda.runtime.deviceSynchronize()
    print(f"GPU SVD time: {gpu_svd_time:.4f} seconds")
    print(f"SVD Speedup:  {cpu_svd_time / gpu_svd_.time:.1f}x")

## Conclusion: When to Use CuPy

CuPy is a powerful tool, but it's not always the right choice. Consider using CuPy when:

1.  **You have an NVIDIA GPU:** CuPy is built on CUDA and only works with NVIDIA hardware.
2.  **Your code is NumPy-heavy:** It is designed as a drop-in replacement for NumPy.
3.  **Your operations are highly parallelizable:** It excels at large-scale matrix algebra, element-wise operations on large arrays, and other vectorized computations.
4.  **The performance bottleneck is significant:** The overhead of transferring data to the GPU is not trivial. For small arrays or simple operations, NumPy on the CPU may be faster. CuPy shines when the computation is heavy enough to make the data transfer cost worthwhile.
By providing a familiar API, CuPy dramatically lowers the barrier to entry for GPU computing, enabling economists to harness the massive parallel processing power of modern graphics cards for their research.