# CuPy GPU Computing Demo

This notebook demonstrates how to use CuPy for GPU-accelerated computing in Python. We'll compare CPU performance with GPU performance to show the dramatic speedups possible with GPU computing.

## What is CuPy?

CuPy is a NumPy-compatible library for GPU-accelerated computing with Python. It provides:
- **NumPy compatibility**: Drop-in replacement for NumPy arrays on GPU
- **CUDA acceleration**: Leverages NVIDIA GPUs for massive parallel processing
- **Easy data transfer**: Seamless movement between CPU and GPU memory
- **Extensive library support**: GPU versions of many NumPy and SciPy functions

## Prerequisites

- NVIDIA GPU with CUDA support
- CuPy installed with appropriate CUDA version
- Sufficient GPU memory for the arrays we'll create

## Import Required Libraries

Let's start by importing the necessary libraries:
- `numpy`: Traditional CPU-based array operations for comparison
- `cupy`: GPU-accelerated array operations (NumPy-compatible API)
- `cupyx.profiler.benchmark`: CuPy's benchmarking tool for accurate GPU performance measurement

In [None]:
import numpy as np
import cupy as cp
from cupyx.profiler import benchmark

## Define a Custom GPU Function

Let's create a mathematical function that will benefit from GPU parallelization:

This function combines several mathematical operations:
- Trigonometric functions (`sin`)
- Exponential functions (`exp`) 
- Arithmetic operations with constants (`π`)

These element-wise operations on large arrays are perfect candidates for GPU acceleration since they can be performed in parallel across thousands of GPU cores.

In [None]:
def my_func(a, b):
    return cp.pi * cp.sin(-a) * cp.exp(b)

## Benchmark GPU Performance

Now let's create large GPU arrays and benchmark our function's performance:

- **Array size**: 2560 × 1024 ≈ 2.6 million elements per array
- **Data type**: 64-bit floating point numbers  
- **Memory usage**: ~20 MB per array on GPU
- **Benchmark**: 100,000 iterations for statistical accuracy

The benchmark function handles GPU timing correctly, accounting for:
- GPU kernel launch overhead
- Asynchronous execution
- Memory transfer considerations

In [None]:
a = cp.random.random((2560, 1024))
b = cp.random.random((2560, 1024))
print(benchmark(my_func, (a,b), n_repeat=100000))

## CPU vs GPU Comparison: Linear Algebra

Let's compare CPU and GPU performance for a common scientific computing task: solving a system of linear equations.

### Problem Setup
We'll solve the equation **Ax = b** where:
- **A**: 1000×1000 coefficient matrix (~8 MB)
- **b**: 1000-element vector
- **Solution method**: LU decomposition with partial pivoting

### Data Preparation
1. Create arrays on CPU using NumPy
2. Transfer data to GPU using CuPy
3. Compare identical operations on both platforms

In [None]:
np_rng = np.random.default_rng()

A_cpu = np_rng.random((1000, 1000))
b_cpu = np_rng.random(1000)

A_gpu = cp.array(A_cpu)
b_gpu = cp.array(b_cpu)

### CPU Performance (NumPy)

First, let's measure how long it takes to solve the linear system using CPU-based NumPy:

- **Method**: `numpy.linalg.solve()` 
- **Implementation**: Uses optimized BLAS/LAPACK libraries (Intel MKL, OpenBLAS, etc.)
- **Parallelization**: Limited to available CPU cores
- **Expected time**: Several milliseconds to tens of milliseconds depending on CPU

In [None]:
%%timeit
np.linalg.solve(A_cpu, b_cpu)

### GPU Performance (CuPy)

Now let's measure the same operation using GPU-accelerated CuPy:

- **Method**: `cupy.linalg.solve()` 
- **Implementation**: Uses cuSOLVER (CUDA's dense linear algebra library)
- **Parallelization**: Leverages thousands of GPU cores
- **Expected speedup**: 5-50x faster depending on GPU model and problem size

**Note**: The speedup will be more dramatic for larger matrices, as GPU overhead becomes negligible compared to computation time.

In [None]:
%%timeit
cp.linalg.solve(A_gpu, b_gpu)

## Data Transfer Between GPU and CPU

An important aspect of GPU computing is managing data transfer between GPU and CPU memory:

### GPU to CPU Transfer
- **Method**: `.get()` method converts CuPy arrays to NumPy arrays
- **Memory copy**: Data is copied from GPU memory to CPU memory
- **Performance consideration**: Minimize transfers as they can be a bottleneck
- **Use case**: When you need results on CPU for further processing or saving

Let's demonstrate creating data on GPU and transferring it back to CPU:

In [None]:
x_gpu = cp.ones((1000, 1000))
x_cpu = x_gpu.get()
print(f'Type of CPU array is: {type(x_cpu)}')

## Key Takeaways and Best Practices

This demo illustrates several important concepts about GPU computing with CuPy:

### 1. **Massive Parallelism**
- GPUs excel at element-wise operations on large arrays
- Thousands of cores work simultaneously on different data elements
- Best suited for problems that can be parallelized across data

### 2. **Memory Management**
- Keep data on GPU as long as possible to avoid transfer overhead
- Use `.get()` only when you need data back on CPU
- Consider GPU memory limitations when working with large datasets

### 3. **When to Use GPU Computing**
**Good candidates:**
- Large matrix operations (linear algebra, FFTs)
- Element-wise mathematical functions
- Image processing and computer vision
- Machine learning training and inference

**Poor candidates:**
- Small arrays (overhead dominates)
- Highly sequential algorithms
- Frequent CPU-GPU data transfers

### 4. **Performance Optimization Tips**
- **Batch operations**: Combine multiple small operations into larger ones
- **Memory coalescing**: Access memory in patterns that optimize GPU bandwidth
- **Persistent kernels**: Keep data on GPU across multiple operations
- **Mixed precision**: Use float32 instead of float64 when precision allows

### 5. **CuPy Advantages**
- **Easy adoption**: NumPy-compatible API requires minimal code changes
- **Comprehensive library**: Most NumPy functions have CuPy equivalents
- **Interoperability**: Works well with other GPU libraries (PyTorch, TensorFlow)
- **Custom kernels**: Allows writing custom CUDA code when needed