# CuPy

* CuPy is functionally similar to Numba-CUDA but more closely mimics NumPy
* Many operations can be written using NumPy-like syntax
* CuPy has drop-in equivalents of most NumPy operations and syntax syntax with GPU arrays (`cp.ndarray`), ufuncs, broadcasting, reductions.
* Unlike Numba-CUDA, it provides implementations of a number of more complex operations in `cupyx.scipy`
    * FFT
    * linear algebra
    * sparse matrices
    * signal processing
    * random numbers
* These can make cupy easier to use for many applications


## CuPy Array Basics (NumPy, but on GPU)
* Basically, you can replace `numpy as np` with `cupy as cp` and most ufuncs and array ops just work


In [1]:

import cupy as cp
a = cp.arange(10**6, dtype=cp.float32)
b = cp.sin(a) + cp.cos(a)  # ufuncs on GPU
print(b.shape, b.dtype, type(b))
print("Host preview:", cp.asnumpy(b[:5]))


(1000000,) float32 <class 'cupy.ndarray'>
Host preview: [ 1.          1.3817732   0.49315065 -0.8488725  -1.4104462 ]


* Using `cp.arange` will allocate an array on GPU device memory
* Ufuncs will perform operations on the GPU
* `cp.asnumpy` will copy back to the host
To minimize data transfer, use operations like `cp.asarray` and `cp.asnumpy` only when needed

### Broadcasting, reductions, and axis ops
* The normal broadcasting, indexing, and reduction operations operate as one would expect

In [2]:

x = cp.random.random((4096, 4096), dtype=cp.float32)
w = cp.linspace(0, 1, x.shape[1], dtype=cp.float32)
y = x * w                       # broadcast
col_means = y.mean(axis=0)      # reduction on GPU
print(col_means.shape, col_means.dtype)


(4096,) float32



## Defining Kernels

* There are a handful of options for defining kernels in CuPy. These present many options for balancing complexity and fine-grain control.

| Approach | Level of Control | Language you write | How you define it | How you launch it | Can use `threadIdx`/`blockIdx` & shared mem? | Best for | Pros | Cons |
|---|---|---|---|---|---|---|---|---|
| **Vectorized CuPy ops (ufuncs, broadcasting, reductions)** | Low | Python (NumPy-like) | Just write `cp.sin(x)`, `x*y`, `x.mean(axis=0)`, etc. | Regular function call | No, abstracted away | Elementwise/reduction math, linear algebra via `cp.linalg`, FFT via `cupyx.scipy.fft` | Fast to write; leverages vendor libs (cuBLAS/cuFFT); very readable | Limited low-level control; performance tuning mostly indirect |
| **`ElementwiseKernel`** | Medium | CUDA C *expression strings* | `ElementwiseKernel(in_params, out_params, operation, name)` | Call like a function: `k(x, ...)` | no explicit thread/block math | Custom elementwise transforms | Minimal boilerplate; in-place and broadcasting friendly | Logic must fit elementwise form; no shared memory |
| **`ReductionKernel`** | Medium | CUDA C *expression strings* | `ReductionKernel(in_params, out_params, map_expr, reduce_expr, post_map_expr, identity, name)` | Call like a function: `k(x, ...)` | no | Custom reductions (sum of f(x), argmax, etc.) | Concise custom reductions; good perf | Reduction pattern only; more complex to reason about |
| **`cupyx.jit` – `@jit.kernel()`** | Medium–High | Restricted Python (JIT to CUDA) | Write Python function with array indexing; decorate with `@jit.kernel()` | Call like a function (grid/block inferred or provided) | APIs for `threadIdx`, `blockIdx`, shared mem via `jit.shared_memory`) | Custom elementwise/nd kernels with Python syntax | Pythonic; no C string; easy to iterate | Still a restricted subset; not as feature-complete as raw CUDA C |
| **`cupyx.jit` – `@jit.rawkernel()`** | High | Restricted Python with explicit launch math | Define func with `@jit.rawkernel()` and compute global index manually | CUDA-like launch: `f[(blocks,), (threads,)](args...)` |full index control; shared mem via `jit.shared_memory` | Hand-tuned kernels without leaving Python | Good balance of control & ergonomics | Some CUDA features may require workarounds; subset semantics |
| **`RawKernel`** | Highest | CUDA C/C++ (as a string) | `cp.RawKernel(src, "func_name")` with CUDA code | CUDA-like launch: `k((blocks,), (threads,), (args...), shared_mem=...)` | full control | Performance-critical kernels; shared memory tiling; intrinsics | Full CUDA power; predictable | You write/maintain CUDA C strings; harder to prototype |
| **`RawModule`** | Highest | CUDA C/C++ (string or file), PTX | `cp.RawModule(code=..., path=..., options=...)`; then `get_function("name")` | Same as `RawKernel` once you get the function | yes| Packaging multiple kernels, using headers, templates, linking | Organize many kernels; reuse across cells | Build/link errors are more complex; still CUDA C |

* The best way to start out with CuPy is to use the vectorized operations, ufuncs, broadcasting / other NumPy-esque features
* Branch out and expand whre more fine-grain control / complexity is needed

> **Rule of thumb:** Start with **CuPy** vectorization; drop to `cupyx.jit`/RawKernel or **Numba** only for hotspots that need custom kernels.



## Elementwise Kernels
* Elemntwise and reduction kernels can be a convenient alternative to writing full CUDA kerneels
* You don't have to worry about launch syntax like specifying the number of blocks / threads per block


In [6]:

from cupy import ElementwiseKernel, ReductionKernel
import cupy as cp

# Elementwise: y = a*x + b
axpb = ElementwiseKernel(
    in_params='float32 x, float32 a, float32 b',
    out_params='float32 y',
    operation='y = a * x + b;',
    name='axpb'
)

x = cp.linspace(0, 1, 8, dtype=cp.float32)
y = axpb(x, 2.0, 1.0)
print("Elementwise:", y)

Elementwise: [1.        1.2857143 1.5714285 1.8571429 2.142857  2.4285715 2.7142859
 3.       ]


## Reduction Kernels

* The `cupy.ReductionKernel` API lets you define your own reductions on the GPU (e.g. sum, mean, norm, min/max, etc.) using a few simple expressions — no need to write explicit CUDA kernels.

* It's roughly analogous to a CUDA kernel that:
1. Maps each input element through some expression (`map_expr`),
2. Reduces all mapped results using an associative binary operation (`reduce_expr`),
3. Optionally transforms the final accumulator before writing output (`post_map_expr`),
4. Starts with an identity (neutral) value for the accumulator.


* Here's an example of a reduction kernel

In [7]:

# Reduction: sum of squares
sum_squares = ReductionKernel(
    in_params='float32 x',
    out_params='float32 y',
    map_expr='x * x',
    reduce_expr='a + b',
    post_map_expr='y = a',
    identity='0',
    name='sum_squares'
)
print("Reduction:", sum_squares(cp.arange(10, dtype=cp.float32)))


Reduction: 285.0


## Raw Kernels
* A raw kernel is basically justa CUDA C kernel written as a string
* `cp.RawKernel` lets you write CUDA C kernels (as strings) and launch from Python
* This is useful for relatively simple kernels that might not be easily expressed as NumPy-like vectorized operations
* Raw kernels require you to specify the block and thread count when launching as in Numba-CUDA / standard CUDA kernels

In [9]:
import cupy as cp

# RawKernel (CUDA C)
saxpy_src = r'''
extern "C" __global__
void saxpy(const float a, const float* __restrict__ x,
           const float* __restrict__ y, float* __restrict__ out, int n) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < n) out[i] = a * x[i] + y[i];
}
''';
saxpy_kernel = cp.RawKernel(saxpy_src, "saxpy")

# test data
n = 1_000_000
x = cp.random.random(n, dtype=cp.float32)
y = cp.random.random(n, dtype=cp.float32)
out = cp.empty_like(x)

threads = 256
blocks = (n + threads - 1) // threads
saxpy_kernel((blocks,), (threads,), (2.0, x, y, out, n))
cp.cuda.Stream.null.synchronize()
print("RawKernel ok; out[0:5] =", out[:5])


RawKernel ok; out[0:5] = [0.29604462 0.523803   0.5467715  0.13893062 0.36670983]



## `cupyx.jit`: CUDA-like kernels in Python
* `cupyx.jit` JIT-compiles a restricted Python subset to CUDA
* Its a middle ground between high-level array ops and CUDA C strings
* Very similar to defining Numba-CUDA kernels


In [None]:

from cupyx import jit
import cupy as cp

@jit.rawkernel()
def saxpy_jit(a, x, y, out):
    i = jit.blockDim.x * jit.blockIdx.x + jit.threadIdx.x
    if i < x.size:
        out[i] = a * x[i] + y[i]

n = 1_000_000
x = cp.random.random(n, dtype=cp.float32)
y = cp.random.random(n, dtype=cp.float32)
out = cp.empty_like(x)
threads = 256
blocks = (n + threads - 1) // threads
saxpy_jit[(blocks,), (threads,)](2.0, x, y, out)
cp.cuda.Stream.null.synchronize()
print("cupyx.jit ok; out[0:5] =", out[:5])



## 8) SciPy‑Compatible: FFT, Linear Algebra, Sparse
`cupyx.scipy` mirrors SciPy APIs and calls high‑performance GPU libs (cuBLAS, cuFFT, cuSPARSE, etc.).


In [None]:

import cupy as cp
from cupyx.scipy import fftpack, linalg, sparse

# FFT round‑trip
x = cp.random.random(1<<20).astype(cp.complex64)
X = fftpack.fft(x)
x_rec = fftpack.ifft(X)
print("FFT round‑trip error:", float(cp.max(cp.abs(x - x_rec))))

# Linear solve: Ax=b
A = cp.random.random((1024, 1024), dtype=cp.float32)
b = cp.random.random(1024, dtype=cp.float32)
x = linalg.solve(A, b)
r = cp.linalg.norm(A @ x - b) / cp.linalg.norm(b)
print("Solve relative residual:", float(r))

# Sparse example
rows = cp.array([0, 0, 1, 2, 2, 2])
cols = cp.array([0, 2, 2, 0, 1, 2])
vals = cp.array([1, 2, 3, 4, 5, 6], dtype=cp.float32)
S = sparse.coo_matrix((vals, (rows, cols)), shape=(3,3)).tocsr()
d = S.dot(cp.array([1,2,3], dtype=cp.float32))
print("Sparse dot:", d)



## 9) Random Numbers & Streams/Events
CuPy RNG mirrors NumPy; versions may support Philox/XORWOW/MRG32k3a backends. Streams/events enable concurrency.


In [None]:

import cupy as cp
rng = cp.random.default_rng(seed=123)
u = rng.random(5, dtype=cp.float32)
n = rng.normal(0, 1, 5, dtype=cp.float32)
print("Uniform:", u)
print("Normal :", n)

# Streams demo (toy): overlapping ops
s1 = cp.cuda.Stream(non_blocking=True)
s2 = cp.cuda.Stream(non_blocking=True)
a = cp.empty((1<<20,), dtype=cp.float32)
b = cp.empty_like(a)
with s1: a.fill(1.0)
with s2: b.fill(2.0)
cp.cuda.Stream.null.synchronize()
print("Streams ok; a[0], b[0] =", float(a[0]), float(b[0]))



## 10) Timing & Profiling
Rule: **synchronize** before/after timing (CUDA is async). Use timers or CUDA events.


In [None]:

import cupy as cp, time

def bench(func, *args, warmup=3, iters=10, synchronize=True, **kwargs):
    for _ in range(warmup):
        func(*args, **kwargs)
    if synchronize: cp.cuda.Stream.null.synchronize()
    t0 = time.perf_counter()
    for _ in range(iters):
        func(*args, **kwargs)
    if synchronize: cp.cuda.Stream.null.synchronize()
    t1 = time.perf_counter()
    return (t1 - t0) / iters

x = cp.random.random((1<<24,), dtype=cp.float32)
def op(x): return cp.tanh(cp.sin(x) + 0.1*x)

t_ms = bench(op, x) * 1e3
print(f"Avg time/op: {t_ms:.3f} ms")


In [None]:

# CUDA events
import cupy as cp
start, stop = cp.cuda.Event(), cp.cuda.Event()

x = cp.random.random((1<<24,), dtype=cp.float32)
start.record()
y = cp.tanh(cp.sin(x) + 0.1*x)
stop.record(); stop.synchronize()
print("Elapsed (ms):", cp.cuda.get_elapsed_time(start, stop))



## 11) Pitfalls & Best Practices
- **Data transfer dominates**: batch device work; avoid frequent small host↔device copies.
- **Asynchrony**: `synchronize()` for timing or host reads.
- **Dtypes**: prefer `float32` unless double is necessary.
- **Memory**: reuse arrays; leverage the memory pool; avoid per‑iteration allocations.
- **Vectorize first**: use CuPy ufuncs/broadcasting before custom kernels.
- **Interop**: use `__cuda_array_interface__` / DLPack with Numba, PyTorch, JAX, etc.
- **Streams**: overlap transfers/compute to hide latency.



## 12) Quick Decision Guide
- Maps to NumPy ops (elementwise, reductions, BLAS, FFT, sparse)? → **CuPy**.
- Needs custom control (warp‑level ops, shared memory tiling)? → **Numba‑CUDA** or CuPy RawKernel/`cupyx.jit`.
- Need SciPy‑like GPU ecosystem fast? → **CuPy (`cupyx.scipy`)**.
- Prototype high‑level in CuPy; specialize hotspots as needed.



## 13) Hands‑On Exercises
1. **Vectorized Warm‑up (CuPy):** Implement `softmax` for a 2D batch using ufuncs/broadcasting and compare timings vs NumPy CPU.
2. **ElementwiseKernel:** Implement `leaky_relu(x, alpha=0.1)`.
3. **ReductionKernel:** Compute per‑row L2 norms for a large matrix.
4. **cupyx.jit or RawKernel:** Implement a tiled matrix multiply (e.g., 16×16). Compare to `cp.matmul`.
5. **Interop with Numba:** Pass a CuPy array to a Numba kernel via `__cuda_array_interface__` and modify in place.
6. **Profiling:** Use CUDA events to compare CuPy vectorized vs `cupyx.jit` vs RawKernel for the same op.



### Appendix: Zero‑copy Numba interop via `__cuda_array_interface__`
Numba understands CuPy device arrays directly (no copies).


In [None]:

import cupy as cp
from numba import cuda

# Create a CuPy array
cp_arr = cp.arange(16, dtype=cp.float32).reshape(4,4)

@cuda.jit
def scale_inplace(a, s):
    i, j = cuda.grid(2)
    if i < a.shape[0] and j < a.shape[1]:
        a[i, j] *= s

threads = (16, 16)
blocks = ((cp_arr.shape[0] + threads[0] - 1)//threads[0],
          (cp_arr.shape[1] + threads[1] - 1)//threads[1])

# Pass cp_arr directly (zero‑copy interface)
scale_inplace[blocks, threads](cp_arr, 3.0)
cuda.synchronize()
print("Scaled via Numba, back in CuPy:", cp_arr)



## References
- CuPy docs: https://docs.cupy.dev  
- Numba CUDA: https://numba.readthedocs.io/en/stable/cuda/index.html  
- CUDA Array Interface: https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html  
- DLPack spec: https://dmlc.github.io/dlpack/latest/
