# GPU Computing with Julia

GPUs (Graphics Processing Units) contain **thousands of cores** designed for massively parallel computation.
While originally built for rendering graphics, they've become essential for scientific computing,
machine learning, and simulations.

**In this notebook we'll cover:**
- GPU vs CPU architecture
- Working with GPU arrays using CUDA.jl
- Benchmarking GPU performance
- Writing custom GPU kernels with KernelAbstractions.jl
- Portable code that runs on both CPU and GPU

---
## 0. Setting Up on Google Colab

To run these notebooks with GPU support, you can use Google Colab with a free GPU runtime.

### Steps:
1. Go to [Google Colab](https://colab.research.google.com/)
2. Upload this notebook or open it from GitHub
3. Go to **Runtime → Change runtime type → Hardware accelerator → GPU**
4. Change the runtime to Julia (Julia should already be available on Colab).

Alternatively, use [JuliaHub](https://juliahub.com/) which provides free Julia notebooks with GPU access.

## GPU vs CPU Architecture

| | CPU | GPU |
|--|-----|-----|
| Cores | 4-64 | 1000s |
| Clock | ~4 GHz | ~1.5 GHz |
| Memory BW | ~50 GB/s | ~1000 GB/s |
| Strength | Complex logic | Parallel compute |

**Key insight:** GPUs trade single-thread performance for massive parallelism.
They excel when the same operation is applied to many data elements simultaneously.

To make it a bit simpler:
![Prof vs Army](images/professor-vs-army.jpeg)

**The CPU (The Professor):**
Think of it as a brilliant mathematician. It can solve complex, logic-heavy problems one by one, incredibly fast. It focuses on **Low Latency**.
- Best for: running your OS, opening apps, and sequential decision-making.

**The GPU (The Muscle):**
Think of it as an army of thousands of workers. Individually, they aren't Einstein, but together? They can move a mountain in seconds. It focuses on **High Throughput**.
- Best for: rendering millions of pixels, training AI models, and matrix multiplication.

**In short: you need the Professor to plan the work, and the Army to do the heavy lifting!**

---

Ok but let's get serious for a second. What's actually going on under the hood?

The figure below (from [NVIDIA's CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-programming-guide/)) shows the key architectural difference: a CPU dedicates most of its transistors to **control logic** and **caches**, giving each core maximum single-thread performance. A GPU instead packs the chip with **thousands of small arithmetic units (ALUs)**, sacrificing per-thread speed for raw throughput.

![CPU vs GPU Architecture](images/cpu-vs-gpu.png)

The important thing to notice here is that the GPU doesn't work alone — **the CPU is always in charge**. It's the one that tells the GPU what to do: it allocates memory, copies data over, launches the compute tasks (called *kernels*), and collects the results. The GPU is powerful, but it's essentially a workhorse waiting for instructions. Without the CPU orchestrating everything, the GPU just sits there doing nothing.

So the real workflow looks something like this:
1. The **CPU** sets up the problem (allocates arrays, prepares data)
2. The **CPU** sends the data to the **GPU**'s memory
3. The **CPU** launches a **kernel** — a function that runs on thousands of GPU threads simultaneously
4. The **GPU** crunches the numbers in parallel
5. The **CPU** pulls the results back when needed

This back-and-forth is why you'll often hear people talk about *host* (CPU) and *device* (GPU). The host is the boss, the device is the muscle.


## Setup

In [None]:
using CUDA
GPU = CUDA
GPUArray = CuArray
# Check if CUDA is working
GPU.functional()

In [None]:
# Get the name of our GPU
GPU.device()

In [None]:
GPU.versioninfo()

## CuArrays: GPU Arrays

CUDA.jl provides `CuArray`, the GPU equivalent of Julia's `Array`.
Most array operations "just work" on CuArrays.

In [None]:
# Create a CPU array
a_cpu = rand(Float32, 1000, 1000)

In [None]:
# Copy to GPU
a_gpu = GPUArray(a_cpu)

In [None]:
# Create directly on GPU (faster - no copy needed)
b_gpu = GPU.rand(Float32, 1000, 1000)

In [None]:
# Check the type
typeof(a_gpu)

## GPU Operations

Standard Julia array operations work seamlessly on CuArrays.

In [None]:
# Element-wise addition
c_gpu = a_gpu .+ b_gpu;

In [None]:
# Matrix multiplication
d_gpu = a_gpu * b_gpu;

In [None]:
# Element-wise functions (broadcasting)
e_gpu = sin.(a_gpu);

### Exercise 1: GPU Array Manipulation

Try the following on your own:

1. Create a GPU array of size `500 × 500` filled with ones (hint: `CUDA.ones`)
2. Create a second GPU array of size `500 × 500` with random values
3. Compute the element-wise product of the two arrays
4. Compute the sum of all elements (hint: `sum(...)` works on CuArrays)
5. Copy the result back to CPU using `Array(...)`

In [None]:
# Exercise 1: your code here


## Benchmark: CPU vs GPU

Let's compare matrix multiplication performance.

**Note:** `GPU.@sync` ensures the GPU computation finishes before timing completes.
GPU operations are asynchronous by default.

In [None]:
using BenchmarkTools
using LinearAlgebra

N = 2000

# CPU arrays
C_cpu, A_cpu, B_cpu = zeros(Float32, N, N), rand(Float32, N, N), rand(Float32, N, N)

# GPU arrays
C_gpu, A_gpu, B_gpu = GPUArray(C_cpu), GPUArray(A_cpu), GPUArray(B_cpu)

In [None]:
# CPU benchmark
@btime LinearAlgebra.mul!($C_cpu, $A_cpu, $B_cpu);

In [None]:
# GPU benchmark (CUDA.@sync waits for GPU to finish)
@btime GPU.@sync LinearAlgebra.mul!($C_gpu, $A_gpu, $B_gpu);

### Exercise 2: Benchmarking

1. Compare CPU vs GPU performance for **element-wise operations** (e.g., `sin.(A)`) instead of matrix multiplication. Use `@btime` and `CUDA.@sync`.
2. Try different matrix sizes: `N = 100`, `N = 1000`, `N = 5000`. At what size does the GPU start to win?
3. Why might the GPU be slower for small arrays?

In [None]:
# Exercise 2: your code here


## Custom GPU Kernels

For full control over GPU computation, we write **kernels** - functions that run on each GPU thread.

**KernelAbstractions.jl** provides a portable way to write kernels that work on:
- NVIDIA GPUs (CUDA)
- AMD GPUs (ROCm)
- Intel GPUs (oneAPI)
- Apple GPUs (Metal)
- CPUs (for testing)

This is the approach used by Oceananigans.jl!

### Understanding GPU Execution: Threads and Blocks

GPUs organize computation into a hierarchy:

```
Grid (all blocks)
  └── Block (group of threads that can synchronize)
        └── Thread (single execution unit)
```

![Grid of Thread Blocks](images/grid_of_thread_blocks.png)
*(Image from [NVIDIA's CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-programming-guide/))*

For a 256×256 grid, we might use:
- **Block size**: 16×16 = 256 threads per block
- **Grid size**: 16×16 = 256 blocks
- **Total threads**: 256 × 256 = 65,536 (one per grid point)

Each thread computes one grid point independently - this is called **SIMD** (Single Instruction, Multiple Data).

In KernelAbstractions.jl, you can specify the block size (workgroup) when creating a kernel:
```julia
kernel = my_kernel!(backend, (16, 16))  # 16×16 workgroup
kernel(args..., ndrange=(Nx, Ny))       # total grid size
```

## Simple 1D Kernel: Vector Addition

In [None]:
using KernelAbstractions

# A kernel runs on each thread, indexed by @index(Global)
@kernel function add_kernel!(C, A, B)
    i = @index(Global)            # Get this thread's global index
    @inbounds C[i] = A[i] + B[i]  # Perform the addition
end

In [None]:
# Function to launch the kernel
function gpu_add!(C, A, B)
    backend = get_backend(A)          # Detect CPU or GPU
    kernel = add_kernel!(backend)     # Compile kernel for this backend
    kernel(C, A, B, ndrange=length(A)) # Launch with N threads
end

In [None]:
# Test the kernel
N = 1024
A = GPU.ones(Float32, N)
B = GPU.ones(Float32, N) .* 2
C = GPU.zeros(Float32, N)

gpu_add!(C, A, B)

# Verify result (should be all 3s)
all(Array(C) .== 3.0f0)

### Exercise 3: Write Your Own 1D Kernel

Write a kernel `multiply_kernel!` that computes the element-wise product `C[i] = A[i] * B[i]`.

1. Define the `@kernel function multiply_kernel!(C, A, B)` following the pattern of `add_kernel!`
2. Write a launcher function `gpu_multiply!(C, A, B)` 
3. Test it: create `A = CuArray(Float32.(1:10))` and `B = CuArray(Float32.(1:10))`, then verify `C[i] = i²`

In [None]:
# Exercise 3: your code here


## 2D Kernels

For 2D arrays like images or grids, we use 2D indexing.
This is essential for PDEs on structured grids.

In [None]:
# 2D Laplacian kernel: ∇²f = ∂²f/∂x² + ∂²f/∂y²
# Uses second-order finite differences
@kernel function laplacian_kernel!(∇²f, f, Δx, Δy)
    i, j = @index(Global, NTuple)  # Get 2D thread index
    Nx, Ny = size(f)
    @inbounds ∇²f[i, j] = (f[i+1,j] - 2f[i,j] + f[i-1,j]) / Δx^2 + (f[i,j+1] - 2f[i,j] + f[i,j-1]) / Δy^2
end

In [None]:
function compute_laplacian!(∇²f, f, Δx, Δy)
    backend = get_backend(f)
    kernel = laplacian_kernel!(backend)
    kernel(∇²f, f, Δx, Δy, ndrange=size(f))
end

In [None]:
# Test: f(x,y) = x² + y² has ∇²f = 4
Nx, Ny = 64, 64
Δx, Δy = 0.1f0, 0.1f0

# Create coordinate arrays
x = [(i-1) * Δx for i in 1:Nx]
y = [(j-1) * Δy for j in 1:Ny]

# f(x,y) = x² + y²
f_cpu = [x[i]^2 + y[j]^2 for i in 1:Nx, j in 1:Ny]
f_gpu = GPUArray(Float32.(f_cpu))

# Compute Laplacian
∇²f_gpu = GPU.zeros(Float32, Nx, Ny)
compute_laplacian!(∇²f_gpu, f_gpu, Δx, Δy)

# Check interior points (should be ≈ 4)
∇²f_cpu = Array(∇²f_gpu)
interior = ∇²f_cpu[2:end-1, 2:end-1]
println("Expected: 4.0")
println("Got (mean): ", sum(interior) / length(interior))

### Exercise 4: Gradient Kernel

Write a 2D kernel that computes the **gradient magnitude** of a scalar field using central differences:

$$|\nabla f| = \sqrt{\left(\frac{f_{i+1,j} - f_{i-1,j}}{2\Delta x}\right)^2 + \left(\frac{f_{i,j+1} - f_{i,j-1}}{2\Delta y}\right)^2}$$

1. Define `@kernel function gradient_magnitude_kernel!(∇f_mag, f, Δx, Δy)` using 2D indexing
2. Test with `f(x,y) = x + 2y` which should give `|∇f| = √(1² + 2²) = √5 ≈ 2.236`

**Hint:** use `sqrt(...)` inside the kernel. Remember to only check interior points (avoid boundary).

In [None]:
# Exercise 4: your code here


## Multiple Dispatch in Kernels

Julia's multiple dispatch also works inside GPU kernels! This means we can write a **single kernel** that behaves differently depending on a type parameter - no `if/else` branching needed.

This is powerful for extensible scientific code: define new physics or numerical schemes as types, and the kernel dispatches to the right method automatically.

In [None]:
# Define an abstract type and two concrete operations
struct Square end
struct Double end

# Dispatch: different methods for different types
@inline apply(x, ::Square) = x * x
@inline apply(x, ::Double) = x + x

# ONE kernel that works with ANY operation type
@kernel function apply_operation!(out, input, op)
    i = @index(Global)
    @inbounds out[i] = apply(input[i], op)
end

# Test on GPU with both operations
N = 1024
input_gpu = GPUArray(Float32.(1:N))
out_gpu   = similar(input_gpu)

backend = get_backend(input_gpu)

# Square all elements
apply_operation!(backend)(out_gpu, input_gpu, Square(), ndrange=N)
println("Square: input[3] = 3, output[3] = ", Array(out_gpu)[3])  # 9

# Double all elements — same kernel, different type!
apply_operation!(backend)(out_gpu, input_gpu, Double(), ndrange=N)
println("Double: input[3] = 3, output[3] = ", Array(out_gpu)[3])  # 6

### Exercise 5: Extend with Multiple Dispatch

Add a new operation `Cube` to the dispatch example above:

1. Define `struct Cube end`
2. Add a method `@inline apply(x, ::Cube) = x * x * x`
3. Test it using the *same* `apply_operation!` kernel -- no changes to the kernel needed!
4. Verify that `input[3] = 3` gives `output[3] = 27`

**Bonus:** Define `struct ScaleBy{T} val::T end` and `@inline apply(x, op::ScaleBy) = x * op.val`. Test with `ScaleBy(10.0f0)`.

In [None]:
# Exercise 5: your code here


## Portable Code: CPU and GPU

The same kernel works on CPU! This is great for:
- Testing without a GPU
- Debugging (CPU error messages are clearer)
- Running on machines without GPUs

In [None]:
using KernelAbstractions: CPU

# Same test, but on CPU
f_cpu_array = Float32.(f_cpu)
∇²f_cpu_array = zeros(Float32, Nx, Ny)

# Uses the SAME kernel code!
compute_laplacian!(∇²f_cpu_array, f_cpu_array, Δx, Δy)

# Verify
interior_cpu = ∇²f_cpu_array[2:end-1, 2:end-1]
println("CPU result (mean): ", sum(interior_cpu) / length(interior_cpu))

## Next: 2D Turbulence

Now that we understand GPU computing, we'll apply these concepts to build
a 2D incompressible Navier-Stokes solver from scratch in the turbulence lecture notebook.