In [None]:
# SPDX-License-Identifier: Apache-2.0 AND CC-BY-NC-4.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# How to use NVIDIA Warp to Build GPU-Accelerated Computational Physics Simulations

## Overview

In this notebook, we will build a fluid solver entirely in Python using the Warp framework. Specifically, we will simulate 2D turbulent flow in a box.

Through this implementation, you will learn:

* Writing SIMT (single instruction, multiple threads) kernels for building blocks of a solver
* Using Warp's tile-based programming primitives to accelerate matrix operations like FFT, inverse FFT
* CUDA graph capture to reduce launch overheads and make timestepping more efficient

By the end of this lab, we will have built a GPU-accelerated 2D Navier-Stokes solver while gaining hands-on experience applying Warp's features to computational physics problems.

---
## Setup

Before we start the lab, let's ensure we have all the necessary packages installed.

In [None]:
# Install Warp
%pip install warp-lang

# Install visualization package
%pip install matplotlib Pillow

Now let's import the necessary libraries and initialize Warp to check if GPU support is available:

In [None]:
import numpy as np
import warp as wp

if wp.get_cuda_device_count() > 0:
    print("GPU detected successfully")
else:
    print("No GPU detected!")

---
## What is a solver?

<div align="center">
<svg width="800" height="400" xmlns="http://www.w3.org/2000/svg">
  <defs>
    <style>
      .input-box { fill: #E8F4F8; stroke: #2C3E50; stroke-width: 2; }
      .solver-box { fill: #FFE5B4; stroke: #2C3E50; stroke-width: 3; }
      .output-box { fill: #E8F5E9; stroke: #2C3E50; stroke-width: 2; }
      .text-main { font-family: Arial, sans-serif; font-size: 16px; font-weight: bold; fill: #2C3E50; text-anchor: middle; }
      .text-large { font-family: Arial, sans-serif; font-size: 22px; font-weight: bold; fill: #2C3E50; text-anchor: middle; }
      .text-label { font-family: Arial, sans-serif; font-size: 18px; font-weight: bold; fill: #2C3E50; text-anchor: middle; font-style: italic; }
      .arrow { stroke: #34495E; stroke-width: 3; fill: none; marker-end: url(#arrowhead); }
    </style>
    <marker id="arrowhead" markerWidth="10" markerHeight="10" refX="9" refY="3" orient="auto">
      <polygon points="0 0, 10 3, 0 6" fill="#34495E" />
    </marker>
  </defs>
  <rect width="800" height="400" fill="white"/>
  <text x="120" y="30" class="text-label">INPUTS</text>
  <rect x="30" y="60" width="180" height="70" rx="10" class="input-box"/>
  <text x="120" y="90" class="text-main">Governing</text>
  <text x="120" y="110" class="text-main">Equations</text>
  <rect x="30" y="150" width="180" height="70" rx="10" class="input-box"/>
  <text x="120" y="180" class="text-main">Initial</text>
  <text x="120" y="200" class="text-main">Conditions</text>
  <rect x="30" y="240" width="180" height="70" rx="10" class="input-box"/>
  <text x="120" y="270" class="text-main">Boundary</text>
  <text x="120" y="290" class="text-main">Conditions</text>
  <path d="M 210 95 L 300 185" class="arrow"/>
  <path d="M 210 185 L 300 185" class="arrow"/>
  <path d="M 210 275 L 300 185" class="arrow"/>
  <rect x="300" y="130" width="200" height="110" rx="15" class="solver-box"/>
  <text x="400" y="195" class="text-large">SOLVER</text>
  <path d="M 500 185 L 570 185" class="arrow"/>
  <text x="660" y="30" class="text-label">OUTPUT</text>
  <rect x="570" y="150" width="180" height="70" rx="10" class="output-box"/>
  <text x="660" y="180" class="text-main">Fields of</text>
  <text x="660" y="200" class="text-main">Interest</text>
</svg>
</div>
 
At its core, a **solver** is a program that computes how something evolves over time, space, or both by discretizing governing equations on a grid. It takes three inputs and gives you the physical fields as a function of space and time.

1. **Governing equations** – the mathematical description of the physics (e.g., Newton's laws, heat equation, fluid dynamics)
2. **Initial conditions** – the starting state of the system (e.g., initial temperature distribution, initial velocity field)
3. **Boundary conditions** – what happens at the edges of the domain (e.g., walls, inlets, periodic boundaries)

<p align="center">
  <img src="images/chapter-12.2/structured_vs_unstructured_grid.png" alt="Structured vs Unstructured Grid" width="50%"><br>
  <em>Structured vs. unstructured grids</em>
</p>


**Computational Fluid Dynamics (CFD)** is a sub-field of computational physics where solvers almost always simulate the Navier-Stokes equations, at times in conjunction with other physics like heat transfer or plasma dynamics. These simulations are notoriously compute- and memory-intensive. A typical industry-grade CFD simulation involves:
 
- Millions (sometimes billions) of grid points, each storing velocity, pressure, temperature, etc.
- Thousands of timesteps, each requiring calculations at every grid point
- Stencil operations where each point depends on its neighbors
 
Here's the key insight --  **updating each grid point is largely independent of the others within a single timestep**. This is a perfect match for GPUs, which excel at running thousands of threads in parallel.

### Where Does Warp Come Into the Picture? 

Warp bridges the gap between Python's ease of use and GPU performance. Instead of writing low-level CUDA, you write Python code decorated with `@wp.kernel`, and Warp compiles it to efficient GPU code. This lets us prototype and iterate quickly while still getting GPU acceleration where it matters.

---
## What are we building today?

![Turbulence simulation](images/chapter-12.2/turbulence_256x256_Re1000.gif)

In today's lab, we will **not** build an industry-grade solver! Instead, we'll simulate a classic academic problem of simulating **decaying turbulence** governed by the 2-D Navier-Stokes equations, formulated **in terms of vorticity and stream function**, on a **periodic domain**.

- **Vorticity** $\omega$: Measures local rotation of the fluid.
- **Stream function** $\psi$: Defines velocity through its derivatives. This formulation ensures that the continuity equation is automatically satisfied.

By using $\psi$ and $\omega$, we reduce the problem to solving the **vorticity transport equation** with only **two unknowns $\omega$ and $\psi$**

$$\underbrace{\frac{\partial \omega}{\partial t}}_{\text{time evolution}} + \underbrace{\frac{\partial \psi}{\partial y}\frac{\partial \omega}{\partial x} - \frac{\partial \psi}{\partial x}\frac{\partial \omega}{\partial y}}_{\text{advection}} = \underbrace{\frac{1}{\text{Re}}\nabla^2 \omega}_{\text{diffusion}} \tag{1}.$$

The $\omega$ and $\psi$ satisfy the Poisson equation 
$$\nabla^2 \psi = -\omega \tag{2}.$$

<p align="center">
<img src="images/chapter-12.2/vorticity_grid_overlay.png" alt="Vorticity field with grid overlay" width="50%"> 
</p>

We solve these equations on a structured $N \times N$ grid with periodic boundary conditions. The red dots indicate where we store the values of $\omega$ and $\psi$.

- We solve **Eq 1** using finite differences (**SIMT kernels**)
- We solve **Eq 2** using FFT in Fourier space (**Tile kernels**)

The solver loop then takes the form shown below. What remains is to build the inner workings of our solver.

<p align="center">
   <img src="images/chapter-12.2/navier_stokes_timestep.svg" alt="Solver timestep loop" width="600">
</p>

---
## Building blocks of the solver

Let's zoom in now on the inner workings of our solver. Our simple solver would have two main building blocks.

1. **Discretization + Time Marching**: Discretize the advection and diffusion terms to obtain the RHS $\mathcal{L}(\omega)$ of $\frac{\partial \omega}{\partial t} = \mathcal{L}(\omega)$ (equation 1), then advance $\omega(t)$ to $\omega(t+\Delta t)$ using simple approximation of 

$$\frac{\partial \omega}{\partial t} \approx \frac{\omega^{(n+1)} - \omega^{n}}{\Delta t}.$$

2. **Poisson Solve**: Solve Poisson equation (equation 2) for $\psi$ given $\omega$ - **details when we get there**.

<p align="center">
  <img src="images/chapter-12.2/solver_pipeline_simple.svg" alt="Solver Pipeline: One Timestep" width="800">
</p>

In pseudo-code, one timestep of our solver looks like:

    def step(omega, psi):
        # === Block 1: Discretization + Time-stepping (SIMT) ===
        rhs = (1/Re) * diffusion(omega) - advection(omega, psi)
        omega_new = omega + dt * rhs

        # === Block 2: Poisson Solve (Tile) ===
        psi = poisson_solve(omega_new)    # details in Block 2

Here are the **6 kernels/functions** that you will implement in this lab:

**Building Block 1: Discretization + Time-stepping (SIMT)**

| Exercise                   | Type         | Warp Concept                 |
| -------------------------- | ------------ | ---------------------------- |
| `advection()`              | `@wp.func`   | Stencil access pattern       |
| `diffusion()`              | `@wp.func`   | Stencil access pattern       |
| `viscous_advection_kernel` | `@wp.kernel` | SIMT kernel, combining funcs |

**Building Block 2: FFT Poisson Solver (Tile)**

| Exercise     | Type         | Warp Concept                                   |
| ------------ | ------------ | ---------------------------------------------- |
| `fft_tiled`  | `@wp.kernel` | `wp.tile_load`, `wp.tile_fft`, `wp.tile_store` |
| `ifft_tiled` | `@wp.kernel` | `wp.tile_ifft`                                 |
| `transpose`  | `@wp.kernel` | SIMT kernel                                    |

*Helper kernels (pre-provided, not exercises): `copy_float_to_vec2`, `multiply_k2_inverse`, `extract_real_and_scale`*

In what follows, we will tackle Block 1 first, then Block 2. In the end, we will put them together to complete the full solver loop.

---
## Simulation parameters

First, let's define the global parameters for our simulation. We will solve the vorticity-transport equation on a square domain of size $2\pi$, discretized on a $1024 \times 1024$ grid. We take a timestep of $\Delta t = 5 \times 10^{-4}$.

In [None]:
# Grid resolution
N_GRID = 1024

# Box size
LEN = 2 * np.pi

# Delta t for timestepping
DT = 0.0005

# Reynolds number
RE = 1000.0

# h = l/N (size of a computational cell)
H = LEN / N_GRID

In [None]:
# Get device object
device = wp.get_device("cuda:1")

# Free memory (in bytes)
free_mem_mb = device.free_memory / (1024 * 1024)
print(f"Free GPU memory: {free_mem_mb:.2f} MB")

Let's also allocate some helper Warp arrays that we will make use of throughout the notebook. 

In [None]:
# allocate warp arrays for vorticity, stream-function, and RHS of NS equation

omega_0 = wp.zeros((N_GRID, N_GRID), dtype=wp.float32)
omega_1 = wp.zeros((N_GRID, N_GRID), dtype=wp.float32)
psi = wp.zeros((N_GRID, N_GRID), dtype=wp.float32)
rhs = wp.zeros((N_GRID, N_GRID), dtype=wp.float32)

# precompute 1/(kx^2 + ky^2) for spectral Poisson solver (avoid division by zero at k=0)
k = np.fft.fftfreq(N_GRID, d=1.0 / N_GRID)
kx, ky = np.meshgrid(k, k)
k2 = kx**2 + ky**2
k2i_np = np.zeros_like(k2)
nonzero = k2 != 0
k2i_np[nonzero] = 1.0 / k2[nonzero]

k2i = wp.array2d(k2i_np.astype(np.float32), dtype=wp.float32)

#### Exercise: Run the cell below and see how the memory footprint of our GPU has changed after the initial allocatons.

In [None]:
# Get device object
device = wp.get_device("cuda:0")

# Free memory
free_mem_mb = device.free_memory / (1024 * 1024)
print(f"Free GPU memory: {free_mem_mb:.2f} MB")

Let's also initialize $\omega$ on the grid.

In [None]:
# Visualize the initial vorticity field for decaying turbulence
from utils import initialize_decaying_turbulence, plot_vorticity_field

# Initialize vorticity field for decaying turbulence
from utils import initialize_decaying_turbulence

omega_init = initialize_decaying_turbulence(n_grid=N_GRID, seed=42)

# Copy to Warp arrays
omega_0 = wp.array(omega_init, dtype=wp.float32)
omega_1 = wp.array(omega_init, dtype=wp.float32)

# Plot the initial vorticity field
fig, ax = plot_vorticity_field(
    omega_init, title="Initial Vorticity Field (Decaying Turbulence)"
)

---
## Building block 1: Discretization + Time Marching

<p align="center">
  <img src="images/chapter-12.2/solver_pipeline_simple_discretization.svg" alt="Solver pipeline highlighting discretization step" width="800">
</p>

---
### Building block 1.1: Compute RHS
Recall the vorticity transport equation (1) from the Background section:

$$\frac{\partial \omega}{\partial t} + \underbrace{\frac{\partial \psi}{\partial y}\frac{\partial \omega}{\partial x} - \frac{\partial \psi}{\partial x}\frac{\partial \omega}{\partial y}}_{\text{advection}} = \underbrace{\frac{1}{\text{Re}}\left(\frac{\partial^2 \omega}{\partial x^2} + \frac{\partial^2 \omega}{\partial y^2}\right)}_{\text{diffusion}}. \tag{1}$$


We discretize the **advection** and **diffusion** terms on a **uniform grid** using **central finite difference schemes**. Discretization means approximating the derivatives using field values at the grid cells.

#### Advection term

Using central differences for first derivatives with grid spacing $h$

<p align="center">
  <img src="images/chapter-12.2/omega_differential.png" alt="Omega stencil" width="400">
  <img src="images/chapter-12.2/psi_differential.png" alt="Psi stencil" width="400">
</p>

$$\text{Advection}_{i,j} = \frac{\partial \omega}{\partial x} \cdot \frac{\partial \psi}{\partial y} - \frac{\partial \omega}{\partial y} \cdot \frac{\partial \psi}{\partial x}.$$

#### Diffusion (or Laplacian) term

$$\text{Diffusion}_{i,j} ={\frac{\omega_{i+1,j} + \omega_{i-1,j} + \omega_{i,j+1} + \omega_{i,j-1} - 4\omega_{i,j}}{h^2}}.$$

#### What do we need to implement?

Two Warp functions:
1. `advection(...)` - Warp function that computes the advection term at a grid point given the flow information on the left, right, top, and bottom neighbors.
2. `diffusion(...)` - Warp function that computes the Laplacian at a grid point given the flow information on the left, right, top, and bottom neighbors.

#### EXERCISE: Fill in the MISSING parts in `advection(...)` and `diffusion(...)` using the finite difference formulas described above.

NOTE: Please ignore the inner workings of `def compute_advection_diffusion_kernel(...)` for now. For now, treat it as the black box to validate the accuracy of the Warp functions that will write.

In [None]:
@wp.func
def cyclic_index(idx: wp.int32, n: wp.int32) -> wp.int32:
    """Map any index to [0, n-1] for periodic boundary conditions.

    Args:
        idx: Input index that may be outside the valid range.
        n: Grid size defining the periodic domain.

    Returns:
        Index wrapped to the range [0, n-1].
    """
    ret_idx = idx % n
    if ret_idx < 0:
        ret_idx += n
    return ret_idx


@wp.func
def advection(
    omega_left: wp.float32,
    omega_right: wp.float32,
    omega_top: wp.float32,
    omega_down: wp.float32,
    psi_left: wp.float32,
    psi_right: wp.float32,
    psi_top: wp.float32,
    psi_down: wp.float32,
    h: wp.float32,
) -> wp.float32:
    """Calculate the advection term using central finite difference.

    Args:
        omega_left: Vorticity at (i-1, j).
        omega_right: Vorticity at (i+1, j).
        omega_top: Vorticity at (i, j+1).
        omega_down: Vorticity at (i, j-1).
        psi_left: Stream function at (i-1, j).
        psi_right: Stream function at (i+1, j).
        psi_top: Stream function at (i, j+1).
        psi_down: Stream function at (i, j-1).
        h: Grid spacing.

    Returns:
        Advection term value at grid point (i, j).
    """
    inv_2h = 1.0 / (2.0 * h)
    term_1 = ((omega_right - omega_left) * inv_2h) * (
        (psi_top - psi_down) * inv_2h
    )  # MISSING
    term_2 = ((omega_top - omega_down) * inv_2h) * (
        (psi_right - psi_left) * inv_2h
    )  # MISSING
    return term_1 - term_2


@wp.func
def diffusion(
    omega_left: wp.float32,
    omega_right: wp.float32,
    omega_center: wp.float32,
    omega_down: wp.float32,
    omega_top: wp.float32,
    h: wp.float32,
) -> wp.float32:
    """Calculate the Laplacian for viscous diffusion using central difference.

    Args:
        omega_left: Vorticity at (i-1, j).
        omega_right: Vorticity at (i+1, j).
        omega_center: Vorticity at (i, j).
        omega_down: Vorticity at (i, j-1).
        omega_top: Vorticity at (i, j+1).
        h: Grid spacing.

    Returns:
        Laplacian of vorticity at grid point (i, j).
    """
    inv_h2 = 1.0 / (h * h)
    # Combine both the diffusion terms in the x and y direction together
    laplacian = (
        omega_left + omega_right + omega_top + omega_down - 4.0 * omega_center
    ) * inv_h2  # MISSING
    return laplacian


# --- Validation kernel for advection and diffusion ---
@wp.kernel
def compute_advection_diffusion_kernel(
    omega: wp.array2d(dtype=wp.float32),
    psi: wp.array2d(dtype=wp.float32),
    advection_out: wp.array2d(dtype=wp.float32),
    diffusion_out: wp.array2d(dtype=wp.float32),
    h: wp.float32,
    n: wp.int32,
):
    """Compute advection and diffusion terms at each grid point."""
    i, j = wp.tid()

    # obtain the neighboring indices for the [i, j]th cell in a periodic square box
    i_left = cyclic_index(i - 1, n)
    i_right = cyclic_index(i + 1, n)
    j_down = cyclic_index(j - 1, n)
    j_top = cyclic_index(j + 1, n)

    # gather omega values at neighbors
    omega_center = omega[i, j]
    omega_left = omega[i_left, j]
    omega_right = omega[i_right, j]
    omega_down = omega[i, j_down]
    omega_top = omega[i, j_top]

    # gather psi values at neighbors
    psi_left = psi[i_left, j]
    psi_right = psi[i_right, j]
    psi_down = psi[i, j_down]
    psi_top = psi[i, j_top]

    # compute diffusion term (Laplacian)
    diffusion_out[i, j] = diffusion(
        omega_left,
        omega_right,
        omega_center,
        omega_down,
        omega_top,
        h,
    )

    # compute advection term J(psi, omega)
    advection_out[i, j] = advection(
        omega_left,
        omega_right,
        omega_top,
        omega_down,
        psi_left,
        psi_right,
        psi_top,
        psi_down,
        h,
    )

#### EXERCISE: Run the code below, analyze the results. You can also try some other combinations of `n_grid`, `kx`, `ky`.

Here we are testing the accuracy of our `advection(...)` and `diffusion(...)` Warp functions using 2-D functions with known analytical solutions. This allows us to compare our numerical approximations against the exact results.

In [None]:
# Run validation with manufactured solution
from utils import validate_advection_diffusion

(fig_diff, _), (fig_adv, _) = validate_advection_diffusion(
    compute_advection_diffusion_kernel, n_grid=512, kx=1, ky=1
)

### Building block 1.2: Time March

A simple way to handle the time evolution of $\omega$ would be to do a *first-order* approximation below (also called **forward Euler scheme**). The terms in $\color{green}{\text{green}}$ are the ones for which we have already written Warp functions in the previous cells.

$$\frac{\partial \omega}{\partial t} \approx \frac{\omega^{(n+1)} - \omega^{(n)}}{\Delta t} = \color{green}\frac{1}{\text{Re}}\text{Diffusion}_{i,j} - \text{Advection}_{i,j}.$$

Rearranging:

$$\omega^{(n+1)} = \omega^{(n)} + \Delta t \left( \frac{1}{\text{Re}}\text{Diffusion}_{i,j} - \text{Advection}_{i,j} \right).$$

where $\mathcal{L}(\omega) = \frac{1}{\text{Re}}\text{Diffusion} - \text{Advection}$ is the RHS term we just discretized above.


#### Explicit time-integration and SIMT schemes

The forward Euler method for approximating $\partial \omega/\partial t$ belongs to a general class of **explicit time-integration schemes**, wherein the solution at timestep $n+1$ depends solely on quantities evaluated at timestep $n$. This property makes explicit schemes particularly well-suited for the **Single Instruction, Multiple Threads (SIMT)** execution model. 

With the entire field from the previous timestep residing in GPU memory, each thread can independently read the required stencil values and compute the updated field value at its assigned grid point—enabling simultaneous updates across all $N \times N$ grid points. 

The figure below illustrates the 5-point stencil pattern: to compute the updated value at cell $(i,j)$, we access its four cardinal neighbors and the center point—all from timestep $n$. The result is written to a separate output array representing timestep $n+1$.

<p align="center">
  <img src="images/chapter-12.2/stencil.png" alt="5-point stencil for advection and diffusion" width="700">
</p>


#### EXERCISE: Given that we have the correct `def advection(...)` and `def diffusion(...)` functions, let's assemble the timestepping kernel together.

Note below that `omega_0` corresponds to $\omega^{n}$ and `omega_1` corresponds to $\omega^{n+1}$.

In [None]:
@wp.kernel
def viscous_advection_kernel(
    n: int,
    h: float,
    re: float,
    dt: float,
    omega_0: wp.array2d(dtype=float),
    omega_1: wp.array2d(dtype=float),
    psi: wp.array2d(dtype=float),
    rhs: wp.array2d(dtype=float),
):
    """Compute advection + diffusion and advance vorticity using forward Euler.

    Args:
        n: Grid size.
        h: Grid spacing.
        re: Reynolds number.
        dt: Timestep size.
        omega_0: Vorticity field at the beginning of the timestep.
        omega_1: Vorticity field at the end of the timestep.
        psi: Stream function field.
        rhs: Temporarily stores diffusion + advection terms.
    """
    i, j = wp.tid()

    # obtain the neighboring indices for the [i, j]th cell in a periodic square box
    left_idx = cyclic_index(i - 1, n)
    right_idx = cyclic_index(i + 1, n)
    top_idx = cyclic_index(j + 1, n)
    down_idx = cyclic_index(j - 1, n)

    # compute viscous diffusion term
    rhs[i, j] = (1.0 / re) * diffusion(
        omega_1[left_idx, j],
        omega_1[right_idx, j],
        omega_1[i, j],
        omega_1[i, down_idx],
        omega_1[i, top_idx],
        h,
    )

    # subtract advection term
    rhs[i, j] -= advection(
        omega_1[left_idx, j],
        omega_1[right_idx, j],
        omega_1[i, top_idx],
        omega_1[i, down_idx],
        psi[left_idx, j],
        psi[right_idx, j],
        psi[i, top_idx],
        psi[i, down_idx],
        h,
    )

    # forward Euler update: omega^(n+1) = omega^n + dt * L(omega^n)
    omega_1[i, j] = omega_0[i, j] + dt * rhs[i, j]  # MISSING

---
## Building block 2: Poisson Solver 

<p align="center">
  <img src="images/chapter-12.2/solver_pipeline_simple_poisson.svg" alt="Solver pipeline highlighting discretization step" width="800">
</p>

Recall from equation (3) in the Background section that the Poisson equation in the Fourier space is

$$\hat{\psi}_{m,n} = \frac{\hat{\omega}_{m,n}}{k_x^2 + k_y^2}, \tag{3}$$

that we solve for each $[k_x, k_y]$ pair. To this end, we need to follow the pipeline below.
<p align="center">
  <img src="images/chapter-12.2/fourier_solver_pipeline.png" alt="Fourier Poisson Solver Pipeline" width="800">
</p>

### Warp's FFT primitives

Warp provides `wp.tile_fft()` and `wp.tile_ifft()` that operate on `wp.vec2f` (or `wp.vec2d`) arrays and perform row-wise FFT/IFFT, where say if $m = 2 + 3i$, then `m=wp.vec2f(2.0, 3.0)`
- `m.x = 2.0`: **real part**
- `m.y = 3.0`: **imaginary part**

### What do we need to implement?

1. **Data type conversion**: Kernels to move data from `wp.float32` ($\omega$, $\psi$) to `wp.vec2f` ($\hat{\omega}, \hat{psi}$) and vice-versa.

2. **Tile-based FFT/IFFT kernels**: Kernels that will use `wp.tile_fft()` and `wp.tile_ifft()` to perform row-wise FFT/IFFT.

3. **Transpose kernel**: A matrix transpose kernel.

Since our physical fields $\omega$, $\psi$ are stored as `wp.float32`, we need a helper kernel to convert between data types. Broadly, the idea again is to leverage Warp kernels to accelerate our numerical operations of FFT/IFFT.

Let's build these step by step.

### Building Block 2.1: Data type conversion kernels

Since Warp's FFT operates on `wp.vec2f` (complex) arrays but our physical fields are `wp.float32` (real), we need:

- `copy_float_to_vec2`: Convert real array to complex with zero imaginary part (before FFT).
- `extract_real_and_scale`: Extract real part and apply scaling by dividing the real part by `scale` (after IFFT).

#### EXERCISE: Complete the `def extract_real_and_scale(...)` kernel definition. `def copy_float_to_vec2(...)` is already completed for you for reference.

In [None]:
@wp.kernel
def copy_float_to_vec2(
    omega: wp.array2d(dtype=wp.float32), omega_complex: wp.array2d(dtype=wp.vec2f)
):
    """Copy real vorticity to a complex array with zero imaginary part.

    Args:
        omega: Input real-valued vorticity array.
        omega_complex: Output complex array where real part is omega, imaginary is 0.
    """
    i, j = wp.tid()
    omega_complex[i, j] = wp.vec2f(omega[i, j], 0.0)


@wp.kernel
def extract_real_and_scale(
    scale: wp.float32,
    complex_array: wp.array2d(dtype=wp.vec2f),
    real_array: wp.array2d(dtype=wp.float32),
):
    """Extract real part from complex array and scale in one pass.

    Args:
        scale: Scale factor to multiply each element by.
        complex_array: Input complex array (vec2f where .x is real part).
        real_array: Output real array (scaled).
    """
    i, j = wp.tid()
    # real_array = Re(complex_array) divided by scale
    real_array[i, j] = complex_array[i, j].x / scale  # MISSING

### Building Block 2.2: Tile-based FFT and IFFT kernels (WIP)

Warp's `wp.tile_fft()` and `wp.tile_ifft()` perform FFT on tiles loaded into registers. For row-wise transforms:

1. Each thread block loads one row (tile) of the 2D array
2. Performs FFT/IFFT using `wp.tile_fft()` / `wp.tile_ifft()`
3. Stores the result back to global memory


In [None]:
@wp.kernel
def fft_tiled(x: wp.array2d(dtype=wp.vec2f), y: wp.array2d(dtype=wp.vec2f)):
    """Perform 1-D FFT on each row using wp.tile_fft().

    Args:
        x: Input complex array of shape (N, N).
        y: Output complex array of shape (N, N) storing FFT results.
    """
    i, _, _ = wp.tid()
    a = wp.tile_load(x, shape=(1, N_GRID), offset=(i * 1, 0))
    wp.tile_fft(a)
    wp.tile_store(y, a, offset=(i * 1, 0))


@wp.kernel
def ifft_tiled(x: wp.array2d(dtype=wp.vec2f), y: wp.array2d(dtype=wp.vec2f)):
    """Perform 1-D inverse FFT on each row using wp.tile_ifft().

    Args:
        x: Input complex array of shape (N, N).
        y: Output complex array of shape (N, N) storing IFFT results.
    """
    i, _, _ = wp.tid()
    a = wp.tile_load(x, shape=(1, N_GRID), offset=(i * 1, 0))
    wp.tile_ifft(a)
    wp.tile_store(y, a, offset=(i * 1, 0))

#### Validating FFT/IFFT kernels 
To verify that `fft_tiled` and `ifft_tiled` work correctly, we test them with a known signal $f(x) = \sin(x)$ on the domain $[0, 2\pi)$. The Fourier transform of $\sin(x)$ has a simple analytical form. Since $\sin(x) = \frac{e^{ix} - e^{-ix}}{2i}$, the FFT should produce peaks only at wavenumbers $k = \pm 1$, with zero amplitude elsewhere. 

The `validate_fft_roundtrip(...)` function initializes the signal $f(x)$ on the $[0, 2\pi)$ domain and performs a full roundtrip, i.e., it applies the FFT to the signal, then applies the IFFT to the result. The function then plots the magnitude spectrum $|\hat{f}(k)|$ and compares the original signal with the reconstructed signal obtained from the FFT $\rightarrow$ IFFT roundtrip.

In [None]:
# Validate FFT kernels with a simple sine wave test
from utils import validate_fft_roundtrip

fig, axes, max_error = validate_fft_roundtrip(
    fft_kernel=fft_tiled,
    ifft_kernel=ifft_tiled,
    n_grid=N_GRID,
    tile_m=1,
    tile_n=N_GRID,
    block_dim=N_GRID // 2,
)

### Building Block 2.3:  Transpose kernel

To compute 2D FFT using 1D row transforms, we need a transpose operation between passes:

$$\text{2D FFT} = \text{row FFT} \rightarrow \text{transpose} \rightarrow \text{row FFT}.$$



#### Exercise: Fill in the missing `def transpose(...)` kernel.

In [None]:
@wp.kernel
def transpose(x: wp.array2d(dtype=wp.vec2f), y: wp.array2d(dtype=wp.vec2f)):
    """Transpose a 2-D array.

    Args:
        x: Input complex array.
        y: Output complex array storing the transpose of x.
    """
    i, j = wp.tid()
    y[i, j] = x[j, i]  # MISSING

### Validating transpose kernel

We validate this using an upper triangular matrix, which should become lower triangular after transpose. For example, in case of a $4 \times 4$ upper triangular matrix:

$$
\begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & 1 & 1 & 1 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 \end{bmatrix}
\xrightarrow{\text{transpose}}
\begin{bmatrix} 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 \\ 1 & 1 & 1 & 1 \end{bmatrix}.
$$

#### Exercise: Run the cell below and feel free to change `n_test` to confirm that your transpose kernel is doing what's intended.

In [None]:
from utils import validate_transpose

fig, axes = validate_transpose(transpose_kernel=transpose, n_test=64)

### Going from $\hat{\omega}$ to $\hat{\psi}$ in the Fourier space

After transforming $\omega$ to Fourier space, we solve equation (3)

$$\hat{\psi}_{m,n} = \frac{\hat{\omega}_{m,n}}{k_x^2 + k_y^2} = \frac{\hat{\omega}_{m,n}}{|k|^2} = \hat{\omega}_{m,n} \cdot |k|^{-2}.$$

We use the precomputed $1/|k|^2$ in the beginning of the lab to multiply to the `omega_hat` array to obtain the `psi_hat` array.
 
We have already filled up the kernel for you.

In [None]:
@wp.kernel
def multiply_k2_inverse(
    k2i: wp.array2d(dtype=wp.float32),
    omega_hat: wp.array2d(dtype=wp.vec2f),
    psi_hat: wp.array2d(dtype=wp.vec2f),
):
    """Solve Poisson equation in Fourier space: psi_hat = omega_hat / |k|^2.

    Args:
        k2i: Precomputed 1/|k|^2 array (0 at k=0).
        omega_hat: Fourier transform of vorticity.
        psi_hat: Output Fourier transform of stream function.
    """
    i, j = wp.tid()
    psi_hat[i, j] = omega_hat[i, j] * k2i[i, j]

## Assembling all the components together to build our one solver loop 

<p align="center">
  <img src="images/chapter-12.2/solver_pipeline_simple.svg" alt="Solver Pipeline" width="800">
</p>

Now that we have all the kernel building blocks in place, we'll assemble them into the complete `step()` function below. One call to `step()` basically captures one solver loop. 

In [None]:
def step(omega_0, omega_1, psi, rhs, k2i):
    """Advance simulation by one timestep using forward Euler.

    Full pipeline: advection/diffusion + Poisson solve via FFT.

    Args:
        omega_0: Vorticity at the start of timestep on the 2-D grid.
        omega_1: Vorticity at the end of timestep on the 2-D grid.
        psi: Stream function on the 2-D grid.
        rhs: Temporary array for advection + diffusion terms.
        k2i: Precomputed 1/|k|^2 for Poisson solver.
    """

    # Allocate temporary buffers for FFT operations
    omega_complex = wp.zeros((N_GRID, N_GRID), dtype=wp.vec2f)
    fft_temp_1 = wp.zeros((N_GRID, N_GRID), dtype=wp.vec2f)
    fft_temp_2 = wp.zeros((N_GRID, N_GRID), dtype=wp.vec2f)
    fft_temp_3 = wp.zeros((N_GRID, N_GRID), dtype=wp.vec2f)

    # ===== Fourier Poisson solver =====

    # Convert omega_1 (real) to wp.vec2f() data type
    wp.launch(
        copy_float_to_vec2,
        dim=(N_GRID, N_GRID),
        inputs=[omega_1],
        outputs=[omega_complex],
    )

    # Forward 2D FFT: row FFT -> transpose -> row FFT
    wp.launch_tiled(
        fft_tiled,
        dim=[N_GRID, 1],
        inputs=[omega_complex],
        outputs=[fft_temp_1],
        block_dim=N_GRID // 2,
    )
    wp.launch(
        transpose, dim=[N_GRID, N_GRID], inputs=[fft_temp_1], outputs=[fft_temp_2]
    )
    wp.launch_tiled(
        fft_tiled,
        dim=[N_GRID, 1],
        inputs=[fft_temp_2],
        outputs=[fft_temp_3],
        block_dim=N_GRID // 2,
    )

    # Solve in Fourier space: psi_hat = omega_hat / |k|^2
    wp.launch(
        multiply_k2_inverse,
        dim=(N_GRID, N_GRID),
        inputs=[k2i, fft_temp_3],
        outputs=[fft_temp_1],
    )

    # Inverse 2D FFT: row IFFT -> transpose -> row IFFT
    wp.launch_tiled(
        ifft_tiled,
        dim=[N_GRID, 1],
        inputs=[fft_temp_1],
        outputs=[fft_temp_2],
        block_dim=N_GRID // 2,
    )
    wp.launch(
        transpose, dim=[N_GRID, N_GRID], inputs=[fft_temp_2], outputs=[fft_temp_3]
    )
    wp.launch_tiled(
        ifft_tiled,
        dim=[N_GRID, 1],
        inputs=[fft_temp_3],
        outputs=[fft_temp_1],
        block_dim=N_GRID // 2,
    )
    wp.launch(
        extract_real_and_scale,
        dim=(N_GRID, N_GRID),
        inputs=[wp.float32(N_GRID * N_GRID), fft_temp_1],
        outputs=[psi],
    )

    rhs.zero_()

    # ===== Discretization + Time Marching =====
    wp.launch(
        viscous_advection_kernel,
        dim=(N_GRID, N_GRID),
        inputs=[N_GRID, H, RE, DT, omega_0, omega_1, psi, rhs],
    )

    # Copy omega_1 back to omega_0 for next timestep
    wp.copy(omega_0, omega_1)

### CUDA graph capture

CUDA graphs let you define a sequence of operations (kernel launches, data movement) and their dependencies once, then replay it repeatedly with minimal overhead. Normally, each kernel launch incurs CPU-side setup costs. For short-running kernels launched many times, this overhead dominates. By capturing the entire workflow in a graph, these setup costs are paid once during graph creation, and subsequent launches are nearly free.

In Warp, you can capture a CUDA graph with `wp.ScopedCapture()`. In the code block below, a single call to `step(...)` is captured and will be replayed for as many times as we want to run the simulation.

In [None]:
# capture first step in a CUDA graph
with wp.ScopedCapture() as capture:
    step(omega_0, omega_1, psi, rhs, k2i)
step_graph = capture.graph

### Benchmarking CUDA Graph WIP

### Running the simulation and saving the results

The code below runs the simulation for 2000 timesteps and saves the vorticity field at every 10th timestep. The vorticity fields are then plotted and saved as a GIF.

In [None]:
import time
from PIL import Image
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize
import os

NUM_FRAMES = 2000
STEPS_PER_FRAME = 10

# Colormap setup
cmap = plt.cm.twilight
norm = Normalize(vmin=-15, vmax=15)

frames = []
print(f"Running {NUM_FRAMES} frames ({STEPS_PER_FRAME} steps each)...")

start_time = time.perf_counter()
for frame in range(NUM_FRAMES):
    # Advance simulation
    for _ in range(STEPS_PER_FRAME):
        wp.capture_launch(step_graph)

    # Capture frame
    vorticity = omega_1.numpy().T  # Transpose for correct orientation
    colored = cmap(norm(vorticity))
    rgb = (colored[:, :, :3] * 255).astype(np.uint8)
    frames.append(rgb)

    if (frame + 1) % 50 == 0:
        print(f"  Frame {frame + 1}/{NUM_FRAMES}")

elapsed = time.perf_counter() - start_time
total_steps = NUM_FRAMES * STEPS_PER_FRAME
print(
    f"Completed {total_steps} steps in {elapsed:.2f}s ({total_steps / elapsed:.0f} steps/s)"
)

In [None]:
import IPython.display

# Create output directory
os.makedirs("./images/chapter-12.2", exist_ok=True)
output_file = f"./images/chapter-12.2/turbulence_{N_GRID}x{N_GRID}_Re{int(RE)}.gif"

# Save as GIF (resize to reduce file size)
GIF_SIZE = 256  # Target resolution for GIF
pil_images = [
    Image.fromarray(frame).resize((GIF_SIZE, GIF_SIZE), Image.LANCZOS)
    for frame in frames
]
pil_images[0].save(
    output_file,
    save_all=True,
    append_images=pil_images[1:],
    duration=50,
    loop=0,
)

print(f"Saved: {output_file}")
IPython.display.Image(output_file)

---
## Conclusion WIP


---
## References WIP

## Appendix A

### 1. The governing equations
In 2-D incompressible flow (density constant), the Navier-Stokes equations consist of three equations:
 
Momentum in $x$
$$\frac{\partial u}{\partial t} + u\frac{\partial u}{\partial x} + v\frac{\partial u}{\partial y} = -\frac{\partial p}{\partial x} + \frac{1}{Re}\left(\frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2}\right),$$
 
Momentum in $y$
$$\frac{\partial v}{\partial t} + u\frac{\partial v}{\partial x} + v\frac{\partial v}{\partial y} = -\frac{\partial p}{\partial y} + \frac{1}{Re}\left(\frac{\partial^2 v}{\partial x^2} + \frac{\partial^2 v}{\partial y^2}\right),$$
 
Continuity
$$\frac{\partial u}{\partial x} + \frac{\partial v}{\partial y} = 0.$$
 
**This means solving for three unknowns: velocity components $(u, v)$ and pressure $p$ and three equations.** The Reynolds number $Re$ is a dimensionless parameter that dictates the large-scale behavior of the fluid in the box. 

But there's a cleaner approach using two scalar fields defined in terms of $(u, v)$, namely 

- **Vorticity** $\omega$: Measures local rotation of the fluid. In 2-D, only one component is non-zero, which we will call $\omega = \frac{\partial v}{\partial x} - \frac{\partial u}{\partial y}$ in subsequent sections.
 
$$\boldsymbol{\omega} = \nabla \times \mathbf{u} = \begin{vmatrix} \mathbf{i} & \mathbf{j} & \mathbf{k} \\ \frac{\partial}{\partial x} & \frac{\partial}{\partial y} & \frac{\partial}{\partial z}=0 \\ u & v & w=0 \end{vmatrix} = \begin{pmatrix} \frac{\partial v}{\partial x} - \frac{\partial u}{\partial y} \\ 0 \\ 0 \end{pmatrix}.$$
  
- **Stream function** $\psi$: Defines velocity through its derivatives as follows. This formulation ensures that the continuity equation is automatically satisfied.
$$u = \frac{\partial \psi}{\partial y}, v = -\frac{\partial \psi}{\partial x}.$$

### 2. The boundary conditions
We choose to simulate the above equations in a square box with **periodic boundary conditions**. In simpler terms, *what exits from one side enters from the other*

To illustrate this further, we tile the original $N \times N$ simulation side by side to form a $2N \times 2N$ GIF. Notice how structures repeat across tiles and vortices maintain continuity (and differentiability!) at the edges.
 
![4pi_simulation](images/chapter-12.2/periodicity_ns_tiled.gif)

Which mathematical functions have the property of repeating after regular intervals and maintaining continuity/differentiability at the edges?

**Sines and cosines!**: These functions are the building blocks of periodic functions. A function $f(x)$ with period $L$ satisfies $f(x + L) = f(x)$, exactly like $\sin(x)$ and $\cos(x)$ with period $2\pi$.

<img src="images/chapter-12.2/periodicity_sin_cos.png" alt="Periodicity of sin and cos" width="60%">

**Periodic boundary conditions** and **Fourier methods** go hand in hand. Any periodic function can be decomposed into a sum of sines and cosines (a Fourier series), and operations like derivatives become simple multiplications in Fourier space. We'll use this when solving the Poisson equation. When $\omega$ and $\psi$ are represented using Fourier series as
 
$$\omega(x,y) = \sum_{m,n} \hat{\omega}_{m,n} e^{i(k_x x + k_y y)}, \quad \psi(x,y) = \sum_{m,n} \hat{\psi}_{m,n} e^{i(k_x x + k_y y)},$$ 

the Poisson equation transforms as 

$$\nabla^2 \psi = -\omega \quad \xrightarrow{\text{Fourier}} \quad -(k_x^2 + k_y^2)\hat{\psi} = -\hat{\omega} \quad \Rightarrow \quad \hat{\psi} = \frac{\hat{\omega}}{k_x^2 + k_y^2}. \tag{3}$$

Thus, we have **transformed a PDE in physical space to a simple algebraic equation in Fourier space**. This would allow us to leverage Warp's FFT functionalities to solve our Equation (2) above.

Following common convention in academic literature, **we set our simulation domain to $2\pi \times 2\pi$**. This choice naturally simplifies Fourier transform operations, as the wavenumbers become integers $(k_x, k_y = 0, \pm 1, \pm 2, \ldots)$.

### 3. The initial conditions

For the initial conditions, we seed the domain with an array of swirling vortices. The derivation of this particular initialization is beyond the scope of this lab; interested readers can refer to San and Staples (2012), *Computers & Fluids*, for further details.

<img src="images/chapter-12.2/initial_vorticity.png" alt="Initial vorticity field" width="40%">