In [None]:
# SPDX-License-Identifier: Apache-2.0 AND CC-BY-NC-4.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Chapter 12.2: GPU-Accelerated 2-D Navier Stokes Simulation using NVIDIA Warp

## Overview

In this notebook, we will implement a 2-D Navier-Stokes (N-S) solver in a periodic box using the vorticity-streamfunction formulation. The Poisson equation relating streamfunction to vorticity is solved spectrally using tile-based FFT in Warp.

Through this implementation, you will learn:

* Writing SIMT (single instruction, multiple threads) kernels for Navier-Stokes operator and flow initialization
* Warp's tile-based programming primitives for efficient matrix operations
* CUDA graph capture for performance

By the end of the notebook, we will have a GPU-accelerated 2-D Navier-Stokes solver in a periodic box, while gaining experience in applying Warp's features to make GPU go brrr for a computational physics problem. 

---
## Setup

Before we begin implementing the 2-D N-S solver, let's ensure we have all the necessary packages installed.

In [None]:
# install Warp
%pip install warp-lang 

# install visualization package
%pip install matplotlib Pillow

Now let us import the necessary libraries and initialize Warp to check if GPU support is available:

In [None]:
import numpy as np
import warp as wp

if wp.get_cuda_device_count() > 0:
    print("GPU detected successfully")
else:
    print("No GPU detected!")

---
## Introduction

This seemingly simple 2-D N-S solver example combines multiple Warp features that can be leveraged to build industrial-grade solvers:
 - CUDA kernels for finite difference operators
 - Tile-based matrix operations for a more performant code
 - CUDA graph capture to speed-up multiple kernel calls to the same kernel

Before diving into the code, let's understand a bit of physics behind the Navier-Stokes equations in 2-D. 

---
## Background: Equations, boundary conditions, and initial conditions

### 1. Vorticity-stream function formulation

For 2-D incompressible flow, we define:
- **Vorticity**: $\omega = \partial v/\partial x - \partial u/\partial y$
- **Streamfunction**: $\psi$ such that $u = \partial\psi/\partial y$, $v = -\partial\psi/\partial x$

This formulation automatically satisfies continuity and eliminates pressure from the Navier-Stokes equations.

### 2. Governing equation

In 2-D, the non-dimensional governing equations for incompressible flows can be written using the **vorticity-stream function transport equation** as

$$\underbrace{\frac{\partial \omega}{\partial t}}_{\text{unsteadiness}} + \underbrace{\frac{\partial \psi}{\partial y}\frac{\partial \omega}{\partial x} - \frac{\partial \psi}{\partial x}\frac{\partial \omega}{\partial y}}_{\text{advection}} = \underbrace{\frac{1}{Re}\left(\frac{\partial^2 \omega}{\partial x^2} + \frac{\partial^2 \omega}{\partial y^2}\right)}_{\text{diffusion}} \tag{1},$$

along with the relationship between stream function and vorticity, given by a Poisson equation. 

Starting from the definition of vorticity (in 2-D)
 
$$\omega = \frac{\partial v}{\partial x} - \frac{\partial u}{\partial y}.$$
 
Substituting the streamfunction relations $u = \frac{\partial \psi}{\partial y}$ and $v = -\frac{\partial \psi}{\partial x}$ gives
 
$$\omega = \frac{\partial}{\partial x}\left(-\frac{\partial \psi}{\partial x}\right) - \frac{\partial}{\partial y}\left(\frac{\partial \psi}{\partial y}\right),$$
 
$$\omega = -\frac{\partial^2 \psi}{\partial x^2} - \frac{\partial^2 \psi}{\partial y^2}.$$
 
Rearranging gives the Poisson equation
 
$$\frac{\partial^2 \psi}{\partial x^2} + \frac{\partial^2 \psi}{\partial y^2} = -\omega \tag{2}.$$

When writing the solver, equations (1) and (2) will be solved at each time step to evolve the flow.

### 3. Spectral Poisson solver

For periodic domains, equation (2) can be directly solved in Fourier space. The wavenumbers are defined as $k_x = \frac{2\pi m}{L_x}, \quad k_y = \frac{2\pi n}{L_y}$, where $m$ and $n$ are the Fourier space indices. The vorticity field $\omega$ is converted to its Fourier representation $\hat{\omega}_{m,n}$ using a 2-D FFT. The Poisson equation then becomes algebraic as follows

$$\hat{\psi}_{m,n} = \frac{\hat{\omega}_{m,n}}{k_x^2 + k_y^2}. \tag{3}$$

$\hat{\psi}_{m,n}$ is then converted back to physical space using inverse FFT.

### 4. Time integration 
For time stepping, we use a strong-stability preserving third-order Runge-Kutta method.
$$\omega^{(1)} = \omega^{(n)} + \Delta t \mathcal{L}(\omega^{(n)}),$$
 
$$\omega^{(2)} = \frac{3}{4}\omega^{(n)} + \frac{1}{4}\omega^{(1)} + \frac{1}{4}\Delta t \mathcal{L}(\omega^{(1)}),$$ 

$$\omega^{(n+1)} = \frac{1}{3}\omega^{(n)} + \frac{2}{3}\omega^{(2)} + \frac{2}{3}\Delta t \mathcal{L}(\omega^{(2)}), \tag{4}.$$

Here $\mathcal{L}(\omega)$ encapsulates the finite-difference forms of both advection and diffusion terms in equation (1). 

**NOTE**: Other similar time-stepping schemes can be implemented with minimal changes to the code, but we use the scheme above as a demonstration example.

### 5. Initial condition: 2-D decaying turbulence

In this problem, we solve the governing equations on a square box of dimensions $2\pi \times 2\pi$. The initial condition is generated using the energy spectrum from San & Staples CNF (2012)

$$E(k) = \frac{a_s}{2 k_p}\left(\frac{k}{k_p}\right)^{2s+1} \exp\left[-\left(s + \frac{1}{2}\right)\left(\frac{k}{k_p}\right)^2\right] \tag{5}$$

where:
- $k = |\mathbf{k}| = \sqrt{k_x^2 + k_y^2}$,
- $k_p = 12$: peak wavenumber, 
- $s = 3$: shape parameter,
- $a_s = (2s+1)^{s+1}/(2^s s!)$.

The magnitude of Fourier coefficients related to the initial energy spectrum is given by 

$$|\hat{\omega}(\mathbf{k})| = \sqrt{\frac{k}{\pi}E(k)},$$

which is then randomized with a phase function $\zeta(\mathbf{k}) = \xi(\mathbf{k}) + \eta(\mathbf{k})$ with $\xi, \eta \in [0, 2\pi]$ are independently chosen random values in the first quadrant of the $k_x - k_y$ plane.

$$\hat{\omega}(\mathbf{k}) = \sqrt{\frac{k}{\pi}E(k)} \, e^{i\zeta(\mathbf{k})}. \tag{6}$$
 
The phase function satisfies the following symmetry relations to ensure that the vorticity field in the physical space is real-valued:
 
$$\xi(-k_x, k_y) = -\xi(k_x, k_y),$$
$$\xi(-k_x, -k_y) = -\xi(k_x, k_y),$$
$$\xi(k_x, -k_y) = \xi(k_x, k_y),$$
 
$$\eta(-k_x, k_y) = \eta(k_x, k_y),$$
$$\eta(-k_x, -k_y) = -\eta(k_x, k_y),$$
$$\eta(k_x, -k_y) = -\eta(k_x, k_y).$$

**NOTE**: Other similar flow initialization can be implemented, but we use the decaying turbulence above as a demonstration example. We encourage you to play with different flow initializations.

---
## Building blocks of the solver

(**TODO: Replace with better picture**)
$$
\boxed{
\begin{array}{c}
\textbf{Solver Pipeline} \\[10pt]
\text{Initialize } \hat{\omega} \;\longrightarrow\; \omega(t) \;\longrightarrow\; 
\underbrace{\text{Discretize} \to \text{RK step} \to \text{Poisson solve}}_{\times\, 3 \text{ sub-steps}} 
\;\longrightarrow\; \omega(t+\Delta t) \\[-3pt]
\hspace{6.5em} \nwarrow \hspace{15em} \swarrow \\[-6pt]
\hspace{7em} \xleftarrow{\hspace{4em} \text{time loop} \hspace{4em}}
\end{array}
}
$$

There are four main building blocks of the solver that we would tackle in a sequential manner.

- Initialize $\omega$ — Generate $\hat{\omega}$ in Fourier space using the energy spectrum specified above and inverse FFT to physical space.
- Discretize the advection and diffusion terms to obtain the RHS $\mathcal{L}(\omega)$ of $\frac{\partial \omega}{\partial t} = \mathcal{L}(\omega)$.
- RK substep to advance $\omega(t)$ to $\omega(t+\Delta t)$ using Runge-Kutta time stepping.
- Solve Poisson equation $\hat{\psi}_{m,n} = \hat{\omega}_{m,n}/(k_x^2 + k_y^2)$ and inverse FFT to $\psi$.  

---
## Simulation parameters

In [None]:
# grid resolution
N_GRID = 512
# box size
LEN = 2 * np.pi
# delta t for timestepping
DT = 0.001
# Reynolds number
RE = 1000.0

# parameters for Warp's tiled-FFT functionality
TILE_M = 1
TILE_N = N_GRID
TILE_TRANSPOSE_DIM = 16
BLOCK_DIM = TILE_N // 2

---
## Building block 1: Flow initialization in Warp

Recall from equation (6) in the Background section that the initial vorticity field $\hat{\omega}(\mathbf{k})$ in Fourier space is given by

$$\hat{\omega}(\mathbf{k}) = \underbrace{\sqrt{\frac{k}{\pi}E(k)}}_{\text{amplitude}} \cdot \underbrace{e^{i\zeta(\mathbf{k})}}_{\text{phase}},$$

where $E(k)$ is the energy spectrum (equation 5) and $\zeta(\mathbf{k}) = \xi(\mathbf{k}) + \eta(\mathbf{k})$ is the phase function with symmetry constraints ensuring a real-valued physical field.

**What do we need to implement?**

To initialize the flow, we need:

1. A kernel to populate $\hat{\omega}(\mathbf{k})$ on the entire 2-D grid using equation (6).

2. Two helper functions for that kernel (and equation 6):

     a. A function to compute the **energy spectrum** $E(k)$ at each wavenumber magnitude.

     b. A function to compute the **phase** $\zeta(\mathbf{k})$ with proper symmetry across quadrants.
  
The helper functions will be `@wp.func` device functions (callable from kernels), and the initialization will be a `@wp.kernel` that launches one thread per grid point. Let's build these step by step. 

**We don't want you to be bogged down by the details of the two helper functions. Assuming that the helper functions are correct, please populate the kernel `def decaying_turbulence_initializer(...)`.**


In [None]:
@wp.func
def factorial(n: wp.int32) -> wp.int32:
    """Compute factorial.

    Args:
        n: Input integer for which we want factorial.

    Returns:
        Factorial of input n.
    """
    result = wp.int32(1)
    for i in range(2, n + 1):
        result *= i
    return result


@wp.func
def energy_spectrum(k: wp.float32, s: wp.int32, kp: wp.float32) -> wp.float32:
    """Compute energy at wavenumber magnitude k.

    Follows San and Staples 2012 Computers and Fluids (page 49).
    https://www.sciencedirect.com/science/article/abs/pii/S0045793012001363.

    Args:
        k: Input wavenumber magnitude.
        s: Shape parameter of spectrum.
        kp: Wavenumber magnitude at which maximum of energy spectrum lies.

    Returns:
        Energy contained at wavenumber magnitude k.
    """
    s_factorial = wp.float32(factorial(s))
    s_float32 = wp.float32(s)
    a_s = (2.0 * s_float32 + 1.0) ** (s_float32 + 1.0) / (2.0**s_float32 * s_factorial)
    energy_k = (
        a_s
        / (2.0 * kp)
        * (k / kp) ** (2.0 * s_float32 + 1.0)
        * wp.exp(-(s_float32 + 0.5) * (k / kp) ** 2.0)
    )
    return energy_k


@wp.func
def phase_randomizer(
    n: int,
    zeta: wp.array2d(dtype=wp.float32),
    eta: wp.array2d(dtype=wp.float32),
    i: int,
    j: int,
) -> wp.float32:
    """Calculate value of the random phase at index (i, j).

    Follows San and Staples 2012 to return phase value in any quadrant based on
    the values of eta and zeta in the first quadrant.

    Args:
        n: Size of the simulation domain.
        zeta: First phase function.
        eta: Second phase function
        i: rowwise index on the 2-D simulation domain.
        j: columnwise index on the 2-D simulation domain

    Returns:
        Value of the random phase in any quadrant.
    """
    n_half = n // 2

    # first quadrant
    if i < n_half and j < n_half:
        return zeta[i, j] + eta[i, j]
    # second quadrant
    if i >= n_half and j < n_half:
        return -zeta[n - i, j] + eta[n - i, j]
    # third quadrant
    if i >= n_half and j >= n_half:
        return -zeta[n - i, n - j] - eta[n - i, n - j]
    # fourth quadrant
    return zeta[i, n - j] - eta[i, n - j]

@wp.kernel
def decaying_turbulence_initializer(
    n: int,
    k_cutoff: wp.float32,
    s: wp.int32,
    k_mag: wp.array2d(dtype=wp.float32),
    zeta: wp.array2d(dtype=wp.float32),
    eta: wp.array2d(dtype=wp.float32),
    omega_hat_init: wp.array2d(dtype=wp.vec2f), # In Warp, wp.vec2f/wp.vec2d are used to handle complex numbers
):
    """Initialize the vorticity field in Fourier space for decaying turbulence.

    Args:
        n: Size of the simulation domain.
        k_cutoff: Wavenumber magnitude at which maximum of energy spectrum lies.
        s: Shape parameter of the energy spectrum.
        k_mag: Wavenumber magnitude on the 2-D grid.
        zeta: First phase function for phase randomization.
        eta: Second phase function for phase randomization.
        omega_hat_init: Output vorticity field in Fourier space.
    """
    i, j = wp.tid()

    amplitude = ... # MISSING
    phase = ... # MISSING
    omega_hat_init[i, j] = ... # MISSING

NameError: name 'wp' is not defined

---
## Building block 2: Discretization of advection and viscous diffusion terms for RHS

Recall the vorticity transport equation (1) from the Background section:

$$\frac{\partial \omega}{\partial t} + \underbrace{\frac{\partial \psi}{\partial y}\frac{\partial \omega}{\partial x} - \frac{\partial \psi}{\partial x}\frac{\partial \omega}{\partial y}}_{\text{advection}} = \underbrace{\frac{1}{Re}\left(\frac{\partial^2 \omega}{\partial x^2} + \frac{\partial^2 \omega}{\partial y^2}\right)}_{\text{diffusion}}$$

We discretize the **advection** and **diffusion** terms on a **uniform grid** using **central finite difference schemes**.
### Advection term

Using central differences for first derivatives with grid spacing $h$,

$$\frac{\partial \omega}{\partial x}\bigg|_{i,j} \approx \frac{\omega_{i+1,j} - \omega_{i-1,j}}{2h}, \quad \frac{\partial \omega}{\partial y}\bigg|_{i,j} \approx \frac{\omega_{i,j+1} - \omega_{i,j-1}}{2h}, \quad \frac{\partial \psi}{\partial x}\bigg|_{i,j} \approx \frac{\psi_{i+1,j} - \psi_{i-1,j}}{2h}, \quad \frac{\partial \psi}{\partial y}\bigg|_{i,j} \approx \frac{\psi_{i,j+1} - \psi_{i,j-1}}{2h}.$$

The advection term becomes

$$\text{Advection}_{i,j} = \underbrace{\frac{\omega_{i+1,j} - \omega_{i-1,j}}{2h} \cdot \frac{\psi_{i,j+1} - \psi_{i,j-1}}{2h}}_{\texttt{term\_1}} - \underbrace{\frac{\omega_{i,j+1} - \omega_{i,j-1}}{2h} \cdot \frac{\psi_{i+1,j} - \psi_{i-1,j}}{2h}}_{\texttt{term\_2}}.$$

### Diffusion (or Laplacian) term

Using the standard 5-point stencil for the Laplacian,

$$\text{Diffusion}_{i,j}  = \nabla^2 \omega\bigg|_{i,j} = \frac{\partial^2 \omega}{\partial x^2}\bigg|_{i,j} + \frac{\partial^2 \omega}{\partial y^2}\bigg|_{i,j} \approx \frac{\omega_{i+1,j} - 2\omega_{i,j} + \omega_{i-1,j}}{h^2} + \frac{\omega_{i,j+1} - 2\omega_{i,j} + \omega_{i,j-1}}{h^2} = \frac{\omega_{i+1,j} + \omega_{i-1,j} + \omega_{i,j+1} + \omega_{i,j-1} - 4\omega_{i,j}}{h^2}.$$

### Runge-Kutta timestepping

Recall the SSP-RK3 scheme from equation (4). Each sub-step has the general form:

$$\omega^{(\text{new})} = c_0 \cdot \omega^{(n)} + c_1 \cdot \omega^{(\text{old})} + c_2 \cdot \Delta t \cdot \mathcal{L}(\omega^{(\text{old})})$$

where $\mathcal{L}(\omega) = \frac{1}{Re}\text{Diffusion} - \text{Advection}$ is the RHS we just discretized above. $\omega^{(n)}$ is the vorticity field on at the beginning of the older time step $t$. $\omega^{\text{old}}$ and $\omega^{\text{new}}$ are the vorticity field at the beginning and the end of a RK substep, respectively. 

**NOTE**: At the first and last RK substep specifically, $\omega^{\text{(new)}}$ and $\omega^{\text{(old)}}$ are equal to $\omega(t+1)$ and $\omega(t)$, respectively. For intermediate RK substeps, this does not necessarily hold true.

In a Warp kernel, all of these operations would translate to:
1. Compute `rhs = (1/Re) * diffusion(...) - advection(...)`.
2. Update `omega_new = coeff0 * omega_n + coeff1 * omega_old + coeff2 * dt * rhs`, where `omega_n` is the vorticity field at the very beginning on the RK time stepping. By passing different coefficients, each time the kernel is called, the same kernel can handle all three RK sub-steps.

**What do we need to implement?**

Two `@wp.func` helper functions:
1. `viscous_advection_rk3_kernel(...)` - Warp kernel that calls two Warp functions `advection(...)` and `diffusion(...)` to calculate the advection and diffusion terms, and then performs one RK update on the entire 2-D grid.
2. `advection(...)` - Warp function that computes the advection term at a grid point given the flow information on the neighboring left, right, top, and bottom neighbors.
3. `diffusion(...)` - Warp function computes the Laplacian at a grid point given the flow information on the left, right, top and bottom neighbors.



In [None]:
@wp.func
def cyclic_index(idx: wp.int32, n: wp.int32) -> wp.int32:
    """Map any index to [0, n-1] for periodic boundary conditions.

    Args:
        idx: Input index that may be outside the valid range.
        n: Grid size defining the periodic domain.

    Returns:
        Index wrapped to the range [0, n-1].
    """
    ret_idx = idx % n
    if ret_idx < 0:
        ret_idx += n
    return ret_idx
    
@wp.func
def advection(
    omega_left: wp.float32,
    omega_right: wp.float32,
    omega_top: wp.float32,
    omega_down: wp.float32,
    psi_left: wp.float32,
    psi_right: wp.float32,
    psi_top: wp.float32,
    psi_down: wp.float32,
    h: wp.float32,
) -> wp.float32:
    """Calculate the advection term using central finite difference.

    Args:
        omega_left: Vorticity at (i-1, j).
        omega_right: Vorticity at (i+1, j).
        omega_top: Vorticity at (i, j+1).
        omega_down: Vorticity at (i, j-1).
        psi_left: Stream function at (i-1, j).
        psi_right: Stream function at (i+1, j).
        psi_top: Stream function at (i, j+1).
        psi_down: Stream function at (i, j-1).
        h: Grid spacing.

    Returns:
        Advection term value at grid point (i, j).
    """
    inv_2h = 1.0 / (2.0 * h)
    term_1 = ... # MISSING
    term_2 = ... # MISSING
    return term_1 - term_2


@wp.func
def diffusion(
    omega_left: wp.float32,
    omega_right: wp.float32,
    omega_center: wp.float32,
    omega_down: wp.float32,
    omega_top: wp.float32,
    h: wp.float32,
) -> wp.float32:
    """Calculate the Laplacian for viscous diffusion using central difference.

    Args:
        omega_left: Vorticity at (i-1, j).
        omega_right: Vorticity at (i+1, j).
        omega_center: Vorticity at (i, j).
        omega_down: Vorticity at (i, j-1).
        omega_top: Vorticity at (i, j+1).
        h: Grid spacing.

    Returns:
        Laplacian of vorticity at grid point (i, j).
    """
    inv_h2 = 1.0 / (h * h)
    # combine both the diffusion terms in the x and y direction together
    laplacian = ... # MISSING
    return laplacian

    
@wp.kernel
def viscous_advection_rk3_kernel(
    n: int,
    h: float,
    re: float,
    dt: float,
    coeff0: float,
    coeff1: float,
    coeff2: float,
    omega_0: wp.array2d(dtype=float),
    omega_1: wp.array2d(dtype=float),
    psi: wp.array2d(dtype=float),
    rhs: wp.array2d(dtype=float),
):
    """Perform a single substep of SSP-RK3.

    Args:
        n: Grid size.
        h: Grid spacing.
        re: Reynolds number.
        dt: Time step size.
        coeff0: SSP-RK3 coefficient for omega_0.
        coeff1: SSP-RK3 coefficient for omega_1.
        coeff2: SSP-RK3 coefficient for RHS.
        omega_0: Vorticity field at the beginning of the time step.
        omega_1: Vorticity field at the end of the RK step.
        psi: Stream function field.
        rhs: Temporarily stores diffusion + advection terms.
    """
    i, j = wp.tid()

    # obtain the neighboring indices for the [i, j]th cell in a periodic square box
    left_idx = cyclic_index(i - 1, n)
    right_idx = cyclic_index(i + 1, n)
    top_idx = cyclic_index(j + 1, n)
    down_idx = cyclic_index(j - 1, n)

    # compute viscous diffusion term
    rhs[i, j] = (1.0 / re) * diffusion(
        omega_1[left_idx, j],
        omega_1[right_idx, j],
        omega_1[i, j],
        omega_1[i, down_idx],
        omega_1[i, top_idx],
        h,
    )

    # add advection term
    rhs[i, j] -= advection(
        omega_1[left_idx, j],
        omega_1[right_idx, j],
        omega_1[i, top_idx],
        omega_1[i, down_idx],
        psi[left_idx, j],
        psi[right_idx, j],
        psi[i, top_idx],
        psi[i, down_idx],
        h,
    )

    # perform RK update
    omega_1[i, j] = ... # MISSING

---
## FFT Kernels for Spectral Poisson Solver

### Data Type Conversion

Convert between real (`wp.float32`) and complex (`wp.vec2f`) arrays.

In [None]:
@wp.kernel
def copy_float_to_vec2(
    omega: wp.array2d(dtype=wp.float32), omega_complex: wp.array2d(dtype=wp.vec2f)
):
    """Copy real vorticity to a complex array with zero imaginary part.

    Args:
        omega: Input real-valued vorticity array.
        omega_complex: Output complex array where real part is omega, imaginary is 0.
    """
    i, j = wp.tid()
    omega_complex[i, j] = wp.vec2f(omega[i, j], 0.0)


@wp.kernel
def extract_real(
    complex_array: wp.array2d(dtype=wp.vec2f), real_array: wp.array2d(dtype=wp.float32)
):
    """Extract real part from a complex array.

    Args:
        complex_array: Input complex array (vec2f where .x is real part).
        real_array: Output real array.
    """
    i, j = wp.tid()
    real_array[i, j] = complex_array[i, j].x


@wp.kernel
def scale_array(scale: wp.float32, arr: wp.array2d(dtype=wp.float32)):
    """Scale array by multiplying with a scale factor.

    Args:
        scale: Scale factor to multiply each element by.
        arr: Array to normalize (modified in-place).
    """
    i, j = wp.tid()
    arr[i, j] = arr[i, j] * scale

### Tile-Based 1D FFT

Use `wp.tile_fft()` and `wp.tile_ifft()` for row-wise transforms.

In [None]:
@wp.kernel
def fft_tiled(x: wp.array2d(dtype=wp.vec2f), y: wp.array2d(dtype=wp.vec2f)):
    """Perform 1-D FFT on each row using wp.tile_fft().

    Args:
        x: Input complex array of shape (N, N).
        y: Output complex array of shape (N, N) storing FFT results.
    """
    i, _, _ = wp.tid()
    a = wp.tile_load(x, shape=(TILE_M, TILE_N), offset=(i * TILE_M, 0))
    wp.tile_fft(a)
    wp.tile_store(y, a, offset=(i * TILE_M, 0))


@wp.kernel
def ifft_tiled(x: wp.array2d(dtype=wp.vec2f), y: wp.array2d(dtype=wp.vec2f)):
    """Perform 1-D inverse FFT on each row using wp.tile_ifft().

    Args:
        x: Input complex array of shape (N, N).
        y: Output complex array of shape (N, N) storing IFFT results.
    """
    i, _, _ = wp.tid()
    a = wp.tile_load(x, shape=(TILE_M, TILE_N), offset=(i * TILE_M, 0))
    wp.tile_ifft(a)
    wp.tile_store(y, a, offset=(i * TILE_M, 0))

### Transpose Kernel

2D FFT = row FFT → transpose → row FFT.

In [None]:
@wp.kernel
def tiled_transpose(x: wp.array2d(dtype=wp.vec2f), y: wp.array2d(dtype=wp.vec2f)):
    """Transpose a 2-D array.

    Args:
        x: Input complex array.
        y: Output complex array storing the transpose of x.
    """
    i, j = wp.tid()
    t = wp.tile_load(
        x,
        shape=(TILE_TRANSPOSE_DIM, TILE_TRANSPOSE_DIM),
        offset=(i * TILE_TRANSPOSE_DIM, j * TILE_TRANSPOSE_DIM),
        storage="shared",
    )
    t_transposed = wp.tile_transpose(t)
    wp.tile_store(
        y, t_transposed, offset=(j * TILE_TRANSPOSE_DIM, i * TILE_TRANSPOSE_DIM)
    )

### Poisson Solve in k-Space

Division by $|k|^2$ in Fourier space to solve Eq. (2).

In [None]:
@wp.kernel
def multiply_k2_inverse(
    k2i: wp.array2d(dtype=wp.float32),
    omega_hat: wp.array2d(dtype=wp.vec2f),
    psi_hat: wp.array2d(dtype=wp.vec2f),
):
    """Solve Poisson equation in Fourier space.

    Args:
        k2i: Precomputed 1/|k|^2 array.
        omega_hat: Fourier transform of vorticity.
        psi_hat: Output Fourier transform of stream function.
    """
    i, j = wp.tid()
    psi_hat[i, j] = omega_hat[i, j] * k2i[i, j]

---
## Assembling the Solver

### Array Allocation

Allocate all Warp arrays needed for the simulation.

In [None]:
H = LEN / N_GRID

# define SSP-RK3 coefficients
RK3_COEFFS = [
    [1.0, 0.0, 1.0],
    [3.0 / 4.0, 1.0 / 4.0, 1.0 / 4.0],
    [1.0 / 3.0, 2.0 / 3.0, 2.0 / 3.0],
]

# allocate warp arrays for vorticity, stream-function, and RHS of NS equation
omega_0 = wp.zeros((N_GRID, N_GRID), dtype=wp.float32)
omega_1 = wp.zeros((N_GRID, N_GRID), dtype=wp.float32)
psi = wp.zeros((N_GRID, N_GRID), dtype=wp.float32)
rhs = wp.zeros((N_GRID, N_GRID), dtype=wp.float32)

# precompute 1/k^2 for spectral Poisson solver (avoid division by zero at k=0)
k = np.fft.fftfreq(N_GRID, d=1.0 / N_GRID)
kx, ky = np.meshgrid(k, k)
k2 = kx**2 + ky**2
k2i_np = np.zeros_like(k2)
nonzero = k2 != 0
k2i_np[nonzero] = 1.0 / k2[nonzero]
k2i = wp.array2d(k2i_np.astype(np.float32), dtype=wp.float32)

# allocate temporary warp arrays for spectral Poisson solver
omega_complex = wp.zeros((N_GRID, N_GRID), dtype=wp.vec2f)
fft_temp_1 = wp.zeros((N_GRID, N_GRID), dtype=wp.vec2f)
fft_temp_2 = wp.zeros((N_GRID, N_GRID), dtype=wp.vec2f)

print(f"Arrays allocated for {N_GRID}x{N_GRID} grid")

### Initial Condition Setup

Generate initial vorticity field from the energy spectrum.

In [None]:
# compute initial vorticity distribution for decaying turbulence
k_mag_np = np.sqrt(k**2 + k[:, np.newaxis] ** 2)
k_mag = wp.array2d(k_mag_np.astype(np.float32), dtype=wp.float32)

rng = np.random.default_rng(42)
zeta_np = 2 * np.pi * rng.random((N_GRID // 2 + 1, N_GRID // 2 + 1))
eta_np = 2 * np.pi * rng.random((N_GRID // 2 + 1, N_GRID // 2 + 1))
zeta = wp.array2d(zeta_np.astype(np.float32), dtype=wp.float32)
eta = wp.array2d(eta_np.astype(np.float32), dtype=wp.float32)

# set parameters for energy spectrum
K_CUTOFF = 12.0
S = 3

wp.launch(
    decaying_turbulence_initializer,
    dim=(N_GRID, N_GRID),
    inputs=[N_GRID, K_CUTOFF, S, k_mag, zeta, eta],
    outputs=[omega_complex],
)

print("Initial vorticity set in Fourier space")

### The 2D FFT Function

Assembles row FFT + transpose + row FFT for full 2D transform.

In [None]:
def fft_2d(fft_kernel, input_arr, output_arr):
    """Perform 2-D FFT or IFFT using row-wise transform + transpose pattern.

    Args:
        fft_kernel: Either fft_tiled or ifft_tiled.
        input_arr: Input complex array.
        output_arr: Output complex array.
    """
    # perform rowwise FFT/IFFT
    wp.launch_tiled(
        fft_kernel,
        dim=[N_GRID, 1],
        inputs=[input_arr],
        outputs=[fft_temp_1],
        block_dim=BLOCK_DIM,
    )

    wp.launch_tiled(
        tiled_transpose,
        dim=(N_GRID // TILE_TRANSPOSE_DIM, N_GRID // TILE_TRANSPOSE_DIM),
        inputs=[fft_temp_1],
        outputs=[fft_temp_2],
        block_dim=TILE_TRANSPOSE_DIM * TILE_TRANSPOSE_DIM,
    )

    # perform columnwise FFT/IFFT
    wp.launch_tiled(
        fft_kernel,
        dim=[N_GRID, 1],
        inputs=[fft_temp_2],
        outputs=[output_arr],
        block_dim=BLOCK_DIM,
    )

### The Poisson Solver Function

Full spectral solve: FFT → divide by $|k|^2$ → IFFT.

In [None]:
def solve_poisson():
    """Solve the Poisson equation using FFT.

    Solve (del^2/del x^2 + del^2/del y^2)(psi) = -omega_1.
    psi_hat(kx, ky) = omega_hat(kx, ky) / ||k||^2 for periodic 2-D domain.
    2-D FFT is computed as sequence of 1-D FFT along rows, transpose, and 1-D FFT along rows.
    """
    # convert updated vorticity at any RK step from wp.float32 to wp.vec2f
    wp.launch(
        copy_float_to_vec2,
        dim=(N_GRID, N_GRID),
        inputs=[omega_1],
        outputs=[omega_complex],
    )

    # perform forward FFT
    fft_2d(fft_tiled, omega_complex, fft_temp_1)

    # multiply by 1/k^2 to solve Poisson in Fourier space
    wp.launch(
        multiply_k2_inverse,
        dim=(N_GRID, N_GRID),
        inputs=[k2i, fft_temp_1],
        outputs=[fft_temp_2],
    )

    # perform inverse FFT
    fft_2d(ifft_tiled, fft_temp_2, fft_temp_1)

    # extract real part and normalize
    wp.launch(
        extract_real,
        dim=(N_GRID, N_GRID),
        inputs=[fft_temp_1],
        outputs=[psi],
    )
    wp.launch(
        scale_array,
        dim=(N_GRID, N_GRID),
        inputs=[wp.float32(1.0 / (N_GRID * N_GRID)), psi],
    )

### The Timestep Function

One complete SSP-RK3 timestep with 3 stages.

In [None]:
def step():
    """Advance simulation by one timestep using SSP-RK3."""
    for c0, c1, c2 in RK3_COEFFS:
        # zero the RHS array
        rhs.zero_()

        # compute RHS and update omega_1
        wp.launch(
            viscous_advection_rk3_kernel,
            dim=(N_GRID, N_GRID),
            inputs=[N_GRID, H, RE, DT, c0, c1, c2, omega_0, omega_1, psi, rhs],
        )

        # update streamfunction from new vorticity (in omega_1)
        solve_poisson()

    # copy omega_1 to omega_0 for next timestep
    wp.copy(omega_0, omega_1)

### Complete Initial Condition

Transform initial vorticity from Fourier space and solve initial Poisson.

In [None]:
# compute IFFT of omega_complex field
fft_2d(ifft_tiled, omega_complex, fft_temp_1)

# extract real part get initial vorticity field
wp.launch(
    extract_real,
    dim=(N_GRID, N_GRID),
    inputs=[fft_temp_1],
    outputs=[omega_0],
)

# for initial distribution, set both omega_1 and omega_0 to be the same
wp.copy(omega_1, omega_0)

# solve initial Poisson equation to get psi from initial vorticity field
solve_poisson()

print(f"Initial vorticity range: [{omega_0.numpy().min():.2f}, {omega_0.numpy().max():.2f}]")

---
## Running the Simulation

### CUDA Graph Capture

Capture the timestep function into a CUDA graph for efficient repeated execution.

In [None]:
# capture first step in a CUDA graph
with wp.ScopedCapture() as capture:
    step()
step_graph = capture.graph

print("CUDA graph captured")

### Run Simulation and Visualize

In [None]:
import time
from PIL import Image
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize
import os

NUM_FRAMES = 200
STEPS_PER_FRAME = 10

# Colormap setup
cmap = plt.cm.twilight
norm = Normalize(vmin=-15, vmax=15)

frames = []
print(f"Running {NUM_FRAMES} frames ({STEPS_PER_FRAME} steps each)...")

start_time = time.perf_counter()
for frame in range(NUM_FRAMES):
    # Advance simulation
    for _ in range(STEPS_PER_FRAME):
        wp.capture_launch(step_graph)

    # Capture frame
    vorticity = omega_1.numpy().T  # Transpose for correct orientation
    colored = cmap(norm(vorticity))
    rgb = (colored[:, :, :3] * 255).astype(np.uint8)
    frames.append(rgb)

    if (frame + 1) % 50 == 0:
        print(f"  Frame {frame + 1}/{NUM_FRAMES}")

elapsed = time.perf_counter() - start_time
total_steps = NUM_FRAMES * STEPS_PER_FRAME
print(f"Completed {total_steps} steps in {elapsed:.2f}s ({total_steps/elapsed:.0f} steps/s)")

### Create Animation

In [None]:
import IPython.display

# Create output directory
# os.makedirs("./images/chapter-12.2", exist_ok=True)
# output_file = f"./images/chapter-12.2/turbulence_{N_GRID}x{N_GRID}_Re{int(RE)}.gif"

# Save as GIF
# pil_images = [Image.fromarray(frame) for frame in frames]
# pil_images[0].save(
#     output_file,
#     save_all=True,
#     append_images=pil_images[1:],
#     duration=50,
#     loop=0,
# )

# print(f"Saved: {output_file}")
# IPython.display.Image(output_file)

---
## Validation

*Placeholder: Future sections may include energy decay analysis, enstrophy evolution, and conservation checks.*

---
## Conclusion

This notebook demonstrated:
- Vorticity-streamfunction formulation for 2D incompressible Navier-Stokes
- Spectral Poisson solver using Warp's tile-based FFT
- SSP-RK3 time integration with custom Warp kernels
- CUDA graph capture for high-performance time-stepping

The combination of spectral methods (for the Poisson equation) and finite differences (for advection/diffusion) is a common pattern in CFD codes.

---
## References

- [NVIDIA Warp GitHub](https://github.com/NVIDIA/warp)
- [Warp Documentation](https://nvidia.github.io/warp)
- San, O. and Staples, A.E. (2012). "High-order methods for decaying two-dimensional homogeneous isotropic turbulence." *Computers & Fluids*, 63, 105-127.