# Chapter 12.2: N-body Simulation using Single Instruction, Multiple Threads (SIMT) and Tile-based Approach in Warp

![Example output](images/chapter-12.2/nbody_animation.gif)

## Overview

In the previous two notebooks, we introduced NVIDIA Warp as a framework for writing high-performance GPU code in Python and implemented a GPU-accelerated 2-D Ising model simulation. Through this parallel GPU implementation, we learned the importance of developing correct parallelization algorithms to faithfully reproduce the underlying physics of any physical system that we are modeling.

Following the theme of simulating physical systems to learn the basics of NVIDIA Warp for parallel computing, in this chapter we will implement an N-body simulation in NVIDIA Warp, first using the single instruction, multiple thread (SIMT) model and then using the tile-based model (both in Warp).

Through this implementation, you will learn the following:

* The simulation of another physical system using Warp, building on the previous notebook on the 2-D Ising model
* How to characterize code performance and use some code profiling tools available in Warp
* An introduction to the tile-based programming method in Warp

By the end of this chapter, you will have two variants of code for N-body simulation, one based on the SIMT paradigm and another based on the tile-based programming model. You will also have learned about tools available in Python and Warp to benchmark your code's time performance in greater detail.

It is strongly recommended to work through the [notebook on introduction to NVIDIA Warp](Chapter_12_Intro_to_NVIDIA_Warp.ipynb) and [GPU-Accelerated Ising Model Simulation in NVIDIA Warp](Chapter_12.1_IsingModel_In_Warp.ipynb) before proceeding with this tutorial.

---
## Setup

Before we begin this tutorial, let us ensure we have all the necessary packages installed.

First, we will install NVIDIA Warp if it is not already available:


In [None]:
# Install NVIDIA Warp
%pip install warp-lang

# Install Matplotlib for plotting/animation
%pip install matplotlib

Now let us import the necessary libraries and initialize Warp to check if GPU support is available:


In [None]:
# Core imports
import numpy as np
import warp as wp

# Check for GPU availability
if wp.get_cuda_device_count() > 0:
    print("✓ GPU detected successfully")
else:
    print("No GPU detected!")

---
## Introduction: Exploring avenues for further speed up of already GPU accelerated code

In the last notebook, we went through an example of how to correctly model a physical system (2-D Ising model) using Warp.

Now, we will go a step beyond the correctness of the model and explore how to profile our code and accelerate an already GPU-accelerated implementation.

Similar to the 2-D Ising model, we choose the N-body simulation for its simplicity while also having underlying physics that makes it amenable to a tile-based programming model.

Like the previous notebook, let us begin by understanding the physics behind the N-body simulation, then progressively build our GPU implementation, profile it, analyze potential avenues for further speedup, and finally explore the tile-based programming model.

---
## Background: The N-body simulation in physics and its numerical implementation 

### What is an N-body simulation?

An [N-body simulation](https://en.wikipedia.org/wiki/N-body_simulation) models the temporal evolution of a system of particles, typically under the influence of physical forces such as gravity. N-body simulations are widely used in astrophysics to study the dynamics of systems of planets, stars, and galaxies. Readers are strongly encouraged to go through section 31.1 of the [Fast N-Body Simulation with CUDA](https://developer.nvidia.com/gpugems/gpugems3/part-v-physics-simulation/chapter-31-fast-n-body-simulation-cuda) webpage for additional context on why we care about speeding up the N-body simulation.

### Numerical setup

The N-body simulation in this tutorial evolves a system of particles interacting with each other through gravity. Each particle exerts an attractive gravitational force on every other particle in the system. For a particle given by index $i$ and mass $m_i$, we approximate the net force on the particle $\mathbf{F}_i$ as:

$$
\mathbf{F}_i \approx G m_i \sum_{j=1}^{N} \frac{ m_j \mathbf{r}_{ij} }{ \left(\mathbf{r}_{ij} \cdot \mathbf{r}_{ij}  + \epsilon^2 \right)^{3/2} },
$$
where $\mathbf{r}_{ij} = \mathbf{x}_j - \mathbf{x}_i$, $\mathbf{x}_j$ and $\mathbf{x}_i$ are the position vectors of particles indexed $j$ and $i$, respectively. $\epsilon$ is a *softening factor* to prevent numerical instabilities when two particles are very close to each other. For convenience and without any loss of generality, we set $G=1$ and do not explicitly use it in the code.


### How to calculate the position of any particle in the system as a function of time?

1. For any particle $i$, we can calculate its acceleration at a given time $t_n$ as:
$$
\mathbf{a}_i^{n} = \frac{\mathbf{F}_i^{n}}{m_{i}} = G \sum_{j=1}^{N} \frac{ m_j \mathbf{r}_{ij} }{ \left(\mathbf{r}_{ij} \cdot \mathbf{r}_{ij}  + \epsilon^2 \right)^{3/2} }.
$$

In the formula above and subsequent formulae, superscript $n$ refers to the values of variables (forces, positions, accelerations, velocities) at time $t_n$.

2. We obtain the velocity of the given particle at time $t_{n+1}$ as:

$$
\mathbf{v}_i^{n+1} = \mathbf{v}_i^{n} + \mathbf{a}_i^{n}\Delta t.
$$

3. Thereafter, the position of the particle is updated at time $t_{n+1}$ as follows:

$$
\mathbf{x}_i^{n+1} = \mathbf{x}_i^{n} + \mathbf{v}_i^{n+1}\Delta t.
$$

Interested readers can also take a look at the appendix of [Swope et al. 1981](https://apps.dtic.mil/sti/tr/pdf/ADA103095.pdf) (page 53 onward).

**Exercise**: Starting with equations 1 to 3, show that the above-mentioned method can equivalently be expressed as
$$
\frac{d^2 \mathbf{x}_i}{d t^{2}} = \mathbf{a}_i \approx  \frac{\mathbf{x}_i^{n+1} -2 \mathbf{x}_{i}^n + \mathbf{x}_{i}^{n-1}}{\Delta t^{2}} = \mathbf{a}_i^{n}
$$

In the next sections, we will first implement a baseline simulation in Warp using a single instruction, multiple threads (SIMT) approach. We will then benchmark the SIMT code and delve into a tile-based paradigm for an even more performant code.

---
## A baseline SIMT implementation

The baseline implementation consists of mainly these N steps:

1. Initialize the N-body simulation parameters and arrays using `def initi_problem (num_bodies)` function

### Initializing the simulation using NumPy and Warp

The cell below implements the `def init_problem(num_bodies)` function to initialize positions and velocities of particles using NumPy. The following arrays are allocated on the GPU:

- **Position arrays**: We use two position arrays to avoid race conditions in the Warp kernel. These arrays are swapped at each time step.
  - Data type: Three-component float vector (`wp.vec3`)
  - Shape: `(N,)`
  - Initial values: Random positions uniformly distributed in space

- **Velocity array**: Stores the velocity of each particle.
  - Data type: Three-component float vector (`wp.vec3`)
  - Shape: `(N,)`
  - Initial values: Random velocities uniformly distributed in [-1, 1]

- **Mass array**: While not strictly necessary for this basic example, including masses makes it easier to extend the code to handle variable particle masses.
  - Data type: Float (`wp.float32`)
  - Shape: `(N,)`
  - Initial values: 1.0 (unit mass for all particles)

For convenience, we first initialize the arrays on the CPU using NumPy, then transfer them to Warp arrays on the GPU.

In [None]:
def init_problem(num_bodies):
    """Initialize N-body simulation parameters and arrays.

    Creates initial positions, velocities, and masses for a system of particles
    that will be used in the N-body gravitational simulation. Positions are
    uniformly distributed in a cube with size scaled by particle count to
    maintain constant density.

    Args:
        num_bodies (int): Number of particles in the simulation.

    Returns:
        tuple: A tuple containing:
            - pos_array_0 (wp.array): Initial position array on GPU (shape: [num_bodies, 3]).
            - pos_array_1 (wp.array): Empty position array for double buffering.
            - vel_array (wp.array): Initial velocity array on GPU (shape: [num_bodies, 3]).
            - mass_array (wp.array): Mass array on GPU (all particles have unit mass).
            - scale (float): Scale factor for the simulation domain.
            - init_pos_np (np.ndarray): Initial positions on CPU for visualization.

    Note:
        Uses a fixed random seed (42) for reproducible results.
    """
    # Initialize random number generator with fixed seed for reproducibility.
    rng = np.random.default_rng(42)

    # Calculate scale factor to maintain constant particle density as N increases.
    # The cube root scaling ensures constant density in 3D space.
    scale = 10.0 * (num_bodies / 1024) ** (1 / 3)

    # Generate initial conditions on CPU using NumPy.
    # Positions are uniformly distributed within a cube of size [-scale, scale]^3.
    init_pos_np = rng.uniform(low=-scale, high=scale, size=(num_bodies, 3))

    # Velocities are uniformly distributed in [-1, 1]^3 for initial random motion.
    init_vel_np = rng.uniform(low=-1.0, high=1.0, size=(num_bodies, 3))

    # All particles have unit mass for simplicity.
    mass_array_np = np.ones(num_bodies)

    # Transfer data from CPU (NumPy) to GPU (Warp).
    pos_array_0 = wp.array(init_pos_np, dtype=wp.vec3)

    # Create second position array for double buffering technique.
    # This avoids race conditions when updating positions in parallel.
    pos_array_1 = wp.empty_like(pos_array_0)

    # Transfer velocity and mass data to GPU.
    vel_array = wp.array(init_vel_np, dtype=wp.vec3)
    mass_array = wp.array(mass_array_np, dtype=wp.float32)

    return pos_array_0, pos_array_1, vel_array, mass_array, scale, init_pos_np

Let us now set the key simulation parameters for the problem at hand.

- `dt`: Timestep for the numerical setup
- `num_bodies`: Total number of bodies we are interested in simulating
- `SOFTENING_SQ`: Softening term to avoid blow-up in $\mathbf{F}_i$ if two particles are very close to each other

In [None]:
dt = 0.01
num_bodies = 1024
SOFTENING_SQ = 0.1**2

Let us know initialize the simulation and visualize the initial distribution of particles in the 3-D cartesian coordinate system.

In [None]:
import numpy as np
import warp as wp

# Call to initialize the position, velocity and mass arrays
pos_array_0, pos_array_1, vel_array, mass_array, scale, init_pos_np = init_problem(
    num_bodies
)

The cell above initializes the position, velocity, and mass arrays for particles for a given number of `num_bodies` particles.

Once the initial positions are set, let us visualize the system before moving to the implementation of the numerical scheme. Based on how initial position is set in `def init_problem(num_bodies)`, we should visually see an uniform distribution of particles in a cubical box of length 20 with centered at origin.

In [None]:
import matplotlib.pyplot as plt

# Create a figure with a dark background
plt.style.use("dark_background")
fig = plt.figure(figsize=(8, 6), dpi=150)
ax = fig.add_subplot(111, projection="3d")

# Calculate distance of all particles from origin for color mapping
distances = np.linalg.norm(init_pos_np, axis=1)

# Create a scatter plot
scatter_plot = ax.scatter(
    init_pos_np[:, 0],
    init_pos_np[:, 1],
    init_pos_np[:, 2],
    c=distances,  # Color by distance from origin
    cmap="plasma",  # Set colormap to plasma
    s=50,  # Set point size
    edgecolors="white",  # White edges for contrast
    linewidth=0.5,  # Thin edge lines
    marker="o",  # Round markers
)

# Add a colorbar with label
cbar = plt.colorbar(scatter_plot, ax=ax, pad=0.1, shrink=0.8)
cbar.set_label("Distance from Origin", rotation=270, labelpad=20, fontsize=12)

# Set axis labels with custom fonts
ax.set_xlabel("X Position", fontsize=12, fontweight="bold", labelpad=10)
ax.set_ylabel("Y Position", fontsize=12, fontweight="bold", labelpad=10)
ax.set_zlabel("Z Position", fontsize=12, fontweight="bold", labelpad=10)

# Set axis limits with some padding
padding = scale * 0.1
ax.set_xlim(-scale - padding, scale + padding)
ax.set_ylim(-scale - padding, scale + padding)
ax.set_zlim(-scale - padding, scale + padding)

# Customize the grid
ax.grid(True, alpha=0.3, linestyle="--")

# Set a viewing angle
ax.view_init(elev=20, azim=45)

# Add pane colors for depth perception
ax.xaxis.pane.fill = False
ax.yaxis.pane.fill = False
ax.zaxis.pane.fill = False

# Customize tick labels
ax.tick_params(axis="both", which="major", labelsize=10)

# Add a subtle box around the plot region
ax.set_box_aspect([1, 1, 1])  # Equal aspect ratio

plt.tight_layout()
plt.show()

#### Warp kernel for timestepping and updating the positions of the particles

Since we are working with the SIMT paradigm, it makes sense to launch a Warp kernel with a total number of threads equal to `num_bodies`. Each thread then takes care of updating the position of a single particle for a single timestep. We call this Warp kernel $N$ times, with each iteration from $0$ to $N$ corresponding to timesteps $t_1 = \Delta t$ to $t_{N} = N\Delta t$ (starting from $t_0=0$).

The Warp kernel we are about to implement needs to perform the following steps on each thread for a single particle, given the positions and velocities at time $t_n$:

- Calculate the acceleration of the particle $\mathbf{a}_i^{n}$ at time $t_n$ using the formula described in the Numerical setup section
- Update the velocity of the particle: $\mathbf{v}_{i}^{n+1} = \mathbf{v}_{i}^{n} + \mathbf{a}_{i}^{n}\Delta t$
- Calculate the updated position: $\mathbf{x}_{i}^{n+1} = \mathbf{x}_{i}^{n} + \mathbf{v}_{i}^{n+1}\Delta t$

**Exercise**: Stop here and try to implement this Warp kernel and all the corresponding steps yourself. You can then compare your implementation with ours to see similarities and differences. Remember that the same algorithm can be implemented in different ways. In the final time-stepping loop, you can replace our kernel implementation with yours and observe whether the results change or remain similar.

The cell below implements two key components: (a) a Warp function named `def body_body_interaction(...)` that calculates the net gravitational acceleration of a particle, and (b) a Warp kernel named `def integrate_bodies(...)` that implements the time-stepping algorithm described above.

In [None]:
@wp.func
def body_body_interaction(
    num_bodies: int,
    masses: wp.array(dtype=float),
    body_position: wp.vec3,
    positions: wp.array(dtype=wp.vec3),
):
    """Calculate gravitational acceleration on a particle due to all other particles.

    Computes the net gravitational acceleration on a particle at the given position
    by summing the gravitational forces from all other particles in the system.
    Uses a softening factor to prevent numerical instabilities when particles are
    very close together.

    Args:
        num_bodies: Total number of particles in the system.
        masses: Array of particle masses indexed by particle ID.
        body_position: Position vector of the particle for which acceleration is computed.
        positions: Array of position vectors for all particles in the system.

    Returns:
        wp.vec3: Net acceleration vector acting on the particle.

    Note:
        The acceleration is computed using Newton's law of gravitation with G=1:
        a_i = Σ_j (m_j * r_ij / |r_ij + ε²|^(3/2))
        where r_ij is the vector from particle i to particle j, and ε is the
        softening factor.
    """

    # Initialize acceleration vector to zero.
    acc = wp.vec3(0.0, 0.0, 0.0)

    # Sum gravitational contributions from all other particles.
    for body_index in range(num_bodies):
        # Calculate displacement vector from current particle to other particle.
        r = positions[body_index] - body_position

        # Add softening factor to squared distance to avoid singularities.
        # This prevents blow-up when two particles get close to each other.
        dist_sq = wp.length_sq(r) + SOFTENING_SQ

        # Calculate inverse distance cubed for gravitational force formula.
        inv_dist = 1.0 / wp.sqrt(dist_sq)
        inv_dist_cubed = inv_dist * inv_dist * inv_dist

        # Accumulate acceleration contribution from this particle.
        # Force is proportional to mass and inverse square of distance.
        acc = acc + masses[body_index] * inv_dist_cubed * r

    return acc


@wp.kernel
def integrate_bodies(
    num_bodies: int,
    dt: float,
    masses: wp.array(dtype=float),
    old_position: wp.array(dtype=wp.vec3),
    velocity: wp.array(dtype=wp.vec3),
    new_position: wp.array(dtype=wp.vec3),
):
    """Integrate N-body system forward one timestep.

    Updates particle velocities and positions for one timestep using the
    gravitational forces between all particles. Each thread handles one particle,
    computing its acceleration, updating its velocity, and calculating its new
    position.

    Args:
        num_bodies: Total number of particles in the system.
        dt: Timestep size for integration.
        masses: Array of particle masses (shape: [num_bodies]).
        old_position: Current particle positions (shape: [num_bodies]).
        velocity: Current particle velocities, updated in-place (shape: [num_bodies]).
        new_position: Output array for updated positions (shape: [num_bodies]).

    Note:
        This kernel uses the SIMT paradigm where each thread processes one particle.
        The kernel should be launched with num_bodies threads.

        Integration scheme:
        1. a_i(t) = F_i(t) / m_i  (compute acceleration)
        2. v_i(t+dt) = v_i(t) + a_i(t) * dt  (update velocity)
        3. x_i(t+dt) = x_i(t) + v_i(t+dt) * dt  (update position)

        Double buffering is used for positions to avoid race conditions during
        parallel updates.
    """

    # Get thread ID - each thread handles one particle.
    i = wp.tid()

    # Calculate gravitational acceleration on this particle from all others.
    accel = body_body_interaction(num_bodies, masses, old_position[i], old_position)

    # Update velocity using forward Euler integration.
    # v(t+dt) = v(t) + a(t) * dt
    velocity[i] = velocity[i] + accel * dt

    # Update position using the newly computed velocity.
    # x(t+dt) = x(t) + v(t+dt) * dt
    # Note: We use new_position array to avoid race conditions.
    new_position[i] = old_position[i] + dt * velocity[i]

#### Warp kernel launch and timestepping

At this point, we have finished writing the only Warp kernel we need for this simulation.

The remaining task is to write a loop that repeatedly launches the Warp kernel and swaps the position arrays for the next iteration so that the updated positions become the current positions.

For the latter task, we can use this trick:

```python
(pos_array_0, pos_array_1) = (pos_array_1, pos_array_0)
```
If you worked through the 2-D Ising model in Warp notebook, you would remember that the above trick was used there as well but for updating the lattice site spins as the Monte Carlo simulation progressed.

The cell below performs the steps mentioned above. We re-initialize the inital state of the system using `def init_problem(...)`.

In [None]:
# Simulation parameters
dt = 0.01  # Time step for numerical integration
num_bodies = 1024  # Number of particles in the N-body system
SOFTENING_SQ = (
    0.1**2
)  # Softening parameter squared to prevent division by a very small number if two particles come closer

# Initialize problem with particle positions, velocities, and masses
pos_array_0, pos_array_1, vel_array, mass_array, scale, init_pos_np = init_problem(
    num_bodies
)

# Main simulation loop that integrates particle motion over time
for _step_index in range(1000):
    # Launch GPU kernel with one thread per particle
    wp.launch(
        integrate_bodies,
        dim=(num_bodies,),  # Launch num_bodies threads
        inputs=[num_bodies, dt, mass_array, pos_array_0, vel_array, pos_array_1],
    )
    # Swap position buffers for next iteration (double buffering)
    pos_array_0, pos_array_1 = pos_array_1, pos_array_0

#### Animating the simulation

The cell below mainly contains the boilerplate code for animating the N-body simulation built until now. Let's see how the simulation evolves.

In [None]:
%matplotlib widget

import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation


# Initialize problem with particle positions, velocities, and masses
pos_array_0, pos_array_1, vel_array, mass_array, scale, init_pos_np = init_problem(
    num_bodies
)

# Create a figure with a dark background
plt.style.use("dark_background")
fig = plt.figure(figsize=(8, 6), dpi=150)
ax = fig.add_subplot(111, projection="3d")

# Calculate distance of all particles from origin for color mapping
distances = np.linalg.norm(init_pos_np, axis=1)

# Create a scatter plot
scatter_plot = ax.scatter(
    init_pos_np[:, 0],
    init_pos_np[:, 1],
    init_pos_np[:, 2],
    c=distances,  # Color by distance from origin
    cmap="plasma",  # Set colormap to plasma
    s=50,  # Set point size
    edgecolors="white",  # White edges for contrast
    linewidth=0.5,  # Thin edge lines
    marker="o",  # Round markers
)

# Add a colorbar with label
cbar = plt.colorbar(scatter_plot, ax=ax, pad=0.1, shrink=0.8)
cbar.set_label("Distance from Origin", rotation=270, labelpad=20, fontsize=12)

# Set axis labels with custom fonts
ax.set_xlabel("X Position", fontsize=12, fontweight="bold", labelpad=10)
ax.set_ylabel("Y Position", fontsize=12, fontweight="bold", labelpad=10)
ax.set_zlabel("Z Position", fontsize=12, fontweight="bold", labelpad=10)

# Set axis limits with some padding
padding = scale * 0.1
ax.set_xlim(-scale - padding, scale + padding)
ax.set_ylim(-scale - padding, scale + padding)
ax.set_zlim(-scale - padding, scale + padding)

# Customize the grid
ax.grid(True, alpha=0.3, linestyle="--")

# Set a viewing angle
ax.view_init(elev=20, azim=45)

# Add pane colors for depth perception
ax.xaxis.pane.fill = False
ax.yaxis.pane.fill = False
ax.zaxis.pane.fill = False

# Customize tick labels
ax.tick_params(axis="both", which="major", labelsize=10)

# Add a subtle box around the plot region
ax.set_box_aspect([1, 1, 1])  # Equal aspect ratio


def update_plot(frame):
    """Update function for animation with proper time stepping."""
    global pos_array_0, pos_array_1

    # Perform integration step using direct kernel launch
    wp.launch(
        integrate_bodies,
        dim=(num_bodies,),
        inputs=[num_bodies, dt, mass_array, pos_array_0, vel_array, pos_array_1],
    )

    # Swap position arrays (double buffering)
    pos_array_0, pos_array_1 = pos_array_1, pos_array_0

    # Copy updated positions to CPU for visualization
    positions_cpu = pos_array_0.numpy()

    # Calculate new distances for color mapping
    distances = np.linalg.norm(positions_cpu, axis=1)

    # Update scatter plot positions
    scatter_plot._offsets3d = (
        positions_cpu[:, 0],
        positions_cpu[:, 1],
        positions_cpu[:, 2],
    )

    # Update colors based on new distances
    scatter_plot.set_array(distances)

    return scatter_plot


# Create animation with optimized settings
anim = FuncAnimation(
    fig,
    update_plot,
    frames=1000,
    interval=25,
    blit=True,
    repeat=True,
)

plt.tight_layout()
plt.show()

### 


---
## Benchmarking SIMT paradigm

Now that we have established our baseline CPU implementation, let us leverage NVIDIA Warp to accelerate our simulation.

### Parallel Monte Carlo strategy

Our first approach uses a straightforward parallelization strategy:

1. Launch a GPU kernel across a $N \times N$ grid where each thread processes one lattice site
2. Use a double-buffer approach to avoid race conditions:
   - `lattice_0`: Contains the current state (read-only during update)
   - `lattice_1`: Stores the updated state after parallel processing
3. After each Monte Carlo step, swap `lattice_0` and `lattice_1` arrays before proceeding to the Monte Carlo step

### Main update logic

The main update logic would then look something like this:

```python
def update_sites_in_parallel(i, j, lattice_0, lattice_1, L, beta):
    # Read current spin value; current values are stored in lattice_0
    spin_ij = lattice_0[i, j]
    
    # Calculate nearest neighbor sum with periodic boundary conditions
    # Use the current state of neighboring spins, i.e., spins taken from lattice_0
    nn_sum = (
        lattice_0[(i - 1 + L) % L, j] + # Top neighbor
        lattice_0[(i + 1) % L, j] +     # Bottom neighbor  
        lattice_0[i, (j - 1 + L) % L] + # Left neighbor
        lattice_0[i, (j + 1) % L]       # Right neighbor
    )
    
    # Calculate acceptance probability using Metropolis criterion
    acceptance_probability = wp.exp(-2.0 * beta * spin_ij * nn_sum)
    
    # Generate random number and decide whether to flip
    if wp.randf() < acceptance_probability:
        lattice_1[i, j] = -lattice_0[i, j] # Flip the spin and save to lattice_1
```


Let us implement this approach and examine its performance and results. We will first write the different Warp kernels needed for this approach. Thereafter, we will assemble all these kernels to simulate the Metropolis algorithm and visualize the results.

### GPU memory allocation for the 2-D lattice

To implement this on a GPU, we need to allocate memory storage for our 2-D lattice. We will use a 2-D Warp array to store the spin configuration. We use `wp.int8` as our data type since each spin only needs to store values of +1 or -1.

### Random number generation in Warp

For the Monte Carlo acceptance decisions, each GPU thread needs to generate independent random numbers. Warp provides a built-in random number generator that can be called directly from within kernels.

The key components of Warp's random number generator (RNG) are:

1. **Initialization**: Create an RNG state using `state = wp.rand_init(seed, offset)`
   - `seed`: Common value shared across all threads
   - `offset`: Unique value per thread

2. **Generation**: Use `wp.randf(state, 0, 1)` to generate random floats in [0,1)

This approach ensures that:
- Each thread generates independent random numbers
- The simulation is reproducible when using the same seed
- Random number generation happens efficiently on the GPU without CPU-GPU transfers

Now let us implement the lattice initialization using Warp in the cell below.

**Exercise:** In the cell above, change `LATTICE_SIZE` to $128, 64, 32, 16, 8$, then re-run the cell. For smaller `LATTICE_SIZES` of $16$ and $8$, verify that the pattern remains consistent when using the same `rng_seed` but changes when you modify the seed value.

### Implementing the Monte Carlo update kernel on GPU

Now we will implement the core of our GPU-accelerated Ising model simulation: a Warp kernel that performs the Metropolis-Hastings algorithm across the entire lattice in parallel.

#### Kernel input parameters

Our `update_lattice` kernel requires four key inputs:

1. **`lattice_in`**: The 2-D spin configuration array that serves as input
2. **`lattice_out`**: The 2-D spin configuration array that serves as output and will be modified in place
3. **`rng_seed`**: Random number seed for generating acceptance probabilities  
4. **`beta`**: Inverse temperature $\beta = 1/(k_B T)$, we assume $k_B = 1$

#### Parallel execution strategy

Each GPU thread will:
1. Calculate the energy change $\Delta E$ from flipping its assigned spin
2. Compute the acceptance ratio
3. Generate a random number and accept/reject the flip accordingly
4. Update the lattice site in `lattice_out` array

This approach allows us to evaluate $N^2$ spin flip proposals in parallel, achieving speedup over sequential CPU implementations that you will see in a bit.

The `update_lattice(...)` kernel is called at the beginning of every Monte Carlo step. Once the kernel executes, `lattice_1` contains the updated state of spins. Right after the execution of the `update_lattice(...)` kernel, we swap `lattice_0` and `lattice_1` for the next Monte Carlo step. Hence, a single Monte Carlo step for the current GPU implementation would look something like this:

```python
# Execute update_lattice kernel to update all lattice sites in parallel
# lattice_1 contains the most updated lattice state
wp.launch(update_lattice, 
          dim=(lattice_size, lattice_size),
          inputs=[beta, random_seed, lattice_0, lattice_1])
    
# Swap lattice_1 and lattice_0 for next Monte Carlo step
lattice_0, lattice_1 = lattice_1, lattice_0     
```

### First set of results

In the cell below, we put together all the building blocks from the previous cells to examine the performance and behavior of our parallelization approach. 

The implementation we have created mirrors our earlier CPU-based Python version, with the key difference being the use of a buffer to avoid race conditions while trying to update spins in parallel.

#### Performance gains but unexpected results

Running the same lattice size and number of Monte Carlo steps as in the baseline Python implementation reveals significant performance improvements:

- **CPU implementation**: ~5.0 seconds for 200 Monte Carlo steps on a 256×256 lattice
- **GPU implementation**: ~0.2 seconds for the same workload

This represents a **25x speedup** - a significant improvement that demonstrates the power of GPU parallelization for this simulation. The acceleration comes from processing all 65,536 lattice sites in parallel rather than sequentially.

However, the results are not correct! At the low temperature $T = 0.02$, the 2-D Ising model should exhibit strong ordering - we should see large domains of aligned spins with well-defined boundaries. Even at high temperatures, the results appear incorrect and differ significantly from our baseline Python implementation. This highlights the importance of maintaining a reference implementation when developing complex algorithms — it provides a reliable benchmark for validating correctness.

The lattice configuration appears unusually fragmented and disordered. This unexpected behavior suggests that our seemingly straightforward parallelization approach may have introduced subtle but significant algorithmic issues.

#### Investigating the discrepancy

The dramatic speedup is certainly encouraging, but the incorrect physics behavior indicates that raw performance gains mean nothing if the underlying algorithm produces wrong results. This observation leads us to an important question:

**What could be causing this discrepancy between the expected and the observed behavior? Try thinking about what could be the fundamental differences between the sequential Python implementation and the GPU implementation.**

In the next section, we will explore why our naive parallelization fails and introduce the checkerboard approach that correctly preserves the physics while maintaining GPU acceleration.


---
## Tile-based programming approach

### Why did our parallel trick fail?

The problem with our double-buffered approach is that it breaks a core rule of the Metropolis algorithm: **the system must evolve sequentially**. Each update needs to depend on the result of the step that came immediately before it. This one-at-a-time process is what guarantees that the physics comes out right.

In our baseline sequential Python implementation, this rule is followed correctly: when a spin at a randomly selected site was flipped, its four neighbors immediately saw the new value for any subsequent calculations. This ensures information flows through the lattice in a sequential fashion.

The parallel double-buffered approach breaks this flow. Instead of using the most current information, it forces every spin update to be based on a single, old snapshot of the lattice taken at the beginning of the Monte Carlo step.

Because it lacks this essential sequential dependency, the double-buffered method is no longer running the Metropolis algorithm. It is executing a fundamentally different process, which is why it is fast but fails to produce the correct physical results.

### The checkerboard approach

The algorithm described below takes care of this potential inconsistency during the update process within a single Monte Carlo step.

Now, we need to allocate separate arrays for the black and white populations.

They will be of size $N$ by $N/2$. We assume $N$ is even.

We will also allocate a $N$ by $N$ array for the combined lattice. This is not needed for the computation, but
it does help with visualization.

Let us see how this works. We will set all values of `lattice_w` to 1 and all values of `lattice_b` to -1, and then combine the two arrays by running the `combine_lattices` kernel.

**Exercise**: Re-run the above cell with different values of `LATTICE_SIZE`, such as 16, 32, 64, and so on. Verify that the black and white spins maintain their proper checkerboard orientation for all lattice sizes.

Based on the checkerboard decomposition approach, we will now write an updated version of `update_lattice` kernel. Pay close attention to the inputs of the kernel and the update algorithm.

Let us run and benchmark the checkerboard algorithm. The code snippet below is similar to the baseline Python implementation, except the lattice updates are performed using Warp.

The output GIFs from the cell above at different `TEMPERATURE` values look visually similar to those generated by the sequential Python implementation at the beginning of the tutorial. The code still runs almost $25\times$ faster than the sequential Python implementation, despite requiring an extra kernel call per Monte Carlo step to combine the black and white lattice grids and the sequential nature of the black and white lattice updates.

While the visual similarity with the sequential Python implementation is reassuring, we need to quantitatively ensure that the underlying physics is faithfully reproduced. Fortunately, as mentioned at the beginning of this tutorial, the 2-D Ising model in zero external magnetic field has an analytical solution when $T < T_{\mathrm{crit}}$. In the final section of this tutorial, we will compare the results obtained from the Warp-accelerated simulation of the 2-D Ising model with the exact solution to verify the accuracy of our final Warp implementation.

---
## Comparison with analytical solution

Finally, we can write some code that plots the steady-state absolute value of magnetization across a range of temperatures.

For $T < T_{\mathrm{crit}}$, there is actually an analytic solution discovered by Onsager in 1944 for the spontaneous magnetization of a 2-D Ising model in zero external magnetic field for temperatures below the critical temperature.

\begin{align}
    M = \left[1 - \frac{1}{\left(\sinh \left( \frac{2 J}{k_B T} \right) \right)^4}\right]^{1/8}
\end{align}

For $T > T_c$, the system is in a paramagnetic state with no net magnetization ($M = 0$).

Let us write a kernel that calculates the total magnetization of the system.

In the cell below, we build the `calculate_steady_state_magnetization` function. This function executes the following steps for any given temperature and lattice size:

1. Initialize the black and white sublattices as in previous cells
2. Run the Monte Carlo simulation for 3000 steps to reach a statistical steady state
3. Run the Monte Carlo simulation for another 1000 steps to obtain the mean and standard deviation of magnetization for the given temperature
4. Return the mean and standard deviation obtained

---
## Conclusion

In this notebook, we examined different attempts to simulate the Ising model in two dimensions.

The main goal of this exercise was to illuminate some considerations when writing algorithms that run **correctly** and **efficiently** on a GPU.

Starting from a sequential Python version, we encountered a fundamental challenge: while separating the lattice into "current" and "updated" grids eliminates race conditions, this approach inadvertently alters the underlying physics, producing an algorithm that no longer faithfully reproduces the statistical mechanics of the original sequential implementation.

Finally, we saw that a checkerboard algorithm allows us to recover the intended physics while addressing the data dependency issues that hindered the initial approaches.

---
## References

You can find the Warp GitHub repository and documentation below:
* "NVIDIA/warp: A Python framework for accelerated simulation, data generation and spatial computing.", GitHub, https://github.com/NVIDIA/warp.
* Warp Developers, "NVIDIA Warp Documentation," GitHub Pages, https://nvidia.github.io/warp.

Note: This tutorial was inspired by [Romero et al.'s](https://www.sciencedirect.com/science/article/abs/pii/S0010465520302228) work on GPU-accelerated 2-D Ising model simulations.