In [None]:
# SPDX-License-Identifier: Apache-2.0 AND CC-BY-NC-4.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Chapter 12: Introduction to NVIDIA Warp

## Overview

[NVIDIA Warp](https://github.com/NVIDIA/warp) is an open-source Python developer framework purpose-built for developing high-performance simulation and AI workloads.

Warp offers coders a clear and expressive programming model to write GPU-accelerated, kernel-based programs for simulation AI, robotics, and machine learning (ML).

Some of the main features of Warp include:

* **Performance** on-par with native CUDA C++ code through a combination of **just-in-time** (JIT) compilation, **CUDA-X** libraries integration, and transparent kernel fusion
* **Ease of use** through a high-level programming model in Python, built-in data structures and algorithms for spatial computing, and support for tile-based programming
* **Support for advanced simulation and AI workloads** through **automatic differentiation** and interoperability with other ML and accelerated Python frameworks

This notebook provides readers with an overview of the main features of the library.

Topics covered:

* Warp fundamentals: Data model, execution model, basic syntax
* Authoring basic kernels in Warp
* Using Warp efficiently with other Python-based frameworks such as NumPy, PyTorch, and Jax
* Automatic differentiation in Warp

---
## Introduction

Warp is a framework from NVIDIA for writing high-performance simulation and graphics code in Python. Central to Warp is a kernel-based programming model in which Python functions are just-in-time (JIT) compiled into efficient code that can run on CPUs and NVIDIA GPUs using C++/CUDA as an intermediate representation. Warp also features a **reverse-mode automatic differentiation** system, which allows researchers to write differentiable simulators that can optionally be incorporated into machine-learning pipelines to train neural networks using PyTorch or JAX. Researchers have applied Warp in areas like physics simulation, perception, robotics, and geometry processing.

---
## Setup

Warp is a lightweight library whose only required dependency is [NumPy](https://numpy.org/). In contrast to many other accelerated Python libraries, Warp comes prepackaged with the necessary compilers to compile code for the CPU or GPU rather than requiring
additional libraries to be installed in the development environment.

Most users install Warp from the Python Package Index (PyPI), where it is available as [warp-lang](https://pypi.org/project/warp-lang/).

In [None]:
# Install the latest version of Warp from PyPI
!pip install warp-lang

After installing the `warp-lang` package, we can `import warp` in a Python script to begin using it.

Typically, the import alias `wp` is used for Warp.

We will explicitly initialize Warp using `wp.init()` to ensure
that an NVIDIA GPU is detected in the notebook environment since the rest of the notebook assumes that at least one is present.

In [None]:
import warp as wp

wp.init()

The output of the above cell should list `"cpu"` and `"cuda:0"` devices under the `Devices:` section.

The `wp.init()` call is not required to be called in users script. It will be implicitly be called the first time a function that requires Warp to be initialized is called.

### Additional ways to obtain a pre-built Warp installation

Community-maintained Conda packages for Warp are available on the [conda-forge](https://anaconda.org/conda-forge/warp-lang) channel:

```
# Install warp-lang specifically built against CUDA Toolkit 12.6
$ conda install conda-forge::warp-lang=*=*cuda126*

# Install warp-lang specifically built against CUDA Toolkit 11.8
$ conda install conda-forge::warp-lang=*=*cuda118*
```

Bleeding-edge nightly packages are published on the NVIDIA Python Package Index. This can be a way to get the latest features from the `main` branch prior to a new release, but these packages have not been tested as thoroughly as the releases published on PyPI:

```
$ pip install -U --pre warp-lang --extra-index-url=https://pypi.nvidia.com/
```

### Installing additional dependencies for this notebook

This introductory notebook makes use of some other Python packages. Before proceeding, please ensure that these dependencies are installed in the Python environment:

In [None]:
!pip install matplotlib ipympl

---
## A particle-simulation example in Warp

We will begin by looking at a basic Warp program that solves for the motion of a group of particles
under the influence of gravity $g$ and non-gravitational forces $f_n$.

This example is meant to give us an overall sense what a Warp program looks like.
In subsequent sections, we will look at the basic concepts in more detail.

The update equations are:

\begin{align*}
a_{n} &=  f_n / m + g \\
v_{n+1} &= v_n + a_n \Delta t \\
x_{n+1} &= x_n + v_{n+1} \Delta t.
\end{align*}

We will assume a simple drag force $f_n = -b v_n$.

The particle positions and velocities will be initialized to random values.

The program then updates the positions and velocities of each particle for each time step of size $\Delta t$ for 100 steps using semi-implicit Euler integration.
At the end of the program, the final positions of the particles are printed out.

In [None]:
import numpy as np
import warp as wp

num_particles = 10_000_000  # Number of particles
num_steps = 100

mass = 0.1  # Mass per particle [kg]
g = 9.81  # Gravitational acceleration [m/s^2]
b = 0.05  # Drag Coefficient [kg/s]

dt = 0.01 * (2 * mass / b)

gravity = wp.vec3([0.0, 0.0, -g])

# Initial positions: random values between -1.0 and 1.0 for x, y, and z
rng = np.random.default_rng(12345)
positions_np = rng.uniform(low=-1.0, high=1.0, size=(num_particles, 3))
positions = wp.array(positions_np, dtype=wp.vec3)

# Initial velocities: random values between -0.5 and 0.5 m/s for vx, vy, and vz
velocities_np = rng.uniform(low=-0.5, high=0.5, size=(num_particles, 3))
velocities = wp.array(velocities_np, dtype=wp.vec3)


@wp.kernel
def integrate(positions: wp.array(dtype=wp.vec3), velocities: wp.array(dtype=wp.vec3)):
    i = wp.tid()

    acceleration = (-b * velocities[i]) / mass + gravity
    velocities[i] += acceleration * dt
    positions[i] += velocities[i] * dt


for step in range(num_steps):
    wp.launch(integrate, dim=(num_particles,), inputs=[positions, velocities])

print(f"Final positions: {positions}")

Let's look at the main sections of the program in more detail.

### Defining program constants

The first section of code defines some constants for the program:

```python
num_particles = 10_000_000  # Number of particles
num_steps = 100

mass = 0.1  # Mass per particle [kg]
g = 9.81  # Gravitational acceleration [m/s^2]
b = 0.05  # Drag Coefficient [kg/s]

dt = 0.01 * (2 * mass / b)

gravity = wp.vec3([0.0, 0.0, -g])
```

Everything but the final line is plain Python. On the final line, we see that gravity was defined as `gravity = wp.vec3([0.0, 0.0, -g])`.

`wp.vec3` is a **built-in** data type provided by Warp that represents a vector made of three 32-bit floating point values.

### Allocating arrays for the particle positions and velocities

Next, we see that we allocate some **arrays** to represent the positions and velocities of the particles:

```python
# Initial positions: random values between -1.0 and 1.0 for x, y, and z
rng = np.random.default_rng(12345)
positions_np = rng.uniform(low=-1.0, high=1.0, size=(num_particles, 3))
positions = wp.array(positions_np, dtype=wp.vec3)

# Initial velocities: random values between -0.5 and 0.5 m/s for vx, vy, and vz
velocities_np = rng.uniform(low=-0.5, high=0.5, size=(num_particles, 3))
velocities = wp.array(velocities_np, dtype=wp.vec3)
```

Arrays in Warp are the fundamental way to represent data and can be created using the `wp.array()` constructor.
Like NumPy, arrays can be multi-dimensional and all elements of an array must be of the same **data type**.

Note that for both the positions and velocities, we first created NumPy arrays containing the data we wanted to initialize the particles with and then passed these NumPy arrays into the `wp.array()` constructor along with the data type of `wp.vec3` (same type as the `gravity` variable). This is one of the common ways that Warp can **interoperate** with NumPy.

The memory allocations that represent the `positions` and `velocities` array end up on the GPU in our example, which is important because we want to use the GPU to update the particles in parallel.

### Defining a kernel

With arrays allocated for the particle data, we then define the computation on that data by writing a **kernel**, which is essentially a function that gets compiled and executed across many threads on the GPU. Actually, kernels in Warp can also be compiled and executed on the CPU, but it will currently be limited to single-threaded execution.

Kernels in Warp are defined by decorating a Python function with `@wp.kernel`. Valid kernels in Warp must also obey additional restrictions, such as:

- Use a subset of the Python language
- Arguments must be typed
- Cannot `return` anything

When a kernel is *launched*, the body of the kernel is executed a certain number of times in parallel as specified through the **kernel launch dimensions** passed to `wp.launch()`.

In contrast, ordinary Python functions only get executed once when called.

We define our `integrate` kernel to accept the `positions` and `velocities` arrays as inputs. Inside the kernel, we use `wp.tid()` to get the current thread ID, which tells us which particle (array element) this particular thread should process. This allows each thread to work on a different particle simultaneously.

```python
@wp.kernel
def integrate(positions: wp.array(dtype=wp.vec3), velocities: wp.array(dtype=wp.vec3)):
    i = wp.tid()

    acceleration = (-b * velocities[i]) / mass + gravity
    velocities[i] += acceleration * dt
    positions[i] += velocities[i] * dt
```

### Launching the `integrate` kernel

We use the `wp.launch()` function to run our kernel on the GPU. This function takes three main arguments: the kernel function to execute, the number of parallel threads to launch (specified by `dim`), and the input arguments that match the kernel's function signature.

```python
for step in range(num_steps):
    wp.launch(integrate, dim=(num_particles,), inputs=[positions, velocities])
```

The `integrate` kernel is launched inside a for loop to simulate the particle motion over time. Each iteration of the loop represents one time step, updating the positions and velocities of all particles based on the physics equations. This process repeats for `num_steps` iterations to simulate the full duration of the particle system.

### Printing out the results

After completing all the simulation time steps, we can print out the final particle positions:

```python
print(f"Final positions: {positions}")
```

This simple print statement works seamlessly because Warp automatically handles the data transfer from GPU memory back to CPU memory.
The `positions` array, which was stored on the GPU during computation, is transparently copied to the CPU so we can display its values.

### Summary

In this basic example, we saw the fundamental building blocks of a Warp program:

- **Data management**: Using `wp.array()` to create and store data on the GPU
- **Computation**: Defining kernels as Python functions decorated with `@wp.kernel` to perform parallel operations
- **Execution**: Launching kernels with `wp.launch()` to run computations across multiple GPU threads

Together, these components enable high-performance parallel computing on GPUs while maintaining Python's ease of use.
In the following sections, we will explore each of these concepts in more detail.

---
## Kernels

In Warp, computational kernels are defined as Python functions and annotated with the `@wp.kernel` decorator.

Python functions that define Warp kernels must obey some additional restrictions, such as:

- Use a subset of the Python language
- Arguments must be typed
- Cannot `return` anything

Conceptually, Warp kernels are similar to CUDA kernels. When a kernel is *launched*, the body of the kernel
is executed a certain number of times in parallel as specified through the **kernel launch dimensions**.

In contrast, ordinary Python functions only get executed once when called.

Like CUDA kernels, Warp kernels do not return a value. Instead, each thread in a kernel can modify data in **global memory**
wrapped by the Warp arrays passed as arguments to the kernel.

The following notebook cell contains one of the simplest Warp kernels possible. It fills out an array with the thread index.

In [None]:
@wp.kernel
def fill_kernel(x: wp.array(dtype=int)):
    i = wp.tid()  # Get the thread index
    x[i] = i


x = wp.zeros(10, dtype=int)

print(f"Initial x: {x.numpy()}")
wp.launch(fill_kernel, dim=[10], inputs=[x])
print(f"Final x: {x.numpy()}")

### Execution of kernels on the GPU

At this point, we should understand how kernels in Warp map over to hardware units on the GPU.

- **Warp** maps the grid dimensions provided in `wp.launch()` into a **one-dimensional CUDA kernel grid**
- The **CUDA grid** is decomposed into individual **thread blocks** of uniform size, which are executed independently from each other
  - Warp defaults the CUDA grid to 256 threads per block. While you can adjust this up to 1024, whether a different number improves performance is specific to your kernel's workload and the GPU architecture.
- Each **thread block** is assigned for execution on a **streaming multiprocessor (SM)**
  - Each block can be scheduled on any available SM in any order
  - Once a thread block begins executing on an SM, it will run to completion on that same SM
  - A SM can often run multiple thread blocks **concurrently**
- A SM processes a thread block by dividing it into **hardware warps**, each consisting of 32 threads that execute instructions in lockstep (SIMT - Single Instruction, Multiple Thread).
  - This is not an important detail for the purposes of using **NVIDIA Warp**, but it's worth mentioning this potential source of naming confusion.
- Individual threads within a hardware warp execute their instructions on **CUDA cores** within the SM

![Execution Hierarchy on a GPU](images/chapter-02//gpu-kernel-exec.png)

*Image credit: [NVIDIA Developer Blog](https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/)*

### The compilation pipeline and the kernel cache

When a kernel is launched for the first time, all kernels in the module that have been defined so far will be translated to native C++/CUDA code and **just-in-time compiled**.

Both the C++/CUDA source and compiled objects are stored as files in the *kernel cache*, which is located in `wp.config.kernel_cache_dir` so that subsequent launches do not have to incur a code generation and compilation overhead.

In [None]:
wp.config.kernel_cache_dir

After launching `fill_kernel`, a `.cu` file and a `.ptx` or `.cubin` file corresponding to the CUDA source generated by Warp 
and the compiler output from the NVIDIA Runtime Compilation (NVRTC) library will appear in the kernel cache.

The following image shows the Warp compilation pipeline:

<div style="background-color:white;">
    <img src="./images/chapter-12/warp-compilation-pipeline.svg" width="800">
</div>

Let's take a closer look at the generated code.

We will use a separate script that runs a finite-difference kernel. The kernel cache will be changed from the default location so we can look at the cache from the sidebar.

In [None]:
%%writefile Chapter_12_finite_difference.py

import warp as wp
import numpy as np
import os

wp.config.kernel_cache_dir = os.path.join(
    os.path.dirname(os.path.realpath(__file__)), "Chapter_12_finite_difference_example_cache"
)


@wp.kernel
def finite_difference(dx: float, u: wp.array(dtype=float), u_out: wp.array(dtype=float)):
    i = wp.tid()
    total_points = u.shape[0]
    u_out[i] = (u[(i + 1) % total_points] - u[(i - 1 + total_points) % total_points]) / (2.0 * dx)


sin_array_np = np.sin(np.linspace(0, 2 * np.pi, 1024))
sin_array_wp = wp.array(sin_array_np, dtype=float)

u_out_wp = wp.empty_like(sin_array_wp)

dx = 2 * np.pi / 1024

wp.launch(finite_difference, sin_array_wp.shape, inputs=[dx, sin_array_wp], outputs=[u_out_wp])

print(f"{sin_array_wp.numpy()=}")
print(f"{u_out_wp.numpy()=}")

In [None]:
!python Chapter_12_finite_difference.py

Explore the files that are saved in the kernel cache located in the `Chapter_12_finite_difference_example_cache` directory.

Note that by default, Warp generates both a  **forward** and **backward** version of each kernel.

The **forward** version is what you are probably accustomed to seeing. The **backward** version is used in automatic differentiation (more on this later).

Now, we will make a small modification to the file to use a second-order finite-difference when computing `u_out`.

Everything else will be the same (filename, the name of the `finite_difference` kernel, problem resolution, etc.)

In [None]:
%%writefile Chapter_12_finite_difference.py

import warp as wp
import numpy as np
import os

wp.config.kernel_cache_dir = os.path.join(
    os.path.dirname(os.path.realpath(__file__)), "Chapter_12_finite_difference_example_cache"
)


@wp.kernel
def finite_difference(dx: float, u: wp.array(dtype=float), u_out: wp.array(dtype=float)):
    i = wp.tid()
    total_points = u.shape[0]
    u_out[i] = (
        u[(i + 1) % total_points] - 2.0 * u[i] + u[(i - 1 + total_points) % total_points]
    ) / (dx * dx)


sin_array_np = np.sin(np.linspace(0, 2 * np.pi, 1024))
sin_array_wp = wp.array(sin_array_np, dtype=float)

u_out_wp = wp.empty_like(sin_array_wp)

dx = 2 * np.pi / 1024

wp.launch(finite_difference, sin_array_wp.shape, inputs=[dx, sin_array_wp], outputs=[u_out_wp])

print(f"{sin_array_wp.numpy()=}")
print(f"{u_out_wp.numpy()=}")

In [None]:
!python Chapter_12_finite_difference.py

In the log from the previous cell, we should see a line like:

```
Module __main__ ea6bc0d load on device 'cuda:0' took 305.15 ms  (compiled)
```

Changing the contents of the `finite_difference` resulted in a new hash (e.g. `05beb6e`) for the module.

Since the kernel cache did not already contain compiled code for the `__main__` module with hash `05beb6e`,
the Python code was translated to CUDA C++ and compiled.

### Type conversions inside Warp kernels

Because Warp kernels are compiled to native C++/CUDA code, all the function input arguments should be *typed*.

This allows Warp to generate fast code that executes at essentially native speeds.

An exception will be raised if there are type mismatches, as Warp does not automatically perform type conversions for the user.

Let's see what happens if `fill_kernel` expects an `int` array, but we give it a `float` array.

In [None]:
@wp.kernel
def fill_kernel(x: wp.array(dtype=int)):
    i = wp.tid()
    x[i] = i


x = wp.array(shape=10, dtype=float)

wp.launch(fill_kernel, dim=[10], inputs=[x])

Sometimes you will need to cast variables to a different type inside Warp kernels, e.g. to multiply a `float` with an `int`

```python
    prod[i] = float(int_array[i])*float_array[i]
```

or equivalently,

```python
    prod[i] = wp.float32(int_array[i])*float_array[i]
```

As an exercise, fix the type issue in the following kernel:

In [None]:
@wp.kernel
def sin_kernel(dx: float, result: wp.array(dtype=float)):
    i = wp.tid()

    # Convert to a position
    x_pos = i * dx

    result[i] = wp.sin(2.0 * wp.PI * x_pos)


dx = 2.0
result_array = wp.empty(10, dtype=float)

wp.launch(sin_kernel, result_array.shape, inputs=[dx, result_array])

print(result_array.numpy())

### Using generics to create more flexible kernels

For convenience and to improve code reusability, Warp supports the use of `typing.Any` instead of concrete types.

See the [Generics documentation](https://nvidia.github.io/warp/modules/generics.html) for more information.

The below example defines a single generic kernel and launches it three times on different data types.

Note the use of `type()` in type conversions.

In [None]:
from typing import Any


@wp.kernel
def sin_kernel(dx: Any, result: wp.array(dtype=Any)):
    i = wp.tid()

    # Convert to a position
    x_pos = type(dx)(i) * dx

    result[i] = wp.sin(type(result[i])(2.0 * wp.PI) * x_pos)


# wp.float16
result_array = wp.empty(10, dtype=wp.float16)
wp.launch(sin_kernel, result_array.shape, inputs=[wp.float16(2.0), result_array])
print(f"wp.float16: {result_array.numpy()}")

# float or wp.float32
result_array = wp.empty(10, dtype=float)
wp.launch(sin_kernel, result_array.shape, inputs=[float(2.0), result_array])
print(f"wp.float32: {result_array.numpy()}")

# wp.float64
result_array = wp.empty(10, dtype=wp.float64)
wp.launch(sin_kernel, result_array.shape, inputs=[wp.float64(2.0), result_array])
print(f"wp.float64: {result_array.numpy()}")

### Debug printing from inside Warp kernels

We can use `wp.printf()` to print C-style formatted strings from inside Warp kernels.

To print composite types like vectors and matrices, use `print()`.

In [None]:
@wp.kernel
def print_tid():
    i = wp.tid()
    wp.printf("Thread Index: %d\n", i)


wp.launch(print_tid, (10))

### Multi-dimensional kernels

So far, we have been launching kernels on 1-D grids, but we can use up to 4-D grids (matching the maximum dimensionality of Warp arrays).

To get multi-dimensional thread indices, we make use of tuple unpacking:

```python
i = wp.tid()
i, j = wp.tid()
i, j, k = wp.tid()
i, j, k, l = wp.tid()
```

The following example launches a kernel on a 4-D grid on which each thread retrieves and prints out its thread index.

In [None]:
@wp.kernel
def basic_4d_kernel():
    i, j, k, w = wp.tid()
    wp.printf("(%d,%d,%d,%d)\n", i, j, k, w)


wp.launch(basic_4d_kernel, (2, 2, 2, 2))

### Using the `device` keyword

The `device` keyword may be used to target a specific device for a kernel launch.

Recall at the beginning of this notebook, we saw the `"cpu"` and `"cuda:0"` devices printed out in the `Devices:` section when we called `wp.init()`. These aliases may be used to launch kernels and allocate arrays with the `device` keyword.

All arrays must reside on the same device as the kernel launch.

We get an error if we try to launch a kernel on `"cpu"` with arrays that are on `"cuda:0"`:

In [None]:
@wp.kernel
def sum_kernel(a: wp.array(dtype=float), sum: wp.array(dtype=float)):
    i = wp.tid()
    wp.atomic_add(sum, 0, a[i])


inputs = wp.ones(10, dtype=float, device="cuda:0")  # ERROR: Needs to be on "cpu"
sum = wp.empty(1, dtype=float, device="cpu")

wp.launch(sum_kernel, inputs.shape, inputs=[inputs, sum], device="cpu")

If a Warp API call that accepts a `device` argument (typically array allocations or kernel launches) is not provided a device, something we call the **default device** is used.

The default device will be `'cuda:0'` if an NVIDIA GPU was detected by Warp, otherwise it will be `'cpu'`.

We can get the default device by calling `wp.get_device()` without an argument.

In [None]:
wp.get_device()

---
## Arrays

Memory allocations are exposed via the `wp.array` type. They are important because **kernels** must write their results to memory instead of directly returning values.

Arrays wrap an underlying memory allocation that may live in either host (CPU) or device (GPU) memory.

All arrays have an associated data type, which can be a scalar data type (e.g. `float`, `int`) or a composite data type e.g. `vec3`, `matrix33`).

We can set the data type using the `dtype` parameter, or we can let Warp infer it when creating an array from existing data (e.g. a Python list or NumPy array).

The full list of **scalar data types** that can be used for Warp arrays is currently:

| Name       | Description            |
|------------|------------------------|
| `bool`     | Boolean                |
| `int8`     | Signed byte            |
| `uint8`    | Unsigned byte          |
| `int16`    | Signed short           |
| `uint16`   | Unsigned short         |
| `int32`    | Signed integer         |
| `uint32`   | Unsigned integer       |
| `int64`    | Signed long integer    |
| `uint64`   | Unsigned long integer  |
| `float16`  | Half-precision float   |
| `float32`  | Single-precision float |
| `float64`  | Double-precision float |

The following code shows the construction of an under-specified `wp.array()` (both a shape and data type is required).

In [None]:
# TypeError: A concrete type is required
missing_dtype = wp.array(shape=10)

In [None]:
# Works since both shape and dtype are specified. Values are uninitialized.
empty_array = wp.array(shape=10, dtype=wp.float32)

print(empty_array)

In practice, the following convenience functions are commonly used to construct Warp arrays with a basic initialization choice:

In [None]:
array_b = wp.empty((5, 5), dtype=wp.float32)  # 5 x 5 array, uninitialized
print(f"array_b = {array_b}")

array_a = wp.zeros((5, 5), dtype=wp.float32)  # 5 x 5 array, zero-initialized
print(f"array_a = {array_a}")

array_c = wp.ones((5, 5), dtype=wp.float32)  # 5 x 5 array, one-initialized
print(f"array_c = {array_c}")

array_d = wp.full(shape=25, value=10, dtype=int)  # 25-element array initialized to 10
print(f"array_d = {array_d}")

Note that arrays we allocate without explicitly specifying a target `device` will be allocated on the default device, which is an NVIDIA GPU if one was detected on the system.

We can verify this by checking the `device.is_cuda` property of the array.

In [None]:
test_array = wp.zeros(10, dtype=wp.int32)
print(f"{test_array.shape=}")
print(f"{test_array.dtype=}")
print(f"{test_array.device=}")
print(f"{test_array.device.is_cuda=}")

Let's explicitly allocate the same array on the `'cpu'` device:

In [None]:
test_array = wp.zeros(10, dtype=wp.int32, device="cpu")
print(f"{test_array.shape=}")
print(f"{test_array.dtype=}")
print(f"{test_array.device=}")
print(f"{test_array.device.is_cuda=}")

### Using NumPy arrays to initialize Warp arrays

One of the libraries that Warp **interoperates** with is NumPy. This means that some Warp functions can accept NumPy arrays.

As we saw in the particle-simulation example, it is often convenient to initialize a Warp array from a NumPy array that has the desired values.

Simply pass the NumPy array in the first position of the `wp.array()` constructor.

In [None]:
import numpy as np

mu, sigma = 0, 0.1  # mean and standard deviation
rng = np.random.default_rng()
random_numbers_np = rng.normal(mu, sigma, 1000)

random_numbers_wp = wp.array(
    random_numbers_np
)  # Create a Warp array on the GPU from the NumPy array (Warp type is inferred)

print(f"{random_numbers_wp.device=}")
print(f"{random_numbers_wp.dtype=}")
print(f"{random_numbers_wp.shape=}")

If we want to retrieve the values of a **GPU** array like `test_array` on the **CPU**, we need to copy the data back to **CPU** memory.

The `numpy()` method is useful for getting a **temporary NumPy view** of a Warp array.

If the Warp array is on the **GPU**, a new array on the **CPU** will first be created, and then the contents of the **GPU** array will be copied into it before returning a NumPy view.

If the Warp array is already on the **CPU**, then a zero-copy NumPy view is returned.

The `__str__` method of Warp arrays also automatically calls `numpy()` on the array.

In [None]:
# bring data from device back to host
test_array_cpu = test_array.numpy()

print(test_array_cpu)

# Automatically calls `numpy()`
print(test_array)

If we want to allocate an array on the CPU, we can explicitly pass the argument `"cpu"` to the `device` parameter.

Recall that without specifying a `device` argument, the array will be allocated on the *default device*, which prefers a GPU on the system over the CPU.

In [None]:
empty_array_cpu = wp.array(shape=(4, 4), dtype=wp.float64, device="cpu")
empty_array_cpu.device

### Creating multi-dimensional arrays

Multi-dimensional arrays with up to four dimensions are currently supported in Warp.

In [None]:
array_4d = wp.array(shape=[2, 2, 2, 2], dtype=wp.float32)

print(f"array_4d.shape = {array_4d.shape}")

print(array_4d)

### Composite types

So far, we have been mostly creating Warp arrays based on scalar data types, but *composite* data types are also supported for convenience.

Here, we allocate a 10-element array with the data type `wp.vec3`, which is a three-component `wp.float32` vector. We used arrays of `wp.vec3` in the initial particle-simulation example.

In [None]:
v = wp.ones(10, dtype=wp.vec3, device="cuda")

print(f"v.shape = {v.shape}")
print(v)

### Copying arrays between devices

The values in an array can also be copied directly into another array (including between arrays residing on different GPUs) using `wp.copy()`.

In [None]:
src_array = wp.full(shape=10, value=10.0, dtype=float, device="cpu")
dest_array = wp.zeros_like(src_array, device="cuda:0")

# Print dest_array before the copy
print(f"dest_array (before copy): {dest_array}")

# copy from src_array (CPU) to dest_array (GPU)
wp.copy(dest_array, src_array)

print(f"dest_array (after copy): {dest_array}")

---
## Python-scope API vs. kernel-scope API

Some of the Warp API can only be called from the **Python scope** (i.e. outside of Warp user functions and kernels),
while others can only be called from the **kernel scope** (i.e. inside Warp kernels and functions).

The Python-scope API is documented in the [Python Reference](https://nvidia.github.io/warp/modules/runtime.html),
while the functions available to use from Warp kernels is documented in the [Built-Ins Reference](https://nvidia.github.io/warp/modules/functions.html).

Generally, the kernel-scope API can also be used in the Python scope. These functions are annotated with a `Python` tag in the Built-Ins Reference.

Not all of the Python language is supported inside the kernel scope. Some features haven't been implemented yet, while
other features do not map well to the GPU from a performance perspective.

See the [Limitations](https://nvidia.github.io/warp/limitations.html) documentation for more details.

In [None]:
@wp.kernel
def invalid_kernel_api_example(a: wp.array(dtype=float), b: wp.array(dtype=float)):
    wp.copy(a, b)  # Invalid: wp.copy() cannot be used in a Warp kernel


a = wp.ones(10, dtype=float)
b = wp.zeros(10, dtype=float)
wp.launch(invalid_kernel_api_example, (1,), inputs=[a, b])

---
## User functions

Users can write their own reusable functions which can be called from kernels using the `@wp.func` decorator, for example:

In [None]:
@wp.func
def square(x: float):
    return x * x


@wp.kernel
def test_kernel(x: wp.array(dtype=float), y: wp.array(dtype=float)):
    i = wp.tid()

    y[i] = square(x[i])


x = wp.full((10,), value=10.0, dtype=float)
y = wp.empty_like(x)

wp.launch(test_kernel, x.shape, inputs=[x], outputs=[y])

print(y.numpy())

Kernels can call user functions that are defined in the same module or in a different module. As the example shows, return type hints for user functions are **optional**.

Anything that can be done in a Warp kernel can also be done in a user function with the exception of `wp.tid()`.

The thread index can be passed in through the arguments of a user function if it is required.

---
## Structs

Users can define their own structures using the `@wp.struct` decorator.

Structs may be passed as arguments to kernels, e.g. to simplify kernel signatures when many arguments are required.

Structs may also be used as a data type for Warp arrays.

The following example shows how a struct is created for the simulation parameters required in the `update` kernel.

In [None]:
%matplotlib widget

import matplotlib.pyplot as plt
import numpy as np
import warp as wp


@wp.struct
class SimParameters:
    nx: int
    dx: float
    dt: float
    alpha: float


sim_params = SimParameters()
sim_params.nx = 1024
sim_params.dx = 0.1
sim_params.alpha = 0.01
sim_params.dt = 0.5 * (sim_params.dx * sim_params.dx) / (2 * sim_params.alpha)


u_np = np.zeros(sim_params.nx)

# Set initial condition: heat in the middle
u_np[sim_params.nx // 2] = 100

u = wp.array(u_np, dtype=float)
u_new = wp.zeros(sim_params.nx, dtype=float)


@wp.kernel
def update(sim_params: SimParameters, u: wp.array(dtype=float), u_new: wp.array(dtype=float)):
    i = wp.tid()

    if (i > 0) and (i < sim_params.nx - 1):
        u_new[i] = u[i] + sim_params.alpha * sim_params.dt * (u[i + 1] - 2.0 * u[i] + u[i - 1]) / (
            sim_params.dx * sim_params.dx
        )


fig, ax = plt.subplots()

for step_index in range(10000):
    wp.launch(update, (sim_params.nx,), inputs=[sim_params, u, u_new])
    (u, u_new) = (u_new, u)

    if (step_index + 1) % 1000 == 0:
        ax.plot(u.numpy(), label=f"{step_index}", alpha=0.5)

---
## Automatic differentiation

If we want to compute a derivative in a computer program, our main options are:

1. Compute derivatives by hand and then coding up the derivative formulas
2. Compute derivatives using finite-difference approximations
3. Implement the formula in a computer algebra system like Mathematica, then take derivatives using symbolic differentiation
4. Use automatic differentiation to obtain exact numerical derivatives

As we saw in the earlier finite-difference example, Warp by default generates a forward and backward (adjoint) version of each kernel definition.

The backward version of a kernel can be used to compute gradients of loss functions using **reverse-mode automatic differentiation**.

Arrays that participate in the chain of computation which require gradients should be created with `requires_grad=True`, for example:

```python
a = wp.zeros(1024, dtype=wp.vec3, requires_grad=True)
```

The `wp.Tape` class can then be used to record kernel launches, and replay them to compute the gradient of a scalar loss function with respect to the kernel inputs:

```python
# forward pass
with wp.Tape() as tape:
    wp.launch(kernel=compute1, inputs=[a, b])
    wp.launch(kernel=compute2, inputs=[c, d])
    wp.launch(kernel=loss, inputs=[d, l])

# reverse pass
tape.backward(l)
```

After the backward pass has completed, the gradients with respect to the inputs are available from the `array.grad` attribute:

```python
# gradient of loss with respect to input a
print(a.grad)
```

In practice, Warp's automatic differentiation capabilities are applied to more complicated algorithms involving branching logic, loops, and function calls, but we will look at a closed-form expression because we can easily compare the results to different methods of obtaining numerical derivatives.

Let's consider evaluating the following closed-form function at `x = 0.5`.

$$
f(x) = \sin \left(x^2\right) \cdot \ln(x) + \frac{x^3}{\sqrt{1 - x^2}}
$$

We could have found an analytical expression for the function derivative by hand and implemented a function in Python:

In [None]:
import math


def f_grad(x):
    return (
        x * x * x * x / (1 - x * x) ** (3 / 2)
        + 3 * x * x / math.sqrt(1 - x * x)
        + 2 * x * math.log(x) * math.cos(x * x)
        + math.sin(x * x) / x
    )


print(f_grad(0.5))

We could also have used a finite-difference approximation to evaluate the derivative with a truncation and round-off error:

In [None]:
def f(x):
    return math.sin(x * x) * math.log(x) + x * x * x / math.sqrt(1.0 - x * x)


h = 1e-8

f_deriv = (f(0.5 + h) - f(0.5)) / h

print(f_deriv)

The analytic approach doesn't scale well as the quantity being calculated grows in complexity, while
the numerical approach doesn't scale well with the number of inputs (in additional to the choice of the step size `h` being difficult to choose).

If we want to evaluate the derivative using Warp's automatic differentiation feature,
we would implement a kernel that performs the function evaluation and writes the result into an array.

In [None]:
@wp.func
def f(x: wp.float64):
    return wp.sin(x * x) * wp.log(x) + x * x * x / wp.sqrt(wp.float64(1.0) - x * x)


@wp.kernel
def compute(x: wp.array(dtype=wp.float64), out: wp.array(dtype=wp.float64)):
    i = wp.tid()
    out[i] = f(x[i])


x = wp.full((1,), value=0.5, dtype=wp.float64, requires_grad=True)
out = wp.empty(1, dtype=wp.float64, requires_grad=True)

with wp.Tape() as tape:
    wp.launch(compute, (1,), inputs=[x, out])

tape.backward(loss=out)

print(x.grad.numpy()[0])

Note that this result is **not** obtained using numerical differentiation. There is no step size.

Instead, the program has been run two times:

- Once in a *forward mode* when `wp.launch()` is called
- Once in a *reverse mode* (because adjoints are propagated in reverse from outputs to inputs) when `tape.backward()` is called

**How does Warp know how to evaluate derivatives exactly?** AD systems implement the known derivatives for a finite set of elementary operations. The chain rule is used to combine the elementary derivatives together to get the overall derivative.

___
## Conclusion

This notebook provided an introduction to the core components of Warp. For more examples, see the [Warp example gallery](https://github.com/NVIDIA/warp?tab=readme-ov-file#running-examples) on GitHub.

The repository at https://github.com/shi-eric/warp-lanl-tutorial-2025-05 also contains a set of tutorials for Warp.

___
## References

For more information about Warp:

* "NVIDIA/warp: A Python framework for accelerated simulation, data generation and spatial computing.", GitHub, https://github.com/NVIDIA/warp, Accessed: July 2, 2025.
* Warp Developers, "NVIDIA Warp Documentation," GitHub Pages, https://nvidia.github.io/warp/, Accessed: July 2, 2025.
* Miles Macklin, Leopold Cambier, Eric Shi, "Introducing Tile-Based Programming in Warp 1.5.0", NVIDIA Developer, https://developer.nvidia.com/blog/introducing-tile-based-programming-in-warp-1-5-0/, Accessed: July 2, 2025.
* "Warp: Differentiable Spatial Computing for Python", ACM Digital Library, https://dl.acm.org/doi/10.1145/3664475.3664543, Accessed: July 2, 2025.
* Miles Macklin, "Warp: Advancing Simulation AI with Differentiable GPU Computing in Python", NVIDIA On-Demand, https://www.nvidia.com/en-us/on-demand/session/gtc24-s63345/, Accessed: July 2, 2025.
* Miles Macklin, "Warp: A High-performance Python Framework for GPU Simulation and Graphics", NVIDIA On-Demand, https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41599/, Accessed: July 2, 2025.
* Miles Macklin, "Differentiable Physics Simulation for Learning and Robotics", NVIDIA On-Demand, https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s31838/, Accessed: July 2, 2025.

For more information about projects using Warp:

* "nvidia-warp · GitHub Topics", GitHub, https://github.com/topics/nvidia-warp, Accessed: July 2, 2025.
* Warp Developers, "Publications using Warp," GitHub, https://github.com/NVIDIA/warp/blob/main/PUBLICATIONS.md, Accessed: July 2, 2025.

For more information about automatic differentiation:

- Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, Jeffrey Mark Siskind, "[Automatic differentiation in machine learning: a survey](https://arxiv.org/abs/1502.05767)", The Journal of Machine Learning Research, 18(153), 1-43, 2018.
- Andreas Griewank and Andrea Walther, "[Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation](https://books.google.com/books?id=qMLUIsgCwvUC)", 2nd Edition, SIAM, 2008.
- Stelian Coros, Miles Macklin, Bernhard Thomaszewski, Nils Thürey, "[Differentiable simulation](https://dl.acm.org/doi/abs/10.1145/3476117.3483433)", SA '21: SIGGRAPH Asia 2021 Courses, 1-142, 2021.