# NVSHMEM4Py

## Introduction

This notebook introduces NVSHMEM4Py: a Python library for multi-GPU "shared memory" programming. It's a part of the [NVSHMEM project](https://developer.nvidia.com/nvshmem).

Some useful NVSHMEM resources are available here:
- [Source Code on GitHub](https://github.com/nvidia/nvshmem)
- [NVSHMEM API Docs](https://docs.nvidia.com/nvshmem/api/)
- [NVSHMEM4Py API Docs](https://docs.nvidia.com/nvshmem/api/api/language_bindings/python/)
- [Install/Quickstart Guide](https://docs.nvidia.com/nvshmem/release-notes-install-guide/install-guide/abstract.html)
- [Project Homepage](https://developer.nvidia.com/nvshmem)

## Environment

This tutorial has several pre-requisites:
- It uses MPI to bootstrap the NVSHMEM runtime.
- NVSHMEM4Py uses the `cuda-core` package from the `cuda-python` project to interact with the CUDA stack.
- Make sure you have a valid MPI installation.
- This tutorial also requires CuPy to construct arrays to pass into NVSHMEM.
- For Device APIs, the `numba-cuda` package is required.

Make sure you launch your notebook in a manner that allows you to use MPI and MPI4Py. For more details, refer to https://jupyter-tutorial.readthedocs.io/en/24.1.0/hub/ipyparallel/mpi.html.

Before continuing, please ensure that the notebook has been launched with MPI. [This page](https://jupyter-tutorial.readthedocs.io/en/24.1.0/hub/ipyparallel/config.html) has some great resources for launching notebooks in this manner.

After following the setup instructions, you should be able to run a command similar to `pipenv run ipcluster start -n <Number of GPUs> --profile=mpi`.

Run the following cell to install the requirements.

In [None]:
!pip install mpi4py nvshmem4py-cu12 cupy-cuda13x numba-cuda==0.20.1 cuda-core

## NVSHMEM introduction

![NVSHMEM Diagram](https://d29g4g2dyqv443.cloudfront.net/sites/default/files/akamai/nvshmem/mpi-nvshmem-explainer-diagram.svg)

NVSHMEM™ is a parallel programming interface based on OpenSHMEM that provides efficient and scalable communication for NVIDIA GPU clusters. NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA® streams.

MPI communication is two-sided and host-initated: each transfer requires a matching send and receive, initiated and coordinated by the CPU. Data movement between GPUs must be orchestrated by the host, adding latency and synchronization overhead.

NVSHMEM, by contrast, provides one-sided, GPU-centric communication: a GPU can directly perform put and get operations into remote GPU memory using a global address space (the symmetric heap). This removes the need for CPU involvement and explicit message matching, enabling lower-latency, fine-grained GPU-to-GPU communication.

### The Symmetric Heap

![sym_heap](images/chapter-nvshmem4py/sym_heap.png)

An NVSHMEM program consists of data objects that are private to each PE (rank) and data objects that are remotely accessible by all PEs. Private data objects are stored in the local memory of each PE and can only be accessed by the PE itself; these data objects cannot be accessed by other PEs via NVSHMEM routines. Private data objects follow the memory model of C. Remotely accessible objects, however, can be accessed by remote PEs using NVSHMEM routines. Remotely accessible data objects are called Symmetric Data Objects. Each Symmetric Data Object has a corresponding object with the same name, type, and size on all PEs where that object is accessible via the NVSHMEM API.

In NVSHMEM, GPU memory allocated by NVSHMEM memory management routines is symmetric. See [Memory Management](https://docs.nvidia.com/nvshmem/api/gen/api/memory.html#sec-memory-management) for information on allocating symmetric memory.

NVSHMEM dynamic memory allocation routines (e.g., `nvshmem_malloc`) allow collective allocation of Symmetric Data Objects on a special memory region called the Symmetric Heap. The Symmetric Heap is created during the execution of a program at a memory location determined by the NVSHMEM library. The Symmetric Heap may reside in different memory regions on different PEs. Figure fig:mem_model shows an example NVSHMEM memory layout, illustrating the location of remotely accessible symmetric objects and private data objects.

## Initializing NVSHMEM

Initializing NVSHMEM requires a collective operation through which all participating processes (processing elements, or PEs) coordinate to establish a common communication context and allocate a symmetric heap across them.

Initialization is typically done by calling `nvshmem_init()` or `nvshmemx_init_attr()`. During this phase, every PE joins a collective startup routine that exchanges addressing and connection information, often using MPI as the bootstrap mechanism. This ensures that each PE’s symmetric heap—a region of memory with identical size and layout across all ranks—is accessible to every other PE through a consistent global address space. Once initialization completes, GPUs can directly perform one-sided put and get operations into any peer’s symmetric memory without further CPU coordination.

The `nvshmem.core` module of NVSHMEM4Py provides initialization and finalization routines for the NVSHMEM runtime. These must be called before and after using any NVSHMEM features in Python. This notebook uses MPI to bootstrap NVSHMEM and NVSHMEM4Py. That can be accomplished with the following cell.

In [None]:
from mpi4py import MPI
from cuda.core.experimental import Device, system
import nvshmem.core as nvshmem

rank = MPI.COMM_WORLD.Get_rank()
dev = Device(rank % system.num_devices)
dev.set_current()

nvshmem.init(device=dev, mpi_comm=MPI.COMM_WORLD, initializer_method="mpi")

# To finalize NVSHMEM, simply call nvshmem.finalize() (we'll hold off until the end of the notebook)

After this cell is run, NVSHMEM’s runtime is initialized using the `MPI.COMM_WORLD` communicator. Each rank in the MPI communicator participates in a collective initialization, during which NVSHMEM sets up the symmetric heap and communication channels between all participating GPUs. The `device=dev` argument ensures that each MPI rank initializes NVSHMEM on a specific GPU (selected based on its rank), while `initializer_method="mpi"` tells NVSHMEM to use MPI for bootstrapping. Once complete, all ranks share a common NVSHMEM context that enables GPU-initiated, one-sided communication across the entire MPI world.

## Querying and Status

NVSHMEM4Py offers a number of functions used for querying information about the NVSHMEM runtime The next cell demonstrates some of them


In [None]:
# This returns a Python class encapsulating the NVSHMEM version, NVSHMEM4Py version,
# and the version of the OpenSHMEM Spec that NVSHMEM was written against.
print(nvshmem.get_version())

# These functions return the number of PEs in the runtime and the PE of each process
print(nvshmem.my_pe())
print(nvshmem.n_pes())

## Team Management

In NVSHMEM, a team is a subgroup of processing elements (PEs) that can participate in collective operations independently of the full program. Teams allow you to divide the global set of PEs (defined at initialization) into smaller, logical groups. For example, one team per node or per GPU partition.

The PEs in an NVSHMEM program communicate using either point-to-point routines—such as RMA and AMO routines—that specify the PE number of the target PE, or collective routines that operate over a set of PEs. NVSHMEM teams allow programs to group a set of PEs for communication. Team-based collective operations include all PEs in a valid team. Point-to-point communication can make use of team-relative PE numbering through PE number translation.

An NVSHMEM team may be predefined (i.e., provided by the NVSHMEM library) or defined by the NVSHMEM application. An application-defined team is created by “splitting” a parent team into one or more new teams—each with some subset of PEs of the parent team—via one of the `nvshmem_team_split_*` routines.

Some of the built-in teams are listed below:
- `TEAM_WORLD` contains all PEs in the NVSHMEM runtime
- `TEAM_NODE` contains all PEs on the same node as the calling PE
- `TEAM_SHARED` contains all PEs which are in the same load-store domain as the calling PE
- `TEAM_SAME_MYPE_NODE` contains all PEs which have the same per-node PE as the calling PE.

Teams can also be created by three methods:
- `team_split_strided` creates a team based on a strided group of PEs in the runtime
- `team_split_2d` creates a team based on a 2-dimensionally strided group of PEs
- `team_init` can be used to initialize teams from arbitrary groups of PEs 


In [None]:
# Using team_split_strided to create a team with every other PE, starting with PE0
team_strided = nvshmem.team_split_strided(
    nvshmem.Teams.TEAM_WORLD, start=0, stride=2, size=min(2, nvshmem.n_pes()), config=nvshmem.TeamConfig(),
    config_mask=0, new_team_name="STRIDED_TEAM"
)
print(f"Created team with handle {team_strided} using team_split_strided")

# Create a 2-d strided PE (requires >=4 PEs)
if nvshmem.n_pes() >= 4:
    x_team, y_team = nvshmem.team_split_2d(
        Teams.WORLD, xrange=2,
        xaxis_config=nvshmem.TeamConfig(), xaxis_mask=0,
        yaxis_config=nvshmem.TeamConfig(), yaxis_mask=0,
        new_team_name="TWO_D_TEAM"
    )
    print(f"Created team_split_2d teams. X-team: {x_team}, Y-team: {y_team}")
    
# Create an arbitrary team
config_init = nvshmem.TeamConfig(version=2, num_contexts=1, uniqueid=1004)
my_pe = nvshmem.my_pe()
n_pes = nvshmem.n_pes()

if n_pes >= 4:
    # Let's say we want to make a team of PEs 2,3:
    if my_pe in (2,3):
        config = nvshmem.TeamConfig()
        team_init = nvshmem.team_init(
            config, config_mask=0, n_pes=n_pes, my_pe=my_pe,
            new_team_name="USER_CREATED_TEAM"
        )
        print(f"Created team_init team. {team_init}")


# Memory Management

All memory used by the NVSHMEM communications runtime must be allocated from or registered onto the symmetric heap, a globally accessible memory region that is identically sized and aligned across all processing elements (PEs), enabling one-sided remote memory operations such as `put` and `get`, as well as collectives like `reduce` and `fcollect`.

In NVSHMEM4Py, symmetric memory must be explicitly managed due to Python’s garbage collection mechanism. Unlike in C/C++, where memory can be automatically freed when it goes out of scope, Python’s garbage collector may not immediately release memory resources, which can lead to resource leaks in distributed environments.

NVSHMEM4Py makes use of the NVIDIA cuda.core Python project’s memory interface to expose symmetric memory. (See the [cuda.core](https://nvidia.github.io/cuda-python/cuda-core/latest/index.html) documentation for more details.)

NVSHMEM4Py uses `cuda.core.Buffer` objects to represent symmetric memory. These objects provide a DLPack-compatible interface that allows for seamless integration with other Python CUDA libraries such as CuPy and PyTorch. Additionally, NVSHMEM provides convenience functions for converting symmetric `Buffers` into objects from popular libraries such as Torch `Tensors` or CuPy `NDArrays`

Even though Python has reference counting, it’s important to explicitly free NVSHMEM symmetric memory to ensure proper cleanup of distributed resources. Relying on Python’s garbage collection alone may lead to resource leaks or undefined behavior in a distributed environment.

Note that all of NVSHMEM's buffer creation, destruction, and registration functions are collective across `TEAM_WORLD`. All PEs must call these functions together.

In [None]:
# Allocate a symmetric Buffer of size 4 KiB
# Important! nvshmem.core.buffer is a collective. All PEs in the NVSHMEM runtime must call this function together.
buf = nvshmem.buffer(4096)
# Buffers are typeless blocks of memory characterized by a pointer and size.
# Buffers created from the same call to buffer() on each PE will have the same symmetric address
# NVSHMEM4Py allocated Buffers are always Device (GPU Memory) buffers
print(buf.handle, buf.size)
# When you are done using the buffer, release it back to the NVSHMEM symmetric heap
nvshmem.free(buf)

In [None]:
# You can also allocate CuPy arrays on the symmetric heap.
# CuPy arrays are wrappers around Buffers that carry shape and data type information.
shape = (1,2,3,4,5)
dtype = "float32"
arr = nvshmem.array(shape, dtype=dtype)
# Once you allocate your array, you can print it or perform operations on it just like you would for any other CuPy array:
print(arr)
arr[:] = nvshmem.my_pe() + 1
print(arr)
# Free it when you are done
nvshmem.free_array(arr)

If you have allocated memory objects (such as arrays or Tensors) external to the NVSHMEM runtime, you can also register them against the NVSHMEM symmetric heap instead of allocating them directly via `nvshmem.buffer`. This allows for a zero-copy scheme where communication uses the same buffers as computation, eliminating redundant data movement and enabling direct GPU-to-GPU or NIC-to-GPU access through the unified virtual address space.

Note that memory registered to NVSHMEM has certain limitations. NVSHMEM only supports registration of buffers created with the [CUDA VMM APIs](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html). The VMM APIs are accessible from Python via the [VirtualMemoryResource in `cuda.core`](https://nvidia.github.io/cuda-python/cuda-core/latest/generated/cuda.core.experimental.VirtualMemoryResource.html)

In [None]:
from cuda.core.experimental._memory import VirtualMemoryResource
# Get a VirtualMemoryResource tied to our current GPU
dev = Device()
resource = VirtualMemoryResource(dev)
# This buffer is allocated by the CUDA VMM APIs
buffer = resource.allocate(536870912)
# Take that buffer and wrap it in a CuPy array
arr = cupy.from_dlpack(buffer, copy=False).view("float32")

# Register it with the NVSHMEM runtime
# NOTE! This is a collective across TEAM_WORLD
arr_reg = nvshmem.register_external_array(arr)
# Use the array however you'd like
# When you're done, unregister the array
# NOTE! this is also a collective across TEAM_WORLD
arr_reg = nvshmem.unregister_external_array(arr)

## Collective operations

A collective operation in NVSHMEM is an operation that applies across a whole team. Each PE in the calling team must call the collective operation together, or the collective may hang or produce other undefined behavior.

NVSHMEM4Py provides collective communication operations that allow for efficient data exchange patterns across all processing elements (PEs). These operations include broadcasts, reductions, and synchronization primitives.

Collective operations in NVSHMEM4Py follow the same semantics as their C/C++ counterparts but with a Pythonic interface. They operate on symmetric memory allocated through NVSHMEM.

Collective operations have to be called for each PE (which is associated with a specific CUDA device), using the same count and the same datatype, to form a complete collective operation. Failure to do so will result in undefined behavior, including hangs, crashes, or data corruption.

For this tutorial, we will introduce the Reduce collective operation.

NVSHMEM’s reduce operation is similar to the AllReduce operation in other libraries such as NCCL or MPI. It performs a reduction operation (`sum`, `min`, or `max`) across all PEs and distributes the result to all PEs.

In a sum allreduce operation between `k` PEs, each PE will provide an array `in` of `N` values, and receive identical results in array `out` of `N` values, where `out[i] = in0[i]+in1[i]+…+in(k-1)[i]`.

![](https://docs.nvidia.com/nvshmem/api/_images/reduce.png)

All host-initated collective operations in NVSHMEM4Py require a CUDA stream from which they will be launched. The following cell shows how to run an NVSHMEM reduction on a CUDA stream using `cuda.core` and `nvshmem4py`

In [None]:

# Get current device and create a stream
device = Device()
stream = device.create_stream()

# Allocate symmetric memory using nvshmem.core.array
size = 10
src_array = nvshmem.array((size,), dtype="float32")
dest_array = nvshmem.array((size,), dtype="float32")

# Set values on each PE
my_pe = nvshmem.my_pe()
n_pes = nvshmem.n_pes()

# Fill source array with PE ID + 1
src_array[:] = my_pe + 1

print("Array before reduction:", src_array, dest_array)

# Perform the reduction from src_array into dst_array across TEAM_WORLD
nvshmem.reduce(nvshmem.Teams.TEAM_WORLD, dest_array, src_array, "sum", stream=stream)

print("Array after reduction:", src_array, dest_array)
nvshmem.free_array(src_array)
nvshmem.free_array(dest_array)

# Point-to-Point Remote Memory Access

NVSHMEM4Py provides a Pythonic interface to the RMA operations defined in the NVSHMEM specification. These operations allow for the transfer of data between processing elements (PEs) in a distributed environment.

NVSHMEM4Py supports two primary types of RMA operations:
- Explicit operations (`put`/`get`) where a PE calls a function to move data explicitly from one symmetric address to another
- Mapping-based operations (supported by `nvshmem.get_peer_buffer` or `nvshmem.get_mc_buffer`) where one PE retrieves a pointer to a symmetric object on another PE, and can issue direct loads/stores to that address.

Explicit operations are supported over network or P2P transports, while mapping-based operations are only supported via P2P (NVLink or PCI Express) transports.

RMA operations in NVSHMEM follow the one-sided communication model, where the initiating PE specifies both the source and destination of the data transfer without requiring active participation from the remote PE.

NVSHMEM RMA operations work on the same kinds of objects that collectives work on.

Similar to collective operations, all RMA operations in NVSHMEM4Py require a CUDA stream to be provided. The stream is used for proper synchronization of GPU operations. See the [cuda.core.Stream docs](https://nvidia.github.io/cuda-python/cuda-core/latest/generated/cuda.core.experimental.Stream.html#cuda.core.experimental.Stream) for more details.

The following example demonstrates copying data from one PE to another using `put`.

In [None]:
# Get current device and create a stream
device = Device()
stream = device.create_stream()

# Allocate symmetric memory
size = 10
src_array = nvshmem.array((size,), dtype=cp.float32)
dest_array = nvshmem.array((size,), dtype=cp.float32)

# Set values on local PE
my_pe = nvshmem.my_pe()
n_pes = nvshmem.n_pes()

# Fill source array with PE ID
src_array[:] = cp.ones(size, dtype=cp.float32) * my_pe

# Target PE (circular next PE)
target_pe = (my_pe + 1) % n_pes

# Put data to the target PE
nvshmem.put(dest_array, src_array, size, target_pe, stream=stream)

# Ensure operation is complete
stream.synchronize()

# Clean up - explicit free is required
nvshmem.free_array(src_array)
nvshmem.free_array(dest_array)

The following cell demonstrates using a simple Numba kernel and the `get_peer_array` function to directly access memory on a different PE via loads/stores

In [None]:
from numba import cuda

@cuda.jit
def simple_shift(arr, dst_pe):
    # This line evaluates to NVLink stores when passed an array retrieved by `get_peer_array`
    arr[0] = dst_pe

dev = Device()
stream = dev.create_stream()

array = nvshmem.array((1,), dtype="int32")

my_pe = nvshmem.my_pe()
# A unidirectional ring - always get the neighbor to the right
dst_pe = (my_pe + 1) % nvshmem.n_pes()

# This function returns an Array which can be directly load/store'd to over NVLink
# The dst_PE must be in the same NVL domain as the PE calling this function, otherwise it will raise an Exception
dev_dst = nvshmem.core.get_peer_array(b, dst_pe)

block = 1
grid = (size + block - 1) // block
# Launch the kernel, followed by a global barrier across the whole NVSHMEM runtime
simple_shift[block, grid, 0, 0](array, my_pe)
nvshmem.core.barrier(nvshmem.Teams.TEAM_NODE, stream=stream)
# The value should be the neighbor PE
print(f"From PE {my_pe}, array contains {array}")

nvshmem.core.free_array(arr_src)
nvshmem.core.free_array(arr_dst)


## Signalling

NVSHMEM4Py lets you pair data movement with a lightweight flag update so peers can synchronize without host round-trips. A signal is a 1-element uint64 array allocated on the NVSHMEM symmetric heap. One PE updates a remote signal (optionally together with a data transfer), while the remote PE waits or polls on that signal.

Core pieces

- The `SignalOp` enum controls how the remote flag is updated. Typical choices include setting the flag to a value or performing an arithmetic update. Supported operations are:
    - `SIGNAL_SET`: publish “done” by setting the flag to a specific value (e.g., 1).
    - `SIGNAL_ADD` (and similar ops, if enabled): build counters or doorbells from multiple senders.

- The `put_signal(dst, src, sig, val, signal_op, remote_pe, stream=...)` function performs a put from `src` to `dst` on `remote_pe` and then atomically updates `sig` on that same remote PE using `signal_op(val)`. The signal update is ordered after the data transfer on the given stream, so a waiter that observes the signal is guaranteed to see the data.

- The `signal_op` function behaves like the `put_signal` function but doesn't perform any data movement, only signalling.
- The `signal_wait(sig, val, cmp, stream=...)` function blocks the calling PE’s work on stream until the comparison on the local signal condition succeeds (e.g., equals val, greater-or-equal, etc.). This integrates cleanly with CUDA streams: the wait is stream-ordered and won’t stall unrelated streams.

- The `ComparisonType` enum specifies which comparison operand the wait function should use (e.g. `CMP_EQ` or `CMP_GT`)
The following example shows a program in which one PE performs a put-with-signal, and another PE performs a wait on that signal.

In [None]:
dev = Device()
local_rank_per_node = dev.device_id
stream = dev.create_stream()

buf_src = nvshmem.core.array((4,4), dtype="float32")
buf_src [:] = nvshmem.core.my_pe() + 1
buf_dst = nvshmem.core.array((4,4), dtype="float32")
buf_dst [:] = 0

# NVSHMEM signals are always 1-element arrays of type uint64
signal = nvshmem.core.array((1,), dtype="uint64")
signal [:] = 0
buf_sig, sz, type = nvshmem.core.array_get_buffer(signal)

if nvshmem.core.my_pe() == 0:
    # The SIGNAL_SET operator sets the signal value to `val` (in this case 1) when the put operation is complete
    nvshmem.core.put_signal(buf_dst, buf_src, buf_sig, 1, nvshmem.core.SignalOp.SIGNAL_SET, remote_pe=1, stream=stream)
    print(f"From PE {nvshmem.core.my_pe()} sent buf to remote PE 1 and set signal")

if nvshmem.core.my_pe() == 1:
    # This signal_wait will wait until the signal value is equal to 1. 
    nvshmem.core.signal_wait(buf_sig, 1, nvshmem.core.ComparisonType.CMP_EQ, stream=stream)
    print(f"From PE {nvshmem.core.my_pe()} waited for signal")

stream.sync()
print(f"From PE {nvshmem.core.my_pe()} src={buf_src}, dst={buf_dst} signal={signal}")

nvshmem.core.free_array(buf_src)
nvshmem.core.free_array(buf_dst)
nvshmem.core.free_array(signal)

When you're done using NVSHMEM, finalize the runtime:

In [None]:
nvshmem.finalize()