Release NCCL4py v0.3.1 Release · NVIDIA/nccl

Highlights

Added nccl.ep, a Pythonic interface to libnccl_ep.so for expert
parallel dispatch/combine workflows. The package exposes Group, Handle,
Tensor, typed config dataclasses, Algorithm, Layout, PassDir, and the
named input/output structs used by the NCCL EP API.
Added nccl.core.device.cute, enabling CuTeDSL kernels to call NCCL device
APIs.
Added top-level stack diagnostics with nccl.get_version() and
nccl.show_versions(), reporting nccl4py, libnccl.so, and
libnccl_ep.so versions, CUDA build variants, and loaded shared-library
paths.
Added free-threaded CPython support.

New Features

NCCL EP Python API

New nccl.ep package provides Pythonic access to the NCCL EP extension
library.
Group.create() creates EP groups from a Communicator and GroupConfig;
Group.create_handle() creates handles with an explicit Layout.
Handle supports update(), dispatch(), combine(), complete(), and
destroy().
DispatchInputs, DispatchOutputs, CombineInputs, CombineOutputs, and
LayoutInfo provide named containers for the tensors and metadata used by
dispatch, combine, and handle setup.
Tensor resolves Python buffers into ncclEpTensor_t descriptors.
GroupConfig, HandleConfig, DispatchConfig, CombineConfig, and
AllocConfig expose typed configuration objects.
AllocFn and FreeFn expose caller-controlled EP allocation hooks.
nccl.ep.interop.torch.get_nccl_comm_from_group() provides PyTorch interop
for creating an NCCL communicator from a PyTorch process group's rank and
world-size information.
Importing nccl.ep sets default NCCL_EP_HOME when bundled EP JIT headers
are present, and NCCL_HOME when NCCL public headers are available from the
installed nvidia.nccl package.
nccl.ep checks that the loaded libnccl.so and libnccl_ep.so were built
with the same CUDA major version. CUDA minor differences are accepted.

Communicator Configuration

Added graph_stream_ordering to NCCLConfig.

Device API and CuTe DSL

New nccl.core.device.cute module exposes the NCCL device API to CuTeDSL
kernels, including communicator/window access, GIN primitives, barrier
operations, and typed structs.
Added bindings/nccl4py/examples/cute/main.py, a GIN put/wait example with
host-side validation.
Added gin_strong_signals_required and gin_va_signals_required to
NCCLDevCommRequirements for configuring device communicator requirements.
Added NcclGinType.GPI for the GPU-Push Interface transport.

Version and Diagnostics API

Top-level nccl.get_version() returns a VersionInfo dataclass containing
the nccl4py package version plus LibraryInfo entries for the loaded
libnccl.so and, when available, libnccl_ep.so.
Top-level nccl.show_versions() prints the same stack information in a
human-readable version block.
Direct library probes are available for each native library:
nccl.core.get_lib_version() and nccl.core.get_lib_path() report the
loaded libnccl.so; nccl.ep.get_lib_version() and
nccl.ep.get_lib_path() report the loaded libnccl_ep.so.
Each LibraryInfo includes release version, CUDA build variant, and loaded
shared-library path.

Installation and Packaging

CuTeDSL support can be installed through the CUDA-specific extras:
nccl4py[cu12] installs nvidia-cutlass-dsl>=4.5.2,<5.0, and
nccl4py[cu13] installs nvidia-cutlass-dsl[cu13]>=4.5.2,<5.0.
Wheels include package data for nccl/ep/lib/libnccl_ep.so plus EP JIT
headers. The bundled libnccl_ep.so is built with CUDA 13, regardless of
whether the cu12 or cu13 extra is installed. Users who want to use a
CUDA 12 build of libnccl_ep.so must provide that library themselves, for
example through LD_PRELOAD or LD_LIBRARY_PATH.
Wheels are available for free-threaded CPython 3.14t.

Examples and Documentation

Added Python examples for:
- multiple devices in one process:
  docs/examples/01_communicators/01_multiple_devices_single_process/python/;
- one device per MPI process:
  docs/examples/01_communicators/03_one_device_per_process_mpi/python/;
- point-to-point ring pattern:
  docs/examples/02_point_to_point/01_ring_pattern/python/;
- allreduce: docs/examples/03_collectives/01_allreduce/python/;
- user-buffer allreduce:
  docs/examples/04_user_buffer_registration/01_allreduce/python/;
- symmetric-memory allreduce:
  docs/examples/05_symmetric_memory/01_allreduce/python/;
- symmetric-memory allgather:
  docs/examples/05_symmetric_memory/02_allgather/python/.
Added nccl4py documentation under docs/userguide/source/nccl4py/, with the
main entry point at docs/userguide/source/nccl4py.rst.

Breaking Changes

Removed APIs

nccl.core.group_simulate_end() has been removed. Use
nccl.core.group_end(simulate=True):

from nccl.core import group_end, group_start

group_start()
# enqueue operations
info = group_end(simulate=True)

NCCL_SPLIT_NOCOLOR has been removed from the public constants. Use
color=None when a rank should opt out of Communicator.split().

Deprecated APIs

nccl.core.get_version() remains available, but is deprecated. Use top-level
nccl.get_version() for structured version information, or
nccl.show_versions() for human-readable output.

Other Compatibility Notes

Public NCCL enum wrappers are pure-Python IntEnum or IntFlag classes.
Integer compatibility is preserved, and dtype conversion remains supported.
Code that depends on binding-backed enum class identity from earlier releases
may need updates.
Enum members now follow the Python enum convention of UPPER_SNAKE_CASE
names, such as CTAPolicy.DEFAULT, CommShrinkFlag.ABORT,
WindowFlag.COLL_SYMMETRIC, and NcclCommMemStat.GPU_MEM_TOTAL. The
previous PascalCase/camelCase aliases, such as CTAPolicy.Default and
NcclCommMemStat.GpuMemTotal, still work in 0.3.1 for compatibility, but
will be removed in a future release. New code should use the uppercase names.

Fixes and Enhancements

Fixed pointer lifetime handling for non-blocking communicator and window
initialization.
Torch interop covers torch.uint32 and torch.uint64 when those dtypes are
available.

API Stability

nccl.ep and nccl.core.device.cute are initial API support. Their public
interfaces may change in future releases as the NCCL EP and CuTeDSL device
API integration matures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL4py v0.3.1 Release

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

New Features

NCCL EP Python API

Communicator Configuration

Device API and CuTe DSL

Version and Diagnostics API

Installation and Packaging

Examples and Documentation

Breaking Changes

Removed APIs

Deprecated APIs

Other Compatibility Notes

Fixes and Enhancements

API Stability

Uh oh!