You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Highlights
Added nccl.ep, a Pythonic interface to libnccl_ep.so for expert
parallel dispatch/combine workflows. The package exposes Group, Handle, Tensor, typed config dataclasses, Algorithm, Layout, PassDir, and the
named input/output structs used by the NCCL EP API.
Added nccl.core.device.cute, enabling CuTeDSL kernels to call NCCL device
APIs.
Added top-level stack diagnostics with nccl.get_version() and nccl.show_versions(), reporting nccl4py, libnccl.so, and libnccl_ep.so versions, CUDA build variants, and loaded shared-library
paths.
Added free-threaded CPython support.
New Features
NCCL EP Python API
New nccl.ep package provides Pythonic access to the NCCL EP extension
library.
Group.create() creates EP groups from a Communicator and GroupConfig; Group.create_handle() creates handles with an explicit Layout.
Handle supports update(), dispatch(), combine(), complete(), and destroy().
DispatchInputs, DispatchOutputs, CombineInputs, CombineOutputs, and LayoutInfo provide named containers for the tensors and metadata used by
dispatch, combine, and handle setup.
Tensor resolves Python buffers into ncclEpTensor_t descriptors.
GroupConfig, HandleConfig, DispatchConfig, CombineConfig, and AllocConfig expose typed configuration objects.
AllocFn and FreeFn expose caller-controlled EP allocation hooks.
nccl.ep.interop.torch.get_nccl_comm_from_group() provides PyTorch interop
for creating an NCCL communicator from a PyTorch process group's rank and
world-size information.
Importing nccl.ep sets default NCCL_EP_HOME when bundled EP JIT headers
are present, and NCCL_HOME when NCCL public headers are available from the
installed nvidia.nccl package.
nccl.ep checks that the loaded libnccl.so and libnccl_ep.so were built
with the same CUDA major version. CUDA minor differences are accepted.
Communicator Configuration
Added graph_stream_ordering to NCCLConfig.
Device API and CuTe DSL
New nccl.core.device.cute module exposes the NCCL device API to CuTeDSL
kernels, including communicator/window access, GIN primitives, barrier
operations, and typed structs.
Added bindings/nccl4py/examples/cute/main.py, a GIN put/wait example with
host-side validation.
Added gin_strong_signals_required and gin_va_signals_required to NCCLDevCommRequirements for configuring device communicator requirements.
Added NcclGinType.GPI for the GPU-Push Interface transport.
Version and Diagnostics API
Top-level nccl.get_version() returns a VersionInfo dataclass containing
the nccl4py package version plus LibraryInfo entries for the loaded libnccl.so and, when available, libnccl_ep.so.
Top-level nccl.show_versions() prints the same stack information in a
human-readable version block.
Direct library probes are available for each native library: nccl.core.get_lib_version() and nccl.core.get_lib_path() report the
loaded libnccl.so; nccl.ep.get_lib_version() and nccl.ep.get_lib_path() report the loaded libnccl_ep.so.
Each LibraryInfo includes release version, CUDA build variant, and loaded
shared-library path.
Installation and Packaging
CuTeDSL support can be installed through the CUDA-specific extras: nccl4py[cu12] installs nvidia-cutlass-dsl>=4.5.2,<5.0, and nccl4py[cu13] installs nvidia-cutlass-dsl[cu13]>=4.5.2,<5.0.
Wheels include package data for nccl/ep/lib/libnccl_ep.so plus EP JIT
headers. The bundled libnccl_ep.so is built with CUDA 13, regardless of
whether the cu12 or cu13 extra is installed. Users who want to use a
CUDA 12 build of libnccl_ep.so must provide that library themselves, for
example through LD_PRELOAD or LD_LIBRARY_PATH.
Wheels are available for free-threaded CPython 3.14t.
Examples and Documentation
Added Python examples for:
multiple devices in one process: docs/examples/01_communicators/01_multiple_devices_single_process/python/;
one device per MPI process: docs/examples/01_communicators/03_one_device_per_process_mpi/python/;
point-to-point ring pattern: docs/examples/02_point_to_point/01_ring_pattern/python/;
NCCL_SPLIT_NOCOLOR has been removed from the public constants. Use color=None when a rank should opt out of Communicator.split().
Deprecated APIs
nccl.core.get_version() remains available, but is deprecated. Use top-level nccl.get_version() for structured version information, or nccl.show_versions() for human-readable output.
Other Compatibility Notes
Public NCCL enum wrappers are pure-Python IntEnum or IntFlag classes.
Integer compatibility is preserved, and dtype conversion remains supported.
Code that depends on binding-backed enum class identity from earlier releases
may need updates.
Enum members now follow the Python enum convention of UPPER_SNAKE_CASE
names, such as CTAPolicy.DEFAULT, CommShrinkFlag.ABORT, WindowFlag.COLL_SYMMETRIC, and NcclCommMemStat.GPU_MEM_TOTAL. The
previous PascalCase/camelCase aliases, such as CTAPolicy.Default and NcclCommMemStat.GpuMemTotal, still work in 0.3.1 for compatibility, but
will be removed in a future release. New code should use the uppercase names.
Fixes and Enhancements
Fixed pointer lifetime handling for non-blocking communicator and window
initialization.
Torch interop covers torch.uint32 and torch.uint64 when those dtypes are
available.
API Stability
nccl.ep and nccl.core.device.cute are initial API support. Their public
interfaces may change in future releases as the NCCL EP and CuTeDSL device
API integration matures.