Skip to content

NCCL4py v0.3.1 Release

Choose a tag to compare

@bhramesh-nvidia bhramesh-nvidia released this 11 Jun 18:45
5067397

Highlights

  • Added nccl.ep, a Pythonic interface to libnccl_ep.so for expert
    parallel dispatch/combine workflows. The package exposes Group, Handle,
    Tensor, typed config dataclasses, Algorithm, Layout, PassDir, and the
    named input/output structs used by the NCCL EP API.
  • Added nccl.core.device.cute, enabling CuTeDSL kernels to call NCCL device
    APIs.
  • Added top-level stack diagnostics with nccl.get_version() and
    nccl.show_versions(), reporting nccl4py, libnccl.so, and
    libnccl_ep.so versions, CUDA build variants, and loaded shared-library
    paths.
  • Added free-threaded CPython support.

New Features

NCCL EP Python API

  • New nccl.ep package provides Pythonic access to the NCCL EP extension
    library.
  • Group.create() creates EP groups from a Communicator and GroupConfig;
    Group.create_handle() creates handles with an explicit Layout.
  • Handle supports update(), dispatch(), combine(), complete(), and
    destroy().
  • DispatchInputs, DispatchOutputs, CombineInputs, CombineOutputs, and
    LayoutInfo provide named containers for the tensors and metadata used by
    dispatch, combine, and handle setup.
  • Tensor resolves Python buffers into ncclEpTensor_t descriptors.
  • GroupConfig, HandleConfig, DispatchConfig, CombineConfig, and
    AllocConfig expose typed configuration objects.
  • AllocFn and FreeFn expose caller-controlled EP allocation hooks.
  • nccl.ep.interop.torch.get_nccl_comm_from_group() provides PyTorch interop
    for creating an NCCL communicator from a PyTorch process group's rank and
    world-size information.
  • Importing nccl.ep sets default NCCL_EP_HOME when bundled EP JIT headers
    are present, and NCCL_HOME when NCCL public headers are available from the
    installed nvidia.nccl package.
  • nccl.ep checks that the loaded libnccl.so and libnccl_ep.so were built
    with the same CUDA major version. CUDA minor differences are accepted.

Communicator Configuration

  • Added graph_stream_ordering to NCCLConfig.

Device API and CuTe DSL

  • New nccl.core.device.cute module exposes the NCCL device API to CuTeDSL
    kernels, including communicator/window access, GIN primitives, barrier
    operations, and typed structs.
  • Added bindings/nccl4py/examples/cute/main.py, a GIN put/wait example with
    host-side validation.
  • Added gin_strong_signals_required and gin_va_signals_required to
    NCCLDevCommRequirements for configuring device communicator requirements.
  • Added NcclGinType.GPI for the GPU-Push Interface transport.

Version and Diagnostics API

  • Top-level nccl.get_version() returns a VersionInfo dataclass containing
    the nccl4py package version plus LibraryInfo entries for the loaded
    libnccl.so and, when available, libnccl_ep.so.
  • Top-level nccl.show_versions() prints the same stack information in a
    human-readable version block.
  • Direct library probes are available for each native library:
    nccl.core.get_lib_version() and nccl.core.get_lib_path() report the
    loaded libnccl.so; nccl.ep.get_lib_version() and
    nccl.ep.get_lib_path() report the loaded libnccl_ep.so.
  • Each LibraryInfo includes release version, CUDA build variant, and loaded
    shared-library path.

Installation and Packaging

  • CuTeDSL support can be installed through the CUDA-specific extras:
    nccl4py[cu12] installs nvidia-cutlass-dsl>=4.5.2,<5.0, and
    nccl4py[cu13] installs nvidia-cutlass-dsl[cu13]>=4.5.2,<5.0.
  • Wheels include package data for nccl/ep/lib/libnccl_ep.so plus EP JIT
    headers. The bundled libnccl_ep.so is built with CUDA 13, regardless of
    whether the cu12 or cu13 extra is installed. Users who want to use a
    CUDA 12 build of libnccl_ep.so must provide that library themselves, for
    example through LD_PRELOAD or LD_LIBRARY_PATH.
  • Wheels are available for free-threaded CPython 3.14t.

Examples and Documentation

  • Added Python examples for:
    • multiple devices in one process:
      docs/examples/01_communicators/01_multiple_devices_single_process/python/;
    • one device per MPI process:
      docs/examples/01_communicators/03_one_device_per_process_mpi/python/;
    • point-to-point ring pattern:
      docs/examples/02_point_to_point/01_ring_pattern/python/;
    • allreduce: docs/examples/03_collectives/01_allreduce/python/;
    • user-buffer allreduce:
      docs/examples/04_user_buffer_registration/01_allreduce/python/;
    • symmetric-memory allreduce:
      docs/examples/05_symmetric_memory/01_allreduce/python/;
    • symmetric-memory allgather:
      docs/examples/05_symmetric_memory/02_allgather/python/.
  • Added nccl4py documentation under docs/userguide/source/nccl4py/, with the
    main entry point at docs/userguide/source/nccl4py.rst.

Breaking Changes

Removed APIs

  • nccl.core.group_simulate_end() has been removed. Use
    nccl.core.group_end(simulate=True):

    from nccl.core import group_end, group_start
    
    group_start()
    # enqueue operations
    info = group_end(simulate=True)
  • NCCL_SPLIT_NOCOLOR has been removed from the public constants. Use
    color=None when a rank should opt out of Communicator.split().

Deprecated APIs

  • nccl.core.get_version() remains available, but is deprecated. Use top-level
    nccl.get_version() for structured version information, or
    nccl.show_versions() for human-readable output.

Other Compatibility Notes

  • Public NCCL enum wrappers are pure-Python IntEnum or IntFlag classes.
    Integer compatibility is preserved, and dtype conversion remains supported.
    Code that depends on binding-backed enum class identity from earlier releases
    may need updates.
  • Enum members now follow the Python enum convention of UPPER_SNAKE_CASE
    names, such as CTAPolicy.DEFAULT, CommShrinkFlag.ABORT,
    WindowFlag.COLL_SYMMETRIC, and NcclCommMemStat.GPU_MEM_TOTAL. The
    previous PascalCase/camelCase aliases, such as CTAPolicy.Default and
    NcclCommMemStat.GpuMemTotal, still work in 0.3.1 for compatibility, but
    will be removed in a future release. New code should use the uppercase names.

Fixes and Enhancements

  • Fixed pointer lifetime handling for non-blocking communicator and window
    initialization.
  • Torch interop covers torch.uint32 and torch.uint64 when those dtypes are
    available.

API Stability

  • nccl.ep and nccl.core.device.cute are initial API support. Their public
    interfaces may change in future releases as the NCCL EP and CuTeDSL device
    API integration matures.