NCCL 2.29 – LLVM Intermediate Representation (IR) #2010
gab9talavera
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
LLVM Intermediate Representation (IR)
With the 2.29 release, the NCCL Device API is not restricted to CUDA kernels only. NCCL now exposes LLVM IR for NCCL Device APIs so that emerging compiler technologies, high-level languages, and domain-specific systems can consume them directly. Instead of being gated behind C++ templates and CUDA-only build paths, the device APIs become accessible to any toolchain that can ingest LLVM bitcode.
C-Compatible Device API
The base NCCL host API has long been callable through
extern "C"as a stable C interface. This change brings the device API to that same model: it exposes a C-compatible, ABI-stable surface for device-side primitives that were previously available only as C++ template APIs.Integration Benefits
At a high level, this turns NCCL’s device primitives into a language-agnostic interface. A JIT compiler, a DSL runtime, or a custom compiler backend can link against the bitcode and call NCCL Device APIs as regular functions. That opens the door to new forms of integration, including dynamic code generation and fine-grained composition that would otherwise be impractical.
Practically, this means you can build fused computation–communication kernels, experiment with custom collective patterns, and control communication–computation overlap from a higher-level environment. It also makes it easier to prototype new distributed algorithms without having to ship large, monolithic, pre-compiled CUDA kernels for every variant. Note that some convenience helpers and specialized primitives remain C++‑only for now.
Build and Tests
To generate the LLVM IR bitcode, compile NCCL with the
EMIT_LLVM_IR=1flag:This build requires Clang 21 and CUDA 12. By default, the resulting bitcode library targets
sm_90when built with CUDA 12. You can override the target architecture usingBITCODE_LIB_ARCH=sm_xxif your toolchain needs a different GPU target.The bitcode library can be found at
build/lib/libnccl_device.bcin the build directory. If you want to inspect the available APIs, you can either search the source forNCCL_IR_EXTERN_Cor disassemble the bitcode with:We encourage compiler developers, DSL authors, and framework builders to integrate this interface and explore what becomes possible when fine‑grained distributed GPU communication is just a function call away.
—
Authored by Subhadeep Bhattacharya (@sb17v)
Beta Was this translation helpful? Give feedback.
All reactions