[Common] Fix: IMA in register_user_buffer_collective on non-SM90 GPUs#2859
[Common] Fix: IMA in register_user_buffer_collective on non-SM90 GPUs#2859phu0ngng merged 5 commits intoNVIDIA:mainfrom
register_user_buffer_collective on non-SM90 GPUs#2859Conversation
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR fixes an illegal memory access (IMA) crash in Confidence Score: 5/5This PR is safe to merge — the fix is targeted, correct, and the RAII guards properly handle all exit paths including exception unwinds. The root cause (pageable host memory passed to NCCL DMA) is correctly addressed with cudaMallocHost, and the RAII unique_ptr guards handle cleanup on every exit path. No regressions are introduced and no P0/P1 issues remain. The prior concern about pinned memory leaks on exception paths is resolved by the guards introduced in this PR. No files require special attention.
|
| Filename | Overview |
|---|---|
| transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers-host.cpp | Replaces malloc/stack allocation of IPC handle buffers with cudaMallocHost + RAII unique_ptr guards; core fix is correct and cleanup is properly handled on all paths. |
Sequence Diagram
sequenceDiagram
participant Host as Host CPU
participant CUDA as CUDA Runtime
participant NCCL as NCCL (allgather)
participant GPU as GPU (DMA)
Host->>CUDA: cudaMallocHost(&memhndl) (pinned memory)
CUDA-->>Host: memhndl [pinned]
Host->>CUDA: cudaIpcGetMemHandle(memhndl, *gpubuff)
CUDA-->>Host: IPC handle written to pinned memhndl
Host->>CUDA: cudaMallocHost(&tmp) (pinned memory, nvsize slots)
CUDA-->>Host: tmp [pinned]
Host->>NCCL: _allgather(tmp, memhndl) (both buffers are pinned)
NCCL->>GPU: DMA read from pinned memhndl
NCCL->>GPU: DMA write to pinned tmp
GPU-->>NCCL: done
NCCL-->>Host: all handles gathered in tmp[]
loop for each peer i
Host->>CUDA: cudaIpcOpenMemHandle(tmp[i])
CUDA-->>Host: peer_ptr[hndl][i] mapped
end
Host->>CUDA: cudaDeviceSynchronize()
Host->>CUDA: cudaFreeHost(memhndl) via RAII guard
Host->>CUDA: cudaFreeHost(tmp) via RAII guard
Reviews (3): Last reviewed commit: "Merge branch 'main' into cgemm_ipc_fix" | Re-trigger Greptile
transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers-host.cpp
Outdated
Show resolved
Hide resolved
for more information, see https://pre-commit.ci
|
Tip: Greploop — Automatically fix all review issues by running Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal. |
|
/te-ci JAX L0 |
|
/te-ci L1 |
|
Pipeline #48064708 passed except that an encoder test failed in the |
Description
On Ampere (SM80) and older GPUs,
collective_gemm_bootstrapcrashes with:The IPC handle exchange uses
malloc/stack memory fortmpandmemhndl, then passes them to the_allgathercallback. When the callback is backed byncclAllGather, NCCL tries to DMA from these pageable host addresses — which the GPU cannot access — causing the illegal memory access.Type of change
Changes
Replace
malloc/stack allocation oftmpandmemhndlwithcudaMallocHost(pinned host memory). Pinned memory is both CPU-addressable and GPU DMA-accessible, soncclAllGathercan use the buffers directly without any staging copies.Checklist: