[Common] Fix: IMA in `register_user_buffer_collective` on non-SM90 GPUs by phu0ngng · Pull Request #2859 · NVIDIA/TransformerEngine

phu0ngng · 2026-04-08T21:53:43Z

Description

On Ampere (SM80) and older GPUs, collective_gemm_bootstrap crashes with:

CUDA Error: an illegal memory access was encountered in register_user_buffer_collective

The IPC handle exchange uses malloc/stack memory for tmp and memhndl, then passes them to the _allgather callback. When the callback is backed by ncclAllGather, NCCL tries to DMA from these pageable host addresses — which the GPU cannot access — causing the illegal memory access.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Replace malloc/stack allocation of tmp and memhndl with cudaMallocHost (pinned host memory). Pinned memory is both CPU-addressable and GPU DMA-accessible, so ncclAllGather can use the buffers directly without any staging copies.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-04-08T21:56:21Z

Greptile Summary

This PR fixes an illegal memory access (IMA) crash in register_user_buffer_collective on non-SM90 (Ampere and older) GPUs by replacing malloc/stack allocations for memhndl and tmp with cudaMallocHost (pinned host memory), which is both CPU- and GPU DMA-accessible. As a bonus, std::unique_ptr RAII guards are introduced to ensure the pinned pages are freed on all exit paths, including exception unwinds.

Confidence Score: 5/5

This PR is safe to merge — the fix is targeted, correct, and the RAII guards properly handle all exit paths including exception unwinds.

The root cause (pageable host memory passed to NCCL DMA) is correctly addressed with cudaMallocHost, and the RAII unique_ptr guards handle cleanup on every exit path. No regressions are introduced and no P0/P1 issues remain. The prior concern about pinned memory leaks on exception paths is resolved by the guards introduced in this PR.

No files require special attention.

Vulnerabilities

No security concerns identified. The change uses cudaMallocHost for IPC handle exchange buffers — this is a standard CUDA pattern and does not introduce new attack surfaces. Memory is properly freed via RAII guards on all exit paths.

Important Files Changed

Filename	Overview
transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers-host.cpp	Replaces malloc/stack allocation of IPC handle buffers with cudaMallocHost + RAII unique_ptr guards; core fix is correct and cleanup is properly handled on all paths.

Sequence Diagram

sequenceDiagram
    participant Host as Host CPU
    participant CUDA as CUDA Runtime
    participant NCCL as NCCL (allgather)
    participant GPU as GPU (DMA)

    Host->>CUDA: cudaMallocHost(&memhndl) (pinned memory)
    CUDA-->>Host: memhndl [pinned]
    Host->>CUDA: cudaIpcGetMemHandle(memhndl, *gpubuff)
    CUDA-->>Host: IPC handle written to pinned memhndl

    Host->>CUDA: cudaMallocHost(&tmp) (pinned memory, nvsize slots)
    CUDA-->>Host: tmp [pinned]

    Host->>NCCL: _allgather(tmp, memhndl) (both buffers are pinned)
    NCCL->>GPU: DMA read from pinned memhndl
    NCCL->>GPU: DMA write to pinned tmp
    GPU-->>NCCL: done
    NCCL-->>Host: all handles gathered in tmp[]

    loop for each peer i
        Host->>CUDA: cudaIpcOpenMemHandle(tmp[i])
        CUDA-->>Host: peer_ptr[hndl][i] mapped
    end

    Host->>CUDA: cudaDeviceSynchronize()
    Host->>CUDA: cudaFreeHost(memhndl) via RAII guard
    Host->>CUDA: cudaFreeHost(tmp) via RAII guard

_{Reviews (3): Last reviewed commit: "Merge branch 'main' into cgemm_ipc_fix" | Re-trigger Greptile}

transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers-host.cpp

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-04-08T22:11:33Z

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

phu0ngng · 2026-04-08T22:43:32Z

/te-ci JAX L0

timmoon10

LGTM

phu0ngng · 2026-04-08T23:22:31Z

/te-ci L1

phu0ngng · 2026-04-09T15:59:26Z

Pipeline #48064708 passed except that an encoder test failed in the L0_jax_unittest--H100_1GPU due to HF rate limit. Thus this PR is good to go.

phu0ngng and others added 2 commits April 8, 2026 21:49

fixed mem alloc for AG

661aa52

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6753c33

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Apr 8, 2026

View reviewed changes

transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers-host.cpp Outdated Show resolved Hide resolved

phu0ngng and others added 2 commits April 8, 2026 22:08

use raid

30a7e53

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b5eb587

for more information, see https://pre-commit.ci

timmoon10 approved these changes Apr 8, 2026

View reviewed changes

Merge branch 'main' into cgemm_ipc_fix

ed3eee9

ptrendx approved these changes Apr 9, 2026

View reviewed changes

phu0ngng merged commit 0aea85f into NVIDIA:main Apr 9, 2026
47 of 52 checks passed

phu0ngng deleted the cgemm_ipc_fix branch April 9, 2026 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Fix: IMA in `register_user_buffer_collective` on non-SM90 GPUs#2859

[Common] Fix: IMA in `register_user_buffer_collective` on non-SM90 GPUs#2859
phu0ngng merged 5 commits intoNVIDIA:mainfrom
phu0ngng:cgemm_ipc_fix

phu0ngng commented Apr 8, 2026

Uh oh!

greptile-apps bot commented Apr 8, 2026 •

edited

Loading

Vulnerabilities

Uh oh!

Uh oh!

greptile-apps bot commented Apr 8, 2026

Uh oh!

phu0ngng commented Apr 8, 2026

Uh oh!

timmoon10 left a comment

Uh oh!

phu0ngng commented Apr 8, 2026

Uh oh!

phu0ngng commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

phu0ngng commented Apr 8, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Vulnerabilities

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

greptile-apps bot commented Apr 8, 2026

Uh oh!

phu0ngng commented Apr 8, 2026

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

phu0ngng commented Apr 8, 2026

Uh oh!

phu0ngng commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Apr 8, 2026 •

edited

Loading