Skip to content

[Common] Fix: IMA in register_user_buffer_collective on non-SM90 GPUs#2859

Merged
phu0ngng merged 5 commits intoNVIDIA:mainfrom
phu0ngng:cgemm_ipc_fix
Apr 9, 2026
Merged

[Common] Fix: IMA in register_user_buffer_collective on non-SM90 GPUs#2859
phu0ngng merged 5 commits intoNVIDIA:mainfrom
phu0ngng:cgemm_ipc_fix

Conversation

@phu0ngng
Copy link
Copy Markdown
Collaborator

@phu0ngng phu0ngng commented Apr 8, 2026

Description

On Ampere (SM80) and older GPUs, collective_gemm_bootstrap crashes with:

CUDA Error: an illegal memory access was encountered in register_user_buffer_collective

The IPC handle exchange uses malloc/stack memory for tmp and memhndl, then passes them to the _allgather callback. When the callback is backed by ncclAllGather, NCCL tries to DMA from these pageable host addresses — which the GPU cannot access — causing the illegal memory access.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Replace malloc/stack allocation of tmp and memhndl with cudaMallocHost (pinned host memory). Pinned memory is both CPU-addressable and GPU DMA-accessible, so ncclAllGather can use the buffers directly without any staging copies.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

phu0ngng and others added 2 commits April 8, 2026 21:49
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 8, 2026

Greptile Summary

This PR fixes an illegal memory access (IMA) crash in register_user_buffer_collective on non-SM90 (Ampere and older) GPUs by replacing malloc/stack allocations for memhndl and tmp with cudaMallocHost (pinned host memory), which is both CPU- and GPU DMA-accessible. As a bonus, std::unique_ptr RAII guards are introduced to ensure the pinned pages are freed on all exit paths, including exception unwinds.

Confidence Score: 5/5

This PR is safe to merge — the fix is targeted, correct, and the RAII guards properly handle all exit paths including exception unwinds.

The root cause (pageable host memory passed to NCCL DMA) is correctly addressed with cudaMallocHost, and the RAII unique_ptr guards handle cleanup on every exit path. No regressions are introduced and no P0/P1 issues remain. The prior concern about pinned memory leaks on exception paths is resolved by the guards introduced in this PR.

No files require special attention.

Vulnerabilities

No security concerns identified. The change uses cudaMallocHost for IPC handle exchange buffers — this is a standard CUDA pattern and does not introduce new attack surfaces. Memory is properly freed via RAII guards on all exit paths.

Important Files Changed

Filename Overview
transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers-host.cpp Replaces malloc/stack allocation of IPC handle buffers with cudaMallocHost + RAII unique_ptr guards; core fix is correct and cleanup is properly handled on all paths.

Sequence Diagram

sequenceDiagram
    participant Host as Host CPU
    participant CUDA as CUDA Runtime
    participant NCCL as NCCL (allgather)
    participant GPU as GPU (DMA)

    Host->>CUDA: cudaMallocHost(&memhndl) (pinned memory)
    CUDA-->>Host: memhndl [pinned]
    Host->>CUDA: cudaIpcGetMemHandle(memhndl, *gpubuff)
    CUDA-->>Host: IPC handle written to pinned memhndl

    Host->>CUDA: cudaMallocHost(&tmp) (pinned memory, nvsize slots)
    CUDA-->>Host: tmp [pinned]

    Host->>NCCL: _allgather(tmp, memhndl) (both buffers are pinned)
    NCCL->>GPU: DMA read from pinned memhndl
    NCCL->>GPU: DMA write to pinned tmp
    GPU-->>NCCL: done
    NCCL-->>Host: all handles gathered in tmp[]

    loop for each peer i
        Host->>CUDA: cudaIpcOpenMemHandle(tmp[i])
        CUDA-->>Host: peer_ptr[hndl][i] mapped
    end

    Host->>CUDA: cudaDeviceSynchronize()
    Host->>CUDA: cudaFreeHost(memhndl) via RAII guard
    Host->>CUDA: cudaFreeHost(tmp) via RAII guard
Loading

Reviews (3): Last reviewed commit: "Merge branch 'main' into cgemm_ipc_fix" | Re-trigger Greptile

phu0ngng and others added 2 commits April 8, 2026 22:08
Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 8, 2026

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

@phu0ngng
Copy link
Copy Markdown
Collaborator Author

phu0ngng commented Apr 8, 2026

/te-ci JAX L0

Copy link
Copy Markdown
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@phu0ngng
Copy link
Copy Markdown
Collaborator Author

phu0ngng commented Apr 8, 2026

/te-ci L1

@phu0ngng
Copy link
Copy Markdown
Collaborator Author

phu0ngng commented Apr 9, 2026

Pipeline #48064708 passed except that an encoder test failed in the L0_jax_unittest--H100_1GPU due to HF rate limit. Thus this PR is good to go.

@phu0ngng phu0ngng merged commit 0aea85f into NVIDIA:main Apr 9, 2026
47 of 52 checks passed
@phu0ngng phu0ngng deleted the cgemm_ipc_fix branch April 9, 2026 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants