Fixed 'Memory access fault' error for fused dense on MI350 by albmalamd · Pull Request #325 · ROCm/apex

albmalamd · 2026-04-15T16:04:06Z

Motivation

The error reported by QA: Memory access fault when running fused dense cuda on MI350.

Technical Details

In gemm_lt (ROCm / USE_ROCM path in csrc/fused_dense_cuda.cu), replace at::cuda::getCurrentCUDABlasHandle() with at::cuda::getCurrentCUDABlasLtHandle() for the handle used with hipblasGetStream and hipblasLtMatmul (and the rest of the Lt setup in that function).

Test Plan

Build the fused_dense_cuda extension on ROCm (e.g. JIT or full Apex build targeting your arch, e.g. gfx950).
Run the ROCm Apex test driver, e.g. APEX_TEST_WITH_ROCM=1 APEX_SKIP_FLAKY_TEST=1 python run_test.py (or your repo’s run_rocm.sh wrapper).

Test Result

Before: Extension could load, then the process aborted with a GPU memory access fault during tests. From the JIRA ticket:

Time to load fused_dense_cuda op: 19.15865182876587 seconds
Memory access fault by GPU node-2 (Agent handle: 0x30623ae0) on address 0x7cdff11f0000. Reason: Write access to a read-only page.
GPU coredump: execvp failed: No such file or directory
Failed to write segment data to pipe: Broken pipe
GPU coredump: handler exited with error (status: 1)
GPU core dump failed
./run_rocm.sh: line 2:    49 Aborted                 (core dumped) APEX_TEST_WITH_ROCM=1 APEX_SKIP_FLAKY_TEST=1 python run_test.py

After: Build completes; fused dense tests / full ROCm run complete without the Lt-related GPU fault: Fixed 'Memory access fault' error for fused dense on MI350 #325 (comment)

jaglinux · 2026-04-15T16:21:30Z

LGTM

jaglinux · 2026-04-15T20:42:30Z

maybe in another PR, need to use hipblaslt api and pointers.
hipblasLtHandle_t handle = at::cuda::getCurrentCUDABlasLtHandle();
hipStream_t stream = at::cuda::getCurrentCUDAStream();
void* workspace = at::cuda::getCUDABlasLtWorkspace();
size_t workspace_size = at::cuda::getCUDABlasLtWorkspaceSize();

albmalamd · 2026-04-20T16:34:02Z

The CI doesn't check MI350 so as a confirmation I attach the screenshot of local execution:

Also, the original L0 CI didn't get broken: https://github.com/ROCm/apex/actions/runs/24464892682/job/71499106039?pr=325#logs

Fixed 'Memory access fault' error for fused dense on MI350

421d59b

albmalamd requested review from jagadish-amd and pragupta April 15, 2026 16:14

albmalamd self-assigned this Apr 15, 2026

jithunnair-amd merged commit af25af4 into ROCm:master Apr 20, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed 'Memory access fault' error for fused dense on MI350#325

Fixed 'Memory access fault' error for fused dense on MI350#325
jithunnair-amd merged 1 commit into
ROCm:masterfrom
albmalamd:fix_fused_dense_memory_access_fault

albmalamd commented Apr 15, 2026 •

edited by jithunnair-amd

Loading

Uh oh!

jaglinux commented Apr 15, 2026

Uh oh!

jaglinux commented Apr 15, 2026

Uh oh!

albmalamd commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

albmalamd commented Apr 15, 2026 • edited by jithunnair-amd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Uh oh!

jaglinux commented Apr 15, 2026

Uh oh!

jaglinux commented Apr 15, 2026

Uh oh!

albmalamd commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

albmalamd commented Apr 15, 2026 •

edited by jithunnair-amd

Loading

albmalamd commented Apr 20, 2026 •

edited

Loading