Skip to content

Fixed 'Memory access fault' error for fused dense on MI350#325

Merged
jithunnair-amd merged 1 commit into
ROCm:masterfrom
albmalamd:fix_fused_dense_memory_access_fault
Apr 20, 2026
Merged

Fixed 'Memory access fault' error for fused dense on MI350#325
jithunnair-amd merged 1 commit into
ROCm:masterfrom
albmalamd:fix_fused_dense_memory_access_fault

Conversation

@albmalamd
Copy link
Copy Markdown

@albmalamd albmalamd commented Apr 15, 2026

Motivation

The error reported by QA: Memory access fault when running fused dense cuda on MI350.

Technical Details

In gemm_lt (ROCm / USE_ROCM path in csrc/fused_dense_cuda.cu), replace at::cuda::getCurrentCUDABlasHandle() with at::cuda::getCurrentCUDABlasLtHandle() for the handle used with hipblasGetStream and hipblasLtMatmul (and the rest of the Lt setup in that function).

Test Plan

Build the fused_dense_cuda extension on ROCm (e.g. JIT or full Apex build targeting your arch, e.g. gfx950).
Run the ROCm Apex test driver, e.g. APEX_TEST_WITH_ROCM=1 APEX_SKIP_FLAKY_TEST=1 python run_test.py (or your repo’s run_rocm.sh wrapper).

Test Result

  • Before: Extension could load, then the process aborted with a GPU memory access fault during tests. From the JIRA ticket:
Time to load fused_dense_cuda op: 19.15865182876587 seconds
Memory access fault by GPU node-2 (Agent handle: 0x30623ae0) on address 0x7cdff11f0000. Reason: Write access to a read-only page.
GPU coredump: execvp failed: No such file or directory
Failed to write segment data to pipe: Broken pipe
GPU coredump: handler exited with error (status: 1)
GPU core dump failed
./run_rocm.sh: line 2:    49 Aborted                 (core dumped) APEX_TEST_WITH_ROCM=1 APEX_SKIP_FLAKY_TEST=1 python run_test.py

@albmalamd albmalamd self-assigned this Apr 15, 2026
@jaglinux
Copy link
Copy Markdown

LGTM

@jaglinux
Copy link
Copy Markdown

maybe in another PR, need to use hipblaslt api and pointers.
hipblasLtHandle_t handle = at::cuda::getCurrentCUDABlasLtHandle();
hipStream_t stream = at::cuda::getCurrentCUDAStream();
void* workspace = at::cuda::getCUDABlasLtWorkspace();
size_t workspace_size = at::cuda::getCUDABlasLtWorkspaceSize();

@albmalamd
Copy link
Copy Markdown
Author

albmalamd commented Apr 20, 2026

The CI doesn't check MI350 so as a confirmation I attach the screenshot of local execution:

image

Also, the original L0 CI didn't get broken: https://github.com/ROCm/apex/actions/runs/24464892682/job/71499106039?pr=325#logs

@jithunnair-amd jithunnair-amd merged commit af25af4 into ROCm:master Apr 20, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants