[AMDGPU] Fix gradient computation. by duburcqa · Pull Request #486 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-04-15T17:07:10Z

The AMDGPU kernel launcher never unwrapped the grad DeviceAllocation when passing ndarray-with-grad parameters. CUDA does this correctly; AMDGPU copy-pasted the path but forgot the grad side.

The diff, side by side

CUDA launcher — quadrants/runtime/cuda/kernel_launcher.cpp:123-139 (Ndarray branch, DevAllocType != kNone):

DeviceAllocation *ptr = static_cast<DeviceAllocation *>(data_ptr);
device_ptrs[data_ptr_idx] = executor->get_device_alloc_info_ptr(*ptr);

if (grad_ptr != nullptr) {
  ptr = static_cast<DeviceAllocation *>(grad_ptr);
  device_ptrs[grad_ptr_idx] = executor->get_device_alloc_info_ptr(*ptr);  // unwrap grad
} else {
  device_ptrs[grad_ptr_idx] = nullptr;
}

ctx.set_ndarray_ptrs(arg_id, (uint64)device_ptrs[data_ptr_idx], (uint64)device_ptrs[grad_ptr_idx]);

AMDGPU launcher — quadrants/runtime/amdgpu/kernel_launcher.cpp:94-104 (same branch, pre-fix):

DeviceAllocation *ptr = static_cast<DeviceAllocation *>(data_ptr);
device_ptrs[data_ptr_idx] = executor->get_device_alloc_info_ptr(*ptr);

ctx.set_ndarray_ptrs(arg_id, (uint64)device_ptrs[data_ptr_idx], (uint64)ctx.array_ptrs[grad_ptr_idx]);
//                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
//                              ctx.array_ptrs[grad_ptr_idx] is a pointer to a host-side
//                              DeviceAllocation struct, NOT the GPU buffer address.

The kernel then treats that host pointer as a GPU address. Reading/writing it from the GPU → hipErrorIllegalAddress. The error is async, so it surfaces at the next sync point — the synchronous mem_free(device_result_buffer) at kernel_launcher.cpp:153, exactly what the stack trace on the genesis test_differentiable_rigid[gpu] repro reports.

Why the test exposes it precisely at loss.backward()

Forward scene.step() kernels only touch primal data — .grad accessors are never referenced → no crash.
kernel_get_state_grad at genesis/.../abd/accessor.py:185 reads rigid_global_info.qpos.grad, dofs_state.vel.grad, links_state.pos.grad, links_state.quat.grad — all go through the grad-ptr arg.
rigid_global_info, dofs_state, links_state are array_class.* dataclasses holding qd ndarrays (hit DevAllocType != kNone — the buggy branch).
The AcceleratorError on the torch tensors in the stack trace repr is the same deferred error re-surfacing through torch's own cached sync.

Secondary bug in the same file

quadrants/runtime/amdgpu/kernel_launcher.cpp:80-93 (external-array branch, on-host case) also fails to transfer/allocate a device copy of the grad buffer like CUDA does at lines 107-116. Less likely to fire in practice (most external grads arrive as torch GPU tensors), but symmetrical breakage — fixed here as well.

The fix

Mirror CUDA's grad handling in quadrants/runtime/amdgpu/kernel_launcher.cpp:

Read grad_ptr into a local at the top of the parameter.is_array block.
Ndarray branch: unwrap grad_ptr as DeviceAllocation* and use get_device_alloc_info_ptr.
External-array on-host branch: allocate a device buffer for grad, memcpy_host_to_device, register a transfer for writeback.
External-array on-device branch: store device_ptrs[grad_ptr_idx] = grad_ptr for symmetry (no behavior change, but avoids relying on the map-default read at set_ndarray_ptrs time).

Why unit tests didn't catch this

Three layered exclusions kept the buggy path out of CI:

tests/python/test_ad_ndarray.py and test_ad_ndarray_torch.py — the two files that exercise exactly this code path — declared archs_support_ndarray_ad = [qd.cpu, qd.cuda] at the top, so every decorated test skipped on AMDGPU.
tests/test_utils.py::get_archs() hardcoded {cpu, cuda, metal, vulkan}. Even if a test allowed AMDGPU, the harness never parametrized on it without QD_WANTED_ARCHS=amdgpu.
quadrants/program/extension.cpp:19 listed only Extension::assertion for AMDGPU. Tests gated on require=qd.extension.adstack skipped AMDGPU on that check alone. The codegen (codegen_llvm.cpp:2106-2153) and runtime (runtime.cpp:1889-1911) for adstack are both
arch-agnostic — the declaration was stale, not a capability statement.

All three are addressed:

archs_support_ndarray_ad now includes qd.amdgpu in both files.
get_archs() now includes amdgpu. is_arch_supported(amdgpu) still filters it out on hosts without ROCm, so non-HIP CI is unaffected.
extension.cpp now declares adstack for AMDGPU. All 86 AMDGPU-parametrized tests pass except one noted below.

One test loosened on AMDGPU only

test_ad_ndarray_torch.py::test_tensor_shape asserts a.grad[i] == 1.0 exactly on an fp32 reverse-mode adjoint sum. Analytically 1.0; on AMDGPU fp32 it lands at ~0.99999988 (CUDA happens to hit exactly 1.0). Loosened to torch.allclose on AMDGPU only — CPU/CUDA keep the
strict == 1.0 check.

Test plan

Genesis tests/test_grad.py::test_differentiable_rigid[gpu] — the original repro, now passes.
QD_WANTED_ARCHS=amdgpu pytest tests/python/test_ad_ndarray.py tests/python/test_ad_ndarray_torch.py — all 86 AMDGPU-parametrized tests pass. Covers the launcher fix and the adstack enablement.
Re-run on a CUDA host to confirm no regression on the CUDA launcher (CUDA path unchanged; new parametrizations add AMDGPU rows without removing CUDA rows).

hughperkins · 2026-04-15T17:22:18Z

Are there any other tests that have cuda, but not amdgpu?

hughperkins · 2026-04-15T17:23:22Z

@claude review

duburcqa · 2026-04-15T20:46:49Z

Within the AD ndarray scope of this PR, no — the two files that gated ndarray-AD coverage (test_ad_ndarray.py, test_ad_ndarray_torch.py) both went through archs_support_ndarray_ad, so flipping the list covers every test in them.

The rest of the AD suite (test_ad_for.py, test_ad_if.py, test_loop_grad.py, ...) uses require=qd.extension.adstack with no explicit arch=, so they take the default from expected_archs()/get_archs(). Since this PR also adds amdgpu to get_archs() and declares adstack for AMDGPU in extension.cpp, those now parametrize on AMDGPU automatically.

There is one AD-adjacent test still pinned to CUDA: test_torch_ad.py::test_torch_cuda_context (arch=qd.cuda). It is intentionally CUDA-specific — it asserts torch._C._cuda_hasPrimaryContext(0), which is a CUDA-only torch internal — so it should stay CUDA-only.

Outside AD, plenty of tests still use arch=[qd.cpu, qd.cuda] (test_print.py, test_function.py, test_atomic.py, test_return.py, test_overflow.py, test_matrix.py, test_ndarray.py, ...). Expanding those to AMDGPU is out of scope for this PR — it would need separate validation on the AMD runner and is unrelated to the grad-ptr unwrap bug.

hughperkins

Thanks! 🙌

duburcqa · 2026-04-16T10:23:56Z

Update: the broader arch expansion (adding amdgpu/metal/vulkan to tests outside this PR's AD scope) is now tracked in #488. That PR declares all unit-tested extensions for AMDGPU, expands 121+ test decorators to include amdgpu, and adds metal/vulkan where SPIR-V can handle them. Validated on MI250X hardware.

…mbodied-AI#486 against pei.zhang DeviceScratchBuffer) Adapts upstream commit 7274da9 ("[AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486)") onto the current amd-integration kernel_launcher.cpp, which has diverged from the upstream baseline due to pei.zhang's per-handle DeviceScratchBuffer caching + the exp12_diag branch counters. Without this fix, KernelLauncher::launch_llvm_kernel passes the raw host grad_ptr straight into ctx.set_ndarray_ptrs(). The generated kernel then dereferences a host address as if it were a device pointer, producing silently-wrong gradients (and occasionally faults) for every AMDGPU autodiff workload that uses ndarrays. What the merge does: - Pulls grad_ptr out of ctx.array_ptrs alongside data_ptr. - kNone + on-device path: routes grad_ptr through device_ptrs[grad_ptr_idx] with no extra copy (caller already put it on the device). - kNone + host-copy path: allocates a second device buffer, H2Ds the host grad into it, registers the (host_grad, devalloc) pair in `transfers` so the existing post-launch loop handles D2H writeback + free automatically. Reuses the cached device_result_buffer for the second allocate_memory_on_device call -- safe because that helper is stream-synchronous from the host's perspective. - Ndarray path: unwraps grad_ptr's DeviceAllocation* exactly the same way data_ptr is unwrapped, so set_ndarray_ptrs gets a real device address. - All three branches now pass device_ptrs[grad_ptr_idx] (was ctx.array_ptrs[grad_ptr_idx], a host pointer) into set_ndarray_ptrs. Companion changes mirror upstream: - quadrants/program/extension.cpp: declare Extension::adstack supported on AMDGPU so the autodiff lowering pass actually runs. - tests/python/test_ad_ndarray*.py: enable the ndarray-AD tests on amdgpu. - tests/python/test_ad_ndarray_torch.py::test_tensor_shape: relax exact fp32 equality to torch.allclose() on AMDGPU (the per-element adjoint sum loses bit-exactness; CUDA happens to round to exactly 1.0). Co-Authored-By: Alexis DUBURCQ <alexis.duburcq@gmail.com>

duburcqa changed the title ~~[AMDGPU ] Fix gradient computation.~~ [AMDGPU] Fix gradient computation. Apr 15, 2026

claude Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread tests/python/test_ad_ndarray_torch.py

hughperkins approved these changes Apr 16, 2026

View reviewed changes

Fix gradient computation on AMDGPU.

bfa3cb8

duburcqa force-pushed the duburcqa/fix_amdgpu_grad branch from 977a125 to bfa3cb8 Compare April 16, 2026 09:26

duburcqa enabled auto-merge (squash) April 16, 2026 09:26

claude Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread quadrants/program/extension.cpp

duburcqa merged commit 7274da9 into main Apr 16, 2026
71 of 72 checks passed

duburcqa deleted the duburcqa/fix_amdgpu_grad branch April 16, 2026 10:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Fix gradient computation.#486

[AMDGPU] Fix gradient computation.#486
duburcqa merged 1 commit into
mainfrom
duburcqa/fix_amdgpu_grad

duburcqa commented Apr 15, 2026 •

edited

Loading

Uh oh!

hughperkins commented Apr 15, 2026

Uh oh!

hughperkins commented Apr 15, 2026

Uh oh!

Uh oh!

duburcqa commented Apr 15, 2026 •

edited

Loading

Uh oh!

hughperkins left a comment

Uh oh!

Uh oh!

duburcqa commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

duburcqa commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The diff, side by side

The fix

Why unit tests didn't catch this

Test plan

Uh oh!

hughperkins commented Apr 15, 2026

Uh oh!

hughperkins commented Apr 15, 2026

Uh oh!

Uh oh!

duburcqa commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hughperkins left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

duburcqa commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

duburcqa commented Apr 15, 2026 •

edited

Loading

duburcqa commented Apr 15, 2026 •

edited

Loading