Skip to content

[AMDGPU] Fix gradient computation.#486

Merged
duburcqa merged 1 commit into
mainfrom
duburcqa/fix_amdgpu_grad
Apr 16, 2026
Merged

[AMDGPU] Fix gradient computation.#486
duburcqa merged 1 commit into
mainfrom
duburcqa/fix_amdgpu_grad

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Apr 15, 2026

The AMDGPU kernel launcher never unwrapped the grad DeviceAllocation when passing ndarray-with-grad parameters. CUDA does this correctly; AMDGPU copy-pasted the path but forgot the grad side.

The diff, side by side

CUDA launcherquadrants/runtime/cuda/kernel_launcher.cpp:123-139 (Ndarray branch, DevAllocType != kNone):

DeviceAllocation *ptr = static_cast<DeviceAllocation *>(data_ptr);
device_ptrs[data_ptr_idx] = executor->get_device_alloc_info_ptr(*ptr);

if (grad_ptr != nullptr) {
  ptr = static_cast<DeviceAllocation *>(grad_ptr);
  device_ptrs[grad_ptr_idx] = executor->get_device_alloc_info_ptr(*ptr);  // unwrap grad
} else {
  device_ptrs[grad_ptr_idx] = nullptr;
}

ctx.set_ndarray_ptrs(arg_id, (uint64)device_ptrs[data_ptr_idx], (uint64)device_ptrs[grad_ptr_idx]);

AMDGPU launcher — quadrants/runtime/amdgpu/kernel_launcher.cpp:94-104 (same branch, pre-fix):

DeviceAllocation *ptr = static_cast<DeviceAllocation *>(data_ptr);
device_ptrs[data_ptr_idx] = executor->get_device_alloc_info_ptr(*ptr);

ctx.set_ndarray_ptrs(arg_id, (uint64)device_ptrs[data_ptr_idx], (uint64)ctx.array_ptrs[grad_ptr_idx]);
//                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
//                              ctx.array_ptrs[grad_ptr_idx] is a pointer to a host-side
//                              DeviceAllocation struct, NOT the GPU buffer address.

The kernel then treats that host pointer as a GPU address. Reading/writing it from the GPU → hipErrorIllegalAddress. The error is async, so it surfaces at the next sync point — the synchronous mem_free(device_result_buffer) at kernel_launcher.cpp:153, exactly what the stack trace on the genesis test_differentiable_rigid[gpu] repro reports.

Why the test exposes it precisely at loss.backward()

  • Forward scene.step() kernels only touch primal data — .grad accessors are never referenced → no crash.
  • kernel_get_state_grad at genesis/.../abd/accessor.py:185 reads rigid_global_info.qpos.grad, dofs_state.vel.grad, links_state.pos.grad, links_state.quat.grad — all go through the grad-ptr arg.
  • rigid_global_info, dofs_state, links_state are array_class.* dataclasses holding qd ndarrays (hit DevAllocType != kNone — the buggy branch).
  • The AcceleratorError on the torch tensors in the stack trace repr is the same deferred error re-surfacing through torch's own cached sync.

Secondary bug in the same file

quadrants/runtime/amdgpu/kernel_launcher.cpp:80-93 (external-array branch, on-host case) also fails to transfer/allocate a device copy of the grad buffer like CUDA does at lines 107-116. Less likely to fire in practice (most external grads arrive as torch GPU tensors), but symmetrical breakage — fixed here as well.

The fix

Mirror CUDA's grad handling in quadrants/runtime/amdgpu/kernel_launcher.cpp:

  1. Read grad_ptr into a local at the top of the parameter.is_array block.
  2. Ndarray branch: unwrap grad_ptr as DeviceAllocation* and use get_device_alloc_info_ptr.
  3. External-array on-host branch: allocate a device buffer for grad, memcpy_host_to_device, register a transfer for writeback.
  4. External-array on-device branch: store device_ptrs[grad_ptr_idx] = grad_ptr for symmetry (no behavior change, but avoids relying on the map-default read at set_ndarray_ptrs time).

Why unit tests didn't catch this

Three layered exclusions kept the buggy path out of CI:

  1. tests/python/test_ad_ndarray.py and test_ad_ndarray_torch.py — the two files that exercise exactly this code path — declared archs_support_ndarray_ad = [qd.cpu, qd.cuda] at the top, so every decorated test skipped on AMDGPU.
  2. tests/test_utils.py::get_archs() hardcoded {cpu, cuda, metal, vulkan}. Even if a test allowed AMDGPU, the harness never parametrized on it without QD_WANTED_ARCHS=amdgpu.
  3. quadrants/program/extension.cpp:19 listed only Extension::assertion for AMDGPU. Tests gated on require=qd.extension.adstack skipped AMDGPU on that check alone. The codegen (codegen_llvm.cpp:2106-2153) and runtime (runtime.cpp:1889-1911) for adstack are both
    arch-agnostic — the declaration was stale, not a capability statement.

All three are addressed:

  • archs_support_ndarray_ad now includes qd.amdgpu in both files.
  • get_archs() now includes amdgpu. is_arch_supported(amdgpu) still filters it out on hosts without ROCm, so non-HIP CI is unaffected.
  • extension.cpp now declares adstack for AMDGPU. All 86 AMDGPU-parametrized tests pass except one noted below.

One test loosened on AMDGPU only

test_ad_ndarray_torch.py::test_tensor_shape asserts a.grad[i] == 1.0 exactly on an fp32 reverse-mode adjoint sum. Analytically 1.0; on AMDGPU fp32 it lands at ~0.99999988 (CUDA happens to hit exactly 1.0). Loosened to torch.allclose on AMDGPU only — CPU/CUDA keep the
strict == 1.0 check.

Test plan

  • Genesis tests/test_grad.py::test_differentiable_rigid[gpu] — the original repro, now passes.
  • QD_WANTED_ARCHS=amdgpu pytest tests/python/test_ad_ndarray.py tests/python/test_ad_ndarray_torch.py — all 86 AMDGPU-parametrized tests pass. Covers the launcher fix and the adstack enablement.
  • Re-run on a CUDA host to confirm no regression on the CUDA launcher (CUDA path unchanged; new parametrizations add AMDGPU rows without removing CUDA rows).

@duburcqa duburcqa changed the title [AMDGPU ] Fix gradient computation. [AMDGPU] Fix gradient computation. Apr 15, 2026
@hughperkins
Copy link
Copy Markdown
Collaborator

Are there any other tests that have cuda, but not amdgpu?

@hughperkins
Copy link
Copy Markdown
Collaborator

@claude review

Comment thread tests/python/test_ad_ndarray_torch.py
@duburcqa
Copy link
Copy Markdown
Contributor Author

duburcqa commented Apr 15, 2026

Within the AD ndarray scope of this PR, no — the two files that gated ndarray-AD coverage (test_ad_ndarray.py, test_ad_ndarray_torch.py) both went through archs_support_ndarray_ad, so flipping the list covers every test in them.

The rest of the AD suite (test_ad_for.py, test_ad_if.py, test_loop_grad.py, ...) uses require=qd.extension.adstack with no explicit arch=, so they take the default from expected_archs()/get_archs(). Since this PR also adds amdgpu to get_archs() and declares adstack for AMDGPU in extension.cpp, those now parametrize on AMDGPU automatically.

There is one AD-adjacent test still pinned to CUDA: test_torch_ad.py::test_torch_cuda_context (arch=qd.cuda). It is intentionally CUDA-specific — it asserts torch._C._cuda_hasPrimaryContext(0), which is a CUDA-only torch internal — so it should stay CUDA-only.

Outside AD, plenty of tests still use arch=[qd.cpu, qd.cuda] (test_print.py, test_function.py, test_atomic.py, test_return.py, test_overflow.py, test_matrix.py, test_ndarray.py, ...). Expanding those to AMDGPU is out of scope for this PR — it would need separate validation on the AMD runner and is unrelated to the grad-ptr unwrap bug.

Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 🙌

@duburcqa duburcqa force-pushed the duburcqa/fix_amdgpu_grad branch from 977a125 to bfa3cb8 Compare April 16, 2026 09:26
@duburcqa duburcqa enabled auto-merge (squash) April 16, 2026 09:26
Comment thread quadrants/program/extension.cpp
@duburcqa
Copy link
Copy Markdown
Contributor Author

Update: the broader arch expansion (adding amdgpu/metal/vulkan to tests outside this PR's AD scope) is now tracked in #488. That PR declares all unit-tested extensions for AMDGPU, expands 121+ test decorators to include amdgpu, and adds metal/vulkan where SPIR-V can handle them. Validated on MI250X hardware.

@duburcqa duburcqa merged commit 7274da9 into main Apr 16, 2026
71 of 72 checks passed
@duburcqa duburcqa deleted the duburcqa/fix_amdgpu_grad branch April 16, 2026 10:47
gpinkert added a commit to ROCm/quadrants that referenced this pull request Apr 28, 2026
…mbodied-AI#486 against pei.zhang DeviceScratchBuffer)

Adapts upstream commit 7274da9 ("[AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486)")
onto the current amd-integration kernel_launcher.cpp, which has diverged
from the upstream baseline due to pei.zhang's per-handle DeviceScratchBuffer
caching + the exp12_diag branch counters.

Without this fix, KernelLauncher::launch_llvm_kernel passes the raw host
grad_ptr straight into ctx.set_ndarray_ptrs(). The generated kernel then
dereferences a host address as if it were a device pointer, producing
silently-wrong gradients (and occasionally faults) for every AMDGPU
autodiff workload that uses ndarrays.

What the merge does:
- Pulls grad_ptr out of ctx.array_ptrs alongside data_ptr.
- kNone + on-device path: routes grad_ptr through device_ptrs[grad_ptr_idx]
  with no extra copy (caller already put it on the device).
- kNone + host-copy path: allocates a second device buffer, H2Ds the host
  grad into it, registers the (host_grad, devalloc) pair in `transfers`
  so the existing post-launch loop handles D2H writeback + free
  automatically. Reuses the cached device_result_buffer for the second
  allocate_memory_on_device call -- safe because that helper is
  stream-synchronous from the host's perspective.
- Ndarray path: unwraps grad_ptr's DeviceAllocation* exactly the same way
  data_ptr is unwrapped, so set_ndarray_ptrs gets a real device address.
- All three branches now pass device_ptrs[grad_ptr_idx] (was
  ctx.array_ptrs[grad_ptr_idx], a host pointer) into set_ndarray_ptrs.

Companion changes mirror upstream:
- quadrants/program/extension.cpp: declare Extension::adstack supported
  on AMDGPU so the autodiff lowering pass actually runs.
- tests/python/test_ad_ndarray*.py: enable the ndarray-AD tests on amdgpu.
- tests/python/test_ad_ndarray_torch.py::test_tensor_shape: relax exact
  fp32 equality to torch.allclose() on AMDGPU (the per-element adjoint sum
  loses bit-exactness; CUDA happens to round to exactly 1.0).

Co-Authored-By: Alexis DUBURCQ <alexis.duburcq@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants