[AMDGPU] Fix gradient computation.#486
Conversation
|
Are there any other tests that have cuda, but not amdgpu? |
|
@claude review |
|
Within the AD ndarray scope of this PR, no — the two files that gated ndarray-AD coverage ( The rest of the AD suite ( There is one AD-adjacent test still pinned to CUDA: Outside AD, plenty of tests still use |
977a125 to
bfa3cb8
Compare
|
Update: the broader arch expansion (adding amdgpu/metal/vulkan to tests outside this PR's AD scope) is now tracked in #488. That PR declares all unit-tested extensions for AMDGPU, expands 121+ test decorators to include amdgpu, and adds metal/vulkan where SPIR-V can handle them. Validated on MI250X hardware. |
…mbodied-AI#486 against pei.zhang DeviceScratchBuffer) Adapts upstream commit 7274da9 ("[AMDGPU] Fix gradient computation. (Genesis-Embodied-AI#486)") onto the current amd-integration kernel_launcher.cpp, which has diverged from the upstream baseline due to pei.zhang's per-handle DeviceScratchBuffer caching + the exp12_diag branch counters. Without this fix, KernelLauncher::launch_llvm_kernel passes the raw host grad_ptr straight into ctx.set_ndarray_ptrs(). The generated kernel then dereferences a host address as if it were a device pointer, producing silently-wrong gradients (and occasionally faults) for every AMDGPU autodiff workload that uses ndarrays. What the merge does: - Pulls grad_ptr out of ctx.array_ptrs alongside data_ptr. - kNone + on-device path: routes grad_ptr through device_ptrs[grad_ptr_idx] with no extra copy (caller already put it on the device). - kNone + host-copy path: allocates a second device buffer, H2Ds the host grad into it, registers the (host_grad, devalloc) pair in `transfers` so the existing post-launch loop handles D2H writeback + free automatically. Reuses the cached device_result_buffer for the second allocate_memory_on_device call -- safe because that helper is stream-synchronous from the host's perspective. - Ndarray path: unwraps grad_ptr's DeviceAllocation* exactly the same way data_ptr is unwrapped, so set_ndarray_ptrs gets a real device address. - All three branches now pass device_ptrs[grad_ptr_idx] (was ctx.array_ptrs[grad_ptr_idx], a host pointer) into set_ndarray_ptrs. Companion changes mirror upstream: - quadrants/program/extension.cpp: declare Extension::adstack supported on AMDGPU so the autodiff lowering pass actually runs. - tests/python/test_ad_ndarray*.py: enable the ndarray-AD tests on amdgpu. - tests/python/test_ad_ndarray_torch.py::test_tensor_shape: relax exact fp32 equality to torch.allclose() on AMDGPU (the per-element adjoint sum loses bit-exactness; CUDA happens to round to exactly 1.0). Co-Authored-By: Alexis DUBURCQ <alexis.duburcq@gmail.com>
The AMDGPU kernel launcher never unwrapped the grad
DeviceAllocationwhen passing ndarray-with-grad parameters. CUDA does this correctly; AMDGPU copy-pasted the path but forgot the grad side.The diff, side by side
CUDA launcher —
quadrants/runtime/cuda/kernel_launcher.cpp:123-139(Ndarray branch,DevAllocType != kNone):AMDGPU launcher — quadrants/runtime/amdgpu/kernel_launcher.cpp:94-104 (same branch, pre-fix):
The kernel then treats that host pointer as a GPU address. Reading/writing it from the GPU → hipErrorIllegalAddress. The error is async, so it surfaces at the next sync point — the synchronous mem_free(device_result_buffer) at kernel_launcher.cpp:153, exactly what the stack trace on the genesis test_differentiable_rigid[gpu] repro reports.
Why the test exposes it precisely at loss.backward()
Secondary bug in the same file
quadrants/runtime/amdgpu/kernel_launcher.cpp:80-93 (external-array branch, on-host case) also fails to transfer/allocate a device copy of the grad buffer like CUDA does at lines 107-116. Less likely to fire in practice (most external grads arrive as torch GPU tensors), but symmetrical breakage — fixed here as well.
The fix
Mirror CUDA's grad handling in quadrants/runtime/amdgpu/kernel_launcher.cpp:
Why unit tests didn't catch this
Three layered exclusions kept the buggy path out of CI:
arch-agnostic — the declaration was stale, not a capability statement.
All three are addressed:
One test loosened on AMDGPU only
test_ad_ndarray_torch.py::test_tensor_shape asserts a.grad[i] == 1.0 exactly on an fp32 reverse-mode adjoint sum. Analytically 1.0; on AMDGPU fp32 it lands at ~0.99999988 (CUDA happens to hit exactly 1.0). Loosened to torch.allclose on AMDGPU only — CPU/CUDA keep the
strict == 1.0 check.
Test plan