Bug Description
The CUDA tile backend is missing a same-shape assignment path from a register-backed tile into an existing shared-memory-backed tile. Kernels that keep tile operands live across wp.tile_matmul() and then assign the next register-loaded operands back to those variables require this storage migration.
One example is a pipelined tile matmul kernel:
@wp.kernel
def matmul_pipelined(A: wp.array2d(dtype=float), B: wp.array2d(dtype=float), C: wp.array2d(dtype=float)):
i, j = wp.tid()
acc = wp.tile_zeros(shape=(8, 4), dtype=wp.float32)
a = wp.tile_load(A, shape=(8, 8), offset=(i * 8, 0), storage="register")
b = wp.tile_load(B, shape=(8, 4), offset=(0, j * 4), storage="register")
count = int(A.shape[1] / 8)
for k in range(1, count):
a_next = wp.tile_load(A, shape=(8, 8), offset=(i * 8, k * 8), storage="register")
b_next = wp.tile_load(B, shape=(8, 4), offset=(k * 8, j * 4), storage="register")
wp.tile_matmul(a, b, acc)
a = a_next
b = b_next
wp.tile_matmul(a, b, acc)
wp.tile_store(C, acc, offset=(i * 8, j * 4))
Expected behavior:
- Forward assignment copies the register tile values into the shared tile destination.
- Backward assignment accumulates the shared destination adjoint into the register source adjoint.
- The overwritten shared destination adjoint is cleared after propagation.
Actual behavior:
The native tile assignment overloads do not cover tile_shared_t <- tile_register_t, so this reassignment pattern is not handled correctly in forward execution or adjoint propagation.
System Information
- Warp version:
main / development branch
- Backend: CUDA tile backend
Bug Description
The CUDA tile backend is missing a same-shape assignment path from a register-backed tile into an existing shared-memory-backed tile. Kernels that keep tile operands live across
wp.tile_matmul()and then assign the next register-loaded operands back to those variables require this storage migration.One example is a pipelined tile matmul kernel:
Expected behavior:
Actual behavior:
The native tile assignment overloads do not cover
tile_shared_t <- tile_register_t, so this reassignment pattern is not handled correctly in forward execution or adjoint propagation.System Information
main/ development branch