Reduce and Allreduce NVLS implementations for the cuda backend#6038
Reduce and Allreduce NVLS implementations for the cuda backend#6038nsarka merged 33 commits intoNVIDIA:mainfrom
Conversation
6cc186a to
2697b92
Compare
Greptile SummaryThis PR extends the CUDA (NVLS) backend with Reduce and Allreduce collective implementations, completing the set alongside the existing Broadcast and Allgather. Both collectives share a new Key changes:
Minor issues found:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant A as Rank A (all ranks)
participant Sym as Symmetric Buffer (NVLS MC VA)
participant B as Rank B (all ranks)
Note over A,B: postAllreduceWithCudaBackend
A->>Sym: cudaMemcpyAsync(inputBuffer, input)
B->>Sym: cudaMemcpyAsync(inputBuffer, input)
A->>A: cuStreamWriteValue32(local_sem[B]=kInProgress)
B->>B: cuStreamWriteValue32(local_sem[A]=kInProgress)
A->>B: cuStreamWaitValue32(B.sem[A]==kInProgress)
B->>A: cuStreamWaitValue32(A.sem[B]==kInProgress)
A->>A: launchMulticastReduceKernel(mc_ptr → output)
B->>B: launchMulticastReduceKernel(mc_ptr → output)
A->>B: cuStreamWriteValue32(B.sem[A]=kIdle)
B->>A: cuStreamWriteValue32(A.sem[B]=kIdle)
Note over A,B: waitAllreduceWithCudaBackend
A->>A: cuStreamWaitValue32(local_sem[B]==kIdle)
B->>B: cuStreamWaitValue32(local_sem[A]==kIdle)
Note over A,B: postReduceWithCudaBackend (root=R)
A->>Sym: cudaMemcpyAsync(inputBuffer, input)
A->>A: cuStreamWriteValue32(own_sem=kInProgress)
alt rank == root
A->>B: cuStreamWaitValue32(B.sem==kInProgress)
A->>Sym: launchMulticastReduceKernel(mc_ptr → output)
A->>B: cuStreamWriteValue32(B.sem=kIdle)
end
Note over A,B: waitReduceWithCudaBackend
B->>B: cuStreamWaitValue32(own_sem==kIdle)
Reviews (11): Last reviewed commit: "Add barrier" | Re-trigger Greptile |
| size_t n_vec = n_bytes / 16; | ||
|
|
||
| for (size_t i = idx; i < n_vec; i += stride) { | ||
| float r0, r1, r2, r3; | ||
| const void* addr = mc_src_c + i * 16; | ||
| asm volatile( | ||
| "multimem.ld_reduce.global.add.v4.f32 {%0,%1,%2,%3}, [%4];" | ||
| : "=f"(r0), "=f"(r1), "=f"(r2), "=f"(r3) | ||
| : "l"(addr) | ||
| : "memory"); | ||
| float4 out; | ||
| out.x = r0; | ||
| out.y = r1; | ||
| out.z = r2; | ||
| out.w = r3; | ||
| ((float4*)dst_c)[i] = out; | ||
| } |
There was a problem hiding this comment.
Tail elements silently dropped when
n_bytes is not a 16-byte multiple
The kernel computes n_vec = n_bytes / 16 using integer division, so any bytes in the range [n_vec*16, n_bytes) are silently ignored and never reduced. The caller (launchMulticastReduceKernelImpl) does check size % 16 == 0 via NVF_CHECK, so this cannot be triggered through the normal code path. However, the public launchMulticastReduceKernel wrapper (used by tests, per the header comment) does not perform that check, leaving callers free to pass a size that is not a multiple of 16 and silently get wrong results.
Consider adding the alignment assertion inside the kernel or inside launchMulticastReduceKernel itself so test callers also get a clear error.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
|
!test |
|
!test |
|
!test |
|
!test |
| // NOLINTNEXTLINE(performance-enum-size) | ||
| enum class IpcSemaphore : cuuint32_t { kIdle, kInProgress }; | ||
|
|
||
| // Basic IPC handle for legacy P2P communication using cudaIpc* APIs |
There was a problem hiding this comment.
@samnordmann I remember you said some IPC handles can be removed in favor of symmetric memory. Is this one example of that?
There was a problem hiding this comment.
yes, or more exactly, this class as an interface could stay untouched but its members and implementation could replace all the legacy cudaIpc* API by cuMem* etc.
| // NOLINTNEXTLINE(performance-enum-size) | ||
| enum class IpcSemaphore : cuuint32_t { kIdle, kInProgress }; | ||
|
|
||
| // Basic IPC handle for legacy P2P communication using cudaIpc* APIs |
There was a problem hiding this comment.
yes, or more exactly, this class as an interface could stay untouched but its members and implementation could replace all the legacy cudaIpc* API by cuMem* etc.
| // Copy input to symmetric buffer | ||
| NVFUSER_CUDA_RT_SAFE_CALL(cudaMemcpyAsync( | ||
| reduce_handle->inputBuffer().data_ptr(), | ||
| input.data_ptr(), | ||
| size, | ||
| cudaMemcpyDeviceToDevice, | ||
| stream)); |
There was a problem hiding this comment.
It would be more efficient to operate directly on the user's input buffer. So we get a true 0 copy, one shot algorithm.
We need to assume the input buffer comes from Symmetric memory
There was a problem hiding this comment.
Thanks! I'll make a new PR for this just so it can be smaller.
Co-authored-by: Jingyue Wu <wujingyue@gmail.com>
|
!test |
Built on top of #5620. Adds reduce and allreduce NVLS implementations. Both use the same ld_reduce kernel and synchronize using a symmetric integer tensor as a semaphore