New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: No threadfence_system in LLGenericOp in all_reduce ring implementation #765
Comments
LL was designed to not need |
Compared to that, SendRecv does use |
The reason prims_simple uses threadfence is so it can reach 100% bandwidth, it just has to suffer the cost of the latency hit the threadfence incurs. The LL protocols avoid the latency hit of the fence but also can't reach 100% bandwidth. |
Thanks, closing the issue now. |
I was trying to understand the code flow of all_reduce (when called from nccl_tests/all_reduce_perf) and wanted to confirm a few things. I have run the code on a Google cloud instance with 2 A100 GPUs.
AllReduce implementation,
ncclKernel_AllReduce_RING_LL_Sum_float
is the kernel being called. I believe this kernel is getting constructed using macros in the code common.h.The stitched-together kernel calls runRing function from all_reduce.h, which then uses the primitives from prims_ll.h (correct me here if I am mistaken).
From my understanding, the communication is happening across GPUs in the kernel. As that is the case, there should be some synchronization among the threads across GPUs using
threadfence_system
. Particularly in the LLGenericOp function. I see the waitSend and postRecv operations which seems synchronizing (some sort and acquire and release operation respectively), but they do not seem to use memory fences.Compared to that, SendRecv primitives uses the fence. Am I missing something here?
The text was updated successfully, but these errors were encountered: