Question: No threadfence_system in LLGenericOp in all_reduce ring implementation #765

nayakajay · 2022-12-28T14:04:36Z

I was trying to understand the code flow of all_reduce (when called from nccl_tests/all_reduce_perf) and wanted to confirm a few things. I have run the code on a Google cloud instance with 2 A100 GPUs.
AllReduce implementation, ncclKernel_AllReduce_RING_LL_Sum_float is the kernel being called. I believe this kernel is getting constructed using macros in the code common.h.

The stitched-together kernel calls runRing function from all_reduce.h, which then uses the primitives from prims_ll.h (correct me here if I am mistaken).

From my understanding, the communication is happening across GPUs in the kernel. As that is the case, there should be some synchronization among the threads across GPUs using threadfence_system. Particularly in the LLGenericOp function. I see the waitSend and postRecv operations which seems synchronizing (some sort and acquire and release operation respectively), but they do not seem to use memory fences.

Compared to that, SendRecv primitives uses the fence. Am I missing something here?

The text was updated successfully, but these errors were encountered:

jbachan · 2022-12-28T18:51:32Z

LL was designed to not need threadfence_system, that is one of the major reasons it can achieve such low latencies. threadfence_system is required to order updates to memory targeting different addresses, for instance when you don't want signal=1 to become visible until after payload=<data>. LL works by putting the signal bits and payload bits into the same 64-bit word. Since 64-bit words are handled atomically by all components of the architecture, it would be impossible for just the signal bits to change before the payload bits. LL splits each 64-bit word into 32-bits of payload and 32-bits of signal, thus it can only achieve 50% of available bandwidth. LL128 follows the same principle but relies on the "riskier" assumption of 128-byte atomicity (which is not honored by all archs so we actively disable it from the host side when we don't detect supported hardware), and can achieve 120 bytes of payload per 128 bytes sent (93% bandwidth efficiency).

nayakajay · 2023-01-03T05:36:50Z

Compared to that, SendRecv does use __threadfence_systems. Is there any reason why the same method was not used in prims_simple.h?

jbachan · 2023-01-03T17:36:56Z

The reason prims_simple uses threadfence is so it can reach 100% bandwidth, it just has to suffer the cost of the latency hit the threadfence incurs. The LL protocols avoid the latency hit of the fence but also can't reach 100% bandwidth.

nayakajay · 2023-01-03T18:04:46Z

Thanks, closing the issue now.

nayakajay closed this as completed Jan 3, 2023

Zha0q1 mentioned this issue Jul 1, 2023

Memory consistency across GPUs #903

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: No threadfence_system in LLGenericOp in all_reduce ring implementation #765

Question: No threadfence_system in LLGenericOp in all_reduce ring implementation #765

nayakajay commented Dec 28, 2022

jbachan commented Dec 28, 2022

nayakajay commented Jan 3, 2023

jbachan commented Jan 3, 2023

nayakajay commented Jan 3, 2023

Question: No threadfence_system in LLGenericOp in all_reduce ring implementation #765

Question: No threadfence_system in LLGenericOp in all_reduce ring implementation #765

Comments

nayakajay commented Dec 28, 2022

jbachan commented Dec 28, 2022

nayakajay commented Jan 3, 2023

jbachan commented Jan 3, 2023

nayakajay commented Jan 3, 2023