Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: No threadfence_system in LLGenericOp in all_reduce ring implementation #765

Closed
nayakajay opened this issue Dec 28, 2022 · 4 comments

Comments

@nayakajay
Copy link

I was trying to understand the code flow of all_reduce (when called from nccl_tests/all_reduce_perf) and wanted to confirm a few things. I have run the code on a Google cloud instance with 2 A100 GPUs.
AllReduce implementation, ncclKernel_AllReduce_RING_LL_Sum_float is the kernel being called. I believe this kernel is getting constructed using macros in the code common.h.

The stitched-together kernel calls runRing function from all_reduce.h, which then uses the primitives from prims_ll.h (correct me here if I am mistaken).

From my understanding, the communication is happening across GPUs in the kernel. As that is the case, there should be some synchronization among the threads across GPUs using threadfence_system. Particularly in the LLGenericOp function. I see the waitSend and postRecv operations which seems synchronizing (some sort and acquire and release operation respectively), but they do not seem to use memory fences.

Compared to that, SendRecv primitives uses the fence. Am I missing something here?

@jbachan
Copy link
Collaborator

jbachan commented Dec 28, 2022

LL was designed to not need threadfence_system, that is one of the major reasons it can achieve such low latencies. threadfence_system is required to order updates to memory targeting different addresses, for instance when you don't want signal=1 to become visible until after payload=<data>. LL works by putting the signal bits and payload bits into the same 64-bit word. Since 64-bit words are handled atomically by all components of the architecture, it would be impossible for just the signal bits to change before the payload bits. LL splits each 64-bit word into 32-bits of payload and 32-bits of signal, thus it can only achieve 50% of available bandwidth. LL128 follows the same principle but relies on the "riskier" assumption of 128-byte atomicity (which is not honored by all archs so we actively disable it from the host side when we don't detect supported hardware), and can achieve 120 bytes of payload per 128 bytes sent (93% bandwidth efficiency).

@nayakajay
Copy link
Author

Compared to that, SendRecv does use __threadfence_systems. Is there any reason why the same method was not used in prims_simple.h?

@jbachan
Copy link
Collaborator

jbachan commented Jan 3, 2023

The reason prims_simple uses threadfence is so it can reach 100% bandwidth, it just has to suffer the cost of the latency hit the threadfence incurs. The LL protocols avoid the latency hit of the fence but also can't reach 100% bandwidth.

@nayakajay
Copy link
Author

Thanks, closing the issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants