-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How is tree reduction implemented? #545
Comments
|
Thank you very much for your reply! That's very helpful. Just to make sure that I understand correctly:
This also makes me wonder what would happen if the reduction is performed on a subset of devices, say we have device 0-7, where 0-3 are connect via one NIC, and 4-7 are connected via another NIC respective. How many trees are constructed for an AllReduce on [0,2,4,6], [1,3,5,7]? To make it hopefully more concrete:
It seems to me that in this case we will need to construct one two-tree per device group. Then because of "one two-tree per NIC", we will construct:
I could be totally wrong. I think my major confusion is the idea of "one two-tree per NIC". It would be great if you could elaborate a bit more on that. Thank you! |
To answer your question about dual NIC, it really depends on whether we have NVLink or not between the GPUs.
|
Update: Ah I think I started to get it a bit. Initially I thought there would be a tree connecting all gpus, but no, there are only inter-node trees, and intra-node is connected using "chains". Could you explain more on how chains work? I dug a bit on past issues and the only impression I have now is that "chains" are not really "rings", and the chain converges on "gpus close to NIC". What is the concept of "gpus close to NIC"? I am trying to summarize what I have understood now, please correct me if I am wrong. If the network topology has 4 nodes, where each node is has 2 nics and each nic has 4 gpus, e.g., a node like:
I guess this is the But since there are two trees, how should the roots talk to each other? Because it now seems like the first tree will collect reduce results [0-3, 8-11, ...] and the second tree node will collect reduce result [4-7, 12-15,...]. |
So considering NIC0, for the reduction, data would go from e.g. GPU 7 to 6, 5, 4, 3, 2, 1, then 0, then follow the inter-node tree up to GPU 0 of node 0 for the first tree or GPU 0 of node 3 for the second tree. Then it would go down in the opposite direction for the broadcast. So, for the reduction on 4 nodes (up the tree), the intra node path would be:
And inter-node:
So in total for the basic tree (GPU 0 does all the communication with the tree down ranks):
and
For split or balanced tree, it's no longer GPU 0 of each node (0, 8, 16, 24) doing all the work, but also another GPU, e.g. GPU 1 (ranks 1, 9, 17, 25), doing part of the work, leveraging the intra-node chain. Split tree (GPU 1 does the communication with the down ranks):
and
Balanced tree (GPUs 0 and 1 do the communication with one down rank each):
and
|
Thank you very much for this detailed explanation!! That's super helpful.
If I understand "one tree pair per NIC" correctly, at the same time for NIC1, for the reduction, data would go from e.g., gpu 3,2,1,0,7,6,5,4 (because we choose 4 as the GPU close to the NIC).
|
Yes. that's all correct. |
Great, thank you so much!! Your help is greatly appreciated! |
Hi, @sjeaugey by a simple chain for intra-node connection, do you mean the intra-node dataflow is actually linearly pipelined? does this hold for pcie-only traffic( with nvlink for intra nodes)? thanks. |
Yes I guess that's what I mean. Chain/pipeline is the same to me.
I'm a bit confused. To me, a system is either PCI-only or uses NVLink. What do you mean? |
@sjeaugey Sorry for the typo. I meant systems without nvlink? are all gpus on one node chained still or organized as a tree? and for what reason? Thanks |
Yes, even without NVLink we still have a chain intra-node. Now on PCI platforms, the Tree algorithm cannot achieve full bandwidth (only 2/3) because the intra-node chain uses the same PCI link as the inter-node communication. So the tuning will adjust to switch to rings earlier. Using a tree intra-node may result in even better latency but bandwidth-wise it's trickier to get good performance, hence we would win only for a very low size range which is usually not what we target. |
hello ,I see in the new NCCL version ,split tree isnt use any more。compare with balanced tree,does it have any drawbacks? |
No, we moved to balanced tree everywhere because it was always better than split tree. |
Thanks, man. Big help! |
We are trying to get a better understanding of the tree reduction used in NCCL, in particular when we specify
os.environ['NCCL_ALGO'] = 'Tree'
.We have read this developer blog and still have the following questions:
Apologies if questions are too many; any reply to any of the questions would be greatly appreciated. Thank you!
The text was updated successfully, but these errors were encountered: