Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is tree reduction implemented? #545

Closed
xnning opened this issue Aug 9, 2021 · 15 comments
Closed

How is tree reduction implemented? #545

xnning opened this issue Aug 9, 2021 · 15 comments

Comments

@xnning
Copy link

xnning commented Aug 9, 2021

We are trying to get a better understanding of the tree reduction used in NCCL, in particular when we specify os.environ['NCCL_ALGO'] = 'Tree'.

We have read this developer blog and still have the following questions:

  1. Is AllReduce always implemented using 2-tree, or does NCCL implement n-tree where n is decided using heuristics?
  2. Will AllReduce divide and pipeline the data, and if so, how is the number of chunks picked?
  3. It seems that the reduce and the broadcast phase can also be pipelined, is that implemented in NCCL?
  4. What would be the n-tree algorithm for AllGather/ReduceScatter, or they just fallback to rings?

Apologies if questions are too many; any reply to any of the questions would be greatly appreciated. Thank you!

@sjeaugey
Copy link
Member

  1. The two complementary trees are all we need to extract the full bandwidth from one NIC. However, if we have more than one NIC, there will be a tree pair per NIC, converging to the NIC intra-node and connecting all NIC N of all nodes with two trees.
  2. Tree allreduce is indeed a reduce-broadcast, so data is pipelined to flow up to the root, then down to the leaves of the tree. Choosing the number of chunks/size of chunk is a pretty complex thing, as we need to keep many chunks for good pipelining but not too small chunks. You can find a part of that progressive logic here.
  3. Yes, in fact most of the time we have a constant flow of reduce and broadcast, splitting each SM into two parts which lets us give more threads to reduction and less to broadcast. There are still a few cases where we do the reduce, then the broadcast as seen here.
  4. Only allreduce has a 2-tree algorithm. Others use rings. We could use it for broadcast/reduce, but that would require to recompute and connect the trees for each root which do not do currently. I don't see how the 2-tree algorithm could be applied to AllGather and ReduceScatter.

@xnning
Copy link
Author

xnning commented Aug 13, 2021

Thank you very much for your reply! That's very helpful. Just to make sure that I understand correctly:

  1. Does that mean there will be one two-tree per NIC that connects all intra-node, and another two-tree that connects all NICs?
  2. Could you explain why for some cases we do reduce/broadcast separately? Looking at the code, one example I could imagine that falls into runTreeUpDown is for example run GPU A100 with CUDA version 11.3. Is there a reason why that cannot be pipelined?

This also makes me wonder what would happen if the reduction is performed on a subset of devices, say we have device 0-7, where 0-3 are connect via one NIC, and 4-7 are connected via another NIC respective. How many trees are constructed for an AllReduce on [0,2,4,6], [1,3,5,7]?

To make it hopefully more concrete:

  ------------------ dc
     |          |
    NIC0       NIC1
     |          |
---------   ----------
|  |  |  |  |  |  |  |
0  1  2  3  4  5  6  7

It seems to me that in this case we will need to construct one two-tree per device group. Then because of "one two-tree per NIC", we will construct:

  1. one two-tree A for NIC0 that connects [0, 2], one two-tree B for NIC 1 that connects [4,6], one two-tree that connects the root A and B
  2. one two-tree C for NIC0 that connects for [1,3], one two-tree D for NIC1 that connects [5,7], and one two-tree that connects root C and D

I could be totally wrong. I think my major confusion is the idea of "one two-tree per NIC". It would be great if you could elaborate a bit more on that. Thank you!

@sjeaugey
Copy link
Member

  1. There is no intra-node tree, it's a simple chain which connects to the inter-node tree.
  2. It's usually a matter of performance tuning, not a functional issue.

To answer your question about dual NIC, it really depends on whether we have NVLink or not between the GPUs.

  1. If we have NVLink then we'll have two two-trees between nodes; and two intranode chains using NVLink, converging to two GPUs close to each NIC.
  2. If we don't have NVLink, NCCL will not use the two NICs (since intra-node communication would be the bottleneck).

@xnning
Copy link
Author

xnning commented Aug 13, 2021

Update: Ah I think I started to get it a bit. Initially I thought there would be a tree connecting all gpus, but no, there are only inter-node trees, and intra-node is connected using "chains".

Could you explain more on how chains work? I dug a bit on past issues and the only impression I have now is that "chains" are not really "rings", and the chain converges on "gpus close to NIC". What is the concept of "gpus close to NIC"?

I am trying to summarize what I have understood now, please correct me if I am wrong. If the network topology has 4 nodes, where each node is has 2 nics and each nic has 4 gpus, e.g., a node like:

  ------------------ dc
     |          |
    NIC0       NIC1
     |          |
---------   ----------
|  |  |  |  |  |  |  |
0  1  2  3  4  5  6  7
  1. There are two trees for inter-node communication, because there are two NICs. Each tree has 4 nodes.
  2. Inside each node, we form a chain. A chain might be something like (suppose gpu0 is the "gpu close to NIC")
reduce: gpu0 -> 1 -> 2 -> 3 -> NIC0 -> node1 ... (the inter-node two-tree part)
broadcast: (the inter-node two-tree part) ... -> node1 -> NIC0 -> gpu0 -> 1 -> 2 -> 3

I guess this is the Tree reduction and there are also SplitTree and BalancedTree, which seems to differ in which gpu talks to NIC? The tree case does look like a uni-directional Ring? Seems BalancedTree is the default from init.cc?

But since there are two trees, how should the roots talk to each other? Because it now seems like the first tree will collect reduce results [0-3, 8-11, ...] and the second tree node will collect reduce result [4-7, 12-15,...].

@sjeaugey
Copy link
Member

  1. There are two network-side tree-pair, one tree-pair per NIC. They connect the nodes in this pattern: 0<->2<->{1,3} and 3<->1<->{0,2} ... for each NIC.
  2. Let's start with the simplest: the TREE. Inside the node we have a chain starting from a GPU close to the NIC, for example 0 for NIC 0 and 4 for NIC 1, and connecting ALL GPUs through NVLink.

So considering NIC0, for the reduction, data would go from e.g. GPU 7 to 6, 5, 4, 3, 2, 1, then 0, then follow the inter-node tree up to GPU 0 of node 0 for the first tree or GPU 0 of node 3 for the second tree. Then it would go down in the opposite direction for the broadcast.

So, for the reduction on 4 nodes (up the tree), the intra node path would be:

7->6->5->4->3->2->1->0
15->14->13->12->11->10->9->8
23->22->21->20->19->18->17->16
31->30->29->28->27->26->25->24

And inter-node:

{8,24}->16->0
{0,16}->8->24

So in total for the basic tree (GPU 0 does all the communication with the tree down ranks):

            7->6->5->4->3->2->1->0
                                 ^
15->14->13->12->11->10->9->8---. |
                               v |
  23->22->21->20->19->18->17->16-'
                               ^
31->30->29->28->27->26->25->24-'

and

        7->6->5->4->3->2->1->0-.
                               v
    15->14->13->12->11->10->9->8-.
                               ^ |
23->22->21->20->19->18->17->16-' |
                                 v
    31->30->29->28->27->26->25->24

For split or balanced tree, it's no longer GPU 0 of each node (0, 8, 16, 24) doing all the work, but also another GPU, e.g. GPU 1 (ranks 1, 9, 17, 25), doing part of the work, leveraging the intra-node chain.

Split tree (GPU 1 does the communication with the down ranks):

                   7->6->5->4->3->2->1->0
                                     ^
15->14->13->12->11->10->9->8---.     |
                               v     |
      23->22->21->20->19->18->17->16-'
                               ^
31->30->29->28->27->26->25->24-'

and

        7->6->5->4->3->2->1->0-.
                               v
       15->14->13->12->11->10->9->8-.
                               ^    |
23->22->21->20->19->18->17->16-'    |
                                    v
           31->30->29->28->27->26->25->24

Balanced tree (GPUs 0 and 1 do the communication with one down rank each):

                7->6->5->4->3->2->1->0
                                     ^
    15->14->13->12->11->10->9->8---. |
                                   v |
      23->22->21->20->19->18->17->16-'
                               ^
31->30->29->28->27->26->25->24-'

and

           7->6->5->4->3->2->1->0-.
                                  v
       15->14->13->12->11->10->9->8-.
                               ^    |
23->22->21->20->19->18->17->16-'    |
                                    v
       31->30->29->28->27->26->25->24

@xnning
Copy link
Author

xnning commented Aug 16, 2021

Thank you very much for this detailed explanation!! That's super helpful.

  1. There are two network-side tree-pair, one tree-pair per NIC. They connect the nodes in this pattern: 0<->2<->{1,3} and 3<->1<->{0,2} ... for each NIC.
  2. Let's start with the simplest: the TREE. Inside the node we have a chain starting from a GPU close to the NIC, for example 0 for NIC 0 and 4 for NIC 1, and connecting ALL GPUs through NVLink.

So considering NIC0, for the reduction, data would go from e.g. GPU 7 to 6, 5, 4, 3, 2, 1, then 0, then follow the inter-node tree up to GPU 0 of node 0 for the first tree or GPU 0 of node 3 for the second tree. Then it would go down in the opposite direction for the broadcast.

If I understand "one tree pair per NIC" correctly, at the same time for NIC1, for the reduction, data would go from e.g., gpu 3,2,1,0,7,6,5,4 (because we choose 4 as the GPU close to the NIC).
So total for the basic tree, we have the following four trees pipelined. Is that correct?
(Apologies that this thread has been going long. I believe this would be my last question)

(0) from NIC0, tree 0 in the tree pair

            7->6->5->4->3->2->1->0
                                 ^
15->14->13->12->11->10->9->8---. |
                               v |
  23->22->21->20->19->18->17->16-'
                               ^
31->30->29->28->27->26->25->24-'

(1) from NIC1, tree 0 in the tree pair

            3->2->1->0->7->6->5->4
                                 ^
11->10->9->8->15->14->13->12---. |
                               v |
  19->18->17->16->23->22->21->20-'
                               ^
27->26->25->24->31->30->29->28-'

(2) from NIC0, tree 1 in the tree pair

        7->6->5->4->3->2->1->0-.
                               v
    15->14->13->12->11->10->9->8-.
                               ^ |
23->22->21->20->19->18->17->16-' |
                                 v
    31->30->29->28->27->26->25->24

(3) from NIC1, tree 1 in the tree pair

        3->2->1->0->7->6->5->4-.
                               v
    11->10->9->8->15->14->13->12-.
                               ^ |
19->18->17->16->23->22->21->20-' |
                                 v
    27->26->25->24->31->30->29->28

@sjeaugey
Copy link
Member

Yes. that's all correct.

@xnning
Copy link
Author

xnning commented Aug 16, 2021

Great, thank you so much!! Your help is greatly appreciated!

@xnning xnning closed this as completed Aug 16, 2021
@zegao96
Copy link

zegao96 commented Dec 29, 2021

Hi, @sjeaugey by a simple chain for intra-node connection, do you mean the intra-node dataflow is actually linearly pipelined? does this hold for pcie-only traffic( with nvlink for intra nodes)? thanks.

@sjeaugey
Copy link
Member

sjeaugey commented Jan 3, 2022

do you mean the intra-node dataflow is actually linearly pipelined?

Yes I guess that's what I mean. Chain/pipeline is the same to me.

Does this hold for pcie-only traffic (with nvlink for intra nodes)

I'm a bit confused. To me, a system is either PCI-only or uses NVLink. What do you mean?

@zegao96
Copy link

zegao96 commented Jan 6, 2022

@sjeaugey Sorry for the typo. I meant systems without nvlink? are all gpus on one node chained still or organized as a tree? and for what reason? Thanks

@sjeaugey
Copy link
Member

sjeaugey commented Jan 6, 2022

Yes, even without NVLink we still have a chain intra-node. Now on PCI platforms, the Tree algorithm cannot achieve full bandwidth (only 2/3) because the intra-node chain uses the same PCI link as the inter-node communication. So the tuning will adjust to switch to rings earlier.

Using a tree intra-node may result in even better latency but bandwidth-wise it's trickier to get good performance, hence we would win only for a very low size range which is usually not what we target.

@kanghui0204
Copy link

hello ,I see in the new NCCL version ,split tree isnt use any more。compare with balanced tree,does it have any drawbacks?

@sjeaugey
Copy link
Member

No, we moved to balanced tree everywhere because it was always better than split tree.

@orion-iluvatar
Copy link

  1. There are two network-side tree-pair, one tree-pair per NIC. They connect the nodes in this pattern: 0<->2<->{1,3} and 3<->1<->{0,2} ... for each NIC.
  2. Let's start with the simplest: the TREE. Inside the node we have a chain starting from a GPU close to the NIC, for example 0 for NIC 0 and 4 for NIC 1, and connecting ALL GPUs through NVLink.

So considering NIC0, for the reduction, data would go from e.g. GPU 7 to 6, 5, 4, 3, 2, 1, then 0, then follow the inter-node tree up to GPU 0 of node 0 for the first tree or GPU 0 of node 3 for the second tree. Then it would go down in the opposite direction for the broadcast.

So, for the reduction on 4 nodes (up the tree), the intra node path would be:

7->6->5->4->3->2->1->0
15->14->13->12->11->10->9->8
23->22->21->20->19->18->17->16
31->30->29->28->27->26->25->24

And inter-node:

{8,24}->16->0
{0,16}->8->24

So in total for the basic tree (GPU 0 does all the communication with the tree down ranks):

            7->6->5->4->3->2->1->0
                                 ^
15->14->13->12->11->10->9->8---. |
                               v |
  23->22->21->20->19->18->17->16-'
                               ^
31->30->29->28->27->26->25->24-'

and

        7->6->5->4->3->2->1->0-.
                               v
    15->14->13->12->11->10->9->8-.
                               ^ |
23->22->21->20->19->18->17->16-' |
                                 v
    31->30->29->28->27->26->25->24

For split or balanced tree, it's no longer GPU 0 of each node (0, 8, 16, 24) doing all the work, but also another GPU, e.g. GPU 1 (ranks 1, 9, 17, 25), doing part of the work, leveraging the intra-node chain.

Split tree (GPU 1 does the communication with the down ranks):

                   7->6->5->4->3->2->1->0
                                     ^
15->14->13->12->11->10->9->8---.     |
                               v     |
      23->22->21->20->19->18->17->16-'
                               ^
31->30->29->28->27->26->25->24-'

and

        7->6->5->4->3->2->1->0-.
                               v
       15->14->13->12->11->10->9->8-.
                               ^    |
23->22->21->20->19->18->17->16-'    |
                                    v
           31->30->29->28->27->26->25->24

Balanced tree (GPUs 0 and 1 do the communication with one down rank each):

                7->6->5->4->3->2->1->0
                                     ^
    15->14->13->12->11->10->9->8---. |
                                   v |
      23->22->21->20->19->18->17->16-'
                               ^
31->30->29->28->27->26->25->24-'

and

           7->6->5->4->3->2->1->0-.
                                  v
       15->14->13->12->11->10->9->8-.
                               ^    |
23->22->21->20->19->18->17->16-'    |
                                    v
       31->30->29->28->27->26->25->24

Thanks, man. Big help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants