How is tree reduction implemented? #545

xnning · 2021-08-09T18:20:22Z

We are trying to get a better understanding of the tree reduction used in NCCL, in particular when we specify os.environ['NCCL_ALGO'] = 'Tree'.

We have read this developer blog and still have the following questions:

Is AllReduce always implemented using 2-tree, or does NCCL implement n-tree where n is decided using heuristics?
Will AllReduce divide and pipeline the data, and if so, how is the number of chunks picked?
It seems that the reduce and the broadcast phase can also be pipelined, is that implemented in NCCL?
What would be the n-tree algorithm for AllGather/ReduceScatter, or they just fallback to rings?

Apologies if questions are too many; any reply to any of the questions would be greatly appreciated. Thank you!

The text was updated successfully, but these errors were encountered:

sjeaugey · 2021-08-13T11:29:28Z

The two complementary trees are all we need to extract the full bandwidth from one NIC. However, if we have more than one NIC, there will be a tree pair per NIC, converging to the NIC intra-node and connecting all NIC N of all nodes with two trees.
Tree allreduce is indeed a reduce-broadcast, so data is pipelined to flow up to the root, then down to the leaves of the tree. Choosing the number of chunks/size of chunk is a pretty complex thing, as we need to keep many chunks for good pipelining but not too small chunks. You can find a part of that progressive logic here.
Yes, in fact most of the time we have a constant flow of reduce and broadcast, splitting each SM into two parts which lets us give more threads to reduction and less to broadcast. There are still a few cases where we do the reduce, then the broadcast as seen here.
Only allreduce has a 2-tree algorithm. Others use rings. We could use it for broadcast/reduce, but that would require to recompute and connect the trees for each root which do not do currently. I don't see how the 2-tree algorithm could be applied to AllGather and ReduceScatter.

xnning · 2021-08-13T15:41:46Z

Thank you very much for your reply! That's very helpful. Just to make sure that I understand correctly:

Does that mean there will be one two-tree per NIC that connects all intra-node, and another two-tree that connects all NICs?
Could you explain why for some cases we do reduce/broadcast separately? Looking at the code, one example I could imagine that falls into runTreeUpDown is for example run GPU A100 with CUDA version 11.3. Is there a reason why that cannot be pipelined?

This also makes me wonder what would happen if the reduction is performed on a subset of devices, say we have device 0-7, where 0-3 are connect via one NIC, and 4-7 are connected via another NIC respective. How many trees are constructed for an AllReduce on [0,2,4,6], [1,3,5,7]?

To make it hopefully more concrete:

  ------------------ dc
     |          |
    NIC0       NIC1
     |          |
---------   ----------
|  |  |  |  |  |  |  |
0  1  2  3  4  5  6  7

It seems to me that in this case we will need to construct one two-tree per device group. Then because of "one two-tree per NIC", we will construct:

one two-tree A for NIC0 that connects [0, 2], one two-tree B for NIC 1 that connects [4,6], one two-tree that connects the root A and B
one two-tree C for NIC0 that connects for [1,3], one two-tree D for NIC1 that connects [5,7], and one two-tree that connects root C and D

I could be totally wrong. I think my major confusion is the idea of "one two-tree per NIC". It would be great if you could elaborate a bit more on that. Thank you!

sjeaugey · 2021-08-13T17:58:06Z

There is no intra-node tree, it's a simple chain which connects to the inter-node tree.
It's usually a matter of performance tuning, not a functional issue.

To answer your question about dual NIC, it really depends on whether we have NVLink or not between the GPUs.

If we have NVLink then we'll have two two-trees between nodes; and two intranode chains using NVLink, converging to two GPUs close to each NIC.
If we don't have NVLink, NCCL will not use the two NICs (since intra-node communication would be the bottleneck).

xnning · 2021-08-13T20:42:55Z

Update: Ah I think I started to get it a bit. Initially I thought there would be a tree connecting all gpus, but no, there are only inter-node trees, and intra-node is connected using "chains".

Could you explain more on how chains work? I dug a bit on past issues and the only impression I have now is that "chains" are not really "rings", and the chain converges on "gpus close to NIC". What is the concept of "gpus close to NIC"?

I am trying to summarize what I have understood now, please correct me if I am wrong. If the network topology has 4 nodes, where each node is has 2 nics and each nic has 4 gpus, e.g., a node like:

  ------------------ dc
     |          |
    NIC0       NIC1
     |          |
---------   ----------
|  |  |  |  |  |  |  |
0  1  2  3  4  5  6  7

There are two trees for inter-node communication, because there are two NICs. Each tree has 4 nodes.
Inside each node, we form a chain. A chain might be something like (suppose gpu0 is the "gpu close to NIC")

reduce: gpu0 -> 1 -> 2 -> 3 -> NIC0 -> node1 ... (the inter-node two-tree part)
broadcast: (the inter-node two-tree part) ... -> node1 -> NIC0 -> gpu0 -> 1 -> 2 -> 3

I guess this is the Tree reduction and there are also SplitTree and BalancedTree, which seems to differ in which gpu talks to NIC? The tree case does look like a uni-directional Ring? Seems BalancedTree is the default from init.cc?

But since there are two trees, how should the roots talk to each other? Because it now seems like the first tree will collect reduce results [0-3, 8-11, ...] and the second tree node will collect reduce result [4-7, 12-15,...].

sjeaugey · 2021-08-16T13:15:51Z

There are two network-side tree-pair, one tree-pair per NIC. They connect the nodes in this pattern: 0<->2<->{1,3} and 3<->1<->{0,2} ... for each NIC.
Let's start with the simplest: the TREE. Inside the node we have a chain starting from a GPU close to the NIC, for example 0 for NIC 0 and 4 for NIC 1, and connecting ALL GPUs through NVLink.

So considering NIC0, for the reduction, data would go from e.g. GPU 7 to 6, 5, 4, 3, 2, 1, then 0, then follow the inter-node tree up to GPU 0 of node 0 for the first tree or GPU 0 of node 3 for the second tree. Then it would go down in the opposite direction for the broadcast.

So, for the reduction on 4 nodes (up the tree), the intra node path would be:

7->6->5->4->3->2->1->0
15->14->13->12->11->10->9->8
23->22->21->20->19->18->17->16
31->30->29->28->27->26->25->24

And inter-node:

{8,24}->16->0
{0,16}->8->24

So in total for the basic tree (GPU 0 does all the communication with the tree down ranks):

            7->6->5->4->3->2->1->0
                                 ^
15->14->13->12->11->10->9->8---. |
                               v |
  23->22->21->20->19->18->17->16-'
                               ^
31->30->29->28->27->26->25->24-'

and

        7->6->5->4->3->2->1->0-.
                               v
    15->14->13->12->11->10->9->8-.
                               ^ |
23->22->21->20->19->18->17->16-' |
                                 v
    31->30->29->28->27->26->25->24

For split or balanced tree, it's no longer GPU 0 of each node (0, 8, 16, 24) doing all the work, but also another GPU, e.g. GPU 1 (ranks 1, 9, 17, 25), doing part of the work, leveraging the intra-node chain.

Split tree (GPU 1 does the communication with the down ranks):

                   7->6->5->4->3->2->1->0
                                     ^
15->14->13->12->11->10->9->8---.     |
                               v     |
      23->22->21->20->19->18->17->16-'
                               ^
31->30->29->28->27->26->25->24-'

and

        7->6->5->4->3->2->1->0-.
                               v
       15->14->13->12->11->10->9->8-.
                               ^    |
23->22->21->20->19->18->17->16-'    |
                                    v
           31->30->29->28->27->26->25->24

Balanced tree (GPUs 0 and 1 do the communication with one down rank each):

                7->6->5->4->3->2->1->0
                                     ^
    15->14->13->12->11->10->9->8---. |
                                   v |
      23->22->21->20->19->18->17->16-'
                               ^
31->30->29->28->27->26->25->24-'

and

           7->6->5->4->3->2->1->0-.
                                  v
       15->14->13->12->11->10->9->8-.
                               ^    |
23->22->21->20->19->18->17->16-'    |
                                    v
       31->30->29->28->27->26->25->24

xnning · 2021-08-16T15:48:42Z

Thank you very much for this detailed explanation!! That's super helpful.

There are two network-side tree-pair, one tree-pair per NIC. They connect the nodes in this pattern: 0<->2<->{1,3} and 3<->1<->{0,2} ... for each NIC.

Let's start with the simplest: the TREE. Inside the node we have a chain starting from a GPU close to the NIC, for example 0 for NIC 0 and 4 for NIC 1, and connecting ALL GPUs through NVLink.

So considering NIC0, for the reduction, data would go from e.g. GPU 7 to 6, 5, 4, 3, 2, 1, then 0, then follow the inter-node tree up to GPU 0 of node 0 for the first tree or GPU 0 of node 3 for the second tree. Then it would go down in the opposite direction for the broadcast.

If I understand "one tree pair per NIC" correctly, at the same time for NIC1, for the reduction, data would go from e.g., gpu 3,2,1,0,7,6,5,4 (because we choose 4 as the GPU close to the NIC).
So total for the basic tree, we have the following four trees pipelined. Is that correct?
(Apologies that this thread has been going long. I believe this would be my last question)

(0) from NIC0, tree 0 in the tree pair

            7->6->5->4->3->2->1->0
                                 ^
15->14->13->12->11->10->9->8---. |
                               v |
  23->22->21->20->19->18->17->16-'
                               ^
31->30->29->28->27->26->25->24-'

(1) from NIC1, tree 0 in the tree pair

            3->2->1->0->7->6->5->4
                                 ^
11->10->9->8->15->14->13->12---. |
                               v |
  19->18->17->16->23->22->21->20-'
                               ^
27->26->25->24->31->30->29->28-'

(2) from NIC0, tree 1 in the tree pair

        7->6->5->4->3->2->1->0-.
                               v
    15->14->13->12->11->10->9->8-.
                               ^ |
23->22->21->20->19->18->17->16-' |
                                 v
    31->30->29->28->27->26->25->24

(3) from NIC1, tree 1 in the tree pair

        3->2->1->0->7->6->5->4-.
                               v
    11->10->9->8->15->14->13->12-.
                               ^ |
19->18->17->16->23->22->21->20-' |
                                 v
    27->26->25->24->31->30->29->28

sjeaugey · 2021-08-16T16:03:29Z

Yes. that's all correct.

xnning · 2021-08-16T16:06:54Z

Great, thank you so much!! Your help is greatly appreciated!

zegao96 · 2021-12-29T03:13:34Z

Hi, @sjeaugey by a simple chain for intra-node connection, do you mean the intra-node dataflow is actually linearly pipelined? does this hold for pcie-only traffic( with nvlink for intra nodes)? thanks.

sjeaugey · 2022-01-03T09:26:18Z

do you mean the intra-node dataflow is actually linearly pipelined?

Yes I guess that's what I mean. Chain/pipeline is the same to me.

Does this hold for pcie-only traffic (with nvlink for intra nodes)

I'm a bit confused. To me, a system is either PCI-only or uses NVLink. What do you mean?

zegao96 · 2022-01-06T05:26:04Z

@sjeaugey Sorry for the typo. I meant systems without nvlink? are all gpus on one node chained still or organized as a tree? and for what reason? Thanks

sjeaugey · 2022-01-06T08:08:55Z

Yes, even without NVLink we still have a chain intra-node. Now on PCI platforms, the Tree algorithm cannot achieve full bandwidth (only 2/3) because the intra-node chain uses the same PCI link as the inter-node communication. So the tuning will adjust to switch to rings earlier.

Using a tree intra-node may result in even better latency but bandwidth-wise it's trickier to get good performance, hence we would win only for a very low size range which is usually not what we target.

kanghui0204 · 2022-03-19T02:57:27Z

hello ，I see in the new NCCL version ，split tree isnt use any more。compare with balanced tree，does it have any drawbacks？

sjeaugey · 2022-03-25T13:10:25Z

No, we moved to balanced tree everywhere because it was always better than split tree.

orion-iluvatar · 2023-11-29T11:09:06Z

There are two network-side tree-pair, one tree-pair per NIC. They connect the nodes in this pattern: 0<->2<->{1,3} and 3<->1<->{0,2} ... for each NIC.

Let's start with the simplest: the TREE. Inside the node we have a chain starting from a GPU close to the NIC, for example 0 for NIC 0 and 4 for NIC 1, and connecting ALL GPUs through NVLink.

So considering NIC0, for the reduction, data would go from e.g. GPU 7 to 6, 5, 4, 3, 2, 1, then 0, then follow the inter-node tree up to GPU 0 of node 0 for the first tree or GPU 0 of node 3 for the second tree. Then it would go down in the opposite direction for the broadcast.

So, for the reduction on 4 nodes (up the tree), the intra node path would be:
7->6->5->4->3->2->1->0
15->14->13->12->11->10->9->8
23->22->21->20->19->18->17->16
31->30->29->28->27->26->25->24
And inter-node:
{8,24}->16->0
{0,16}->8->24
So in total for the basic tree (GPU 0 does all the communication with the tree down ranks):
            7->6->5->4->3->2->1->0
                                 ^
15->14->13->12->11->10->9->8---. |
                               v |
  23->22->21->20->19->18->17->16-'
                               ^
31->30->29->28->27->26->25->24-'
and
        7->6->5->4->3->2->1->0-.
                               v
    15->14->13->12->11->10->9->8-.
                               ^ |
23->22->21->20->19->18->17->16-' |
                                 v
    31->30->29->28->27->26->25->24
For split or balanced tree, it's no longer GPU 0 of each node (0, 8, 16, 24) doing all the work, but also another GPU, e.g. GPU 1 (ranks 1, 9, 17, 25), doing part of the work, leveraging the intra-node chain.

Split tree (GPU 1 does the communication with the down ranks):
                   7->6->5->4->3->2->1->0
                                     ^
15->14->13->12->11->10->9->8---.     |
                               v     |
      23->22->21->20->19->18->17->16-'
                               ^
31->30->29->28->27->26->25->24-'
and
        7->6->5->4->3->2->1->0-.
                               v
       15->14->13->12->11->10->9->8-.
                               ^    |
23->22->21->20->19->18->17->16-'    |
                                    v
           31->30->29->28->27->26->25->24
Balanced tree (GPUs 0 and 1 do the communication with one down rank each):
                7->6->5->4->3->2->1->0
                                     ^
    15->14->13->12->11->10->9->8---. |
                                   v |
      23->22->21->20->19->18->17->16-'
                               ^
31->30->29->28->27->26->25->24-'
and
           7->6->5->4->3->2->1->0-.
                                  v
       15->14->13->12->11->10->9->8-.
                               ^    |
23->22->21->20->19->18->17->16-'    |
                                    v
       31->30->29->28->27->26->25->24

Thanks, man. Big help!

xnning closed this as completed Aug 16, 2021

megh1241 mentioned this issue Dec 5, 2023

Intra-node allreduce and allreduce more generally in a rail optimized network #1103

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is tree reduction implemented? #545

How is tree reduction implemented? #545

xnning commented Aug 9, 2021 •

edited

Loading

sjeaugey commented Aug 13, 2021

xnning commented Aug 13, 2021

sjeaugey commented Aug 13, 2021

xnning commented Aug 13, 2021 •

edited

Loading

sjeaugey commented Aug 16, 2021

xnning commented Aug 16, 2021

sjeaugey commented Aug 16, 2021

xnning commented Aug 16, 2021

zegao96 commented Dec 29, 2021

sjeaugey commented Jan 3, 2022 •

edited

Loading

zegao96 commented Jan 6, 2022

sjeaugey commented Jan 6, 2022

kanghui0204 commented Mar 19, 2022

sjeaugey commented Mar 25, 2022

orion-iluvatar commented Nov 29, 2023

How is tree reduction implemented? #545

How is tree reduction implemented? #545

Comments

xnning commented Aug 9, 2021 • edited Loading

sjeaugey commented Aug 13, 2021

xnning commented Aug 13, 2021

sjeaugey commented Aug 13, 2021

xnning commented Aug 13, 2021 • edited Loading

sjeaugey commented Aug 16, 2021

xnning commented Aug 16, 2021

sjeaugey commented Aug 16, 2021

xnning commented Aug 16, 2021

zegao96 commented Dec 29, 2021

sjeaugey commented Jan 3, 2022 • edited Loading

zegao96 commented Jan 6, 2022

sjeaugey commented Jan 6, 2022

kanghui0204 commented Mar 19, 2022

sjeaugey commented Mar 25, 2022

orion-iluvatar commented Nov 29, 2023

xnning commented Aug 9, 2021 •

edited

Loading

xnning commented Aug 13, 2021 •

edited

Loading

sjeaugey commented Jan 3, 2022 •

edited

Loading