Is there any indicator or flag indicating the algorithm of a communication is using? #754

mozizhao · 2022-12-12T08:05:33Z

Hi,

I know there are three types of algorithms that NCCL could use to communicate among ranks under the all-reduce scenario (ring, tree, and another), depending on the environment setting of the cluster. But in this case, I cannot know the exact algorithm is using. Is there any indicator or flags that can tell user which algorithm is using in the log?

mozizhao · 2022-12-12T08:28:18Z

@sjeaugey Could you please help answer this question?

sjeaugey · 2022-12-12T09:33:37Z

Short answer is no, there is no such log. You can uncomment this line to see it:
https://github.com/NVIDIA/nccl/blob/master/src/enqueue.cc#L1158

As for why, it's because we can't print a line on every call (that would have a perf impact) and we can't also deduce which algorithm we're using easily.
There are 4 algorithms now (Ring, Tree, CollnetChain and CollnetDirect) × 3 protocols (LL, LL128, Simple). Restricting to Ring and Tree, that's still 6 combinations.

The internal tuning model will switch from one to the other depending on the number of nodes, number of GPUs per node, and size of the operation, but also the intra-node and inter-node bandwidth. NCCL will move from one combination to another depending on the size, for example it could do: tree/LL -> ring/LL -> tree/LL128 -> ring/LL128 -> tree/simple -> ring/simple. Or any other combination. So there is no easy rule to determine which algorithm/protocol is used.

mozizhao · 2022-12-12T09:45:54Z

Hi, thanks for your response. According to my survey, in the tuner file, the algorithm is determined. Is it possible to output the selected algorithm by adding a line of code in this file? I am not familiar with nccl code, so I am not sure about that. 

…

---Original--- From: "Sylvain ***@***.***> Date: Mon, Dec 12, 2022 17:33 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [NVIDIA/nccl] Is there any indicator or flag indicating thealgorithm of a communication is using? (Issue #754) Short answer is no, there is no such log. You can uncomment this line to see it: https://github.com/NVIDIA/nccl/blob/master/src/enqueue.cc#L1158 As for why, it's because we can't print a line on every call (that would have a perf impact) and we can't also deduce which algorithm we're using easily. There are 4 algorithms now (Ring, Tree, CollnetChain and CollnetDirect) × 3 protocols (LL, LL128, Simple). Restricting to Ring and Tree, that's still 6 combinations. The internal tuning model will switch from one to the other depending on the number of nodes, number of GPUs per node, and size of the operation, but also the intra-node and inter-node bandwidth. NCCL will move from one combination to another depending on the size, for example it could do: tree/LL -> ring/LL -> tree/LL128 -> ring/LL128 -> tree/simple -> ring/simple. Or any other combination. So there is no easy rule to determine which algorithm/protocol is used. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

sjeaugey · 2022-12-12T10:02:19Z

I'm not sure what you mean. Yes, for each size, the algorithm/protocol is deterministic. But it's still dependent on the size.

mozizhao · 2022-12-12T11:11:31Z

I'm not sure what you mean. Yes, for each size, the algorithm/protocol is deterministic. But it's still dependent on the size.

Hi Sylvain Jeaugey,

Sorry for my unclear expression. I will try to express myself more clearly about my thought. In the code https://github.com/NVIDIA/nccl/blob/master/src/graph/tuning.cc, the algorithm (ring, tree and etc.) is somehow determined and never changed after executing this initiation according to the environment and data. Is it convenient for me to add some logs in this file to determine what algorithm is selected, given the environment and data? To my knowledge, the algorithm in this file is one-shot and may not incur much performance degradation.

The reason why I am curious about the selected algorithm is that I want to build up a prediction model, to forecast the mini-batch time spent on multi-GPU multi-node training for my research idea. Sometimes the overhead is linear with the number of workers/nodes and sometimes it's not. I guess this is susceptible to algorithm selection. So awareness of the algorithm may be important for my prediction. If you found I go astray, could you please give me some idea?

Thanks a lot.

sjeaugey · 2022-12-12T11:26:46Z

Sure, you could add traces and extracting the settings.

You can also set NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=TUNING to get the numbers, and then run ncclTopoGetAlgoTime() on the size you care about.

mozizhao · 2022-12-12T13:17:19Z

Sure, you could add traces and extracting the settings.

You can also set NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=TUNING to get the numbers, and then run ncclTopoGetAlgoTime() on the size you care about.

Hi, I download the source code from GitHub and uncomment the line https://github.com/NVIDIA/nccl/blob/master/src/enqueue.cc#L1178, to check the algorithm it uses. Then I build and reinstall the NCCL according to https://github.com/NVIDIA/nccl/blob/master/README.md. However, it didn't output any log as expected. Should I clean the original NCCL in my host first?

sjeaugey · 2022-12-12T13:21:35Z

You may want to modify makefiles/version.mk to set the suffix to e.g. debug. Then rebuild, install and when you run, set NCCL_DEBUG=VERSION to print which version you're using. You should see 2.16.2debug.

mozizhao · 2022-12-12T13:32:38Z

You may want to modify makefiles/version.mk to set the suffix to e.g. debug. Then rebuild, install and when you run, set NCCL_DEBUG=VERSION to print which version you're using. You should see 2.16.2debug.

Hi Sylvain Jeaugey,

thanks for your quick reply. According to your comments, I found my current version is not labeled as debug, meaning that I didn't replace the NCCL code with my customized version. Actually, I run this command after modifying the source code:

make -j src.build NVCC_GENCODE="-gencode=arch=compute_86,code=sm_86"

and then:

make pkg.debian.build

I didn't encounter any error thrown from the console. Am I in the wrong way of customizing the NCCL code? I am fresh to a program written in C so maybe this is a silly problem. Thanks for your reply!

sjeaugey · 2022-12-12T14:44:15Z

Did you install the deb packages with dpkg -i build/pkg/deb/*.deb?

Now you may just want to try that version without installing it. In that case you can just set:

export LD_LIBRARY_PATH=$NCCL_HOME/build/lib:$LDL_LIBRARY_PATH

And then run again.

mozizhao · 2022-12-12T14:59:38Z

Did you install the deb packages with dpkg -i build/pkg/deb/*.deb?

Now you may just want to try that version without installing it. In that case you can just set:
export LD_LIBRARY_PATH=$NCCL_HOME/build/lib:$LDL_LIBRARY_PATH
And then run again.

Hi,

Since I test the NCCL under PyTorch, I found that maybe I should rebuild PyTorch with a customized NCCL path as discussed in https://discuss.pytorch.org/t/how-can-i-change-nccl-version-in-pytorch/143942. I will try to rebuild them to see if it will work or not. Anyway, thank you for your advice.

mozizhao · 2022-12-13T01:33:47Z

Sure, you could add traces and extracting the settings.

You can also set NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=TUNING to get the numbers, and then run ncclTopoGetAlgoTime() on the size you care about.

Hi, I want to know if NCCL always picks the algorithm yielding the minimum value in the last row when using the AllReduce? In this case, is CollNet/ LL selected to be the runtime algorithm?

autodl-container-035011a13c-c17b1392:9373:9456 [0] NCCL INFO Latency/AlgBw |    Tree/    LL |    Tree/ LL128 |    Tree/Simple |    Ring/    LL |    Ring/ LL128 |    Ring/Simple | CollNet/    LL | CollNet/ LL128 | CollNet/Simple |
autodl-container-035011a13c-c17b1392:9373:9456 [0] NCCL INFO  Max NThreads |            512 |            640 |            512 |            512 |            640 |            256 |            512 |            640 |            512 |
autodl-container-035011a13c-c17b1392:9373:9456 [0] NCCL INFO     Broadcast |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     6.3/   0.8 |    14.0/   0.0 |    18.0/   3.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
autodl-container-035011a13c-c17b1392:9373:9456 [0] NCCL INFO        Reduce |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     6.3/   0.8 |    14.0/   0.0 |    18.0/   3.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
autodl-container-035011a13c-c17b1392:9373:9456 [0] NCCL INFO     AllGather |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    33.3/   0.8 |    52.0/   0.0 |   105.6/   3.2 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
autodl-container-035011a13c-c17b1392:9373:9456 [0] NCCL INFO ReduceScatter |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |    33.3/   0.8 |    52.0/   0.0 |   105.6/   3.2 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
autodl-container-035011a13c-c17b1392:9373:9456 [0] NCCL INFO     AllReduce |    35.2/   0.4 |    49.8/   0.0 |   280.0/   1.4 |    63.0/   0.4 |    94.0/   0.0 |   202.8/   1.6 |    14.2/   0.0 |    14.8/   0.0 |    29.7/   0.0 |

sjeaugey · 2022-12-13T10:08:29Z

The first number is the latency and the second is the bandwidth. Anything with 0 bandwidth will be dismissed (infinite time). LL has low latency but low bandwidth. LL128 more latency more bandwidth, Simple even more latency and more bandwidth...

The formula is not always just latency + size/bandwidth, for Tree we have a correction factor to apply (it's ugly but that's what we have at the moment).

mozizhao · 2022-12-13T11:58:55Z

The first number is the latency and the second is the bandwidth. Anything with 0 bandwidth will be dismissed (infinite time). LL has low latency but low bandwidth. LL128 more latency more bandwidth, Simple even more latency and more bandwidth...

The formula is not always just latency + size/bandwidth, for Tree we have a correction factor to apply (it's ugly but that's what we have at the moment).

Thanks!

mozizhao closed this as completed Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any indicator or flag indicating the algorithm of a communication is using? #754

Is there any indicator or flag indicating the algorithm of a communication is using? #754

mozizhao commented Dec 12, 2022

mozizhao commented Dec 12, 2022

sjeaugey commented Dec 12, 2022

mozizhao commented Dec 12, 2022 via email

sjeaugey commented Dec 12, 2022

mozizhao commented Dec 12, 2022

sjeaugey commented Dec 12, 2022

mozizhao commented Dec 12, 2022

sjeaugey commented Dec 12, 2022

mozizhao commented Dec 12, 2022

sjeaugey commented Dec 12, 2022

mozizhao commented Dec 12, 2022

mozizhao commented Dec 13, 2022 •

edited

sjeaugey commented Dec 13, 2022

mozizhao commented Dec 13, 2022

Is there any indicator or flag indicating the algorithm of a communication is using? #754

Is there any indicator or flag indicating the algorithm of a communication is using? #754

Comments

mozizhao commented Dec 12, 2022

mozizhao commented Dec 12, 2022

sjeaugey commented Dec 12, 2022

mozizhao commented Dec 12, 2022 via email

sjeaugey commented Dec 12, 2022

mozizhao commented Dec 12, 2022

sjeaugey commented Dec 12, 2022

mozizhao commented Dec 12, 2022

sjeaugey commented Dec 12, 2022

mozizhao commented Dec 12, 2022

sjeaugey commented Dec 12, 2022

mozizhao commented Dec 12, 2022

mozizhao commented Dec 13, 2022 • edited

sjeaugey commented Dec 13, 2022

mozizhao commented Dec 13, 2022

mozizhao commented Dec 13, 2022 •

edited