NCCL ALGOs aren't quite documented #790

stas00 · 2023-02-23T05:34:04Z

I have been trying to understand primarily:

when the tree vs ring algorithm is used?
which of them is superior when?

The official docs are scarce and unclear.

Perhaps this issue could be used to update the state of things with the recent versions of NCCL, e.g. 2.14+.

I read a bunch of older issues on this topic here, but they seem to be talking about older nccl versions. So I found no definitive answers.

So my questions if you don't mind answering are:

1. when the tree vs ring algorithm is used?

In the earlier nccl one used the threshold env vars to control when tree vs ring were switched.

The only thing I discovered from the current docs is that
NCCL_ALGO controls this behavior.

The default is Tree,Ring,CollnetDirect,CollnetChain.

That's all the info. no explanation. What does it mean?

1a. Does it mean that it'll consider these algos?
1b. Is the order important or does it just use it to know which algos to consider?
1c. When does NCCL switch from one algo to another.

2. which of them is superior when?

It'd be awesome to have some sort of brain/experience dump on this question which could help others to build upon such a starter.

Here is what I found that was useful, yet confusing as the information isn't quite consistent (written by different people).

I found this blog that tree is superior to rings both in latency and bandwidth
here it says:

So basically we have 2 algorithms: a flat ring (high latency, best bandwidth) and a tree (low latency, ok bandwidth). On top of that we have collnet, but that's only if you have a plugin that supports network-based collectives (like SHARP).

but I'm not sure if this is still the case with the latest versions and the blog post I linked just above contradicts this. (it says tree has better latency than ring)
this comment looked very useful but doesn't go into bottom line latency/bw discussion

3. benchmarking

Since there is no clarity about when which algo is engaged I'm not sure how to do benchmarking correctly. For, example, when a 4-node bandwidth can be extrapolated to estimating the performance of a much larger cluster - since each algo will scale differently with many more nodes. Did anybody at NVIDIA publish some numbers like this blog comparing the latest tree to ring algo at various scales?

I don't know if perhaps I have missed some important questions if you are willing to meet my challenge and perhaps you could kindly add your own notes from your experience that would help others deploy NCCL better.

Thank you very much!

The text was updated successfully, but these errors were encountered:

sjeaugey · 2023-02-23T14:10:34Z

Hi Stas,

There used to be a static threshold, but it's been replaced by a more complex tuning system. The new system builds a model of the latency and bandwidth of each algorithm/protocol combination (that's many, many combinations) and decides which one should perform best depending on the size. So there is no longer an env var and a static value, which is good because the performance of each algorithm depends on the number of nodes and number of GPUs per node and therefore we need to navigate a 2D space of algo/protocols which isn't easy. You can always force one algorithm with NCCL_ALGO=TREE and NCCL_ALGO=RING and see what performance you get and whether NCCL switches at the right point. I know it's hard to understand, but it's also the best solution we found to have the best performance across all platforms and users without users having to manually tune the switch points. Downside is, if you want to manually tune things, you can't.
1a. Yes, NCCL consider all algorithms/protocols it can use and decides on the best one for a particular operation.
1b. The order doesn't matter.
1c. It's complicated.
Roughly speaking, ring is superior in terms of peak bandwidth (except on 2 nodes), tree is superior in terms of base latency (especially as we scale). Bandwidth = Size / Time, so whether you look at the time or the bandwidth for a given size, it will be a combination of both the peak bandwidth and the base latency. For a fixed size, as you scale, the base latency of ring will become prevalent and tree will be better.
Extrapolating at scale is not that hard for ring and tree (we have a function in tuning.cc predicting it, based on the ring linear latency and the tree log latency with reduced BW). Now as you scale, there are many factors which may cause your real performance to be very far off the prediction, like routing. Also note on an IB network you'll be able to use SHARP; that way your latency stays mostly constant as you scale, your bandwidth doesn't degrade much either, and you're always better than both ring and tree.

stas00 · 2023-02-24T05:28:33Z

Thank you very much for your detailed rich and to the point answers, Sylvain. You're awesome!

If at all possible could these be put in the docs? You could use even a FAQ form so you can just copy my questions and your answers and not need to polish anything to some high standard if any. Thank you!

sjeaugey · 2023-02-24T13:51:54Z

We try to limit the documentation to how to use NCCL and not verse into how NCCL is implemented. This is because the implementation changes all the time and we don't have the bandwidth to maintain an implementation documentation.

So, for the moment, everything which is related to the implementation (not the usage) ends up being documented in our GTC talks each year, and here, answering questions.

Also, keep in mind there are two sorts of environment variables: the ones which users may legitimately have to set, for example NCCL_SOCKET_IFNAME or NCCL_CROSS_NIC because it's a setting tied to the system, and the ones that are here mostly for debugging and workarounds (NCCL_ALGO is one of those). Those env vars are for us mainly to help users debug or diagnose issues, not to use by themselves. They're minimally documented because they're not supposed to be used in production. NCCL is supposed to detect which ALGO and PROTO is ok to use and the best to use. Setting NCCL_PROTO=LL128 for example on a platform that doesn't support it can cause silent data corruption.

stas00 · 2023-02-24T21:36:48Z

I again find this explanation very helpful, Sylvain.

I think even adding a short para like your comment above will go a long way, since when there are no docs and questions arise users like myself try to find the answers. e.g. I didn't know I should be looking for GTC talks to find out about the latest state of NCCL - it's surely obvious to you, but it wasn't for me. I tried to google answers and found very little out there. None of your videos came up.

As I'm making my own notes as I'm dealing with various NCCL problems, I'm exactly like your comment finding that I need to separate the env vars into at least 2 groups - operational ones and debug ones.

You completely answered my queries. Thank you.

sjeaugey · 2023-02-27T08:43:23Z

Yes, separating env vars based on those two categories has been on our todolist for some time, but we haven't done it yet. Sorry about that.

stas00 closed this as completed Feb 24, 2023

samiwilf mentioned this issue Sep 15, 2023

torchrec: Super slow allreduce in multi-node multi-gpu setting facebookresearch/dlrm#356

Closed

zigzagcai mentioned this issue Nov 13, 2023

[Bug] NCCL all_reduce failed with A800 when NCCL_ALGO uses Ring #1055

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL ALGOs aren't quite documented #790

NCCL ALGOs aren't quite documented #790

stas00 commented Feb 23, 2023 •

edited

Loading

sjeaugey commented Feb 23, 2023

stas00 commented Feb 24, 2023

sjeaugey commented Feb 24, 2023

stas00 commented Feb 24, 2023 •

edited

Loading

sjeaugey commented Feb 27, 2023

NCCL ALGOs aren't quite documented #790

NCCL ALGOs aren't quite documented #790

Comments

stas00 commented Feb 23, 2023 • edited Loading

1. when the tree vs ring algorithm is used?

2. which of them is superior when?

3. benchmarking

sjeaugey commented Feb 23, 2023

stas00 commented Feb 24, 2023

sjeaugey commented Feb 24, 2023

stas00 commented Feb 24, 2023 • edited Loading

sjeaugey commented Feb 27, 2023

stas00 commented Feb 23, 2023 •

edited

Loading

stas00 commented Feb 24, 2023 •

edited

Loading