Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL ALGOs aren't quite documented #790

Closed
stas00 opened this issue Feb 23, 2023 · 5 comments
Closed

NCCL ALGOs aren't quite documented #790

stas00 opened this issue Feb 23, 2023 · 5 comments

Comments

@stas00
Copy link

stas00 commented Feb 23, 2023

I have been trying to understand primarily:

  1. when the tree vs ring algorithm is used?
  2. which of them is superior when?

The official docs are scarce and unclear.

Perhaps this issue could be used to update the state of things with the recent versions of NCCL, e.g. 2.14+.

I read a bunch of older issues on this topic here, but they seem to be talking about older nccl versions. So I found no definitive answers.

So my questions if you don't mind answering are:

1. when the tree vs ring algorithm is used?

In the earlier nccl one used the threshold env vars to control when tree vs ring were switched.

The only thing I discovered from the current docs is that
NCCL_ALGO controls this behavior.

The default is Tree,Ring,CollnetDirect,CollnetChain.

That's all the info. no explanation. What does it mean?

1a. Does it mean that it'll consider these algos?
1b. Is the order important or does it just use it to know which algos to consider?
1c. When does NCCL switch from one algo to another.

2. which of them is superior when?

It'd be awesome to have some sort of brain/experience dump on this question which could help others to build upon such a starter.

Here is what I found that was useful, yet confusing as the information isn't quite consistent (written by different people).

  • I found this blog that tree is superior to rings both in latency and bandwidth

  • here it says:

    So basically we have 2 algorithms: a flat ring (high latency, best bandwidth) and a tree (low latency, ok bandwidth). On top of that we have collnet, but that's only if you have a plugin that supports network-based collectives (like SHARP).

    but I'm not sure if this is still the case with the latest versions and the blog post I linked just above contradicts this. (it says tree has better latency than ring)

  • this comment looked very useful but doesn't go into bottom line latency/bw discussion

3. benchmarking

Since there is no clarity about when which algo is engaged I'm not sure how to do benchmarking correctly. For, example, when a 4-node bandwidth can be extrapolated to estimating the performance of a much larger cluster - since each algo will scale differently with many more nodes. Did anybody at NVIDIA publish some numbers like this blog comparing the latest tree to ring algo at various scales?

I don't know if perhaps I have missed some important questions if you are willing to meet my challenge and perhaps you could kindly add your own notes from your experience that would help others deploy NCCL better.

Thank you very much!

@sjeaugey
Copy link
Member

Hi Stas,

  1. There used to be a static threshold, but it's been replaced by a more complex tuning system. The new system builds a model of the latency and bandwidth of each algorithm/protocol combination (that's many, many combinations) and decides which one should perform best depending on the size. So there is no longer an env var and a static value, which is good because the performance of each algorithm depends on the number of nodes and number of GPUs per node and therefore we need to navigate a 2D space of algo/protocols which isn't easy. You can always force one algorithm with NCCL_ALGO=TREE and NCCL_ALGO=RING and see what performance you get and whether NCCL switches at the right point. I know it's hard to understand, but it's also the best solution we found to have the best performance across all platforms and users without users having to manually tune the switch points. Downside is, if you want to manually tune things, you can't.
    1a. Yes, NCCL consider all algorithms/protocols it can use and decides on the best one for a particular operation.
    1b. The order doesn't matter.
    1c. It's complicated.
  2. Roughly speaking, ring is superior in terms of peak bandwidth (except on 2 nodes), tree is superior in terms of base latency (especially as we scale). Bandwidth = Size / Time, so whether you look at the time or the bandwidth for a given size, it will be a combination of both the peak bandwidth and the base latency. For a fixed size, as you scale, the base latency of ring will become prevalent and tree will be better.
  3. Extrapolating at scale is not that hard for ring and tree (we have a function in tuning.cc predicting it, based on the ring linear latency and the tree log latency with reduced BW). Now as you scale, there are many factors which may cause your real performance to be very far off the prediction, like routing. Also note on an IB network you'll be able to use SHARP; that way your latency stays mostly constant as you scale, your bandwidth doesn't degrade much either, and you're always better than both ring and tree.

@stas00
Copy link
Author

stas00 commented Feb 24, 2023

Thank you very much for your detailed rich and to the point answers, Sylvain. You're awesome!

If at all possible could these be put in the docs? You could use even a FAQ form so you can just copy my questions and your answers and not need to polish anything to some high standard if any. Thank you!

@sjeaugey
Copy link
Member

We try to limit the documentation to how to use NCCL and not verse into how NCCL is implemented. This is because the implementation changes all the time and we don't have the bandwidth to maintain an implementation documentation.

So, for the moment, everything which is related to the implementation (not the usage) ends up being documented in our GTC talks each year, and here, answering questions.

Also, keep in mind there are two sorts of environment variables: the ones which users may legitimately have to set, for example NCCL_SOCKET_IFNAME or NCCL_CROSS_NIC because it's a setting tied to the system, and the ones that are here mostly for debugging and workarounds (NCCL_ALGO is one of those). Those env vars are for us mainly to help users debug or diagnose issues, not to use by themselves. They're minimally documented because they're not supposed to be used in production. NCCL is supposed to detect which ALGO and PROTO is ok to use and the best to use. Setting NCCL_PROTO=LL128 for example on a platform that doesn't support it can cause silent data corruption.

@stas00
Copy link
Author

stas00 commented Feb 24, 2023

I again find this explanation very helpful, Sylvain.

I think even adding a short para like your comment above will go a long way, since when there are no docs and questions arise users like myself try to find the answers. e.g. I didn't know I should be looking for GTC talks to find out about the latest state of NCCL - it's surely obvious to you, but it wasn't for me. I tried to google answers and found very little out there. None of your videos came up.

As I'm making my own notes as I'm dealing with various NCCL problems, I'm exactly like your comment finding that I need to separate the env vars into at least 2 groups - operational ones and debug ones.

You completely answered my queries. Thank you.

@stas00 stas00 closed this as completed Feb 24, 2023
@sjeaugey
Copy link
Member

Yes, separating env vars based on those two categories has been on our todolist for some time, but we haven't done it yet. Sorry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants