-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL ALGOs aren't quite documented #790
Comments
Hi Stas,
|
Thank you very much for your detailed rich and to the point answers, Sylvain. You're awesome! If at all possible could these be put in the docs? You could use even a FAQ form so you can just copy my questions and your answers and not need to polish anything to some high standard if any. Thank you! |
We try to limit the documentation to how to use NCCL and not verse into how NCCL is implemented. This is because the implementation changes all the time and we don't have the bandwidth to maintain an implementation documentation. So, for the moment, everything which is related to the implementation (not the usage) ends up being documented in our GTC talks each year, and here, answering questions. Also, keep in mind there are two sorts of environment variables: the ones which users may legitimately have to set, for example |
I again find this explanation very helpful, Sylvain. I think even adding a short para like your comment above will go a long way, since when there are no docs and questions arise users like myself try to find the answers. e.g. I didn't know I should be looking for GTC talks to find out about the latest state of NCCL - it's surely obvious to you, but it wasn't for me. I tried to google answers and found very little out there. None of your videos came up. As I'm making my own notes as I'm dealing with various NCCL problems, I'm exactly like your comment finding that I need to separate the env vars into at least 2 groups - operational ones and debug ones. You completely answered my queries. Thank you. |
Yes, separating env vars based on those two categories has been on our todolist for some time, but we haven't done it yet. Sorry about that. |
I have been trying to understand primarily:
The official docs are scarce and unclear.
Perhaps this issue could be used to update the state of things with the recent versions of NCCL, e.g. 2.14+.
I read a bunch of older issues on this topic here, but they seem to be talking about older nccl versions. So I found no definitive answers.
So my questions if you don't mind answering are:
1. when the tree vs ring algorithm is used?
In the earlier nccl one used the threshold env vars to control when tree vs ring were switched.
The only thing I discovered from the current docs is that
NCCL_ALGO
controls this behavior.The default is Tree,Ring,CollnetDirect,CollnetChain.
That's all the info. no explanation. What does it mean?
1a. Does it mean that it'll consider these algos?
1b. Is the order important or does it just use it to know which algos to consider?
1c. When does NCCL switch from one algo to another.
2. which of them is superior when?
It'd be awesome to have some sort of brain/experience dump on this question which could help others to build upon such a starter.
Here is what I found that was useful, yet confusing as the information isn't quite consistent (written by different people).
I found this blog that tree is superior to rings both in latency and bandwidth
here it says:
but I'm not sure if this is still the case with the latest versions and the blog post I linked just above contradicts this. (it says tree has better latency than ring)
this comment looked very useful but doesn't go into bottom line latency/bw discussion
3. benchmarking
Since there is no clarity about when which algo is engaged I'm not sure how to do benchmarking correctly. For, example, when a 4-node bandwidth can be extrapolated to estimating the performance of a much larger cluster - since each algo will scale differently with many more nodes. Did anybody at NVIDIA publish some numbers like this blog comparing the latest tree to ring algo at various scales?
I don't know if perhaps I have missed some important questions if you are willing to meet my challenge and perhaps you could kindly add your own notes from your experience that would help others deploy NCCL better.
Thank you very much!
The text was updated successfully, but these errors were encountered: