New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about profiling NCCL ring-reduce #768
Comments
I'm not sure why you care about the value of nelem. It's an implementation detail, where we compute the size of each chunk. It can be negative if it's past the end of the buffer in which case we'll consider it as zero.
That's a good question but I don't know the answer. That'd be something to ask to the Megatron-LM project.
Count is the argument passed to NCCL. opCount is an internal value for NCCL to track operations on a given NCCL communicator (group). |
Hello @sjeaugey, I was trying to fully understand the NCCL's ring-reduce operation by studying the code and some materials provided by the NCCL team.
I see, I know that in the primitive function (e.g. LL, LL128, there was the 'actual data' part & the 'flag' part, which is used for checking validation check.) Then I am expecting that if the size of the chunk, (which is used for primitive send function) is '1 or 2', then the system is checking the validity of the data to send. Am I thinking properly? If there's anything I misunderstood, please point it out.
I think that I couldn't totally understand your advice... If the Count is the argument passed to NCCL, then can I think that it's the argument passed to the NCCL function? (More specifically in the all-reduce case, does the count mean the size of the data?) I am interested in why the count of each operation is different.
When I looked at the NCCL debugger log, the Count is also dynamically changing. (from I am trying to interpret this phenomenon, but still struggling for a while. Thank you! |
Dear NCCL team,
I'm new to this library and trying to study the all-reduce systems by profiling the DL training.
For the model, I'm trying to train Megatron-LM's BERT-base via Distributed Data Parallel (DDP).
And also for the environment, I'm using a machine with 2 x A5000 GPUs, connected via PCIe network.
When I analyzed the code
all_reduce.h
, I found that the system determines the data size (based on the chunk size), which is explained asnelem
.However, there were some interesting parts while profiling the all-reduce.
When I checked the size of the data to use at send function, some were
-511
, some were1
, and some were1048576
. Can I get some advice for explaining the result of checking the data size, and why I could get that data?My second interesting part is interpreting the training result using
NCCL_DEBUG=INFO
.Here is the snippet of my log.
Because the BERT-base model has 12 layers, I thought that all-reduce's function call would be launched 12 times when the layer-by-layer synchronization happened. However, I'm quite confused when I saw that the function call was launched 14 times.
Can I get a device for interpreting this log?
(Additionally, I couldn't get some answer for the meaning of
opCount
andcount
, having a hard time with how to interpret the data.)Thank you!
The text was updated successfully, but these errors were encountered: