Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about profiling NCCL ring-reduce #768

Open
tjdgh0715 opened this issue Jan 6, 2023 · 2 comments
Open

Questions about profiling NCCL ring-reduce #768

tjdgh0715 opened this issue Jan 6, 2023 · 2 comments

Comments

@tjdgh0715
Copy link

tjdgh0715 commented Jan 6, 2023

Dear NCCL team,

I'm new to this library and trying to study the all-reduce systems by profiling the DL training.
For the model, I'm trying to train Megatron-LM's BERT-base via Distributed Data Parallel (DDP).
And also for the environment, I'm using a machine with 2 x A5000 GPUs, connected via PCIe network.

When I analyzed the code all_reduce.h, I found that the system determines the data size (based on the chunk size), which is explained as nelem.
However, there were some interesting parts while profiling the all-reduce.
When I checked the size of the data to use at send function, some were -511, some were 1, and some were 1048576. Can I get some advice for explaining the result of checking the data size, and why I could get that data?

My second interesting part is interpreting the training result using NCCL_DEBUG=INFO.
Here is the snippet of my log.

 iteration        6/      10 | consumed samples:         1536 | consumed tokens:       185228 | elapsed time per iteration (ms): 306.5 | learning rate: 0.000E+00 | global batch size:   256 | loss scale: 134217728.0 | number of skipped iterations:   1 | number of nan iterations:   0 | samples per second: 835.199 | TFLOPs: 33.65 |
time (ms) | forward-compute: 105.94 | backward-compute: 155.04 | backward-params-all-reduce: 40.91 | backward-embedding-all-reduce: 0.02 | optimizer-copy-to-main-grad: 1.67 | optimizer-unscale-and-check-inf: 1.62 | optimizer: 3.36 | batch-generator: 1.13
jungfrau:11104:11104 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f69019ffe00 recvbuff 0x7f69019ffe00 count 240 datatype 0 op 0 root 0 comm 0x7f6a38002f70 [nranks=1] stream 0x55fe4921ac30
jungfrau:11103:11103 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7fb6c79ffe00 recvbuff 0x7fb6c79ffe00 count 240 datatype 0 op 0 root 0 comm 0x7fb800002f70 [nranks=1] stream 0x563cb9e1a0d0
jungfrau:11103:11103 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7fb6c7853200 recvbuff 0x7fb6c7853200 count 656384 datatype 0 op 0 root 0 comm 0x7fb800002f70 [nranks=1] stream 0x563cb9e1a0d0
jungfrau:11104:11104 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f6901853200 recvbuff 0x7f6901853200 count 656384 datatype 0 op 0 root 0 comm 0x7f6a38002f70 [nranks=1] stream 0x55fe4921ac30
jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 15 sendbuff 0x7fb6c79ffe00 recvbuff 0x7fb6c79ffe00 count 1 datatype 4 op 0 root 0 comm 0x7fb814002f70 [nranks=2] stream 0x563cb9e19e50
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 15 sendbuff 0x7f69019ffe00 recvbuff 0x7f69019ffe00 count 1 datatype 4 op 0 root 0 comm 0x7f6a99dab510 [nranks=2] stream 0x55fe4921a9b0
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f6a37d40000 recvbuff 0x7f6a37d40000 count 16384 datatype 7 op 2 root 0 comm 0x7f6a38002f70 [nranks=1] stream 0x55fe4921ac30
jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fb7ffd40000 recvbuff 0x7fb7ffd40000 count 16384 datatype 7 op 2 root 0 comm 0x7fb800002f70 [nranks=1] stream 0x563cb9e1a0d0
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f6733820000 recvbuff 0x7f6733820000 count 16384 datatype 7 op 0 root 0 comm 0x7f6a38002f70 [nranks=1] stream 0x55fe4921ac30
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f6a37da0000 recvbuff 0x7f6a37da0000 count 16384 datatype 7 op 0 root 0 comm 0x7f6a38002f70 [nranks=1] stream 0x55fe4921ac30
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 16 sendbuff 0x7f6acd55a600 recvbuff 0x7f6acd55a600 count 2 datatype 7 op 0 root 0 comm 0x7f6a99dab510 [nranks=2] stream 0x55fe4921a9b0
jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fb4fd820000 recvbuff 0x7fb4fd820000 count 16384 datatype 7 op 0 root 0 comm 0x7fb800002f70 [nranks=1] stream 0x563cb9e1a0d0
jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fb7ffda0000 recvbuff 0x7fb7ffda0000 count 16384 datatype 7 op 0 root 0 comm 0x7fb800002f70 [nranks=1] stream 0x563cb9e1a0d0
jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 16 sendbuff 0x7fb86875a600 recvbuff 0x7fb86875a600 count 2 datatype 7 op 0 root 0 comm 0x7fb814002f70 [nranks=2] stream 0x563cb9e19e50
jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 17 sendbuff 0x7fb58c000000 recvbuff 0x7fb58c000000 count 110160258 datatype 6 op 0 root 0 comm 0x7fb814002f70 [nranks=2] stream 0x563cb9e19e50
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 17 sendbuff 0x7f67c6000000 recvbuff 0x7f67c6000000 count 110160258 datatype 6 op 0 root 0 comm 0x7f6a99dab510 [nranks=2] stream 0x55fe4921a9b0
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f6acd44d600 recvbuff 0x7f6acd44d600 count 1 datatype 7 op 2 root 0 comm 0x7f673c002f70 [nranks=1] stream 0x55fe4921ad70
jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fb86864d600 recvbuff 0x7fb86864d600 count 1 datatype 7 op 2 root 0 comm 0x7fb508002f70 [nranks=1] stream 0x563cb9e1a210

Because the BERT-base model has 12 layers, I thought that all-reduce's function call would be launched 12 times when the layer-by-layer synchronization happened. However, I'm quite confused when I saw that the function call was launched 14 times.
Can I get a device for interpreting this log?
(Additionally, I couldn't get some answer for the meaning of opCount and count, having a hard time with how to interpret the data.)

Thank you!

@tjdgh0715 tjdgh0715 changed the title Questions about debugging & profiling NCCL ring-reduce Questions about NCCL ring-reduce Jan 6, 2023
@tjdgh0715 tjdgh0715 changed the title Questions about NCCL ring-reduce Questions about profiling NCCL ring-reduce Jan 6, 2023
@sjeaugey
Copy link
Member

When I analyzed the code all_reduce.h, I found that the system determines the data size (based on the chunk size), which is explained as nelem.
However, there were some interesting parts while profiling the all-reduce.
When I checked the size of the data to use at send function, some were -511, some were 1, and some were 1048576. Can I get some advice for explaining the result of checking the data size, and why I could get that data?

I'm not sure why you care about the value of nelem. It's an implementation detail, where we compute the size of each chunk. It can be negative if it's past the end of the buffer in which case we'll consider it as zero.

Because the BERT-base model has 12 layers, I thought that all-reduce's function call would be launched 12 times when the layer-by-layer synchronization happened. However, I'm quite confused when I saw that the function call was launched 14 times.

That's a good question but I don't know the answer. That'd be something to ask to the Megatron-LM project.

I couldn't get some answer for the meaning of opCount and count, having a hard time with how to interpret the data.

Count is the argument passed to NCCL. opCount is an internal value for NCCL to track operations on a given NCCL communicator (group).

@tjdgh0715
Copy link
Author

Hello @sjeaugey,
Thank you for the detailed reply!

I was trying to fully understand the NCCL's ring-reduce operation by studying the code and some materials provided by the NCCL team.

I'm not sure why you care about the value of nelem. It's an implementation detail, where we compute the size of each chunk. It can be negative if it's past the end of the buffer in which case we'll consider it zero.

I see, I know that in the primitive function (e.g. LL, LL128, there was the 'actual data' part & the 'flag' part, which is used for checking validation check.) Then I am expecting that if the size of the chunk, (which is used for primitive send function) is '1 or 2', then the system is checking the validity of the data to send. Am I thinking properly? If there's anything I misunderstood, please point it out.

Count is the argument passed to NCCL. opCount is an internal value for NCCL to track operations on a given NCCL communicator (group).

I think that I couldn't totally understand your advice... If the Count is the argument passed to NCCL, then can I think that it's the argument passed to the NCCL function? (More specifically in the all-reduce case, does the count mean the size of the data?) I am interested in why the count of each operation is different.

jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 15 sendbuff 0x7fb6c79ffe00 recvbuff 0x7fb6c79ffe00 count 1 datatype 4 op 0 root 0 comm 0x7fb814002f70 [nranks=2] stream 0x563cb9e19e50
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 15 sendbuff 0x7f69019ffe00 recvbuff 0x7f69019ffe00 count 1 datatype 4 op 0 root 0 comm 0x7f6a99dab510 [nranks=2] stream 0x55fe4921a9b0
jungfrau:11104:11104 [1] NCCL INFO AllReduce: opCount 0 sendbuff 0x7f6a37d40000 recvbuff 0x7f6a37d40000 count 16384 datatype 7 op 2 root 0 comm 0x7f6a38002f70 [nranks=1] stream 0x55fe4921ac30
jungfrau:11103:11103 [0] NCCL INFO AllReduce: opCount 0 sendbuff 0x7fb7ffd40000 recvbuff 0x7fb7ffd40000 count 16384 datatype 7 op 2 root 0 comm 0x7fb800002f70 [nranks=1] stream 0x563cb9e1a0d0

When I looked at the NCCL debugger log, the Count is also dynamically changing. (from 1 to 16384).
Also, I understand that the opCount is an internal value to track operations.
But as we can see in the log, there is the 0 value for the opCount.
I am wondering why there is the opCount=0 even though All-Reduce is launched. It didn't seem to match each other.

I am trying to interpret this phenomenon, but still struggling for a while.
Can I get some advice for this additional questions too?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants