You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Why is allgather's busbw a little worse than allreduce/reducescatter for the same environment variables (e.g. same number of channels)?
For example, the result of nccl-tests on H100, reducescatter's busbw is 360GBps, and allgather's busbw is 350GBps.
Does this have anything to do with the processing efficiency of the kernel functions? Intuitively allgather only needs copy, but reducescatter needs copy and compute (reduce), shouldn't allgather be faster?
I'm not good at kernel performance analysis, so I hope you can point out where I'm wrong. Thanks.
The text was updated successfully, but these errors were encountered:
There isn't a known reason for this. The CUDA compiler may make better choices when building reduce_scatter than allgather. We frequently deal with innocuous changes to the code regressing some ops but improving others. So we just see it as noise in the compiler and move on. The only interesting difference between the two is that allgather is only compiled for byte elements while reduce_scatter has a version compiled for every datatype (and reduction op). Even though the hot path for well aligned (16 byte) data should be equivalent between the two, the slow path for allgather will be much worse than the slow path for reduce_scatter f32, since the latter can assume 4 byte alignment while the former cannot. While the slow path isn't taken in nccl-tests, it might be bloating the icache or something.
Why is allgather's busbw a little worse than allreduce/reducescatter for the same environment variables (e.g. same number of channels)?
For example, the result of nccl-tests on H100, reducescatter's busbw is 360GBps, and allgather's busbw is 350GBps.
Does this have anything to do with the processing efficiency of the kernel functions? Intuitively allgather only needs copy, but reducescatter needs copy and compute (reduce), shouldn't allgather be faster?
I'm not good at kernel performance analysis, so I hope you can point out where I'm wrong. Thanks.
The text was updated successfully, but these errors were encountered: