Why is allgather's busbw a little worse than allreduce/reducescatter for the same nccl environment variables #1281

pkuleo · 2024-05-10T13:28:54Z

Why is allgather's busbw a little worse than allreduce/reducescatter for the same environment variables (e.g. same number of channels)?

For example, the result of nccl-tests on H100, reducescatter's busbw is 360GBps, and allgather's busbw is 350GBps.

Does this have anything to do with the processing efficiency of the kernel functions? Intuitively allgather only needs copy, but reducescatter needs copy and compute (reduce), shouldn't allgather be faster?

I'm not good at kernel performance analysis, so I hope you can point out where I'm wrong. Thanks.

jbachan · 2024-05-10T17:22:23Z

There isn't a known reason for this. The CUDA compiler may make better choices when building reduce_scatter than allgather. We frequently deal with innocuous changes to the code regressing some ops but improving others. So we just see it as noise in the compiler and move on. The only interesting difference between the two is that allgather is only compiled for byte elements while reduce_scatter has a version compiled for every datatype (and reduction op). Even though the hot path for well aligned (16 byte) data should be equivalent between the two, the slow path for allgather will be much worse than the slow path for reduce_scatter f32, since the latter can assume 4 byte alignment while the former cannot. While the slow path isn't taken in nccl-tests, it might be bloating the icache or something.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is allgather's busbw a little worse than allreduce/reducescatter for the same nccl environment variables #1281

Why is allgather's busbw a little worse than allreduce/reducescatter for the same nccl environment variables #1281

pkuleo commented May 10, 2024

jbachan commented May 10, 2024

Why is allgather's busbw a little worse than allreduce/reducescatter for the same nccl environment variables #1281

Why is allgather's busbw a little worse than allreduce/reducescatter for the same nccl environment variables #1281

Comments

pkuleo commented May 10, 2024

jbachan commented May 10, 2024