Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

busbw exceeds network bandwidth (2 nodes, 16 gpus, 100Gbps intel NIC, no NVSwitch) - what algorithm is used? #196

Closed
ofilip opened this issue Feb 5, 2024 · 5 comments

Comments

@ofilip
Copy link

ofilip commented Feb 5, 2024

We are running all-reduce tests with our GPU nodes.

HW setup:

  • 8x H100 PCIe, 4x nvlink, without nvswitch in node
  • single Intel 100Gbps network card in node

When running all_reduce test with 2 nodes, reported busbw exceeds 100Gbps. After little thinking we concluded that all_reduce is done in some hierarchical fashion (so that inter-node bandwidth is only algbw * (2*(n_nodes-1)/n_nodes) = algbw instead of algbw * (2*(n-1)/n), basically it seems that chunk is first reduced inside nodes to spare network bandwidth). After little more research we learnt about collnet/SHARP algorithm which seems related but implemented only with NVSwitch which is present in our setup.

Any hints what exactly is going on with all_reduce in our setup? How can I find details of algorithm used for all_reduce?

@sjeaugey
Copy link
Member

sjeaugey commented Feb 5, 2024

On two nodes, NCCL uses the tree (or nvlstree) algorithm which can indeed go beyond the network bottleneck (provided the intra-node bandwidth can sustain that higher speed). If you want to benchmark your network performance through NCCL on 2 nodes, you may want to force NCCL_ALGO=RING.

@ofilip
Copy link
Author

ofilip commented Feb 5, 2024

Thanks!

@ofilip ofilip closed this as completed Feb 5, 2024
@skalyan
Copy link

skalyan commented Apr 8, 2024

On two nodes, NCCL uses the tree (or nvlstree) algorithm which can indeed go beyond the network bottleneck (provided the intra-node bandwidth can sustain that higher speed). If you want to benchmark your network performance through NCCL on 2 nodes, you may want to force NCCL_ALGO=RING.

Is this a change in recent NCCL versions? For 2 A100s with Infiniband HDR generation, we didn't have to turn NCCL_ALGO=RING to benchmark nw performance.

@sjeaugey
Copy link
Member

sjeaugey commented Apr 8, 2024

It's been there since NCCL 2.4. But depending on the platform and number of GPUs per node, NCCL may or may not select the Tree algorithm on 2 nodes depending on its performance. In particular if you have 8 NICs, Tree will struggle to get better performance than ring without using 32 SMs (which we disallow). But if you have less NICs you'll probably see the "Tree effect".

On H100 we have NVLink SHARP so we can use the NVLSTree algorithm between 2 nodes which will give you higher BW and doesn't require that many SMs since the reductions are offloaded.

@skalyan
Copy link

skalyan commented Apr 8, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants