-
Notifications
You must be signed in to change notification settings - Fork 817
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How could I carefully control which NIC to use when running ring-based collective operation? #687
Comments
I'm not sure I understand the problem. Are all GPUs part of the communicator? |
Each broadcast has its own communicator that involves all devices used in this broadcast. |
One simple example: we have 2 hosts, each host has two devices, each device has its own NIC. Broadcast A sends data from (host0, device0) to (host1, device0) and (host1, device1), whereas Broadcast B sends data from (host0, device1) to (host1, device0) and (host1, device1). If these two broadcasts both use the first NIC corresponding to the first device in host1, then these two broadcast could not run concurrently. That will be slow. So I want to let two broadcasts use different NICs when entering host1. |
I see. With a recent NCCL, if both GPUs are at the same distance of both NICs, I think each GPU would use a different NIC. Could you provide the node topology using Edit: that might work on the "source" node but on the "destination" node, it would not. One trick would be to force all communicators to use both NICs. For that, you could edit the topology to set the network speed to half of what it is. |
sry, I do not understand this trick. What does "force all communicators to use both NICs" means? I mean in a single broadcast we will only have one ring and thus use only one NIC. Besides, what does "edit the topology" mean? How could I edit this? I am designing algorithm but have not started coding and doing experiments yet. So I do not have topology to be exported. |
I thought you wanted to create two communicators, each having 3 ranks, one GPU on the "source" node and the two GPUs on the "destination" node, then use ncclBroadcast. Is that right or did I misunderstand? |
Yes, you exactly understand my example. My general question is how to finely control used NIC in a nccl collective operation. |
You can't. But we may be able to find tricks to still get good performance. Can you run the all_reduce_perf test on all 4 GPUs setting |
Thanks! This is just an imagined example. |
@sjeaugey hello ! But.. when i test some nccl-tests it didn't work like that.. let me describe my test.. First, Test environment
For test, i just set SR-IOV enable and add 4 VF for 1 PF. this is lspci result
and this is nvidia-smi topo -m result
If i run nccl-tests... (command is below)
I hope 4 process on each node just use each of infiniband hca mlx5_10, 11, 12, 13...(not duplicated. e.g. process-A use mlx5_10, process-B use mlx5_11.. like this) but only mlx5_10 is used !
and this is nccl graph log
Purpose of this test is... I just check that when multiple process share one infiniband device(PF) if i make VFs via SR-IOV and assign that VFs for each process, it can make more good performance result than just share a PF thank you, i hope your answer.. |
I'm not sure how your experiment relates to my comment. You have 2 NICs for 2 GPUs so each GPU would use a different NIC by default, e.g. GPU 0 would use |
I am sorry, my long question makes you confused.. Let me explain again
And, I will run nccl-tests with this command. If I run this command, 4 process on each node run, and each process use 1 GPU(not share with other process, each process get own gpu. e.g. process-A get GPU1, process-B get GPU2...) and all processes do all_reduce operation via 4 Infiniband VFs device mlx5_[10:13] (not use PFs, only use VFs)
I can not understand the above nccl-tests result. Because all processes use only a Net device mlx5_10. From your comment "I see. With a recent NCCL, if both GPUs are at the same distance of both NICs, I think each GPU would use a different NIC." I think the processes use all net device mlx5_[10:13]. but they use only mlx5_10... I try to explain my test easily based on my poor english T^T... I hope you'll get my situation.. and want some suggestion. Thank you ! |
I think you are running a single allreduce here, so we create a single ring. Hence, only one interface will be used. If you were to run 4 concurrent allreduce operations (across GPUs 0 of each node, GPUs 1 of each node, ...) then maybe each GPU would pick a different VF. But when NCCL tries to maximize the bandwidth within a node, it can see that all VFs are actually the same NIC, so it will know there is no point in using all of them because they map to the same port in the end. So once we've found a path using the first port, we know there is no bandwidth left for the other ports and we stop there. You log indicates a single ring. |
Could you please tell me about single ring..?? and how to set to use concurrent ring algorithm..?? and... how did you got it that nccl use single ring algorithm..?? In the log above, there are logs like this "Connected all trees"... What is this mean?? (I am very very beginner of nccl.... sorry) |
It would use a single ring because more would not give better performance, because they all map to the same port. NCCL has a pretty advanced topology detection and figures out the GPU, PCI, NIC, ports topology -- then searches for the most optimized path between GPUs and NICs. In the logs, I see you have only 2 channels and we use 2 channels per ring. Also, this log:
shows that for pattern 4 (ring) the most optimal solution we found was 1 channel, going from NET 0 to GPU 2, 1, 0, 3 then back to NET 0.
|
I really appreciate for your answer ! Thank you very very much, I learn lots of thing..! I get some more question..! You say like this " I understand that this means if nccl detect only a infiniband device and all process(gpu) should share the device, only A process use infiniband at a time and other process wait until the process finish it's job. Am i right?? ( As i know, if multiple pure process(not based on nccl environment) share a infiniband, each process create each QP(Queue Pair) and send data concurrently. not wait until another process finish it's job. but process based on nccl is wait. am i right..??) I run nccl-tests alltoall operation like this(mlx5_0 is infiniband HDR device, not VF, it is PF)
But bandwidth is very very low. (as i know, infiniband hdr can make bandwidth up to 200Gb/s = 25GB/s) |
I'm not sure I'd agree with that. Ring Allreduce requires data to go through each GPU and enter/exit the node once. There is no point in having all GPUs communicate between nodes, we just need to do it once in each direction. In the example above, GPU 2 is receiving data and GPU 3 is sending data. Everything is pipelined; there is a constant flow of data entering the NIC going to GPU 2, then being processed by all GPUs and exiting the node. Feel free to watch my GTC talk this year (2022) for a graphical depiction of the ring algorithm and how the rings map to the hardware.
Alltoall would have each GPU use the NIC because there is no way to fuse data (hence no ring), just direct communication, and they may actually use the different VFs (not that it makes any difference). The expected performance, given they share a NIC would be 24GB/s / 8 GPUs = 3 GB/s. 1GB/s is indeed much lower than it should be, but that likely because most GPUs have to use a remote NIC, through the CPU, and that path is slow. |
I want to run multiple broadcasts concurrently. They will all send data into one host with multiple NICs. I do not want one single NIC to be the bottleneck. So I prefer to let these broadcasts use different NICs. How could I carefully control this?
Besides, is the order for broadcast to send data decided at creation of communicator or during running on the fly? This is also important for me to configure which NIC is used.
Thanks!
The text was updated successfully, but these errors were encountered: