Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do NCCL support multi-NIC on ethernet? #601

Open
Dounm opened this issue Nov 24, 2021 · 9 comments
Open

Do NCCL support multi-NIC on ethernet? #601

Dounm opened this issue Nov 24, 2021 · 9 comments

Comments

@Dounm
Copy link

Dounm commented Nov 24, 2021

I was testing NCCL between two nodes, with each node have 4 GPU and 4 NIC on ethernet.

But I found that NCCL only make use of one ethernet NIC even when I have set NCCL_SOCKET_IFNAME=nic0,nic1,nic2,nic3.

And the NCCL debug info shows that it has detect all these 4 NICs (NCCL INFO Bootstrap : Using xxx).

I see the #452 and knows that NCCL do support multi-NIC on RDMA automatically, so I wonder if NCCL support multi-NIC on ethernet?

@sjeaugey
Copy link
Member

We do support multi-NIC on Ethernet. Can you run the NCCL perf tests, and also post the log with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH? What is the performance you get vs what you'd expect with all 4 NICs?

If NCCL is not using all NICs, it's probably because it would not make performance better. Ethernet has no GPU Direct RDMA, so everything has to go through the CPU, hence it is bottlenecked by the TCP/IP stack in the linux kernel. On top of that, multi-NIC aggregation works well when GPUs are connected through NVLink (because we can use NVLink to split the operations in N parts then balance the traffic onto the N NICs) but without NVLink it would not help much except if NICs have low bandwidth, e.g. 10Gb/s or 25Gb/s.

@Dounm
Copy link
Author

Dounm commented Nov 25, 2021

Each Node has 4x 2080Ti and 4x 10Gb/s Ethernet Nic port.

And I run the command mpirun --allow-run-as-root -np 2 --mca btl_tcp_if_include ens11f0 -H gpu23,gpu24 /root/work/nccl-tests/build/sendrecv_perf -b 1M -e 256M -f 2 -g 4 with export NCCL_SOCKET_IFNAME=ens10f0,ens10f1,ens11f0,ens11f1 and get the log as below.

# nThread 1 nGpus 4 minBytes 1048576 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   1513 on      gpu23 device  0 [0x02] NVIDIA GeForce RTX 2080 Ti
#   Rank  1 Pid   1513 on      gpu23 device  1 [0x03] NVIDIA GeForce RTX 2080 Ti
#   Rank  2 Pid   1513 on      gpu23 device  2 [0x83] NVIDIA GeForce RTX 2080 Ti
#   Rank  3 Pid   1513 on      gpu23 device  3 [0x84] NVIDIA GeForce RTX 2080 Ti
#   Rank  4 Pid   1188 on      gpu24 device  0 [0x02] NVIDIA GeForce RTX 2080 Ti
#   Rank  5 Pid   1188 on      gpu24 device  1 [0x03] NVIDIA GeForce RTX 2080 Ti
#   Rank  6 Pid   1188 on      gpu24 device  2 [0x83] NVIDIA GeForce RTX 2080 Ti
#   Rank  7 Pid   1188 on      gpu24 device  3 [0x84] NVIDIA GeForce RTX 2080 Ti
NCCL version 2.7.8+cuda11.1
gpu23:1513:1527 [1] NCCL INFO Attribute coll of node net not found
gpu23:1513:1527 [1] NCCL INFO Attribute coll of node net not found
gpu23:1513:1527 [1] NCCL INFO Attribute coll of node net not found
gpu23:1513:1527 [1] NCCL INFO Attribute coll of node net not found
gpu23:1513:1527 [1] NCCL INFO NCCL_SHM_DISABLE set by environment to 0.
gpu23:1513:1527 [1] NCCL INFO === System : maxWidth 1.2 ===
gpu23:1513:1527 [1] NCCL INFO CPU/0 (1/1/1)
gpu23:1513:1527 [1] NCCL INFO + PCI[12.0] - GPU/2000 (0)
gpu23:1513:1527 [1] NCCL INFO + PCI[12.0] - GPU/3000 (1)
gpu23:1513:1527 [1] NCCL INFO + SYS[6.0] - CPU/1
gpu23:1513:1527 [1] NCCL INFO + PCI[3.0] - NIC/1000
gpu23:1513:1527 [1] NCCL INFO              + NET[1.2] - NET/0 (0/0/1.250000)
gpu23:1513:1527 [1] NCCL INFO              + NET[1.2] - NET/1 (1/0/1.250000)
gpu23:1513:1527 [1] NCCL INFO CPU/1 (1/1/1)
gpu23:1513:1527 [1] NCCL INFO + PCI[12.0] - GPU/83000 (2)
gpu23:1513:1527 [1] NCCL INFO + PCI[12.0] - GPU/84000 (3)
gpu23:1513:1527 [1] NCCL INFO + SYS[6.0] - CPU/0
gpu23:1513:1527 [1] NCCL INFO + PCI[3.0] - NIC/82000
gpu23:1513:1527 [1] NCCL INFO              + NET[1.2] - NET/2 (2/0/1.250000)
gpu23:1513:1527 [1] NCCL INFO              + NET[1.2] - NET/3 (3/0/1.250000)
gpu23:1513:1527 [1] NCCL INFO ==========================================
gpu23:1513:1527 [1] NCCL INFO GPU/2000 :GPU/2000 (0/5000.000000/LOC) GPU/3000 (2/12.000000/PHB) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu23:1513:1527 [1] NCCL INFO GPU/3000 :GPU/2000 (2/12.000000/PHB) GPU/3000 (0/5000.000000/LOC) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu23:1513:1527 [1] NCCL INFO GPU/83000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (0/5000.000000/LOC) GPU/84000 (2/12.000000/PHB) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu23:1513:1527 [1] NCCL INFO GPU/84000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (2/12.000000/PHB) GPU/84000 (0/5000.000000/LOC) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu23:1513:1527 [1] NCCL INFO NET/0 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (2/1.250000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu23:1513:1527 [1] NCCL INFO NET/1 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (2/1.250000/LOC) NET/1 (0/5000.000000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu23:1513:1527 [1] NCCL INFO NET/2 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (0/5000.000000/LOC) NET/3 (2/1.250000/LOC) 
gpu23:1513:1527 [1] NCCL INFO NET/3 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (2/1.250000/LOC) NET/3 (0/5000.000000/LOC) 
gpu23:1513:1529 [3] NCCL INFO Attribute coll of node net not found
gpu23:1513:1529 [3] NCCL INFO Attribute coll of node net not found
gpu23:1513:1529 [3] NCCL INFO Attribute coll of node net not found
gpu23:1513:1529 [3] NCCL INFO Attribute coll of node net not found
gpu23:1513:1529 [3] NCCL INFO === System : maxWidth 1.2 ===
gpu23:1513:1529 [3] NCCL INFO CPU/0 (1/1/1)
gpu23:1513:1529 [3] NCCL INFO + PCI[12.0] - GPU/2000 (0)
gpu23:1513:1529 [3] NCCL INFO + PCI[12.0] - GPU/3000 (1)
gpu23:1513:1529 [3] NCCL INFO + SYS[6.0] - CPU/1
gpu23:1513:1529 [3] NCCL INFO + PCI[3.0] - NIC/1000
gpu23:1513:1529 [3] NCCL INFO              + NET[1.2] - NET/0 (0/0/1.250000)
gpu23:1513:1529 [3] NCCL INFO              + NET[1.2] - NET/1 (1/0/1.250000)
gpu23:1513:1529 [3] NCCL INFO CPU/1 (1/1/1)
gpu23:1513:1529 [3] NCCL INFO + PCI[12.0] - GPU/83000 (2)
gpu23:1513:1529 [3] NCCL INFO + PCI[12.0] - GPU/84000 (3)
gpu23:1513:1529 [3] NCCL INFO + SYS[6.0] - CPU/0
gpu23:1513:1529 [3] NCCL INFO + PCI[3.0] - NIC/82000
gpu23:1513:1529 [3] NCCL INFO              + NET[1.2] - NET/2 (2/0/1.250000)
gpu23:1513:1529 [3] NCCL INFO              + NET[1.2] - NET/3 (3/0/1.250000)
gpu23:1513:1529 [3] NCCL INFO ==========================================
gpu23:1513:1529 [3] NCCL INFO GPU/2000 :GPU/2000 (0/5000.000000/LOC) GPU/3000 (2/12.000000/PHB) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu23:1513:1529 [3] NCCL INFO GPU/3000 :GPU/2000 (2/12.000000/PHB) GPU/3000 (0/5000.000000/LOC) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu23:1513:1529 [3] NCCL INFO GPU/83000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (0/5000.000000/LOC) GPU/84000 (2/12.000000/PHB) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu23:1513:1529 [3] NCCL INFO GPU/84000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (2/12.000000/PHB) GPU/84000 (0/5000.000000/LOC) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu23:1513:1529 [3] NCCL INFO NET/0 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (2/1.250000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu23:1513:1529 [3] NCCL INFO NET/1 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (2/1.250000/LOC) NET/1 (0/5000.000000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu23:1513:1529 [3] NCCL INFO NET/2 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (0/5000.000000/LOC) NET/3 (2/1.250000/LOC) 
gpu23:1513:1529 [3] NCCL INFO NET/3 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (2/1.250000/LOC) NET/3 (0/5000.000000/LOC) 
gpu24:1188:1203 [3] NCCL INFO Attribute coll of node net not found
gpu23:1513:1528 [2] NCCL INFO Attribute coll of node net not found
gpu23:1513:1526 [0] NCCL INFO Attribute coll of node net not found
gpu23:1513:1526 [0] NCCL INFO Attribute coll of node net not found
gpu23:1513:1526 [0] NCCL INFO Attribute coll of node net not found
gpu23:1513:1526 [0] NCCL INFO Attribute coll of node net not found
gpu23:1513:1526 [0] NCCL INFO === System : maxWidth 1.2 ===
gpu23:1513:1526 [0] NCCL INFO CPU/0 (1/1/1)
gpu23:1513:1526 [0] NCCL INFO + PCI[12.0] - GPU/2000 (0)
gpu23:1513:1526 [0] NCCL INFO + PCI[12.0] - GPU/3000 (1)
gpu23:1513:1526 [0] NCCL INFO + SYS[6.0] - CPU/1
gpu23:1513:1526 [0] NCCL INFO + PCI[3.0] - NIC/1000
gpu23:1513:1526 [0] NCCL INFO              + NET[1.2] - NET/0 (0/0/1.250000)
gpu23:1513:1526 [0] NCCL INFO              + NET[1.2] - NET/1 (1/0/1.250000)
gpu23:1513:1526 [0] NCCL INFO CPU/1 (1/1/1)
gpu23:1513:1526 [0] NCCL INFO + PCI[12.0] - GPU/83000 (2)
gpu23:1513:1526 [0] NCCL INFO + PCI[12.0] - GPU/84000 (3)
gpu23:1513:1526 [0] NCCL INFO + SYS[6.0] - CPU/0
gpu23:1513:1526 [0] NCCL INFO + PCI[3.0] - NIC/82000
gpu23:1513:1526 [0] NCCL INFO              + NET[1.2] - NET/2 (2/0/1.250000)
gpu23:1513:1526 [0] NCCL INFO              + NET[1.2] - NET/3 (3/0/1.250000)
gpu23:1513:1526 [0] NCCL INFO ==========================================
gpu23:1513:1526 [0] NCCL INFO GPU/2000 :GPU/2000 (0/5000.000000/LOC) GPU/3000 (2/12.000000/PHB) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu23:1513:1526 [0] NCCL INFO GPU/3000 :GPU/2000 (2/12.000000/PHB) GPU/3000 (0/5000.000000/LOC) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu23:1513:1526 [0] NCCL INFO GPU/83000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (0/5000.000000/LOC) GPU/84000 (2/12.000000/PHB) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu23:1513:1526 [0] NCCL INFO GPU/84000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (2/12.000000/PHB) GPU/84000 (0/5000.000000/LOC) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu23:1513:1526 [0] NCCL INFO NET/0 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (2/1.250000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu23:1513:1526 [0] NCCL INFO NET/1 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (2/1.250000/LOC) NET/1 (0/5000.000000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu23:1513:1526 [0] NCCL INFO NET/2 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (0/5000.000000/LOC) NET/3 (2/1.250000/LOC) 
gpu23:1513:1526 [0] NCCL INFO NET/3 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (2/1.250000/LOC) NET/3 (0/5000.000000/LOC) 
gpu23:1513:1528 [2] NCCL INFO Attribute coll of node net not found
gpu23:1513:1528 [2] NCCL INFO Attribute coll of node net not found
gpu23:1513:1528 [2] NCCL INFO Attribute coll of node net not found
gpu23:1513:1528 [2] NCCL INFO === System : maxWidth 1.2 ===
gpu23:1513:1528 [2] NCCL INFO CPU/0 (1/1/1)
gpu23:1513:1528 [2] NCCL INFO + PCI[12.0] - GPU/2000 (0)
gpu23:1513:1528 [2] NCCL INFO + PCI[12.0] - GPU/3000 (1)
gpu23:1513:1528 [2] NCCL INFO + SYS[6.0] - CPU/1
gpu23:1513:1528 [2] NCCL INFO + PCI[3.0] - NIC/1000
gpu23:1513:1528 [2] NCCL INFO              + NET[1.2] - NET/0 (0/0/1.250000)
gpu23:1513:1528 [2] NCCL INFO              + NET[1.2] - NET/1 (1/0/1.250000)
gpu23:1513:1528 [2] NCCL INFO CPU/1 (1/1/1)
gpu23:1513:1528 [2] NCCL INFO + PCI[12.0] - GPU/83000 (2)
gpu23:1513:1528 [2] NCCL INFO + PCI[12.0] - GPU/84000 (3)
gpu23:1513:1528 [2] NCCL INFO + SYS[6.0] - CPU/0
gpu23:1513:1528 [2] NCCL INFO + PCI[3.0] - NIC/82000
gpu23:1513:1528 [2] NCCL INFO              + NET[1.2] - NET/2 (2/0/1.250000)
gpu23:1513:1528 [2] NCCL INFO              + NET[1.2] - NET/3 (3/0/1.250000)
gpu23:1513:1528 [2] NCCL INFO ==========================================
gpu23:1513:1528 [2] NCCL INFO GPU/2000 :GPU/2000 (0/5000.000000/LOC) GPU/3000 (2/12.000000/PHB) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu23:1513:1528 [2] NCCL INFO GPU/3000 :GPU/2000 (2/12.000000/PHB) GPU/3000 (0/5000.000000/LOC) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu23:1513:1528 [2] NCCL INFO GPU/83000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (0/5000.000000/LOC) GPU/84000 (2/12.000000/PHB) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu23:1513:1528 [2] NCCL INFO GPU/84000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (2/12.000000/PHB) GPU/84000 (0/5000.000000/LOC) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu23:1513:1528 [2] NCCL INFO NET/0 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (2/1.250000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu23:1513:1528 [2] NCCL INFO NET/1 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (2/1.250000/LOC) NET/1 (0/5000.000000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu23:1513:1528 [2] NCCL INFO NET/2 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (0/5000.000000/LOC) NET/3 (2/1.250000/LOC) 
gpu23:1513:1528 [2] NCCL INFO NET/3 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (2/1.250000/LOC) NET/3 (0/5000.000000/LOC) 
gpu24:1188:1203 [3] NCCL INFO Attribute coll of node net not found
gpu24:1188:1203 [3] NCCL INFO Attribute coll of node net not found
gpu24:1188:1203 [3] NCCL INFO Attribute coll of node net not found
gpu24:1188:1203 [3] NCCL INFO NCCL_SHM_DISABLE set by environment to 0.
gpu24:1188:1203 [3] NCCL INFO === System : maxWidth 1.2 ===
gpu24:1188:1203 [3] NCCL INFO CPU/0 (1/1/1)
gpu24:1188:1203 [3] NCCL INFO + PCI[12.0] - GPU/2000 (4)
gpu24:1188:1203 [3] NCCL INFO + PCI[12.0] - GPU/3000 (5)
gpu24:1188:1203 [3] NCCL INFO + SYS[6.0] - CPU/1
gpu24:1188:1203 [3] NCCL INFO + PCI[3.0] - NIC/1000
gpu24:1188:1203 [3] NCCL INFO              + NET[1.2] - NET/0 (0/0/1.250000)
gpu24:1188:1203 [3] NCCL INFO              + NET[1.2] - NET/1 (1/0/1.250000)
gpu24:1188:1203 [3] NCCL INFO CPU/1 (1/1/1)
gpu24:1188:1203 [3] NCCL INFO + PCI[12.0] - GPU/83000 (6)
gpu24:1188:1203 [3] NCCL INFO + PCI[12.0] - GPU/84000 (7)
gpu24:1188:1203 [3] NCCL INFO + SYS[6.0] - CPU/0
gpu24:1188:1203 [3] NCCL INFO + PCI[3.0] - NIC/82000
gpu24:1188:1203 [3] NCCL INFO              + NET[1.2] - NET/2 (2/0/1.250000)
gpu24:1188:1203 [3] NCCL INFO              + NET[1.2] - NET/3 (3/0/1.250000)
gpu24:1188:1203 [3] NCCL INFO ==========================================
gpu24:1188:1203 [3] NCCL INFO GPU/2000 :GPU/2000 (0/5000.000000/LOC) GPU/3000 (2/12.000000/PHB) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu24:1188:1203 [3] NCCL INFO GPU/3000 :GPU/2000 (2/12.000000/PHB) GPU/3000 (0/5000.000000/LOC) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu24:1188:1203 [3] NCCL INFO GPU/83000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (0/5000.000000/LOC) GPU/84000 (2/12.000000/PHB) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu24:1188:1203 [3] NCCL INFO GPU/84000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (2/12.000000/PHB) GPU/84000 (0/5000.000000/LOC) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu24:1188:1203 [3] NCCL INFO NET/0 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (2/1.250000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu24:1188:1203 [3] NCCL INFO NET/1 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (2/1.250000/LOC) NET/1 (0/5000.000000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu24:1188:1203 [3] NCCL INFO NET/2 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (0/5000.000000/LOC) NET/3 (2/1.250000/LOC) 
gpu24:1188:1203 [3] NCCL INFO NET/3 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (2/1.250000/LOC) NET/3 (0/5000.000000/LOC) 
gpu24:1188:1202 [2] NCCL INFO Attribute coll of node net not found
gpu24:1188:1202 [2] NCCL INFO Attribute coll of node net not found
gpu24:1188:1202 [2] NCCL INFO Attribute coll of node net not found
gpu24:1188:1202 [2] NCCL INFO Attribute coll of node net not found
gpu24:1188:1202 [2] NCCL INFO === System : maxWidth 1.2 ===
gpu24:1188:1202 [2] NCCL INFO CPU/0 (1/1/1)
gpu24:1188:1202 [2] NCCL INFO + PCI[12.0] - GPU/2000 (4)
gpu24:1188:1202 [2] NCCL INFO + PCI[12.0] - GPU/3000 (5)
gpu24:1188:1202 [2] NCCL INFO + SYS[6.0] - CPU/1
gpu24:1188:1202 [2] NCCL INFO + PCI[3.0] - NIC/1000
gpu24:1188:1202 [2] NCCL INFO              + NET[1.2] - NET/0 (0/0/1.250000)
gpu24:1188:1202 [2] NCCL INFO              + NET[1.2] - NET/1 (1/0/1.250000)
gpu24:1188:1202 [2] NCCL INFO CPU/1 (1/1/1)
gpu24:1188:1202 [2] NCCL INFO + PCI[12.0] - GPU/83000 (6)
gpu24:1188:1202 [2] NCCL INFO + PCI[12.0] - GPU/84000 (7)
gpu24:1188:1202 [2] NCCL INFO + SYS[6.0] - CPU/0
gpu24:1188:1202 [2] NCCL INFO + PCI[3.0] - NIC/82000
gpu24:1188:1202 [2] NCCL INFO              + NET[1.2] - NET/2 (2/0/1.250000)
gpu24:1188:1202 [2] NCCL INFO              + NET[1.2] - NET/3 (3/0/1.250000)
gpu24:1188:1202 [2] NCCL INFO ==========================================
gpu24:1188:1202 [2] NCCL INFO GPU/2000 :GPU/2000 (0/5000.000000/LOC) GPU/3000 (2/12.000000/PHB) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu24:1188:1202 [2] NCCL INFO GPU/3000 :GPU/2000 (2/12.000000/PHB) GPU/3000 (0/5000.000000/LOC) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu24:1188:1202 [2] NCCL INFO GPU/83000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (0/5000.000000/LOC) GPU/84000 (2/12.000000/PHB) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu24:1188:1202 [2] NCCL INFO GPU/84000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (2/12.000000/PHB) GPU/84000 (0/5000.000000/LOC) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu24:1188:1202 [2] NCCL INFO NET/0 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (2/1.250000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu24:1188:1202 [2] NCCL INFO NET/1 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (2/1.250000/LOC) NET/1 (0/5000.000000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu24:1188:1202 [2] NCCL INFO NET/2 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (0/5000.000000/LOC) NET/3 (2/1.250000/LOC) 
gpu24:1188:1202 [2] NCCL INFO NET/3 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (2/1.250000/LOC) NET/3 (0/5000.000000/LOC) 
gpu24:1188:1200 [0] NCCL INFO Attribute coll of node net not found
gpu24:1188:1200 [0] NCCL INFO Attribute coll of node net not found
gpu24:1188:1200 [0] NCCL INFO Attribute coll of node net not found
gpu24:1188:1200 [0] NCCL INFO Attribute coll of node net not found
gpu24:1188:1200 [0] NCCL INFO === System : maxWidth 1.2 ===
gpu24:1188:1201 [1] NCCL INFO Attribute coll of node net not found
gpu24:1188:1201 [1] NCCL INFO Attribute coll of node net not found
gpu24:1188:1201 [1] NCCL INFO Attribute coll of node net not found
gpu24:1188:1201 [1] NCCL INFO Attribute coll of node net not found
gpu24:1188:1201 [1] NCCL INFO === System : maxWidth 1.2 ===
gpu24:1188:1201 [1] NCCL INFO CPU/0 (1/1/1)
gpu24:1188:1201 [1] NCCL INFO + PCI[12.0] - GPU/2000 (4)
gpu24:1188:1201 [1] NCCL INFO + PCI[12.0] - GPU/3000 (5)
gpu24:1188:1201 [1] NCCL INFO + SYS[6.0] - CPU/1
gpu24:1188:1201 [1] NCCL INFO + PCI[3.0] - NIC/1000
gpu24:1188:1201 [1] NCCL INFO              + NET[1.2] - NET/0 (0/0/1.250000)
gpu24:1188:1201 [1] NCCL INFO              + NET[1.2] - NET/1 (1/0/1.250000)
gpu24:1188:1201 [1] NCCL INFO CPU/1 (1/1/1)
gpu24:1188:1201 [1] NCCL INFO + PCI[12.0] - GPU/83000 (6)
gpu24:1188:1201 [1] NCCL INFO + PCI[12.0] - GPU/84000 (7)
gpu24:1188:1201 [1] NCCL INFO + SYS[6.0] - CPU/0
gpu24:1188:1201 [1] NCCL INFO + PCI[3.0] - NIC/82000
gpu24:1188:1201 [1] NCCL INFO              + NET[1.2] - NET/2 (2/0/1.250000)
gpu24:1188:1201 [1] NCCL INFO              + NET[1.2] - NET/3 (3/0/1.250000)
gpu24:1188:1201 [1] NCCL INFO ==========================================
gpu24:1188:1201 [1] NCCL INFO GPU/2000 :GPU/2000 (0/5000.000000/LOC) GPU/3000 (2/12.000000/PHB) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu24:1188:1201 [1] NCCL INFO GPU/3000 :GPU/2000 (2/12.000000/PHB) GPU/3000 (0/5000.000000/LOC) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu24:1188:1201 [1] NCCL INFO GPU/83000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (0/5000.000000/LOC) GPU/84000 (2/12.000000/PHB) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu24:1188:1201 [1] NCCL INFO GPU/84000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (2/12.000000/PHB) GPU/84000 (0/5000.000000/LOC) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu24:1188:1201 [1] NCCL INFO NET/0 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (2/1.250000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu24:1188:1201 [1] NCCL INFO NET/1 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (2/1.250000/LOC) NET/1 (0/5000.000000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu24:1188:1201 [1] NCCL INFO NET/2 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (0/5000.000000/LOC) NET/3 (2/1.250000/LOC) 
gpu24:1188:1201 [1] NCCL INFO NET/3 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (2/1.250000/LOC) NET/3 (0/5000.000000/LOC) 
gpu24:1188:1200 [0] NCCL INFO CPU/0 (1/1/1)
gpu24:1188:1200 [0] NCCL INFO + PCI[12.0] - GPU/2000 (4)
gpu24:1188:1200 [0] NCCL INFO + PCI[12.0] - GPU/3000 (5)
gpu24:1188:1200 [0] NCCL INFO + SYS[6.0] - CPU/1
gpu24:1188:1200 [0] NCCL INFO + PCI[3.0] - NIC/1000
gpu24:1188:1200 [0] NCCL INFO              + NET[1.2] - NET/0 (0/0/1.250000)
gpu24:1188:1200 [0] NCCL INFO              + NET[1.2] - NET/1 (1/0/1.250000)
gpu24:1188:1200 [0] NCCL INFO CPU/1 (1/1/1)
gpu24:1188:1200 [0] NCCL INFO + PCI[12.0] - GPU/83000 (6)
gpu24:1188:1200 [0] NCCL INFO + PCI[12.0] - GPU/84000 (7)
gpu24:1188:1200 [0] NCCL INFO + SYS[6.0] - CPU/0
gpu24:1188:1200 [0] NCCL INFO + PCI[3.0] - NIC/82000
gpu24:1188:1200 [0] NCCL INFO              + NET[1.2] - NET/2 (2/0/1.250000)
gpu24:1188:1200 [0] NCCL INFO              + NET[1.2] - NET/3 (3/0/1.250000)
gpu24:1188:1200 [0] NCCL INFO ==========================================
gpu24:1188:1200 [0] NCCL INFO GPU/2000 :GPU/2000 (0/5000.000000/LOC) GPU/3000 (2/12.000000/PHB) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu24:1188:1200 [0] NCCL INFO GPU/3000 :GPU/2000 (2/12.000000/PHB) GPU/3000 (0/5000.000000/LOC) GPU/83000 (3/6.000000/SYS) GPU/84000 (3/6.000000/SYS) CPU/0 (1/12.000000/PHB) CPU/1 (2/6.000000/SYS) NET/0 (3/1.250000/PHB) NET/1 (3/1.250000/PHB) NET/2 (4/1.250000/SYS) NET/3 (4/1.250000/SYS) 
gpu24:1188:1200 [0] NCCL INFO GPU/83000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (0/5000.000000/LOC) GPU/84000 (2/12.000000/PHB) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu24:1188:1200 [0] NCCL INFO GPU/84000 :GPU/2000 (3/6.000000/SYS) GPU/3000 (3/6.000000/SYS) GPU/83000 (2/12.000000/PHB) GPU/84000 (0/5000.000000/LOC) CPU/0 (2/6.000000/SYS) CPU/1 (1/12.000000/PHB) NET/0 (4/1.250000/SYS) NET/1 (4/1.250000/SYS) NET/2 (3/1.250000/PHB) NET/3 (3/1.250000/PHB) 
gpu24:1188:1200 [0] NCCL INFO NET/0 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (0/5000.000000/LOC) NET/1 (2/1.250000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu24:1188:1200 [0] NCCL INFO NET/1 :GPU/2000 (3/1.250000/PHB) GPU/3000 (3/1.250000/PHB) GPU/83000 (4/1.250000/SYS) GPU/84000 (4/1.250000/SYS) CPU/0 (2/1.250000/PHB) CPU/1 (3/1.250000/SYS) NET/0 (2/1.250000/LOC) NET/1 (0/5000.000000/LOC) NET/2 (5/1.250000/SYS) NET/3 (5/1.250000/SYS) 
gpu24:1188:1200 [0] NCCL INFO NET/2 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (0/5000.000000/LOC) NET/3 (2/1.250000/LOC) 
gpu24:1188:1200 [0] NCCL INFO NET/3 :GPU/2000 (4/1.250000/SYS) GPU/3000 (4/1.250000/SYS) GPU/83000 (3/1.250000/PHB) GPU/84000 (3/1.250000/PHB) CPU/0 (3/1.250000/SYS) CPU/1 (2/1.250000/PHB) NET/0 (5/1.250000/SYS) NET/1 (5/1.250000/SYS) NET/2 (2/1.250000/LOC) NET/3 (0/5000.000000/LOC) 
gpu23:1513:1528 [2] NCCL INFO Pattern 4, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu24:1188:1201 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu23:1513:1528 [2] NCCL INFO  0 : NET/0 GPU/0 GPU/1 GPU/2 GPU/3 NET/0
gpu23:1513:1528 [2] NCCL INFO  1 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 NET/1
gpu23:1513:1528 [2] NCCL INFO  2 : NET/2 GPU/0 GPU/1 GPU/2 GPU/3 NET/2
gpu23:1513:1528 [2] NCCL INFO  3 : NET/3 GPU/2 GPU/3 GPU/0 GPU/1 NET/3
gpu24:1188:1201 [1] NCCL INFO  0 : NET/0 GPU/4 GPU/5 GPU/6 GPU/7 NET/0
gpu24:1188:1201 [1] NCCL INFO  1 : NET/1 GPU/4 GPU/5 GPU/6 GPU/7 NET/1
gpu24:1188:1201 [1] NCCL INFO  2 : NET/2 GPU/4 GPU/5 GPU/6 GPU/7 NET/2
gpu24:1188:1201 [1] NCCL INFO  3 : NET/3 GPU/6 GPU/7 GPU/4 GPU/5 NET/3
gpu23:1513:1526 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu23:1513:1526 [0] NCCL INFO  0 : NET/0 GPU/0 GPU/1 GPU/2 GPU/3 NET/0
gpu23:1513:1526 [0] NCCL INFO  1 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 NET/1
gpu23:1513:1526 [0] NCCL INFO  2 : NET/2 GPU/0 GPU/1 GPU/2 GPU/3 NET/2
gpu23:1513:1526 [0] NCCL INFO  3 : NET/3 GPU/2 GPU/3 GPU/0 GPU/1 NET/3
gpu23:1513:1529 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu23:1513:1529 [3] NCCL INFO  0 : NET/0 GPU/0 GPU/1 GPU/2 GPU/3 NET/0
gpu23:1513:1529 [3] NCCL INFO  1 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 NET/1
gpu23:1513:1529 [3] NCCL INFO  2 : NET/2 GPU/0 GPU/1 GPU/2 GPU/3 NET/2
gpu23:1513:1529 [3] NCCL INFO  3 : NET/3 GPU/2 GPU/3 GPU/0 GPU/1 NET/3
gpu23:1513:1527 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu23:1513:1527 [1] NCCL INFO  0 : NET/0 GPU/0 GPU/1 GPU/2 GPU/3 NET/0
gpu23:1513:1527 [1] NCCL INFO  1 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 NET/1
gpu23:1513:1527 [1] NCCL INFO  2 : NET/2 GPU/0 GPU/1 GPU/2 GPU/3 NET/2
gpu23:1513:1527 [1] NCCL INFO  3 : NET/3 GPU/2 GPU/3 GPU/0 GPU/1 NET/3
gpu24:1188:1200 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu24:1188:1200 [0] NCCL INFO  0 : NET/0 GPU/4 GPU/5 GPU/6 GPU/7 NET/0
gpu24:1188:1200 [0] NCCL INFO  1 : NET/1 GPU/4 GPU/5 GPU/6 GPU/7 NET/1
gpu24:1188:1200 [0] NCCL INFO  2 : NET/2 GPU/4 GPU/5 GPU/6 GPU/7 NET/2
gpu24:1188:1200 [0] NCCL INFO  3 : NET/3 GPU/6 GPU/7 GPU/4 GPU/5 NET/3
gpu24:1188:1203 [3] NCCL INFO Pattern 4, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu24:1188:1203 [3] NCCL INFO  0 : NET/0 GPU/4 GPU/5 GPU/6 GPU/7 NET/0
gpu24:1188:1203 [3] NCCL INFO  1 : NET/1 GPU/4 GPU/5 GPU/6 GPU/7 NET/1
gpu24:1188:1203 [3] NCCL INFO  2 : NET/2 GPU/4 GPU/5 GPU/6 GPU/7 NET/2
gpu24:1188:1203 [3] NCCL INFO  3 : NET/3 GPU/6 GPU/7 GPU/4 GPU/5 NET/3
gpu24:1188:1202 [2] NCCL INFO Pattern 4, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu24:1188:1202 [2] NCCL INFO  0 : NET/0 GPU/4 GPU/5 GPU/6 GPU/7 NET/0
gpu24:1188:1202 [2] NCCL INFO  1 : NET/1 GPU/4 GPU/5 GPU/6 GPU/7 NET/1
gpu24:1188:1202 [2] NCCL INFO  2 : NET/2 GPU/4 GPU/5 GPU/6 GPU/7 NET/2
gpu24:1188:1202 [2] NCCL INFO  3 : NET/3 GPU/6 GPU/7 GPU/4 GPU/5 NET/3
gpu23:1513:1526 [0] NCCL INFO Pattern 2, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu23:1513:1529 [3] NCCL INFO Pattern 2, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu23:1513:1527 [1] NCCL INFO Pattern 2, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu23:1513:1527 [1] NCCL INFO  0 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 NET/1
gpu23:1513:1529 [3] NCCL INFO  0 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 NET/1
gpu23:1513:1527 [1] NCCL INFO  1 : NET/0 GPU/0 GPU/2 GPU/3 GPU/1 NET/0
gpu23:1513:1529 [3] NCCL INFO  1 : NET/0 GPU/0 GPU/2 GPU/3 GPU/1 NET/0
gpu23:1513:1527 [1] NCCL INFO  2 : NET/2 GPU/2 GPU/1 GPU/0 GPU/3 NET/2
gpu23:1513:1529 [3] NCCL INFO  2 : NET/2 GPU/2 GPU/1 GPU/0 GPU/3 NET/2
gpu23:1513:1529 [3] NCCL INFO  3 : NET/3 GPU/2 GPU/3 GPU/0 GPU/1 NET/3
gpu23:1513:1529 [3] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
gpu23:1513:1527 [1] NCCL INFO  3 : NET/3 GPU/2 GPU/3 GPU/0 GPU/1 NET/3
gpu23:1513:1527 [1] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
gpu23:1513:1526 [0] NCCL INFO  0 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 NET/1
gpu23:1513:1526 [0] NCCL INFO  1 : NET/0 GPU/0 GPU/2 GPU/3 GPU/1 NET/0
gpu23:1513:1526 [0] NCCL INFO  2 : NET/2 GPU/2 GPU/1 GPU/0 GPU/3 NET/2
gpu23:1513:1526 [0] NCCL INFO  3 : NET/3 GPU/2 GPU/3 GPU/0 GPU/1 NET/3
gpu23:1513:1526 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
gpu23:1513:1528 [2] NCCL INFO Pattern 2, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu23:1513:1528 [2] NCCL INFO  0 : NET/1 GPU/0 GPU/1 GPU/2 GPU/3 NET/1
gpu23:1513:1528 [2] NCCL INFO  1 : NET/0 GPU/0 GPU/2 GPU/3 GPU/1 NET/0
gpu23:1513:1528 [2] NCCL INFO  2 : NET/2 GPU/2 GPU/1 GPU/0 GPU/3 NET/2
gpu23:1513:1528 [2] NCCL INFO  3 : NET/3 GPU/2 GPU/3 GPU/0 GPU/1 NET/3
gpu23:1513:1528 [2] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
gpu24:1188:1203 [3] NCCL INFO Pattern 2, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu24:1188:1201 [1] NCCL INFO Pattern 2, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu24:1188:1201 [1] NCCL INFO  0 : NET/1 GPU/4 GPU/5 GPU/6 GPU/7 NET/1
gpu24:1188:1201 [1] NCCL INFO  1 : NET/0 GPU/4 GPU/6 GPU/7 GPU/5 NET/0
gpu24:1188:1201 [1] NCCL INFO  2 : NET/2 GPU/6 GPU/5 GPU/4 GPU/7 NET/2
gpu24:1188:1201 [1] NCCL INFO  3 : NET/3 GPU/6 GPU/7 GPU/4 GPU/5 NET/3
gpu24:1188:1201 [1] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
gpu24:1188:1200 [0] NCCL INFO Pattern 2, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu24:1188:1203 [3] NCCL INFO  0 : NET/1 GPU/4 GPU/5 GPU/6 GPU/7 NET/1
gpu24:1188:1203 [3] NCCL INFO  1 : NET/0 GPU/4 GPU/6 GPU/7 GPU/5 NET/0
gpu24:1188:1203 [3] NCCL INFO  2 : NET/2 GPU/6 GPU/5 GPU/4 GPU/7 NET/2
gpu24:1188:1203 [3] NCCL INFO  3 : NET/3 GPU/6 GPU/7 GPU/4 GPU/5 NET/3
gpu24:1188:1203 [3] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
gpu24:1188:1200 [0] NCCL INFO  0 : NET/1 GPU/4 GPU/5 GPU/6 GPU/7 NET/1
gpu24:1188:1200 [0] NCCL INFO  1 : NET/0 GPU/4 GPU/6 GPU/7 GPU/5 NET/0
gpu24:1188:1200 [0] NCCL INFO  2 : NET/2 GPU/6 GPU/5 GPU/4 GPU/7 NET/2
gpu24:1188:1200 [0] NCCL INFO  3 : NET/3 GPU/6 GPU/7 GPU/4 GPU/5 NET/3
gpu24:1188:1200 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
gpu24:1188:1202 [2] NCCL INFO Pattern 2, crossNic 0, nChannels 4, speed 1.200000/1.200000, type SYS/SYS, sameChannels 0
gpu24:1188:1202 [2] NCCL INFO  0 : NET/1 GPU/4 GPU/5 GPU/6 GPU/7 NET/1
gpu24:1188:1202 [2] NCCL INFO  1 : NET/0 GPU/4 GPU/6 GPU/7 GPU/5 NET/0
gpu24:1188:1202 [2] NCCL INFO  2 : NET/2 GPU/6 GPU/5 GPU/4 GPU/7 NET/2
gpu24:1188:1202 [2] NCCL INFO  3 : NET/3 GPU/6 GPU/7 GPU/4 GPU/5 NET/3
gpu24:1188:1202 [2] NCCL INFO Pattern 3, crossNic 0, nChannels 0, speed 0.000000/0.000000, type NVL/PIX, sameChannels 1
#
#                                               out-of-place                       in-place          
#       size         count      type     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576        262144     float    962.8    1.09    1.09  0e+00    952.7    1.10    1.10  0e+00
     2097152        524288     float   1841.4    1.14    1.14  0e+00   1878.1    1.12    1.12  1e+00
     4194304       1048576     float   3626.6    1.16    1.16  0e+00   3625.7    1.16    1.16  1e+00
     8388608       2097152     float   7451.3    1.13    1.13  0e+00   7398.5    1.13    1.13  1e+00
    16777216       4194304     float    14334    1.17    1.17  0e+00    14527    1.15    1.15  1e+00
    33554432       8388608     float    29235    1.15    1.15  0e+00    28666    1.17    1.17  1e+00
    67108864      16777216     float    57194    1.17    1.17  0e+00    57242    1.17    1.17  1e+00
   134217728      33554432     float   114299    1.17    1.17  0e+00   114367    1.17    1.17  1e+00
   268435456      67108864     float   229481    1.17    1.17  0e+00   228908    1.17    1.17  1e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.14988 
#

The final busbw is 1.17 GB/s, which means NCCL only make use of one NIC port of 10Gb/s.

And I confirmed the conclusion with ifstat
middle_img_v2_44653753-e338-4052-a1b8-aa35197621cg

Another thing that may be useful is, NCCL can only make use of one NIC on Cpu0, which is either ens10f0 or ens10f1, even these two nics are not specified in NCCL_SOCKET_IFNAME.

@sjeaugey
Copy link
Member

I can confirm that NCCL detects the 4 NICs correctly and creates 4 rings, each using a different NIC. If you run with just NCCL_DEBUG=INFO (not setting NCCL_DEBUG_SUBSYS) you should see all NICs being used for the connections.

Now for some reason, you end up seeing traffic on only one NIC. Did you define a different subnet for each interface or did you put all interfaces in the same subnet? That could cause the linux kernel to route all TCP/IP traffic through a single interface.

@Dounm
Copy link
Author

Dounm commented Nov 25, 2021

image

this is the config of our NICs, and you can see that ens10f0, ens11f0, ens11f1 are in the subnet of 11.1.2.x and ens10f1 is in the subnet of 192.168.2.x.

so, what is the recommended configuration? to put all the interfaces into the same subnet?

@sjeaugey
Copy link
Member

I think it's possible to have all NICs in the same subnet, but that might require complex ARP and IP routing configuration to make sure all traffic uses the right NIC. So to understand what's happening here, the easiest is to have them in different subnets, even if later to run alltoall-like patterns you might need to either have routing between subnets or put them back in the same subnet.

And even if all 3 NICs in net 11.1.2.x were to only use a single NIC, I'd think we should see 2x1.2 = 2.4 GB/s, yet it seems that all traffic goes through the 192.168.2.x network. I'd still think this is an IP configuration issue, but to be sure, can you double check that the NCCL log shows NET/TCP/1, NET/TCP/0, NET/TCP/2 and NET/TCP/3 being used?

@Dounm
Copy link
Author

Dounm commented Nov 26, 2021

I'd run the NCCL perf test with NCCL_DEBUG=Info and got the log, some parts are list below

gpu23:2625:2640 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
gpu23:2625:2640 [2] NCCL INFO Trees [0] 3/-1/-1->2->1|1->2->3/-1/-1 [1] 3/4/-1->2->0|0->2->3/4/-1 [2] 1/-1/-1->2->-1|-1->2->1/-1/-1 [3] 3/-1/-1->2->-1|-1->2->3/-1/-1 [4] 3/-1/-1->2->1|1->2->3/-1/-1 [5] 3/-1/-1->2->0|0->2->3/-1/-1 [6] 1/-1/-1->2->5|5->2->1/-1/-1 [7] 3/-1/-1->2->7|7->2->3/-1/-1
gpu24:2321:2333 [0] NCCL INFO Channel 00 : 3[84000] -> 4[2000] [receive] via NET/Socket/0
gpu23:2625:2638 [0] NCCL INFO Channel 00/08 :    0   1   2   3   4   5   6   7
gpu23:2625:2639 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
gpu23:2625:2639 [1] NCCL INFO Trees [0] 2/4/-1->1->0|0->1->2/4/-1 [1] -1/-1/-1->1->3|3->1->-1/-1/-1 [2] 0/6/-1->1->2|2->1->0/6/-1 [3] -1/-1/-1->1->0|0->1->-1/-1/-1 [4] 2/-1/-1->1->0|0->1->2/-1/-1 [5] -1/-1/-1->1->3|3->1->-1/-1/-1 [6] 0/-1/-1->1->2|2->1->0/-1/-1 [7] -1/-1/-1->1->0|0->1->-1/-1/-1
gpu23:2625:2639 [1] NCCL INFO Setting affinity for GPU 1 to 10,00000001
gpu23:2625:2638 [0] NCCL INFO Channel 01/08 :    0   1   2   3   4   5   6   7
gpu23:2625:2638 [0] NCCL INFO Channel 02/08 :    0   1   2   3   4   5   6   7
gpu23:2625:2638 [0] NCCL INFO Channel 03/08 :    0   1   6   7   4   5   2   3
gpu23:2625:2638 [0] NCCL INFO Channel 04/08 :    0   1   2   3   4   5   6   7
gpu23:2625:2638 [0] NCCL INFO Channel 05/08 :    0   1   2   3   4   5   6   7
gpu23:2625:2638 [0] NCCL INFO Channel 06/08 :    0   1   2   3   4   5   6   7
gpu23:2625:2638 [0] NCCL INFO Channel 07/08 :    0   1   6   7   4   5   2   3
gpu23:2625:2638 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 8/8/64
gpu23:2625:2638 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 2/-1/-1->0->-1|-1->0->2/-1/-1 [2] 3/-1/-1->0->1|1->0->3/-1/-1 [3] 1/-1/-1->0->3|3->0->1/-1/-1 [4] 1/-1/-1->0->5|5->0->1/-1/-1 [5] 2/-1/-1->0->6|6->0->2/-1/-1 [6] 3/-1/-1->0->1|1->0->3/-1/-1 [7] 1/-1/-1->0->3|3->0->1/-1/-1
gpu23:2625:2638 [0] NCCL INFO Setting affinity for GPU 0 to 10,00000001
gpu24:2321:2333 [0] NCCL INFO Channel 00 : 4[2000] -> 5[3000] via direct shared memory
gpu24:2321:2334 [1] NCCL INFO Channel 00 : 5[3000] -> 6[83000] via direct shared memory
gpu23:2625:2638 [0] NCCL INFO Channel 00 : 7[84000] -> 0[2000] [receive] via NET/Socket/0
gpu24:2321:2336 [3] NCCL INFO Channel 00 : 7[84000] -> 0[2000] [send] via NET/Socket/0
gpu23:2625:2639 [1] NCCL INFO Channel 00 : 1[3000] -> 2[83000] via direct shared memory
gpu23:2625:2640 [2] NCCL INFO Channel 00 : 2[83000] -> 3[84000] via direct shared memory
gpu23:2625:2641 [3] NCCL INFO Channel 00 : 3[84000] -> 2[83000] via direct shared memory
gpu23:2625:2638 [0] NCCL INFO Channel 00 : 0[2000] -> 1[3000] via direct shared memory
gpu24:2321:2335 [2] NCCL INFO Channel 00 : 6[83000] -> 7[84000] via direct shared memory
gpu24:2321:2333 [0] NCCL INFO Channel 00 : 4[2000] -> 1[3000] [send] via NET/Socket/1
gpu24:2321:2336 [3] NCCL INFO Channel 00 : 7[84000] -> 6[83000] via direct shared memory
gpu24:2321:2334 [1] NCCL INFO Channel 00 : 5[3000] -> 4[2000] via direct shared memory
gpu23:2625:2640 [2] NCCL INFO Channel 00 : 2[83000] -> 1[3000] via direct shared memory

You can see from the log that there are total 8 channels, and the inter-node communication is through Net/Socket/x and intra-node communication is through direct shared memory.

can you double check that the NCCL log shows NET/TCP/1, NET/TCP/0, NET/TCP/2 and NET/TCP/3 being used?

I didn't find Net/TCP/x inside the log, but only found Net/Socket/0, Net/Socket/1, Net/Socket/2, Net/Socket/3, not sure if these are same things.

Also, there is another info that maybe useful, I changed the perf algo from sendrecv to alltoall, and found NCCL makes use of two NICs (900MB/s on ens10f0, 300MB/s on ens10f1, boths are from CPU0.)

@sjeaugey
Copy link
Member

Yes, Net/Socket/x is what I meant by NET/TCP/x. Sorry about that, should have checked.

Ok so you seem to have indeed a routing issue due to 3 NICs being in the same subnet and causing ens10f0 to get 3/4 of the traffic while ens10f1 gets 1/4 (which is expected).

I still can't quite explain why allreduce is causing only one NIC to see traffic and not 2 (with the 3/4-1/4 balance you're seeing on alltoall) but it would be good to put all 3 NICs in different subnets and try again alltoall and allreduce. That will probably give us more insight as to what's happening.

@Dounm
Copy link
Author

Dounm commented Nov 28, 2021

Much appreciate for your rely, I will change the subnet config (it's a bit complex) to see if there is any difference later.

And I found one more info about this, that no matter how I set the NCCL_SOCKET_NTHREADS equals to 8 or 64, it still only made use of CPU core 1 and 33 (both on NUMA node CPU0) with a 100% usage,

So, could you explain how does NCCL support NUMA Affinity? does NCCL always perform the socket operations on a single NUMA node?

I'm also curious about how NCCL route its packet between different GPUs, is there any doc or code related that I shall read? I read some code of src/graph but couldn't get a whole view

@sjeaugey
Copy link
Member

Ah, sorry about missing that earlier. By default mpirun binds every task to a single core. Can you try again adding --bind-to numa to the mpirun arguments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants