-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
os:ubuntu 20.04
NCCL version 2.14.3+cuda11.7
pytorch '1.13.1+cu117'
Python 3.8.10
device0 command:
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=enp129s0f0
export NCCL_DEBUG_SUBSYS=ENV
torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr="10.0.62.153" --master_port "9678" test.py
device0 output:
Model loaded succeed
No protocol specified
0
28433
ubuntu-SYS-4028GR-TR:25203:25203 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp129s0f0
NCCL version 2.14.3+cuda11.7
ubuntu-SYS-4028GR-TR:25203:25332 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp129s0f0
ubuntu-SYS-4028GR-TR:25203:25332 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp129s0f0
device1 command:
export NCCL_DEBUG=INFO
export NCCL_SOCKET_IFNAME=enp129s0f0
export NCCL_DEBUG_SUBSYS=ENV
torchrun --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr="10.0.62.153" --master_port "9678" test.py
device1 output
Model loaded succeed
No protocol specified
0
28433
ubuntu1-NF5588M4S:11567:11567 [0] NCCL INFO Bootstrap : Using enp129s0f0:10.0.119.255<0>
ubuntu1-NF5588M4S:11567:11567 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
ubuntu1-NF5588M4S:11567:11567 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.14.3+cuda11.7
ubuntu1-NF5588M4S:11567:11697 [0] NCCL INFO NET/IB : No device found.
ubuntu1-NF5588M4S:11567:11697 [0] NCCL INFO NET/Socket : Using [0]enp129s0f0:10.0.119.255<0>
ubuntu1-NF5588M4S:11567:11697 [0] NCCL INFO Using network Socket
NCCL not work while device output: NCCL INFO Using network Socket
I waited a long time and the program still not new output
Firewalls are not activated on either device
Tow devices can ping for each other
How can I solve this problem? thx.