-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
probability init hang with pytorch #581
Comments
After looking at a specific run‘s log, I found that there is a rank4->rank3 connection that has not been established, and its intermediateRank is rank2. Here are the gdb results of rank4 and rank2, we can see that rank4 is waiting for the remote mem alloc of rank2 to finish, but rank2 is stuck on some cuda operation. At this point rank2 is already caught in the execution of the communication kernel, which I guess is the reason why subsequent cuda operations on the same stream cannot be finished. rank 4rank 2 |
To test my idea, I added a global barrier (using PyTorch's tcp store) immediately after PyTorch initialized nccl comm in: https://github.com/pytorch/pytorch/blob/lts/release/1.8/torch/lib/c10d/ProcessGroupNCCL.cpp#L826. This seemed to solve the hang problem. But I think this is an implementation mistake of nccl or PyTorch+NCCL, and I'd like to know your opinion. |
Your analysis and workaround are correct. This was fixed in 2.10.3 by adding a call to |
@sjeaugey, thanks for your reply! And which nccl version introduced this problem? I'm considering downgrading to the appropriate nccl version in my cluster. |
The full story:
This improves alltoall performance on DGX-1-like servers. But it was prone to deadlocks later on when using send/recv. 2.9.9-1:
This fixes the later hang but may hang during init, even for codes not using send/recv. It also adds a 2.10.3-1:
This adds a barrier after the NVB Preconnect phase, hopefully fixing the issue once and for all. |
Hi, I'm using PyTorch 1.8 + nccl2.9.9 for distributed training. I found that there is a high probability of initializing hang dwell in a particular hardware environment and configuration. After some investigation, I found out that this is due to the fact that the device where intermediateRank is located init rank completed and enter the communication primitive call, when the remote mem alloc calls from other GPUs that rely on this intermediateRank are not processed all the time causing the hang.
A hang example
1
RANK 0: init done
RANK 1: init done
RANK 2: init done
RANK 3: wait for RANK4
RANK 4: try to connect to RANK3(with intermediateRank 2)
RANK 5: init done
RANK 6: init done
RANK 7: init done
2
RANK 0: allreduce
RANK 1: allreduce
RANK 2: allreduce
RANK 3: wait for RANK4
RANK 4: ask RANK2 to alloc cuda mem
RANK 5: allreduce
RANK 6: allreduce
RANK 7: allreduce
3
RANK 0: allreduce
RANK 1: allreduce
RANK 2: allreduce + try to alloc cuda mem(stuck)
RANK 3: wait for RANK4
RANK 4: wait for RANK2 alloc ready
RANK 5: allreduce
RANK 6: allreduce
RANK 7: allreduce
4
hang
nccl related environments config:
NCCL_IB_GID_INDEX=3
NCCL_ASYNC_ERROR_HANDLING=1
NCCL_SOCKET_NTHREADS=2
NCCL_VERSION=2
NCCL_MAX_NCHANNELS=2
NCCL_MIN_NCHANNELS=2
NCCL_NSOCKS_PERTHREAD=1
NCCL_LAUNCH_MODE=PARALLEL
NCCL_DEBUG=INFO
machine info:
PyTorch version: 1.8.2+PAI2108
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.19.4
Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB
GPU 4: Tesla V100-SXM2-32GB
GPU 5: Tesla V100-SXM2-32GB
GPU 6: Tesla V100-SXM2-32GB
GPU 7: Tesla V100-SXM2-32GB
Nvidia driver version: 418.87.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A
hardware topo:
The text was updated successfully, but these errors were encountered: