-
Notifications
You must be signed in to change notification settings - Fork 760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nccl test on multiple gpu hang all the time #199
Comments
Hi Xiangyu, |
Thanks, I find
Is there any problem via output? |
And after upgrade to cuda 10, nccl test also hang when multiple gpu test. |
Thanks for running the test. From your p2pBandwidthLatencyTest results, the bandwidth in the P2P-enabled case is very low. That explains why you saw NCCL test "hanging" -- it is not hanging actually, it is just running at a very low speed. You can try to disable ACS in your PCI-e switch setting. Or, if you are using AMD CPUs, you can try turning off Virtual Technology (VT) in the BIOS. If neither solution works for you and you want to get things going ASAP, you can set NCCL_P2P_DISABLE=1 to use shared memory (SHM) instead of CUDA P2P. The SHM performance could be lower than the P2P performance though. But in your case, GPUs that are NODE away would use SHM by default anyway, so you might not lose much. |
disable ACS take effect, thank you very much. |
Hi NCCL team:
I download NCCL test code from
https://github.com/NVIDIA/nccl-tests.git
and run in our deep learning work station. I run test on one GPU, normal as expected but run test on two GPUs hang all the time.Command is
./build/all_reduce_perf -b 8 -e 256M -f 2 -g
and stack when hang is:Our deep learning work station has 4 pieces of NVIDIA V100 GPU, and version of cuda toolkit is:
output of
nvidia-smi
is:topo of gpu is:
Is there any problem in hardware in workstation or it is just a software error, thank you.
The text was updated successfully, but these errors were encountered: