Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nccl test on multiple gpu hang all the time #199

Closed
Keepmoving-ZXY opened this issue Mar 28, 2019 · 5 comments
Closed

nccl test on multiple gpu hang all the time #199

Keepmoving-ZXY opened this issue Mar 28, 2019 · 5 comments

Comments

@Keepmoving-ZXY
Copy link

Hi NCCL team:
I download NCCL test code from https://github.com/NVIDIA/nccl-tests.git and run in our deep learning work station. I run test on one GPU, normal as expected but run test on two GPUs hang all the time.
Command is ./build/all_reduce_perf -b 8 -e 256M -f 2 -g and stack when hang is:

 84	../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007f2e766c9827 in sched_yield () at ../sysdeps/unix/syscall-template.S:84
#1  0x000000000040890f in TimeTest(threadArgs_t*, ncclDataType_t, char const*, ncclRedOp_t, char const*, int, int) ()
#2  0x0000000000403875 in RunTest(threadArgs_t*, int, ncclDataType_t, char const*, ncclRedOp_t, char const*) ()
#3  0x000000000040392d in threadRunTests(void*) ()
#4  0x0000000000402959 in main ()
(gdb) 

Our deep learning work station has 4 pieces of NVIDIA V100 GPU, and version of cuda toolkit is:

 ➜  ~ sudo dpkg -l | grep cuda
ii  cuda                                             9.0.176-1                                     amd64        CUDA meta-package
ii  cuda-9-0                                         9.0.176-1                                     amd64        CUDA 9.0 meta-package
ii  cuda-command-line-tools-9-0                      9.0.176-1                                     amd64        CUDA command-line tools
ii  cuda-core-9-0                                    9.0.176-1                                     amd64        CUDA core tools
ii  cuda-cublas-9-0                                  9.0.176-1                                     amd64        CUBLAS native runtime libraries
ii  cuda-cublas-dev-9-0                              9.0.176-1                                     amd64        CUBLAS native dev links, headers
ii  cuda-cudart-9-0                                  9.0.176-1                                     amd64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-9-0                              9.0.176-1                                     amd64        CUDA Runtime native dev links, headers
ii  cuda-cufft-9-0                                   9.0.176-1                                     amd64        CUFFT native runtime libraries
ii  cuda-cufft-dev-9-0                               9.0.176-1                                     amd64        CUFFT native dev links, headers
ii  cuda-curand-9-0                                  9.0.176-1                                     amd64        CURAND native runtime libraries
ii  cuda-curand-dev-9-0                              9.0.176-1                                     amd64        CURAND native dev links, headers
ii  cuda-cusolver-9-0                                9.0.176-1                                     amd64        CUDA solver native runtime libraries
ii  cuda-cusolver-dev-9-0                            9.0.176-1                                     amd64        CUDA solver native dev links, headers
ii  cuda-cusparse-9-0                                9.0.176-1                                     amd64        CUSPARSE native runtime libraries
ii  cuda-cusparse-dev-9-0                            9.0.176-1                                     amd64        CUSPARSE native dev links, headers
ii  cuda-demo-suite-9-0                              9.0.176-1                                     amd64        Demo suite for CUDA
ii  cuda-documentation-9-0                           9.0.176-1                                     amd64        CUDA documentation
ii  cuda-driver-dev-9-0                              9.0.176-1                                     amd64        CUDA Driver native dev stub library
ii  cuda-drivers                                     384.145-1                                     amd64        CUDA Driver meta-package
ii  cuda-libraries-9-0                               9.0.176-1                                     amd64        CUDA Libraries 9.0 meta-package
ii  cuda-libraries-dev-9-0                           9.0.176-1                                     amd64        CUDA Libraries 9.0 development meta-package
ii  cuda-license-9-0                                 9.0.176-1                                     amd64        CUDA licenses
ii  cuda-misc-headers-9-0                            9.0.176-1                                     amd64        CUDA miscellaneous headers
ii  cuda-npp-9-0                                     9.0.176-1                                     amd64        NPP native runtime libraries
ii  cuda-npp-dev-9-0                                 9.0.176-1                                     amd64        NPP native dev links, headers
ii  cuda-nvgraph-9-0                                 9.0.176-1                                     amd64        NVGRAPH native runtime libraries
ii  cuda-nvgraph-dev-9-0                             9.0.176-1                                     amd64        NVGRAPH native dev links, headers
ii  cuda-nvml-dev-9-0                                9.0.176-1                                     amd64        NVML native dev links, headers
ii  cuda-nvrtc-9-0                                   9.0.176-1                                     amd64        NVRTC native runtime libraries
ii  cuda-nvrtc-dev-9-0                               9.0.176-1                                     amd64        NVRTC native dev links, headers
ii  cuda-repo-ubuntu1604-9-0-local                   9.0.176-1                                     amd64        cuda repository configuration files
ii  cuda-runtime-9-0                                 9.0.176-1                                     amd64        CUDA Runtime 9.0 meta-package
ii  cuda-samples-9-0                                 9.0.176-1                                     amd64        CUDA example applications
ii  cuda-toolkit-9-0                                 9.0.176-1                                     amd64        CUDA Toolkit 9.0 meta-package
ii  cuda-visual-tools-9-0                            9.0.176-1                                     amd64        CUDA visual tools
ii  libcuda1-384                                     384.145-0ubuntu1                              amd64        NVIDIA CUDA runtime library
ii  libcudnn7                                        7.2.1.38-1+cuda9.0                            amd64        cuDNN runtime libraries
ii  libcudnn7-dev                                    7.2.1.38-1+cuda9.0                            amd64        cuDNN development libraries and headers
ii  libnccl-dev                                      2.2.13-1+cuda9.0                              amd64        NVIDIA Collectives Communication Library (NCCL) Development Files
ii  libnccl2                                         2.2.13-1+cuda9.0                              amd64        NVIDIA Collectives Communication Library (NCCL) Runtime
ii  nccl-repo-ubuntu1604-2.2.13-ga-cuda9.0           1-1                                           amd64        nccl repository configuration files

output of nvidia-smi is:

➜  ~ nvidia-smi  
Thu Mar 28 12:49:52 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.145                Driver Version: 384.145                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:1B:00.0 Off |                    0 |
| N/A   48C    P0    44W / 250W |    966MiB / 32502MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:1E:00.0 Off |                    0 |
| N/A   45C    P0    44W / 250W |    966MiB / 32502MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  Off  | 00000000:3D:00.0 Off |                    0 |
| N/A   36C    P0    26W / 250W |     10MiB / 32502MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  Off  | 00000000:41:00.0 Off |                    0 |
| N/A   37C    P0    27W / 250W |     10MiB / 32502MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     11199      C   ./build/all_reduce_perf                      956MiB |
|    1     11199      C   ./build/all_reduce_perf                      956MiB |
+-----------------------------------------------------------------------------+
➜  ~ 

topo of gpu is:

➜  ~ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	CPU Affinity
GPU0	 X 	PIX	NODE	NODE	0-9,20-29
GPU1	PIX	 X 	NODE	NODE	0-9,20-29
GPU2	NODE	NODE	 X 	PIX	0-9,20-29
GPU3	NODE	NODE	PIX	 X 	0-9,20-29

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks
➜  ~ 

Is there any problem in hardware in workstation or it is just a software error, thank you.

@kwen2501
Copy link
Contributor

Hi Xiangyu,
Have you tried the p2pBandwidthLatencyTest from the CUDA Utilities samples?
It would be also good to check if ACS is enabled on your machine. On PCI-E platforms bandwidth could be really low if ACS is enabled.

@Keepmoving-ZXY
Copy link
Author

Thanks, I find p2pBandwidthLatencyTest can only run in cuda10 or higher version, so I update cuda version from 9.0 to 10.0 . Below is output of p2pBandwidthLatencyTest:

➜  release git:(91dc60d) ✗ ./p2pBandwidthLatencyTest 
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, Tesla V100-PCIE-32GB, pciBusID: 1b, pciDeviceID: 0, pciDomainID:0
Device: 1, Tesla V100-PCIE-32GB, pciBusID: 1e, pciDeviceID: 0, pciDomainID:0
Device: 2, Tesla V100-PCIE-32GB, pciBusID: 3d, pciDeviceID: 0, pciDomainID:0
Device: 3, Tesla V100-PCIE-32GB, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2     3
     0       1     1     1     1
     1       1     1     1     1
     2       1     1     1     1
     3       1     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 728.78   6.95   6.85   7.19 
     1   6.92 737.03   6.97   7.21 
     2   6.99   6.86 739.82   6.32 
     3   7.29   7.24   6.43 739.82 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3 
     0 730.14   0.14   0.15   0.15 
     1   0.18 741.22   0.19   0.18 
     2   0.18   0.18 738.42   0.18 
     3   0.18   0.18   0.18 741.22 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 743.34   9.14  13.26  12.77 
     1   9.05 743.34  12.83  13.27 
     2  13.57  12.84 741.22   9.15 
     3  12.91  13.23   9.15 747.61 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3 
     0 741.93   0.24   0.30   0.30 
     1   0.24 753.38   0.36   0.35 
     2   0.30   0.36 745.47   0.28 
     3   0.30   0.35   0.28 746.18 
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2      3 
     0   1.96  16.56  16.62  16.41 
     1  16.89   1.97  16.49  16.55 
     2  16.57  16.55   1.91  16.55 
     3  16.41  16.41  16.40   1.98 

   CPU     0      1      2      3 
     0   4.85  12.14  12.15  11.70 
     1  11.92   4.22  11.61  11.64 
     2  11.76  11.56   4.21  11.48 
     3  11.88  11.64  11.54   4.20 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2      3 
     0   2.12 49252.94 49252.84 49252.85 
     1 49253.04   1.98 49253.02 49253.04 
     2 49253.05 49252.95   1.93 49252.93 
     3 49258.15 49252.89 49252.92   1.98 

   CPU     0      1      2      3 
     0   4.38   3.24   3.29   3.19 
     1   3.29   4.13   2.98   3.05 
     2   3.48   3.19   4.50   3.22 
     3   3.19   3.42   3.24   4.35 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Is there any problem via output?

@Keepmoving-ZXY
Copy link
Author

And after upgrade to cuda 10, nccl test also hang when multiple gpu test.

@kwen2501
Copy link
Contributor

Thanks for running the test. From your p2pBandwidthLatencyTest results, the bandwidth in the P2P-enabled case is very low. That explains why you saw NCCL test "hanging" -- it is not hanging actually, it is just running at a very low speed.

You can try to disable ACS in your PCI-e switch setting. Or, if you are using AMD CPUs, you can try turning off Virtual Technology (VT) in the BIOS.

If neither solution works for you and you want to get things going ASAP, you can set NCCL_P2P_DISABLE=1 to use shared memory (SHM) instead of CUDA P2P. The SHM performance could be lower than the P2P performance though. But in your case, GPUs that are NODE away would use SHM by default anyway, so you might not lose much.

@Keepmoving-ZXY
Copy link
Author

disable ACS take effect, thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants