Testing nccl with a difficult topology #19

wme7 · 2016-04-19T04:51:48Z

Dear NCCL team,
First of all, thx much for such nice open-source project.
I just got to know about you through the Parallel-Forall Blog.
Currently, I'm testing your examples in a small production PC, and I noticed that the topology that I'm using is a little bit complex, namely:

[r1bsl@supermicro single]$ nvidia-smi topo --matrix
        GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 CPU Affinity
GPU0     X  PIX SOC SOC SOC SOC 0-7,16-23
GPU1    PIX  X  SOC SOC SOC SOC 0-7,16-23
GPU2    SOC SOC  X  PIX PHB PHB 8-15,24-31
GPU3    SOC SOC PIX  X  PHB PHB 8-15,24-31
GPU4    SOC SOC PHB PHB  X  PIX 8-15,24-31
GPU5    SOC SOC PHB PHB PIX  X  8-15,24-31


Legend:
  X   = Self
  SOC = Path traverses a socket-level link (e.g. QPI)
  PHB = Path traverses a PCIe host bridge
  PXB = Path traverses multiple PCIe internal switches
  PIX = Path traverses a PCIe internal switch

As you may see, I'm working with K80-type GPUs in this machine.
I've noticed that I have no problem running your tests using one of the internal GPUs, e.g.:

[r1bsl@supermicro single]$ ./all_gather_test 10000000 3 1 3 5 
# Using devices
#   Rank  0 uses device  1 [0x06] Tesla K80
#   Rank  1 uses device  3 [0x85] Tesla K80
#   Rank  2 uses device  5 [0x89] Tesla K80

#      bytes             N    type     time  algbw  busbw    delta
    10000000      10000000    char    5.247   3.81   3.81    0e+00
    10000000       2500000     int    4.872   4.11   4.11    0e+00
    10000000       5000000    half    4.802   4.16   4.16    0e+00
    10000000       2500000   float    4.816   4.15   4.15    0e+00
    10000000       1250000  double    4.793   4.17   4.17    0e+00
    10000000       1250000   int64    4.766   4.20   4.20    0e+00
    10000000       1250000  uint64    4.731   4.23   4.23    0e+00

However, it I want to run the test using both internal GPU in a single K80 card, I get in troubles:

[r1bsl@supermicro single]$ ./all_gather_test 100000 2 2 3
# Using devices
#   Rank  0 uses device  2 [0x84] Tesla K80
#   Rank  1 uses device  3 [0x85] Tesla K80

#      bytes             N    type     time  algbw  busbw    delta
[code stalls]
^C

The execution stalls and I have no more option that to kill the execution.
My question is: Can NCCL handle such complex topology? and if so, what can I do to modify the examples for the case that I can run them with all my 6 GPUs?

sjeaugey · 2016-04-19T17:54:40Z

Hi Manuel,

I cannot reproduce the stall on a similar machine. Could you get a stack trace so that we get an idea of where it stalls ?

Also, can you look at nvidia-smi to see if one (or two) GPUs are busy ?

wme7 · 2016-04-21T04:02:00Z

Hi sjeaugey,

I've been running and testing my configuration for the toolkit (CUDA SDK 7.5), NCCL libraries, MPI (openMPI 1.10.2) libraries to make sure they are all correctly installed, in both my workstation and the SuperMicro PC, and to rule any bug from the libraries. Having made sure that both computers have identical configurations and tested nccl examples in them, I concluded that this problem has to be related to my topology as I don't see get any problems when executing in my workstation (which for now works with two different GPUs).

To put it in perspective my testing hardware configurations given by the lspci command are as follows:

[manuel@nhri]$ lspci -tv | grep NVIDIA
           +-01.1-[02]----00.0  NVIDIA Corporation GK110GL [Tesla K20c]
           +-1c.4-[10]--+-00.0  NVIDIA Corporation GM107GL [Quadro K620]
           |            \-00.1  NVIDIA Corporation Device 0fbc

[r1bsl@supermicro]$ lspci -tv | grep NVIDIA
 |           +-02.0-[82-85]----00.0-[83-85]--+-08.0-[84]----00.0  NVIDIA Corporation GK210GL [Tesla K80]
 |           |                               \-10.0-[85]----00.0  NVIDIA Corporation GK210GL [Tesla K80]
 |           +-03.0-[86-89]----00.0-[87-89]--+-08.0-[88]----00.0  NVIDIA Corporation GK210GL [Tesla K80]
 |           |                               \-10.0-[89]----00.0  NVIDIA Corporation GK210GL [Tesla K80]
             +-03.0-[03-06]----00.0-[04-06]--+-08.0-[05]----00.0  NVIDIA Corporation GK210GL [Tesla K80]
             |                               \-10.0-[06]----00.0  NVIDIA Corporation GK210GL [Tesla K80]

In my workstation I have no PCIe switch therefore, the topology doesn't allow me to use P2P communications. But, the Supermicro configuration is more complex as I found out that to reach each gpu I have to pass by too PCIe switches. To put it in a schematic way:

CPU(0) -- switch -- K80 internal switch -- K80(0)
    |       |                         \ -- K80(1)
    |       \ ----- K80 internal switch -- K80(2)
    |                                 \ -- K80(3)
CPU(1) -- switch -- K80 internal switch -- K80(4)
                                      \ -- K80(5)

I have tested the MPI example and all the examples in the 'single' folder. When run nccl tests, I made sure no GPU is busy by using the nvidia-smi command as indicated. I have no problems executing the single and MPI test using 3 gpus simultaneously, by either GPUs (0,2,4) or (1,3,5), however using all GPUs or contiguous GPUs e.g. (0,1,5) or (0,4,5), the execution will stall. I have traced a little bit the problem using the MPI example. I notice that the program always stalls after the ncclAllReduce stage. The cudaStreamSynchronize never gets to be completed.

If you have similar machine, do you have any special configuration in it?

lukeyeager · 2016-04-21T16:20:50Z

Could this be related to NVIDIA/caffe#10?

sjeaugey · 2016-04-21T16:33:24Z

@lukeyeager It seems this problem is about a deadlock rather than a P2P bandwidth issue.

@wme7 your topology doesn't seem strange nor difficult to me. That's in fact a very common configuration here. Could you please run a test which stalls, and while it is stalled (do not hit Ctrl-C), run :
nvidia-smi
And look if there is one, two, or three GPUs showing 100% usage (or just report the GPU usage).

Then, can you provide us with the output of gstack <pid> with <pid> being the PID of the test ? if you don't have gstack installed, you can alternatively call gdb -p <pid> then type bt.

That would help me understand the problem and hopefully reproduce it and fix it.

wme7 · 2016-04-22T01:52:12Z

@lukeyeager I don't know about that but I'll check my BIOS just in case.

@sjeaugey I'm following your instructions now. I run the test:

[r1bsl@supermicro simpleMPITest]$ sudo nvidia-smi -r -i 0
GPU 0000:05:00.0 was successfully reset.
All done.
[r1bsl@supermicro simpleMPITest]$ sudo nvidia-smi -r -i 3
GPU 0000:85:00.0 was successfully reset.
All done.
[r1bsl@supermicro simpleMPITest]$ sudo nvidia-smi -r -i 1
GPU 0000:06:00.0 was successfully reset.
All done.
[r1bsl@supermicro simpleMPITest]$ sudo nvidia-smi -r -i 4
GPU 0000:88:00.0 was successfully reset.
All done.
[r1bsl@supermicro simpleMPITest]$ sudo nvidia-smi -r -i 5
GPU 0000:89:00.0 was successfully reset.
All done.
[r1bsl@supermicro simpleMPITest]$ sudo nvidia-smi -r -i 2
GPU 0000:84:00.0 was successfully reset.
All done.
[r1bsl@supermicro simpleMPITest]$ ~/openMPI/bin/mpirun -np 3 test.run 0 2 3
MPI initialized
rank 0 has device 0
rank 1 has device 2
rank 2 has device 3
nccl communicator created!
CUDA streams created!
Input values set. Starting Test:
Reduction complete:
[stall]

from another terminal window I call the nvidia-smi and get the gstack output of the working PIDs:

[r1bsl@supermicro simpleMPITest]$ nvidia-smi
Fri Apr 22 10:48:50 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.79     Driver Version: 352.79         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:05:00.0     Off |                    0 |
| N/A   51C    P0    71W / 149W |    127MiB / 11519MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:06:00.0     Off |                    0 |
| N/A   30C    P8    30W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 0000:84:00.0     Off |                    0 |
| N/A   49C    P0    73W / 149W |    194MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 0000:85:00.0     Off |                    0 |
| N/A   35C    P0    85W / 149W |    194MiB / 11519MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           On   | 0000:88:00.0     Off |                    0 |
| N/A   33C    P8    26W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           On   | 0000:89:00.0     Off |                    0 |
| N/A   28C    P8    30W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0       696    C   test.run                                        71MiB |
|    2       697    C   test.run                                        71MiB |
|    2       698    C   test.run                                        64MiB |
|    3       697    C   test.run                                        64MiB |
|    3       698    C   test.run                                        71MiB |
+-----------------------------------------------------------------------------+
[r1bsl@supermicro simpleMPITest]$ gstack 696
Thread 3 (Thread 0x7fec7072f700 (LWP 699)):
#0  0x00007fec725b0c3d in poll () from /lib64/libc.so.6
#1  0x00007fec71fd6a96 in poll_dispatch (base=0x1c4ad70, tv=0x7fec7072ee90) at ../../../../../../opal/mca/event/libevent2021/libevent/poll.c:165
#2  0x00007fec71fce8c4 in opal_libevent2021_event_base_loop (base=0x1c4ad70, flags=1) at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1633
#3  0x00007fec7228131e in orte_progress_thread_engine () from /home/r1bsl/openMPI/lib/libopen-rte.so.12
#4  0x00007fec732b1dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fec725bb28d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fec57bfc700 (LWP 710)):
#0  0x00007fec725b0c3d in poll () from /lib64/libc.so.6
#1  0x00007fec779b885b in ?? () from /lib64/libcuda.so.1
#2  0x00007fec7737e651 in ?? () from /lib64/libcuda.so.1
#3  0x00007fec779b91a8 in ?? () from /lib64/libcuda.so.1
#4  0x00007fec732b1dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007fec725bb28d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fec79754740 (LWP 696)):
#0  0x00007ffd8e97b7c2 in clock_gettime ()
#1  0x00007fec725ceedd in clock_gettime () from /lib64/libc.so.6
#2  0x00007fec779b81de in ?? () from /lib64/libcuda.so.1
#3  0x00007fec7736d7ab in ?? () from /lib64/libcuda.so.1
#4  0x00007fec7734ae33 in ?? () from /lib64/libcuda.so.1
#5  0x00007fec7734af89 in ?? () from /lib64/libcuda.so.1
#6  0x00007fec772bec87 in ?? () from /lib64/libcuda.so.1
#7  0x00007fec772970c2 in cuStreamSynchronize () from /lib64/libcuda.so.1
#8  0x00007fec780fed90 in ?? () from /usr/local/cuda-7.5/lib64/libcudart.so.7.5
#9  0x00007fec781361fd in cudaStreamSynchronize () from /usr/local/cuda-7.5/lib64/libcudart.so.7.5
#10 0x0000000000401870 in main ()
[r1bsl@supermicro simpleMPITest]$ gstack 697
Thread 4 (Thread 0x7ff42bd39700 (LWP 700)):
#0  0x00007ff42dbbac3d in poll () from /lib64/libc.so.6
#1  0x00007ff42d5e0a96 in poll_dispatch (base=0x2127d70, tv=0x7ff42bd38e90) at ../../../../../../opal/mca/event/libevent2021/libevent/poll.c:165
#2  0x00007ff42d5d88c4 in opal_libevent2021_event_base_loop (base=0x2127d70, flags=1) at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1633
#3  0x00007ff42d88b31e in orte_progress_thread_engine () from /home/r1bsl/openMPI/lib/libopen-rte.so.12
#4  0x00007ff42e8bbdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ff42dbc528d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7ff4130a5700 (LWP 712)):
#0  0x00007ff42dbbac3d in poll () from /lib64/libc.so.6
#1  0x00007ff432fc285b in ?? () from /lib64/libcuda.so.1
#2  0x00007ff432988651 in ?? () from /lib64/libcuda.so.1
#3  0x00007ff432fc31a8 in ?? () from /lib64/libcuda.so.1
#4  0x00007ff42e8bbdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ff42dbc528d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7ff409dfd700 (LWP 714)):
#0  0x00007ff42dbbac3d in poll () from /lib64/libc.so.6
#1  0x00007ff432fc285b in ?? () from /lib64/libcuda.so.1
#2  0x00007ff432988651 in ?? () from /lib64/libcuda.so.1
#3  0x00007ff432fc31a8 in ?? () from /lib64/libcuda.so.1
#4  0x00007ff42e8bbdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007ff42dbc528d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7ff434d5e740 (LWP 697)):
#0  0x00007ffd38bb77c2 in clock_gettime ()
#1  0x00007ff42dbd8edd in clock_gettime () from /lib64/libc.so.6
#2  0x00007ff432fc21de in ?? () from /lib64/libcuda.so.1
#3  0x00007ff4329777ab in ?? () from /lib64/libcuda.so.1
#4  0x00007ff432954e33 in ?? () from /lib64/libcuda.so.1
#5  0x00007ff432954f89 in ?? () from /lib64/libcuda.so.1
#6  0x00007ff4328c8c87 in ?? () from /lib64/libcuda.so.1
#7  0x00007ff4328a10c2 in cuStreamSynchronize () from /lib64/libcuda.so.1
#8  0x00007ff433708d90 in ?? () from /usr/local/cuda-7.5/lib64/libcudart.so.7.5
#9  0x00007ff4337401fd in cudaStreamSynchronize () from /usr/local/cuda-7.5/lib64/libcudart.so.7.5
#10 0x0000000000401870 in main ()
[r1bsl@supermicro simpleMPITest]$ gstack 698
Thread 4 (Thread 0x7f5016a85700 (LWP 701)):
#0  0x00007f5018906c3d in poll () from /lib64/libc.so.6
#1  0x00007f501832ca96 in poll_dispatch (base=0xf23d70, tv=0x7f5016a84e90) at ../../../../../../opal/mca/event/libevent2021/libevent/poll.c:165
#2  0x00007f50183248c4 in opal_libevent2021_event_base_loop (base=0xf23d70, flags=1) at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1633
#3  0x00007f50185d731e in orte_progress_thread_engine () from /home/r1bsl/openMPI/lib/libopen-rte.so.12
#4  0x00007f5019607dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f501891128d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f4ffde06700 (LWP 711)):
#0  0x00007f5018906c3d in poll () from /lib64/libc.so.6
#1  0x00007f501dd0e85b in ?? () from /lib64/libcuda.so.1
#2  0x00007f501d6d4651 in ?? () from /lib64/libcuda.so.1
#3  0x00007f501dd0f1a8 in ?? () from /lib64/libcuda.so.1
#4  0x00007f5019607dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f501891128d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f4ff4dff700 (LWP 713)):
#0  0x00007f5018906c3d in poll () from /lib64/libc.so.6
#1  0x00007f501dd0e85b in ?? () from /lib64/libcuda.so.1
#2  0x00007f501d6d4651 in ?? () from /lib64/libcuda.so.1
#3  0x00007f501dd0f1a8 in ?? () from /lib64/libcuda.so.1
#4  0x00007f5019607dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f501891128d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f501faaa740 (LWP 698)):
#0  0x00007ffea4bfd7c2 in clock_gettime ()
#1  0x00007f5018924edd in clock_gettime () from /lib64/libc.so.6
#2  0x00007f501dd0e1de in ?? () from /lib64/libcuda.so.1
#3  0x00007f501d6c37ab in ?? () from /lib64/libcuda.so.1
#4  0x00007f501d6a0e33 in ?? () from /lib64/libcuda.so.1
#5  0x00007f501d6a0f89 in ?? () from /lib64/libcuda.so.1
#6  0x00007f501d614c87 in ?? () from /lib64/libcuda.so.1
#7  0x00007f501d5ed0c2 in cuStreamSynchronize () from /lib64/libcuda.so.1
#8  0x00007f501e454d90 in ?? () from /usr/local/cuda-7.5/lib64/libcudart.so.7.5
#9  0x00007f501e48c1fd in cudaStreamSynchronize () from /usr/local/cuda-7.5/lib64/libcudart.so.7.5
#10 0x0000000000401870 in main ()

wme7 · 2016-04-22T02:59:55Z

@lukeyeager I'm checking to see if the ACS are disabled for the PLX PCI-e switch in my motherboard: SS 7048GR-TR. I just found the following recommendation in a question of the Supermicro forum. Using the lscpi we check for the PLX switches and ACSCtl:

[root@supermicro manuel]# lspci | grep PLX
03:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
04:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
04:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
82:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
83:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
83:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
86:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
87:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
87:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)

[root@supermicro manuel]# lspci -vvv | grep ACSCtl
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
        ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
        ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
        ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
        ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-

Here, I notice that some that not all of the SrcValid are positive. Therefore, we set

[root@supermicro manuel]# sudo setpci -s 03:00.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 04:08.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 04:10.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 82:00.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 83:08.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 83:10.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 86:00.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 87:08.0 f2a.w=0000
[root@supermicro manuel]# sudo setpci -s 87:10.0 f2a.w=0000

Now when we check for ACSCtl I get:

[root@supermicro manuel]# sudo lspci -vvv | grep ACSCtl
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

and I turn to test a again the mpi example and found that ... it worked!

[r1bsl@supermicro simpleMPITest]$ sh run.sh 
rm *.run
Building  mpi_test.cu               > test.run                
nvcc -I/home/r1bsl/nccl/include -L/home/r1bsl/nccl/lib  -I/home/r1bsl/openMPI/include -L/home/r1bsl/openMPI/lib -lmpi -gencode=arch=compute_35,code=sm_35 -O3 -lineinfo -std=c++11 -maxrregcount 96 --compiler-options "-O3 -fPIC -fvisibility=hidden" -o test.run mpi_test.cu   -lnccl -L/usr/local/cuda-7.5/lib64 -lcudart -lcuda -lcurand -lnvToolsExt
MPI initialized
rank 1 has device 1
rank 2 has device 2
rank 3 has device 3
rank 4 has device 4
rank 5 has device 5
rank 0 has device 0
nccl communicator created!
CUDA streams created!
Input values set. Starting Test:
Reduction complete:
streams synchronization complete:
streams synchronization complete:
Checking results:
streams synchronization complete:
streams synchronization complete:
streams synchronization complete:
streams synchronization complete:
Test PASSED.

wme7 · 2016-04-22T03:07:44Z

Just for the record, my machine has BIOS version:

[r1bsl@supermicro]# sudo dmidecode | less

SMBIOS 2.8 present.
132 structures occupying 6109 bytes.
Table at 0x000ED8A0.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
        Vendor: American Megatrends Inc.
        Version: 1.0b
        Release Date: 01/07/2015
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 16384 kB
        Characteristics:
                PCI is supported
                BIOS is upgradeable
                BIOS shadowing is allowed
                Boot from CD is supported
                Selectable boot is supported
                BIOS ROM is socketed
                EDD is supported
                5.25"/1.2 MB floppy services are supported (int 13h)
                3.5"/720 kB floppy services are supported (int 13h)
                3.5"/2.88 MB floppy services are supported (int 13h)
                Print screen service is supported (int 5h)
                8042 keyboard services are supported (int 9h)
                Serial services are supported (int 14h)
                Printer services are supported (int 17h)
                ACPI is supported
                USB legacy is supported
                BIOS boot specification is supported
                Targeted content distribution is supported
                UEFI is supported
        BIOS Revision: 5.6

I'll try to update my computer BIOS, but as I'm reading, linux can enable the ACS every time I reboot. Is there a simple way to make sure is set at boot time?

lukeyeager · 2016-04-22T16:14:22Z

I'm happy to hear that helped! This issue has been reported as a deadlock before, even though it's really just a [super dramatic] reduction in communication speed.

as I'm reading, linux can enable the ACS every time I reboot. Is there a simple way to make sure is set at boot time?

It seems like you've already seen this comment:

Or one can disable the ACS directly in the BIOS of the server.
@juliebernauer NVIDIA/caffe#10 (comment)

If you can't get that to work, you could always set up a script to run after boot, I guess.

wme7 closed this as completed Apr 22, 2016

sjeaugey mentioned this issue Jun 17, 2016

all_reduce_test stop. #30

Closed

nluehr mentioned this issue Jul 19, 2016

NCCL All reduce on M40 #36

Closed

nluehr mentioned this issue Feb 16, 2017

Mystery hangs with nccl, CUDA 8rc, and Torch #39

Closed

sjeaugey mentioned this issue Apr 17, 2017

stalled when runing all_reduce_test #85

Closed

nluehr mentioned this issue Oct 30, 2017

nccl all_reduce_test hangs #117

Closed

starsblinking mentioned this issue Nov 24, 2017

single/all_gather_test is always stuck #120

Closed

cliffwoolley mentioned this issue Dec 21, 2017

The program only works on 1 GPU when using nccl 2.0 #123

Closed

skang29 mentioned this issue Nov 17, 2018

Multi GPU, GPU to GPU communication stops tensorflow/tensorflow#23730

Closed

sjeaugey mentioned this issue Sep 30, 2019

AllReduce hangs #257

Closed

sjeaugey mentioned this issue Nov 19, 2019

Performance issues of NCCL Allreduce NVIDIA/nccl-tests#29

Closed

richardliaw mentioned this issue Sep 29, 2020

[tune] distributed torch - colocation-support ray-project/ray#11061

Closed

8 tasks

This was referenced Oct 21, 2020

NCCL hang issue #394

Closed

Stuck when running MPI test NVIDIA/nccl-tests#18

Closed

genesith mentioned this issue Feb 17, 2022

DDP training hangs with run_glue.py and run_seq2seq.py huggingface/transformers#15618

Closed

4 tasks

himanshucodz55 mentioned this issue Jul 25, 2022

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms #708

Open

junior-zsy mentioned this issue Jun 29, 2023

FasterTransformer NcclAllReduceSum with 4 GPUs hangs #901

Closed

raninbowlalala mentioned this issue Jul 4, 2023

2 allreduce and a allgather hang in multi-node #899

Open

acphile mentioned this issue Sep 29, 2023

Question about ncclCommAbort stuck issue #1013

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing nccl with a difficult topology #19

Testing nccl with a difficult topology #19

wme7 commented Apr 19, 2016 •

edited

Loading

sjeaugey commented Apr 19, 2016

wme7 commented Apr 21, 2016 •

edited

Loading

lukeyeager commented Apr 21, 2016

sjeaugey commented Apr 21, 2016 •

edited

Loading

wme7 commented Apr 22, 2016 •

edited

Loading

wme7 commented Apr 22, 2016

wme7 commented Apr 22, 2016

lukeyeager commented Apr 22, 2016

Testing nccl with a difficult topology #19

Testing nccl with a difficult topology #19

Comments

wme7 commented Apr 19, 2016 • edited Loading

sjeaugey commented Apr 19, 2016

wme7 commented Apr 21, 2016 • edited Loading

lukeyeager commented Apr 21, 2016

sjeaugey commented Apr 21, 2016 • edited Loading

wme7 commented Apr 22, 2016 • edited Loading

wme7 commented Apr 22, 2016

wme7 commented Apr 22, 2016

lukeyeager commented Apr 22, 2016

wme7 commented Apr 19, 2016 •

edited

Loading

wme7 commented Apr 21, 2016 •

edited

Loading

sjeaugey commented Apr 21, 2016 •

edited

Loading

wme7 commented Apr 22, 2016 •

edited

Loading