probability init hang with pytorch #581

cicirori · 2021-10-12T04:31:23Z

Hi, I'm using PyTorch 1.8 + nccl2.9.9 for distributed training. I found that there is a high probability of initializing hang dwell in a particular hardware environment and configuration. After some investigation, I found out that this is due to the fact that the device where intermediateRank is located init rank completed and enter the communication primitive call, when the remote mem alloc calls from other GPUs that rely on this intermediateRank are not processed all the time causing the hang.

A hang example

1

RANK 0: init done
RANK 1: init done
RANK 2: init done
RANK 3: wait for RANK4
RANK 4: try to connect to RANK3(with intermediateRank 2)
RANK 5: init done
RANK 6: init done
RANK 7: init done

2

RANK 0: allreduce
RANK 1: allreduce
RANK 2: allreduce
RANK 3: wait for RANK4
RANK 4: ask RANK2 to alloc cuda mem
RANK 5: allreduce
RANK 6: allreduce
RANK 7: allreduce

3

RANK 0: allreduce
RANK 1: allreduce
RANK 2: allreduce + try to alloc cuda mem(stuck)
RANK 3: wait for RANK4
RANK 4: wait for RANK2 alloc ready
RANK 5: allreduce
RANK 6: allreduce
RANK 7: allreduce

4

hang

nccl related environments config:

NCCL_IB_GID_INDEX=3
NCCL_ASYNC_ERROR_HANDLING=1
NCCL_SOCKET_NTHREADS=2
NCCL_VERSION=2
NCCL_MAX_NCHANNELS=2
NCCL_MIN_NCHANNELS=2
NCCL_NSOCKS_PERTHREAD=1
NCCL_LAUNCH_MODE=PARALLEL
NCCL_DEBUG=INFO

machine info:

PyTorch version: 1.8.2+PAI2108
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.19.4

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB
GPU 4: Tesla V100-SXM2-32GB
GPU 5: Tesla V100-SXM2-32GB
GPU 6: Tesla V100-SXM2-32GB
GPU 7: Tesla V100-SXM2-32GB

Nvidia driver version: 418.87.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A

hardware topo:

<system version="1">
  <cpu numaid="0" affinity="00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,ffffffff,ffffffff,ffffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="85">
    <pci busid="0000:18:00.0" class="0x060400" vendor="0x10b5" device="0x8764" subsystem_vendor="0x10b5" subsystem_device="0x8764" link_speed="" link_width="0">
      <pci busid="0000:1b:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="0" sm="70" rank="0" gdr="0">
          <nvlink target="0000:1c:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:3e:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:3d:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:1c:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="1" sm="70" rank="1" gdr="0">
          <nvlink target="0000:dc:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:3d:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:1b:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:3e:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
    <pci busid="0000:3b:00.0" class="0x060400" vendor="0x10b5" device="0x8764" subsystem_vendor="0x10b5" subsystem_device="0x8764" link_speed="" link_width="0">
      <pci busid="0000:3d:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="2" sm="70" rank="2" gdr="0">
          <nvlink target="0000:3e:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b1:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:1c:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:1b:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:3e:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="3" sm="70" rank="3" gdr="0">
          <nvlink target="0000:1b:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:1c:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:3d:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
    <nic>
      <net name="bond0" dev="0" speed="20000" port="0" guid="0x0" maxconn="65536" gdr="0"/>
    </nic>
  </cpu>
  <cpu numaid="-1" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="85">
    <pci busid="0000:af:00.0" class="0x060400" vendor="0x10b5" device="0x8764" subsystem_vendor="0x10b5" subsystem_device="0x8764" link_speed="" link_width="0">
      <pci busid="0000:b1:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="4" sm="70" rank="4" gdr="0">
          <nvlink target="0000:dc:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:3d:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:b2:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="5" sm="70" rank="5" gdr="0">
          <nvlink target="0000:3e:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b1:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:dc:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
    <pci busid="0000:d8:00.0" class="0x060400" vendor="0x10b5" device="0x8764" subsystem_vendor="0x10b5" subsystem_device="0x8764" link_speed="" link_width="0">
      <pci busid="0000:db:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="6" sm="70" rank="6" gdr="0">
          <nvlink target="0000:1b:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:dc:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:b1:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="2" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:dc:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="7" sm="70" rank="7" gdr="0">
          <nvlink target="0000:b1:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:1c:00.0" count="2" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
  </cpu>
</system>

The text was updated successfully, but these errors were encountered:

cicirori · 2021-10-12T05:15:20Z

After looking at a specific run‘s log, I found that there is a rank4->rank3 connection that has not been established, and its intermediateRank is rank2. Here are the gdb results of rank4 and rank2, we can see that rank4 is waiting for the remote mem alloc of rank2 to finish, but rank2 is stuck on some cuda operation. At this point rank2 is already caught in the execution of the communication kernel, which I guess is the reason why subsequent cuda operations on the same stream cannot be finished.

rank 4

rank 2

cicirori · 2021-10-12T05:18:14Z

To test my idea, I added a global barrier (using PyTorch's tcp store) immediately after PyTorch initialized nccl comm in: https://github.com/pytorch/pytorch/blob/lts/release/1.8/torch/lib/c10d/ProcessGroupNCCL.cpp#L826. This seemed to solve the hang problem. But I think this is an implementation mistake of nccl or PyTorch+NCCL, and I'd like to know your opinion.

sjeaugey · 2021-10-12T07:51:58Z

Your analysis and workaround are correct.

This was fixed in 2.10.3 by adding a call to bootstrapBarrier() after the NVB preconnect section inside nccl init.

cicirori · 2021-10-14T03:42:01Z

Your analysis and workaround are correct.

This was fixed in 2.10.3 by adding a call to bootstrapBarrier() after the NVB preconnect section inside nccl init.

@sjeaugey, thanks for your reply! And which nccl version introduced this problem? I'm considering downgrading to the appropriate nccl version in my cluster.
Also, I'm wondering if nvidia has an internal unit tests for nccl, and if not do you have any recommendations for ensuring stability when updating nccl versions in complex cluster environments?

sjeaugey · 2021-10-14T09:40:37Z

The full story:
2.8.3-1:

Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.

This improves alltoall performance on DGX-1-like servers. But it was prone to deadlocks later on when using send/recv.

2.9.9-1:

Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.

This fixes the later hang but may hang during init, even for codes not using send/recv. It also adds a NCCL_NVB_DISABLE parameter to disable NVB. @cicirori this might be useful for you in case you don't actually use alltoall hence was subject to the hang in 2.9 but wasn't in 2.8.

2.10.3-1:

Fix hang in cubemesh during NVB connections.

This adds a barrier after the NVB Preconnect phase, hopefully fixing the issue once and for all.

cicirori mentioned this issue Oct 12, 2021

nccl comm init needs a global barrier. pytorch/pytorch#66473

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

probability init hang with pytorch #581

probability init hang with pytorch #581

cicirori commented Oct 12, 2021 •

edited

Loading

cicirori commented Oct 12, 2021 •

edited

Loading

cicirori commented Oct 12, 2021 •

edited

Loading

sjeaugey commented Oct 12, 2021

cicirori commented Oct 14, 2021

sjeaugey commented Oct 14, 2021

probability init hang with pytorch #581

probability init hang with pytorch #581

Comments

cicirori commented Oct 12, 2021 • edited Loading

A hang example

1

2

3

4

nccl related environments config:

machine info:

hardware topo:

cicirori commented Oct 12, 2021 • edited Loading

rank 4

rank 2

cicirori commented Oct 12, 2021 • edited Loading

sjeaugey commented Oct 12, 2021

cicirori commented Oct 14, 2021

sjeaugey commented Oct 14, 2021

cicirori commented Oct 12, 2021 •

edited

Loading

cicirori commented Oct 12, 2021 •

edited

Loading

cicirori commented Oct 12, 2021 •

edited

Loading