Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

probability init hang with pytorch #581

Open
cicirori opened this issue Oct 12, 2021 · 5 comments
Open

probability init hang with pytorch #581

cicirori opened this issue Oct 12, 2021 · 5 comments

Comments

@cicirori
Copy link

cicirori commented Oct 12, 2021

Hi, I'm using PyTorch 1.8 + nccl2.9.9 for distributed training. I found that there is a high probability of initializing hang dwell in a particular hardware environment and configuration. After some investigation, I found out that this is due to the fact that the device where intermediateRank is located init rank completed and enter the communication primitive call, when the remote mem alloc calls from other GPUs that rely on this intermediateRank are not processed all the time causing the hang.

A hang example

1

RANK 0: init done
RANK 1: init done
RANK 2: init done
RANK 3: wait for RANK4
RANK 4: try to connect to RANK3(with intermediateRank 2)
RANK 5: init done
RANK 6: init done
RANK 7: init done

2

RANK 0: allreduce
RANK 1: allreduce
RANK 2: allreduce
RANK 3: wait for RANK4
RANK 4: ask RANK2 to alloc cuda mem
RANK 5: allreduce
RANK 6: allreduce
RANK 7: allreduce

3

RANK 0: allreduce
RANK 1: allreduce
RANK 2: allreduce + try to alloc cuda mem(stuck)
RANK 3: wait for RANK4
RANK 4: wait for RANK2 alloc ready
RANK 5: allreduce
RANK 6: allreduce
RANK 7: allreduce

4

hang

nccl related environments config:

NCCL_IB_GID_INDEX=3
NCCL_ASYNC_ERROR_HANDLING=1
NCCL_SOCKET_NTHREADS=2
NCCL_VERSION=2
NCCL_MAX_NCHANNELS=2
NCCL_MIN_NCHANNELS=2
NCCL_NSOCKS_PERTHREAD=1
NCCL_LAUNCH_MODE=PARALLEL
NCCL_DEBUG=INFO

machine info:

PyTorch version: 1.8.2+PAI2108
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.19.4

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB
GPU 4: Tesla V100-SXM2-32GB
GPU 5: Tesla V100-SXM2-32GB
GPU 6: Tesla V100-SXM2-32GB
GPU 7: Tesla V100-SXM2-32GB

Nvidia driver version: 418.87.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A

hardware topo:

<system version="1">
  <cpu numaid="0" affinity="00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,ffffffff,ffffffff,ffffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="85">
    <pci busid="0000:18:00.0" class="0x060400" vendor="0x10b5" device="0x8764" subsystem_vendor="0x10b5" subsystem_device="0x8764" link_speed="" link_width="0">
      <pci busid="0000:1b:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="0" sm="70" rank="0" gdr="0">
          <nvlink target="0000:1c:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:3e:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:3d:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:1c:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="1" sm="70" rank="1" gdr="0">
          <nvlink target="0000:dc:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:3d:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:1b:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:3e:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
    <pci busid="0000:3b:00.0" class="0x060400" vendor="0x10b5" device="0x8764" subsystem_vendor="0x10b5" subsystem_device="0x8764" link_speed="" link_width="0">
      <pci busid="0000:3d:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="2" sm="70" rank="2" gdr="0">
          <nvlink target="0000:3e:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b1:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:1c:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:1b:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:3e:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="3" sm="70" rank="3" gdr="0">
          <nvlink target="0000:1b:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:1c:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:3d:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
    <nic>
      <net name="bond0" dev="0" speed="20000" port="0" guid="0x0" maxconn="65536" gdr="0"/>
    </nic>
  </cpu>
  <cpu numaid="-1" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="85">
    <pci busid="0000:af:00.0" class="0x060400" vendor="0x10b5" device="0x8764" subsystem_vendor="0x10b5" subsystem_device="0x8764" link_speed="" link_width="0">
      <pci busid="0000:b1:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="4" sm="70" rank="4" gdr="0">
          <nvlink target="0000:dc:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:3d:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:b2:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="5" sm="70" rank="5" gdr="0">
          <nvlink target="0000:3e:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b1:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:dc:00.0" count="1" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
    <pci busid="0000:d8:00.0" class="0x060400" vendor="0x10b5" device="0x8764" subsystem_vendor="0x10b5" subsystem_device="0x8764" link_speed="" link_width="0">
      <pci busid="0000:db:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="6" sm="70" rank="6" gdr="0">
          <nvlink target="0000:1b:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:dc:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:b1:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="2" tclass="0x030200"/>
        </gpu>
      </pci>
      <pci busid="0000:dc:00.0" class="0x030200" vendor="0x10de" device="0x1db5" subsystem_vendor="0x10de" subsystem_device="0x1249" link_speed="" link_width="0">
        <gpu dev="7" sm="70" rank="7" gdr="0">
          <nvlink target="0000:b1:00.0" count="2" tclass="0x030200"/>
          <nvlink target="0000:b2:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:db:00.0" count="1" tclass="0x030200"/>
          <nvlink target="0000:1c:00.0" count="2" tclass="0x030200"/>
        </gpu>
      </pci>
    </pci>
  </cpu>
</system>
@cicirori
Copy link
Author

cicirori commented Oct 12, 2021

After looking at a specific run‘s log, I found that there is a rank4->rank3 connection that has not been established, and its intermediateRank is rank2. Here are the gdb results of rank4 and rank2, we can see that rank4 is waiting for the remote mem alloc of rank2 to finish, but rank2 is stuck on some cuda operation. At this point rank2 is already caught in the execution of the communication kernel, which I guess is the reason why subsequent cuda operations on the same stream cannot be finished.

rank 4

image

rank 2

image

@cicirori
Copy link
Author

cicirori commented Oct 12, 2021

To test my idea, I added a global barrier (using PyTorch's tcp store) immediately after PyTorch initialized nccl comm in: https://github.com/pytorch/pytorch/blob/lts/release/1.8/torch/lib/c10d/ProcessGroupNCCL.cpp#L826. This seemed to solve the hang problem. But I think this is an implementation mistake of nccl or PyTorch+NCCL, and I'd like to know your opinion.

@sjeaugey
Copy link
Member

Your analysis and workaround are correct.

This was fixed in 2.10.3 by adding a call to bootstrapBarrier() after the NVB preconnect section inside nccl init.

@cicirori
Copy link
Author

Your analysis and workaround are correct.

This was fixed in 2.10.3 by adding a call to bootstrapBarrier() after the NVB preconnect section inside nccl init.

@sjeaugey, thanks for your reply! And which nccl version introduced this problem? I'm considering downgrading to the appropriate nccl version in my cluster.
Also, I'm wondering if nvidia has an internal unit tests for nccl, and if not do you have any recommendations for ensuring stability when updating nccl versions in complex cluster environments?

@sjeaugey
Copy link
Member

The full story:
2.8.3-1:

Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.

This improves alltoall performance on DGX-1-like servers. But it was prone to deadlocks later on when using send/recv.

2.9.9-1:

Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.

This fixes the later hang but may hang during init, even for codes not using send/recv. It also adds a NCCL_NVB_DISABLE parameter to disable NVB. @cicirori this might be useful for you in case you don't actually use alltoall hence was subject to the hang in 2.9 but wasn't in 2.8.

2.10.3-1:

Fix hang in cubemesh during NVB connections.

This adds a barrier after the NVB Preconnect phase, hopefully fixing the issue once and for all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants