GPUDirect RDMA is not working inside the horovod-docker #41

vilmara · 2018-06-08T16:45:27Z

hi all, I am running TensorFlow benchmarks inside the horovod-docker to evaluate the models in distributed mode. I have installed Mellanox driver and GPUDirect RDMA API, and loaded the GPUDirect kernel module on each server; also I have checked its status to make sure GPUDirect RDMA is active and I realized it is not recognized inside horovod docker, see below:

Outside the docker:
service nv_peer_mem status
Output
● nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
Loaded: loaded (/etc/init.d/nv_peer_mem; bad; vendor preset: enabled)
Active: active (exited) since Thu 2018-06-07 16:02:45 CDT; 16h ago
Docs: man:systemd-sysv-generator(8)
Process: 303965 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
Tasks: 0
Memory: 0B
CPU: 0

Jun 07 16:02:45 C4140-V100-1 systemd[1]: Starting LSB: Activates/Deactivates nv_peer_mem to \ start at boot time....
Jun 07 16:02:45 C4140-V100-1 nv_peer_mem[303965]: starting... OK

Inside the docker:
service nv_peer_mem status
Output:
nv_peer_mem: unrecognized service

Also, when I run the benchmarks inside the docker, the scaling efficiency drops from ~90% to ~77%. The systems releases this warning:
host-1-V100:24:203 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
host-1-V100:24:203 [0] INFO Using internal Network Socket

Can you help to find out how to fix it? also what are the mpirun flags to enable rmda (infiniband) and be sure the network communication is over rmda (infiniband) instead of the socket?

The text was updated successfully, but these errors were encountered:

haggaie · 2018-06-10T06:46:30Z

I'm not sure about the efficiency inside docker, but regarding nv_peer_mem, it is only required to be loaded once, on the host. You don't need to load it inside a container too.

boriskovalev · 2018-06-10T07:52:43Z

@vilmara As Haggai commented you need to load nv_peer_mem only on host.
Please sure that GPUs and Mellanox card connection is traversing a single PCIe switch (PIX) by running nvidia-smi topo -m ?
Do you use IB or RoCE?
For IB you can use my community document https://community.mellanox.com/docs/DOC-3083 (without GPUDirect).
I will add GPUDirect part in next week.
For RoCE please use host network.

vilmara · 2018-06-10T17:14:23Z

Hi @haggaie/ @boriskovalev thanks for your reply.

@boriskovalev, I am using IB, here is the output when running nvidia-smi topo -m:

    GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity

GPU0 X NV2 NV2 NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20 ,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
GPU1 NV2 X NV2 NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
GPU2 NV2 NV2 X NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20 ,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
GPU3 NV2 NV2 NV2 X SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
mlx5_0 SYS SYS SYS SYS X

I was able to modify the horovod Dockerfile and build it with the MLNX_OFED included, I have run the benchmarks but the system hangs and also is showing socket and infiniband connections:
Outputs:
c4140v1001:97640:97815 [0] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:97640:97815 [0] INFO NET/IB : Using interface ib0 for sideband communication
c4140v1001:97640:97815 [0] INFO NET/IB: [0] mlx5_0:1/IB
c4140v1001:97640:97815 [0] INFO Using internal Network IB
c4140v1001:97640:97815 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
c4140v1001:97640:97815 [0] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:97640:97815 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.12+cuda9.0

C4140-V100-2:375816:376026 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:375816:376026 [1] INFO Using internal Network Socket
C4140-V100-2:375816:376026 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:375817:376020 [2] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:375817:376020 [2] INFO Using internal Network Socket
C4140-V100-2:375817:376020 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:375815:376019 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:375815:376019 [0] INFO Using internal Network Socket
C4140-V100-2:375815:376019 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:375818:376025 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:375818:376025 [3] INFO Using internal Network Socket
C4140-V100-2:375818:376025 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384

Could you please share the commands you used in Appendix A: TensorFlow Benchmarks and TCP vs. RDMA comparison https://community.mellanox.com/docs/DOC-3083

vilmara mentioned this issue Jun 12, 2018

Performance tuning parameters for Horovod-TensorFlow benchmarks horovod/horovod#288

Closed

tingweiwu mentioned this issue Mar 21, 2019

nv_peer_memory should be installed inside the docker or on the host horovod/horovod#940

Closed

dtrudg mentioned this issue Jan 2, 2020

Availability of GPU Direct RDMA support with Singularity? apptainer/singularity#4921

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPUDirect RDMA is not working inside the horovod-docker #41

GPUDirect RDMA is not working inside the horovod-docker #41

vilmara commented Jun 8, 2018 •

edited

haggaie commented Jun 10, 2018

boriskovalev commented Jun 10, 2018

vilmara commented Jun 10, 2018 •

edited

GPUDirect RDMA is not working inside the horovod-docker #41

GPUDirect RDMA is not working inside the horovod-docker #41

Comments

vilmara commented Jun 8, 2018 • edited

haggaie commented Jun 10, 2018

boriskovalev commented Jun 10, 2018

vilmara commented Jun 10, 2018 • edited

vilmara commented Jun 8, 2018 •

edited

vilmara commented Jun 10, 2018 •

edited