Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPUDirect RDMA is not working inside the horovod-docker #41

Open
vilmara opened this issue Jun 8, 2018 · 3 comments
Open

GPUDirect RDMA is not working inside the horovod-docker #41

vilmara opened this issue Jun 8, 2018 · 3 comments

Comments

@vilmara
Copy link

vilmara commented Jun 8, 2018

hi all, I am running TensorFlow benchmarks inside the horovod-docker to evaluate the models in distributed mode. I have installed Mellanox driver and GPUDirect RDMA API, and loaded the GPUDirect kernel module on each server; also I have checked its status to make sure GPUDirect RDMA is active and I realized it is not recognized inside horovod docker, see below:

Outside the docker:
service nv_peer_mem status
Output
● nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
Loaded: loaded (/etc/init.d/nv_peer_mem; bad; vendor preset: enabled)
Active: active (exited) since Thu 2018-06-07 16:02:45 CDT; 16h ago
Docs: man:systemd-sysv-generator(8)
Process: 303965 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
Tasks: 0
Memory: 0B
CPU: 0

Jun 07 16:02:45 C4140-V100-1 systemd[1]: Starting LSB: Activates/Deactivates nv_peer_mem to \ start at boot time....
Jun 07 16:02:45 C4140-V100-1 nv_peer_mem[303965]: starting... OK

Inside the docker:
service nv_peer_mem status
Output:
nv_peer_mem: unrecognized service

Also, when I run the benchmarks inside the docker, the scaling efficiency drops from ~90% to ~77%. The systems releases this warning:
host-1-V100:24:203 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
host-1-V100:24:203 [0] INFO Using internal Network Socket

Can you help to find out how to fix it? also what are the mpirun flags to enable rmda (infiniband) and be sure the network communication is over rmda (infiniband) instead of the socket?

@haggaie
Copy link
Contributor

haggaie commented Jun 10, 2018

I'm not sure about the efficiency inside docker, but regarding nv_peer_mem, it is only required to be loaded once, on the host. You don't need to load it inside a container too.

@boriskovalev
Copy link

@vilmara As Haggai commented you need to load nv_peer_mem only on host.
Please sure that GPUs and Mellanox card connection is traversing a single PCIe switch (PIX) by running nvidia-smi topo -m ?
Do you use IB or RoCE?
For IB you can use my community document https://community.mellanox.com/docs/DOC-3083 (without GPUDirect).
I will add GPUDirect part in next week.
For RoCE please use host network.

@vilmara
Copy link
Author

vilmara commented Jun 10, 2018

Hi @haggaie/ @boriskovalev thanks for your reply.

@boriskovalev, I am using IB, here is the output when running nvidia-smi topo -m:

    GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity

GPU0 X NV2 NV2 NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20 ,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
GPU1 NV2 X NV2 NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
GPU2 NV2 NV2 X NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20 ,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
GPU3 NV2 NV2 NV2 X SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
mlx5_0 SYS SYS SYS SYS X

I was able to modify the horovod Dockerfile and build it with the MLNX_OFED included, I have run the benchmarks but the system hangs and also is showing socket and infiniband connections:
Outputs:
c4140v1001:97640:97815 [0] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:97640:97815 [0] INFO NET/IB : Using interface ib0 for sideband communication
c4140v1001:97640:97815 [0] INFO NET/IB: [0] mlx5_0:1/IB
c4140v1001:97640:97815 [0] INFO Using internal Network IB
c4140v1001:97640:97815 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
c4140v1001:97640:97815 [0] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:97640:97815 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.12+cuda9.0

C4140-V100-2:375816:376026 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:375816:376026 [1] INFO Using internal Network Socket
C4140-V100-2:375816:376026 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:375817:376020 [2] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:375817:376020 [2] INFO Using internal Network Socket
C4140-V100-2:375817:376020 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:375815:376019 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:375815:376019 [0] INFO Using internal Network Socket
C4140-V100-2:375815:376019 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:375818:376025 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:375818:376025 [3] INFO Using internal Network Socket
C4140-V100-2:375818:376025 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384

Could you please share the commands you used in Appendix A: TensorFlow Benchmarks and TCP vs. RDMA comparison https://community.mellanox.com/docs/DOC-3083

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants