-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPUDirect RDMA is not working inside the horovod-docker #41
Comments
I'm not sure about the efficiency inside docker, but regarding nv_peer_mem, it is only required to be loaded once, on the host. You don't need to load it inside a container too. |
@vilmara As Haggai commented you need to load nv_peer_mem only on host. |
Hi @haggaie/ @boriskovalev thanks for your reply. @boriskovalev, I am using IB, here is the output when running
GPU0 X NV2 NV2 NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20 ,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38 I was able to modify the horovod Dockerfile and build it with the MLNX_OFED included, I have run the benchmarks but the system hangs and also is showing socket and infiniband connections: C4140-V100-2:375816:376026 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-2:375817:376020 [2] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-2:375815:376019 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-2:375818:376025 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] Could you please share the commands you used in Appendix A: TensorFlow Benchmarks and TCP vs. RDMA comparison https://community.mellanox.com/docs/DOC-3083 |
hi all, I am running TensorFlow benchmarks inside the horovod-docker to evaluate the models in distributed mode. I have installed Mellanox driver and GPUDirect RDMA API, and loaded the GPUDirect kernel module on each server; also I have checked its status to make sure GPUDirect RDMA is active and I realized it is not recognized inside horovod docker, see below:
Outside the docker:
service nv_peer_mem status
Output
● nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
Loaded: loaded (/etc/init.d/nv_peer_mem; bad; vendor preset: enabled)
Active: active (exited) since Thu 2018-06-07 16:02:45 CDT; 16h ago
Docs: man:systemd-sysv-generator(8)
Process: 303965 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
Tasks: 0
Memory: 0B
CPU: 0
Jun 07 16:02:45 C4140-V100-1 systemd[1]: Starting LSB: Activates/Deactivates nv_peer_mem to \ start at boot time....
Jun 07 16:02:45 C4140-V100-1 nv_peer_mem[303965]: starting... OK
Inside the docker:
service nv_peer_mem status
Output:
nv_peer_mem: unrecognized service
Also, when I run the benchmarks inside the docker, the scaling efficiency drops from ~90% to ~77%. The systems releases this warning:
host-1-V100:24:203 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
host-1-V100:24:203 [0] INFO Using internal Network Socket
Can you help to find out how to fix it? also what are the mpirun flags to enable rmda (infiniband) and be sure the network communication is over rmda (infiniband) instead of the socket?
The text was updated successfully, but these errors were encountered: