Getting "help-mpi-btl-base.txt / btl:no-nics" when trying to run on Ethernet network #21

yaroslavvb · 2019-05-18T03:51:11Z

Is there a special way of building the test if I want to run it on Ethernet? Trying to run it on AWS 100Gbps network and getting following error

ubuntu@ip-172-31-15-234:~$ /usr/local/mpi/bin/mpirun --host 172.31.15.234,172.31.3.83 -np 2 -N 1 -x LD_LIBRARY_PATH=~/nccl/nccl-2.3.7/nccl/build/lib:$LD_LIBRARY_PATH ~/nccl/nccl-2.3.7/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8 
--------------------------------------------------------------------------
[[33448,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: ip-172-31-3-83

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
[ip-172-31-15-234:73547] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[ip-172-31-15-234:73547] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

The text was updated successfully, but these errors were encountered:

sjeaugey · 2019-05-18T05:18:10Z

There is no Infiniband on AWS, you can add -mca ^openib to mpirun to suppress this warning.

Usually problems are due to MPI trying to use docker0 to communicate, so I usually add -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 as well to force the usage of the ENA card.

yaroslavvb · 2019-05-18T06:24:36Z

Thanks, that fixed it!
I'm now seeing the following by doing 1GB allreduce on 2 nodes.

/usr/local/mpi/bin/mpirun --host 172.31.15.234,172.31.3.83 -np 4 -mca btl ^openib -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 -mca orte_base_help_aggregate 0 -x LD_LIBRARY_PATH=~/nccl/nccl-2.4.7ms0/nccl/build/lib:$LD_LIBRARY_PATH -oversubscribe ~/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf -b 1280M -e 1280M -f 2 -g 4

# nThread 1 nGpus 4 minBytes 1342177280 maxBytes 1342177280 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   8932 on ip-172-31-15-234 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  1 Pid   8932 on ip-172-31-15-234 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank  2 Pid   8932 on ip-172-31-15-234 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank  3 Pid   8932 on ip-172-31-15-234 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank  4 Pid   8933 on ip-172-31-15-234 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank  5 Pid   8933 on ip-172-31-15-234 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank  6 Pid   8933 on ip-172-31-15-234 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank  7 Pid   8933 on ip-172-31-15-234 device  7 [0x00] Tesla V100-SXM2-32GB
#   Rank  8 Pid   8299 on ip-172-31-3-83 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  9 Pid   8299 on ip-172-31-3-83 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank 10 Pid   8299 on ip-172-31-3-83 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank 11 Pid   8299 on ip-172-31-3-83 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank 12 Pid   8300 on ip-172-31-3-83 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank 13 Pid   8300 on ip-172-31-3-83 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank 14 Pid   8300 on ip-172-31-3-83 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank 15 Pid   8300 on ip-172-31-3-83 device  7 [0x00] Tesla V100-SXM2-32GB
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
  1342177280     335544320   float     sum   396609    3.38    6.35  4e-07   396519    3.38    6.35  4e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 6.34596 
#

Some questions remain:

specifying -g 4 and mpirun -np 4 gives me 8 GPUs per node ....I expected 4 gpus per node
sudo nload shows my utilization to be about 25 Gbps, meanwhile I'm getting 93 Gbps using iperf3 with 5 processes and 10 connections each. AWS throttles each connection to 10 Gbps, so I need at least 10 connections between these two machines, is there some setting I can add to increase number of connections? I've been running nccl from this branch

Setting NCCL_MIN_NRINGS=16 didn't have an effect

AddyLaddy · 2019-05-22T20:07:01Z

Have you tried this PR against NCCL; NVIDIA/nccl#223

yaroslavvb · 2019-05-22T20:22:00Z

Yes, that's the version I'm testing above

AddyLaddy · 2019-05-22T20:47:32Z

Ok let's continue this discussion over on the NCCL project

AddyLaddy · 2019-05-22T20:50:20Z

Typically we have been running with 1 process per GPU, so you can use mpirun -n $GPUS -N 8 for an 8 GPU per node job. You would then need to use -g1 on the all_reduce_perf command line

sjeaugey · 2019-05-22T23:58:58Z

When running with MPI, you would usually want to not set -g nor -t (keep both equal to one) and just vary the -np argument of mpirun to match the total number of GPUs.

This would match what most framework do.

yaroslavvb · 2019-05-31T03:59:23Z

Thanks, it works without -g arg now. However, I'm still confused why I need --oversubscribe argument. Basically I'm wondering if I'm missing some slots configuration somewhere.

ie

/usr/local/mpi/bin/mpirun --host 172.31.34.74,172.31.45.216 -np 16 -N 8 -mca btl ^openib -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 -mca orte_base_help_aggregate 0 -x LD_LIBRARY_PATH=~/nccl/nccl-2.4.7ms0/nccl/build/lib:$LD_LIBRARY_PATH ~/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf -b 1280M -e 1280M -f 2

There are not enough slots available in the system to satisfy the 16 slots
that were requested by the application:
  /home/ubuntu/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf

Either request fewer slots for your application, or make more slots available
for use.

When I add oversubscribe, it works

ubuntu@ip-172-31-34-74:~$ /usr/local/mpi/bin/mpirun --host 172.31.34.74,172.31.45.216 -np 16 -N
 8 -mca btl ^openib -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 -mca orte_base_help
_aggregate 0 -x LD_LIBRARY_PATH=~/nccl/nccl-2.4.7ms0/nccl/build/lib:$LD_LIBRARY_PATH -oversubscribe ~/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf -b 1280M -e 1280M -f 2;

# nThread 1 nGpus 1 minBytes 1342177280 maxBytes 1342177280 step: 2(factor) warmup iters: 5 iters
: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  84278 on ip-172-31-34-74 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  1 Pid  84279 on ip-172-31-34-74 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank  2 Pid  84280 on ip-172-31-34-74 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank  3 Pid  84281 on ip-172-31-34-74 device  3 [0x00] Tesla V100-SXM2-32GB
...

yaroslavvb · 2019-06-15T21:43:27Z

Closing since issue was solved.
The solution to oversubscription was to use hosts.slots file instead of host string for mpirun

yaroslavvb closed this as completed Jun 15, 2019

smiet mentioned this issue Aug 12, 2020

Zernike PrincetonUniversity/SPEC#111

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting "help-mpi-btl-base.txt / btl:no-nics" when trying to run on Ethernet network #21

Getting "help-mpi-btl-base.txt / btl:no-nics" when trying to run on Ethernet network #21

yaroslavvb commented May 18, 2019

sjeaugey commented May 18, 2019

yaroslavvb commented May 18, 2019 •

edited

Loading

AddyLaddy commented May 22, 2019

yaroslavvb commented May 22, 2019

AddyLaddy commented May 22, 2019

AddyLaddy commented May 22, 2019

sjeaugey commented May 22, 2019 •

edited

Loading

yaroslavvb commented May 31, 2019 •

edited

Loading

yaroslavvb commented Jun 15, 2019

Getting "help-mpi-btl-base.txt / btl:no-nics" when trying to run on Ethernet network #21

Getting "help-mpi-btl-base.txt / btl:no-nics" when trying to run on Ethernet network #21

Comments

yaroslavvb commented May 18, 2019

sjeaugey commented May 18, 2019

yaroslavvb commented May 18, 2019 • edited Loading

AddyLaddy commented May 22, 2019

yaroslavvb commented May 22, 2019

AddyLaddy commented May 22, 2019

AddyLaddy commented May 22, 2019

sjeaugey commented May 22, 2019 • edited Loading

yaroslavvb commented May 31, 2019 • edited Loading

yaroslavvb commented Jun 15, 2019

yaroslavvb commented May 18, 2019 •

edited

Loading

sjeaugey commented May 22, 2019 •

edited

Loading

yaroslavvb commented May 31, 2019 •

edited

Loading