Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting "help-mpi-btl-base.txt / btl:no-nics" when trying to run on Ethernet network #21

Closed
yaroslavvb opened this issue May 18, 2019 · 9 comments

Comments

@yaroslavvb
Copy link

Is there a special way of building the test if I want to run it on Ethernet? Trying to run it on AWS 100Gbps network and getting following error

ubuntu@ip-172-31-15-234:~$ /usr/local/mpi/bin/mpirun --host 172.31.15.234,172.31.3.83 -np 2 -N 1 -x LD_LIBRARY_PATH=~/nccl/nccl-2.3.7/nccl/build/lib:$LD_LIBRARY_PATH ~/nccl/nccl-2.3.7/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8 
--------------------------------------------------------------------------
[[33448,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: ip-172-31-3-83

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
[ip-172-31-15-234:73547] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[ip-172-31-15-234:73547] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
@sjeaugey
Copy link
Member

There is no Infiniband on AWS, you can add -mca ^openib to mpirun to suppress this warning.

Usually problems are due to MPI trying to use docker0 to communicate, so I usually add -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 as well to force the usage of the ENA card.

@yaroslavvb
Copy link
Author

yaroslavvb commented May 18, 2019

Thanks, that fixed it!
I'm now seeing the following by doing 1GB allreduce on 2 nodes.

/usr/local/mpi/bin/mpirun --host 172.31.15.234,172.31.3.83 -np 4 -mca btl ^openib -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 -mca orte_base_help_aggregate 0 -x LD_LIBRARY_PATH=~/nccl/nccl-2.4.7ms0/nccl/build/lib:$LD_LIBRARY_PATH -oversubscribe ~/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf -b 1280M -e 1280M -f 2 -g 4

# nThread 1 nGpus 4 minBytes 1342177280 maxBytes 1342177280 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   8932 on ip-172-31-15-234 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  1 Pid   8932 on ip-172-31-15-234 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank  2 Pid   8932 on ip-172-31-15-234 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank  3 Pid   8932 on ip-172-31-15-234 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank  4 Pid   8933 on ip-172-31-15-234 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank  5 Pid   8933 on ip-172-31-15-234 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank  6 Pid   8933 on ip-172-31-15-234 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank  7 Pid   8933 on ip-172-31-15-234 device  7 [0x00] Tesla V100-SXM2-32GB
#   Rank  8 Pid   8299 on ip-172-31-3-83 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  9 Pid   8299 on ip-172-31-3-83 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank 10 Pid   8299 on ip-172-31-3-83 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank 11 Pid   8299 on ip-172-31-3-83 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank 12 Pid   8300 on ip-172-31-3-83 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank 13 Pid   8300 on ip-172-31-3-83 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank 14 Pid   8300 on ip-172-31-3-83 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank 15 Pid   8300 on ip-172-31-3-83 device  7 [0x00] Tesla V100-SXM2-32GB
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
  1342177280     335544320   float     sum   396609    3.38    6.35  4e-07   396519    3.38    6.35  4e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 6.34596 
#

Some questions remain:

  • specifying -g 4 and mpirun -np 4 gives me 8 GPUs per node ....I expected 4 gpus per node
  • sudo nload shows my utilization to be about 25 Gbps, meanwhile I'm getting 93 Gbps using iperf3 with 5 processes and 10 connections each. AWS throttles each connection to 10 Gbps, so I need at least 10 connections between these two machines, is there some setting I can add to increase number of connections? I've been running nccl from this branch

Setting NCCL_MIN_NRINGS=16 didn't have an effect

@AddyLaddy
Copy link
Collaborator

Have you tried this PR against NCCL; NVIDIA/nccl#223

@yaroslavvb
Copy link
Author

Yes, that's the version I'm testing above

@AddyLaddy
Copy link
Collaborator

Ok let's continue this discussion over on the NCCL project

@AddyLaddy
Copy link
Collaborator

Typically we have been running with 1 process per GPU, so you can use mpirun -n $GPUS -N 8 for an 8 GPU per node job. You would then need to use -g1 on the all_reduce_perf command line

@sjeaugey
Copy link
Member

sjeaugey commented May 22, 2019

When running with MPI, you would usually want to not set -g nor -t (keep both equal to one) and just vary the -np argument of mpirun to match the total number of GPUs.

This would match what most framework do.

@yaroslavvb
Copy link
Author

yaroslavvb commented May 31, 2019

Thanks, it works without -g arg now. However, I'm still confused why I need --oversubscribe argument. Basically I'm wondering if I'm missing some slots configuration somewhere.

ie

/usr/local/mpi/bin/mpirun --host 172.31.34.74,172.31.45.216 -np 16 -N 8 -mca btl ^openib -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 -mca orte_base_help_aggregate 0 -x LD_LIBRARY_PATH=~/nccl/nccl-2.4.7ms0/nccl/build/lib:$LD_LIBRARY_PATH ~/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf -b 1280M -e 1280M -f 2

There are not enough slots available in the system to satisfy the 16 slots
that were requested by the application:
  /home/ubuntu/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf

Either request fewer slots for your application, or make more slots available
for use.

When I add oversubscribe, it works

ubuntu@ip-172-31-34-74:~$ /usr/local/mpi/bin/mpirun --host 172.31.34.74,172.31.45.216 -np 16 -N
 8 -mca btl ^openib -mca oob_tcp_if_include ens5 -mca btl_tcp_if_include ens5 -mca orte_base_help
_aggregate 0 -x LD_LIBRARY_PATH=~/nccl/nccl-2.4.7ms0/nccl/build/lib:$LD_LIBRARY_PATH -oversubscribe ~/nccl/nccl-2.4.7ms0/nccl-tests/build/all_reduce_perf -b 1280M -e 1280M -f 2;

# nThread 1 nGpus 1 minBytes 1342177280 maxBytes 1342177280 step: 2(factor) warmup iters: 5 iters
: 20 validation: 1
#
# Using devices
#   Rank  0 Pid  84278 on ip-172-31-34-74 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  1 Pid  84279 on ip-172-31-34-74 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank  2 Pid  84280 on ip-172-31-34-74 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank  3 Pid  84281 on ip-172-31-34-74 device  3 [0x00] Tesla V100-SXM2-32GB
...

@yaroslavvb
Copy link
Author

Closing since issue was solved.
The solution to oversubscription was to use hosts.slots file instead of host string for mpirun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants