-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting "help-mpi-btl-base.txt / btl:no-nics" when trying to run on Ethernet network #21
Comments
There is no Infiniband on AWS, you can add Usually problems are due to MPI trying to use |
Thanks, that fixed it!
Some questions remain:
Setting |
Have you tried this PR against NCCL; NVIDIA/nccl#223 |
Yes, that's the version I'm testing above |
Ok let's continue this discussion over on the NCCL project |
Typically we have been running with 1 process per GPU, so you can use mpirun -n $GPUS -N 8 for an 8 GPU per node job. You would then need to use -g1 on the all_reduce_perf command line |
When running with MPI, you would usually want to not set This would match what most framework do. |
Thanks, it works without -g arg now. However, I'm still confused why I need --oversubscribe argument. Basically I'm wondering if I'm missing some slots configuration somewhere. ie
When I add oversubscribe, it works
|
Closing since issue was solved. |
Is there a special way of building the test if I want to run it on Ethernet? Trying to run it on AWS 100Gbps network and getting following error
The text was updated successfully, but these errors were encountered: