-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL Unable to register memory #563
Comments
It looks like you are using EFA because If it is the case, maybe open an issue in https://github.com/aws/aws-ofi-nccl. |
Also try to run |
|
when world_size = 64, is this setting too big? |
NCCL_NSOCKS_PERTHREAD=8 -x NCCL_SOCKET_NTHREADS=8 in your command line is causing each connection to create 64 sockets (each socket using one file), hence 64 sockets per connection * |
Hi, I am a little confused how sockets play in this scenario. If you are using EFA, then you are not using sockets. Or Am I missing something? |
I see. I checked this by setting them to 1, the same error happened. |
@wzamazon I also created an issue at EFA's github: If you are in Amazon, can we create a channel to debug? I know EFA's engineer. What's your internal ID at Amazon? |
this issue is finally resolved by setting |
I was stuck with this problem for many days. May I know how to debug such issues?
Currently, I am using all_to_all NCCL MPI (world_size=64). The error is as follows.
The text was updated successfully, but these errors were encountered: