-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL WARN NET/Socket : message truncated #268
Comments
Thanks for the report. It seems to be running NCCL 2.4.7 compiled in a weird way that the CUDA_MAJOR/CUDA_MINOR were not replaced. Can you tell us more about the environment in which you are running ? For example, where this tensorflow version comes from, how NCCL was compiled, and on which platform you are executing ? That would help us try to reproduce and understand the issue. Also it might help us figure out how to try a newer version like 2.4.8, or 2.5.6. |
I met up with the same problem on TensorFlow 2.0.1 official docker image. |
This looks like a setup issue, like different ranks not calling NCCL consistently. I would suggest submitting the issue on Tensorflow and see if they have advice on what could be wrong. |
I also meet same issue,My compute environment is as follows:
I don't konw how to slove it. |
Can you set |
@sjeaugey, Thanks for your reply. By the way. I found that the case is ok, the case is as follows:
However, when data is big (e.g. 150 hours ) the distribution training is not working I hope you to help me. Thanks a lot. |
Sorry for the delay. This could be due to mostly two things:
Hope this helps. |
When Use NCCL trainning
NCCL version 2.4.7+cudaCUDA_MAJOR.CUDA_MINOR
hvd1:2918:3333 [0] external/nccl_archive/src/transport/net_socket.cc:200 NCCL WARN NET/Socket : message truncated : receiving 696320 bytes instead of 32768
The text was updated successfully, but these errors were encountered: