Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL WARN NET/Socket : message truncated #268

Open
zwqjoy opened this issue Nov 28, 2019 · 7 comments
Open

NCCL WARN NET/Socket : message truncated #268

zwqjoy opened this issue Nov 28, 2019 · 7 comments

Comments

@zwqjoy
Copy link

zwqjoy commented Nov 28, 2019

When Use NCCL trainning

NCCL version 2.4.7+cudaCUDA_MAJOR.CUDA_MINOR
hvd1:2918:3333 [0] external/nccl_archive/src/transport/net_socket.cc:200 NCCL WARN NET/Socket : message truncated : receiving 696320 bytes instead of 32768

WARNING:tensorflow:`eval_strategy` is not passed in. No distribution strategy will be used for evaluation.
WARNING:tensorflow:ModelCheckpoint callback is not provided. Workers will need to restart training if any fails.
WARNING:tensorflow:ModelCheckpoint callback is not provided. Workers will need to restart training if any fails.
Train for 937 steps
Epoch 1/3
2019-11-28 16:59:29.962581: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-11-28 16:59:31.771855: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
**NCCL version 2.4.7+cudaCUDA_MAJOR.CUDA_MINOR**
hvd1:2918:3332 [0] NCCL INFO Setting affinity for GPU 0 to ffffff,ffffffff
hvd1:2918:3332 [0] NCCL INFO Could not find real path of /sys/class/net/eth0/device
hvd1:2918:3332 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:24 -> 2
hvd1:2918:3332 [0] NCCL INFO CUDA Dev 0[3], Socket NIC distance :  SYS
hvd1:2918:3332 [0] NCCL INFO Channel 00 :    0   1
hvd1:2918:3332 [0] NCCL INFO Could not find real path of /sys/class/net/eth0/device
hvd1:2918:3332 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:24 -> 2
hvd1:2918:3332 [0] NCCL INFO Ring 00 : 1 -> 0 [receive] via NET/Socket/0
hvd1:2918:3332 [0] NCCL INFO Ring 00 : 0 -> 1 [send] via NET/Socket/0
hvd1:2918:3332 [0] NCCL INFO Using 256 threads, Min Comp Cap 6, Trees disabled
hvd1:2918:3332 [0] NCCL INFO comm 0x7fbf380021e0 rank 0 nranks 2 cudaDev 0 nvmlDev 3 - Init COMPLETE
hvd1:2918:3331 [0] NCCL INFO Launch mode Parallel
 17/937 [..............................] - ETA: 6:57 - loss: 5.9124 - sparse_categorical_accuracy: 0.0358   
**hvd1:2918:3333 [0] external/nccl_archive/src/transport/net_socket.cc:200 NCCL WARN NET/Socket : message truncated : receiving 696320 bytes instead of 32768**
hvd1:2918:3333 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:34 -> 3
hvd1:2918:3333 [0] NCCL INFO external/nccl_archive/src/transport/net.cc:533 -> 3
hvd1:2918:3333 [0] NCCL INFO external/nccl_archive/src/transport.cc:163 -> 3 [Proxy Thread]

@sjeaugey
Copy link
Member

sjeaugey commented Dec 2, 2019

Thanks for the report. It seems to be running NCCL 2.4.7 compiled in a weird way that the CUDA_MAJOR/CUDA_MINOR were not replaced.

Can you tell us more about the environment in which you are running ? For example, where this tensorflow version comes from, how NCCL was compiled, and on which platform you are executing ?

That would help us try to reproduce and understand the issue. Also it might help us figure out how to try a newer version like 2.4.8, or 2.5.6.

@372046933
Copy link

I met up with the same problem on TensorFlow 2.0.1 official docker image.

@sjeaugey
Copy link
Member

This looks like a setup issue, like different ranks not calling NCCL consistently.

I would suggest submitting the issue on Tensorflow and see if they have advice on what could be wrong.

@shanguanma
Copy link

I also meet same issue,My compute environment is as follows:
NCCL version 2.10.3+cuda11.1, ubuntu20.04, I am using multiple machines and multiple GPUs of pytorch to train model.
The detail is as follows:

transport/net_socket.cc:424 NCCL WARN NET/Socket : peer 192.168.161.30<59210> message truncated :
receiving 1048576 bytes instead of 65536

I don't konw how to slove it.

@sjeaugey
Copy link
Member

Can you set NCCL_PROTO=SIMPLE and see if the problem still happens?

@shanguanma
Copy link

@sjeaugey, Thanks for your reply.
The issue still can't be solved via your suggestion.

By the way. I found that the case is ok, the case is as follows:
I only reduce the train data scale (from 150 hours to 2 mins), the distribution training is working.
My NCCL environment is as follows:

export NCCL_SOCKET_IFNAME="eno1np0"
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1  # because I haven't IB 
export NCCL_DEBUG_SUBSYS=ENV

However, when data is big (e.g. 150 hours ) the distribution training is not working

I hope you to help me. Thanks a lot.

@sjeaugey
Copy link
Member

Sorry for the delay.

This could be due to mostly two things:

  • A protocol mismatch, which is what I wanted to check setting NCCL_PROTO=SIMPLE. This can happen if different environment variables are set on the different nodes, causing NCCL to choose to use the LL protocol (which has a 64K chunk size) on one node, and the Simple protocol (which has a 1MB chunk size) on another node. Given that was exactly the numbers reported by your log, that was the most probable. Are you sure NCCL_PROTO was set on all ranks? Are you getting the exact same error message, with same sizes (65536 and 1048576)?
  • An application bug, where different ranks calls NCCL allreduce with a different size. This is also quite common. You can detect that by setting NCCL_DEBUG_SUBSYS=COLL and checking all ranks always call NCCL with the same sizes, but that generates a lot of output and can be tedious if it fails after a long time. Also, if setting NCCL_PROTO=SIMPLE causes the error to still happen but with different symptoms, it's likely to be such an issue where one rank runs out of data before the others and that generates an error.

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants