Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL tests don't work on WSL #442

Open
PolKul opened this issue Dec 18, 2020 · 18 comments
Open

NCCL tests don't work on WSL #442

PolKul opened this issue Dec 18, 2020 · 18 comments

Comments

@PolKul
Copy link

PolKul commented Dec 18, 2020

I've installed NCCL and its tests on WSL. When trying to run a test like this:

NCCL_ALGO=Ring NCCL_PROTO=Simple NCCL_DEBUG_FILE=debug.%h.%p NCCL_DEBUG=INFO ./build/all_reduce_perf -b 128M -e 128M -g 1 -n 1 -w 0 -c 0 -m 0

I get the following error message:

nThread 1 nGpus 1 minBytes 134217728 maxBytes 134217728 step: 1048576(bytes) warmup iters: 0 iters: 1 validation: 0

Using devices
Rank 0 Pid 36629 on DESKTOP device 0 [0x21] TITAN RTX
NCCL version 2.8.3+cuda11.1
DESKTOP: Test NCCL failure common.cu:777 'unhandled system error'

The debug log shows this:

DESKTOP:36629:36629 [0] NCCL INFO Bootstrap : Using eth0:192.168.143.185<0>
DESKTOP:36629:36629 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

DESKTOP:36629:36629 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
DESKTOP:36629:36629 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.143.185<0>
DESKTOP:36629:36629 [0] NCCL INFO Using network Socket
DESKTOP:36629:36629 [0] NCCL INFO NCCL version 2.8.3+cuda11.1

DESKTOP:36629:36635 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:21/../../0000:21:00.0
DESKTOP:36629:36635 [0] NCCL INFO graph/xml.cc:469 -> 2
DESKTOP:36629:36635 [0] NCCL INFO graph/xml.cc:660 -> 2
DESKTOP:36629:36635 [0] NCCL INFO graph/topo.cc:522 -> 2
DESKTOP:36629:36635 [0] NCCL INFO init.cc:627 -> 2
DESKTOP:36629:36635 [0] NCCL INFO init.cc:878 -> 2
DESKTOP:36629:36635 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
DESKTOP:36629:36629 [0] NCCL INFO init.cc:946 -> 2

Version of NCCL: version 2.8.3
Version of CUDA: 11.1
Windows: 10.0.20277
WSL: Ubuntu 20.04

@Dango233
Copy link

I have exactly the same problem...

@AddyLaddy
Copy link
Collaborator

Thanks for these reports. Currently NCCL is not supported on WSL2 installations but we are working on validating it.

@PolKul
Copy link
Author

PolKul commented Jan 16, 2021

I think this is the reason why I cannot use multi-gpu training with PyTorch as well. Because when I use PyTorch DataParallel it give me similar error with NCCL.

@amannm
Copy link

amannm commented Jan 21, 2021

Thanks for these reports. Currently NCCL is not supported on WSL2 installations but we are working on validating it.

I also ran into the issue of NCCL simply not supporting WSL environments. It would have helped to have the lack of support documented right here https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations

This might be the only place on the net a dev has said anything on the topic.

@monotaro3
Copy link

Maybe I have the same error. I'm trying to use multigpu in two nodes where the one is wsl2 environment but seems that nccl communicator hangs displaying "cupy.cuda.nccl.NcclError: NCCL_ERROR_SYSTEM_ERROR: unhandled system error" only in the wsl2 side. Looking forward to the fix.

@jogiji
Copy link

jogiji commented Jul 10, 2021

Any update on this issue.. NCCL support for WSL2 is needed so that i can use Transfer Learning Toolkit 3 on my Windows desktop using WSL2

@AddyLaddy
Copy link
Collaborator

NCCL 2.10.3 was released last week and it should support WSL2 with a single GPU. Multi-GPU has not been validated yet.

@jogiji
Copy link

jogiji commented Sep 3, 2021

Still doesn't work with latest upgrades to TAO on WSL2 with newest driver 510.06... following is the output :
FYI I am trying to run the latest TAO toolkit from NGC on docker in wsl2.
I have an RTX 3090 GPU for the same.

`Epoch 1/80
  1/238 [..............................] - ETA: 35:21 - loss: 3.4665 - acc: 0.0938WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:146: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2021-09-03 03:34:33,108 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:146: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

  2/238 [..............................] - ETA: 19:56 - loss: 3.6723 - acc: 0.0625/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.548347). Check your callbacks.
  % delta_t_median)
238/238 [==============================] - 40s 169ms/step - loss: 2.1327 - acc: 0.4371 - val_loss: 1.5816 - val_acc: 0.5542
96d216ed9f8a:127:179 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.18.0.2<0>
96d216ed9f8a:127:179 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : libnccl-net.so: cannot open shared object file: No such file or directory.
96d216ed9f8a:127:179 [0] NCCL INFO NET/IB : No device found.
96d216ed9f8a:127:179 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.18.0.2<0>
96d216ed9f8a:127:179 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1

96d216ed9f8a:127:179 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:0a/../../0000:0a:00.0
96d216ed9f8a:127:179 [0] NCCL INFO graph/xml.cc:469 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO graph/xml.cc:660 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO graph/topo.cc:523 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:581 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:840 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:876 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:887 -> 2
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: ncclCommInitRank failed: unhandled system error
	 [[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
  (1) Unknown: ncclCommInitRank failed: unhandled system error
	 [[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
	 [[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 500, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 494, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 482, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 495, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 468, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 251, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/lib/python3.6/dist-packages/keras/callbacks.py", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 84, in on_epoch_end
    self._average_metrics_in_place(logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 77, in _average_metrics_in_place
    self.backend.get_session().run(self.allreduce_ops[metric])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: ncclCommInitRank failed: unhandled system error
	 [[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Unknown: ncclCommInitRank failed: unhandled system error
	 [[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0':
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 500, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 482, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 495, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 468, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 251, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/lib/python3.6/dist-packages/keras/callbacks.py", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 84, in on_epoch_end
    self._average_metrics_in_place(logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 73, in _average_metrics_in_place
    self._make_variable(metric, value)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 58, in _make_variable
    allreduce_op = hvd.allreduce(var, device_dense=self.device)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 80, in allreduce
    summed_tensor_compressed = _allreduce(tensor_compressed)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 86, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 80, in horovod_allreduce
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

2021-09-03 09:05:06,420 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.`

@sjeaugey
Copy link
Member

sjeaugey commented Sep 3, 2021

From your log:

NCCL version 2.7.8+cuda11.1

Note, NCCL might have been compiled statically with tensorflow, so upgrading NCCL might not be enough to use the newest version.

@tanzhenyu
Copy link

The current status should be that NCCL isn't supported (on multiple GPUs) for WSL.

@softmatic
Copy link

Same issue here with WSL2 (Windows 11), driver 510.06 and torch 1.9.1.cu111 with 2x 2080 Super.

@AddyLaddy
Copy link
Collaborator

NCCL 2.11.4 has been tested on multi-GPU Win11 systems. I don't know what drivers and OS level are required though. You need to make sure that your pytorch/tensorflow subsystem hasn't been statically linked against an older NCCL version.

@softmatic
Copy link

@AddyLaddy Thanks for getting back to me. I checked and Torch 1.9.1.cu111 apparently uses NCCL 2.7.8. Will have to see what our options are now.

@cascgu
Copy link

cascgu commented Oct 28, 2021

NCCL 2.11.4 has been tested on multi-GPU Win11 systems. I don't know what drivers and OS level are required though. You need to make sure that your pytorch/tensorflow subsystem hasn't been statically linked against an older NCCL version.

@AddyLaddy How can I unlink the old NCCL from pytorch and update the NCCL of pytorch to version 2.11.4? I have installed version 2.11.4 in wsl2 and can pass the test by using nccl-tests. However, when training the model, pytorch 1.7.1 still calls NCCL 2.7.8

@AddyLaddy
Copy link
Collaborator

I'm not a PyTorch expert, but I believe you need to configure and rebuild it using the USE_SYSTEM_NCCL=1 option. Perhaps ask in a PyTorch forum for help?

@cascgu
Copy link

cascgu commented Oct 31, 2021

@AddyLaddy Thank you very much. I'll try to recompile PyTorch.

@Chan0081
Copy link

@AddyLaddy Thank you very much. I'll try to recompile PyTorch.

hi. I've got the same issue recently. Did it work to recompile PyTorch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests