You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
File "/home/dsi/eyalbetzalel/miniconda3/envs/NVAE_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/dsi/eyalbetzalel/miniconda3/envs/NVAE_env/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "train.py", line 281, in init_processes
fn(args)
File "train.py", line 92, in main
train_nelbo, global_step = train(train_queue, model, cnn_optimizer, grad_scalar, global_step, warmup_iters, writer, logging)
File "train.py", line 160, in train
utils.average_params(model.parameters(), args.distributed)
File "/home/dsi/eyalbetzalel/NVAE/utils.py", line 274, in average_params
dist.all_reduce(param.data, op=dist.ReduceOp.SUM)
File "/home/dsi/eyalbetzalel/miniconda3/envs/NVAE_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 936, in all_reduce
work = _default_pg.allreduce([tensor], opts) RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled system error, NCCL version 2.4.8
am I doing something wrong?
Thanks,
Eyal
@
The text was updated successfully, but these errors were encountered:
Are you running with WSL? WSL does not yet support NCCL: NVIDIA/nccl#442
If you are on WSL, then you can try changing backend in train.py:280
"dist.init_process_group(backend='nccl', init_method='env://', rank=rank, world_size=size)"
from "nccl" to "gloo".
Hi,
I am trying to run NVAE on my machine with your command line for CIFAR10 (updating only the .. from 8 to 4 cause I own 4 GPUs):
and get this error:
am I doing something wrong?
Thanks,
Eyal
@
The text was updated successfully, but these errors were encountered: