RuntimeError: NCCL error #9

eyalbetzalel · 2020-10-22T18:51:26Z

Hi,

I am trying to run NVAE on my machine with your command line for CIFAR10 (updating only the .. from 8 to 4 cause I own 4 GPUs):

export EXPR_ID=/home/dsi/eyalbetzalel/NVAE/logs  
export DATA_DIR=/home/dsi/eyalbetzalel/NVAE/data 
export CHECKPOINT_DIR=/home/dsi/eyalbetzalel/NVAE/cpt  
export CODE_DIR=/home/dsi/eyalbetzalel/NVAE  
cd $CODE_DIR

nohup python train.py --data $DATA_DIR/cifar10 --root $CHECKPOINT_DIR --save $EXPR_ID --dataset cifar10 \
        --num_channels_enc 128 --num_channels_dec 128 --epochs 400 --num_postprocess_cells 2 --num_preprocess_cells 2 \
        --num_latent_scales 1 --num_latent_per_group 20 --num_cell_per_cond_enc 2 --num_cell_per_cond_dec 2 \
        --num_preprocess_blocks 1 --num_postprocess_blocks 1 --num_groups_per_scale 30 --batch_size 32 \
        --weight_decay_norm 1e-2 --num_nf 1 --num_process_per_node 4 --use_se --res_dist --fast_adamax &> NVAE_DSIGPU13_test_2_22102020.out &

and get this error:

File "/home/dsi/eyalbetzalel/miniconda3/envs/NVAE_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/dsi/eyalbetzalel/miniconda3/envs/NVAE_env/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "train.py", line 281, in init_processes
fn(args)
File "train.py", line 92, in main
train_nelbo, global_step = train(train_queue, model, cnn_optimizer, grad_scalar, global_step, warmup_iters, writer, logging)
File "train.py", line 160, in train
utils.average_params(model.parameters(), args.distributed)
File "/home/dsi/eyalbetzalel/NVAE/utils.py", line 274, in average_params
dist.all_reduce(param.data, op=dist.ReduceOp.SUM)
File "/home/dsi/eyalbetzalel/miniconda3/envs/NVAE_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 936, in all_reduce
work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled system error, NCCL version 2.4.8

am I doing something wrong?

Thanks,
Eyal

@

The text was updated successfully, but these errors were encountered:

kaushik333 · 2020-11-02T15:43:16Z

Perhaps a version mismatch between pytorch, cuda and nccl version? What versions are you using ?

ImanHosseini · 2021-02-02T11:12:50Z

Are you running with WSL? WSL does not yet support NCCL: NVIDIA/nccl#442
If you are on WSL, then you can try changing backend in train.py:280
"dist.init_process_group(backend='nccl', init_method='env://', rank=rank, world_size=size)"
from "nccl" to "gloo".

ImanHosseini mentioned this issue Feb 2, 2021

Adding 'WSL' support issue #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: NCCL error #9

RuntimeError: NCCL error #9

eyalbetzalel commented Oct 22, 2020

kaushik333 commented Nov 2, 2020

ImanHosseini commented Feb 2, 2021

RuntimeError: NCCL error #9

RuntimeError: NCCL error #9

Comments

eyalbetzalel commented Oct 22, 2020

kaushik333 commented Nov 2, 2020

ImanHosseini commented Feb 2, 2021