Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL version #12

Closed
cdhx opened this issue Mar 14, 2022 · 12 comments
Closed

NCCL version #12

cdhx opened this issue Mar 14, 2022 · 12 comments

Comments

@cdhx
Copy link

cdhx commented Mar 14, 2022

Hi

i have installed environment in the yaml file and installed torch 1.8 follow the setting in readme

my cuda version is 11.4, it seems that it is a version conflict of NCCL, pytorch and cuda

Is my cuda version to high?

ssh://xh@210.28.134.34:22/home2/xh/.conda/envs/skg/bin/python -u -m torch.distributed.launch --nproc_per_node 4 --master_port 1234 train.py --seed 2 --cfg Salesforce/T5_base_prefix_compwebq.cfg --run_name T5_base_prefix_compwebq --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 2 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_prefix_compwebq --overwrite_output_dir --per_device_train_batch_size 2 --per_device_eval_batch_size 4 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
INFO:filelock:Lock 140211123887800 acquired on .lock
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/__init__.py:422: UserWarning: torch.set_deterministic is deprecated and will be removed in a future release. Please use torch.use_deterministic_algorithms instead
  "torch.set_deterministic is deprecated and will be removed in a future "
[W Context.cpp:70] Warning: torch.use_deterministic_algorithms is in beta, and its design and functionality may change in the future. (function operator())
INFO:filelock:Lock 140211123887800 released on .lock
INFO:filelock:Lock 140144150953768 acquired on .lock
INFO:filelock:Lock 140144150953768 released on .lock
INFO:filelock:Lock 139898741587640 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
INFO:filelock:Lock 139898741587640 released on .lock
INFO:filelock:Lock 139711655354096 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
INFO:filelock:Lock 139711655354096 released on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Killing subprocess 30316
Killing subprocess 30320
Killing subprocess 30321
Killing subprocess 30322
Traceback (most recent call last):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home2/xh/.conda/envs/skg/bin/python', '-u', 'train.py', '--local_rank=3', '--seed', '2', '--cfg', 'Salesforce/T5_base_prefix_compwebq.cfg', '--run_name', 'T5_base_prefix_compwebq', '--logging_strategy', 'steps', '--logging_first_step', 'true', '--logging_steps', '4', '--evaluation_strategy', 'steps', '--eval_steps', '500', '--metric_for_best_model', 'avr', '--greater_is_better', 'true', '--save_strategy', 'steps', '--save_steps', '500', '--save_total_limit', '1', '--load_best_model_at_end', '--gradient_accumulation_steps', '2', '--num_train_epochs', '400', '--adafactor', 'true', '--learning_rate', '5e-5', '--do_train', '--do_eval', '--do_predict', '--predict_with_generate', '--output_dir', 'output/T5_base_prefix_compwebq', '--overwrite_output_dir', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '4', '--generation_num_beams', '4', '--generation_max_length', '128', '--input_max_length', '1024', '--ddp_find_unused_parameters', 'true']' returned non-zero exit status 1.

Process finished with exit code 1

I also tried torch 1.11+ cu113,got another error


(skg) xh@4210GPU:~/PycharmProject/UnifiedSKG$ python -m torch.distributed.launch --nproc_per_node 4 --master_port 1234 train.py --seed 2 --cfg Salesforce/T5_base_finetune_compwebq.cfg --run_name T5_base_finetune_compwebq --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 2 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_finetune_compwebq --overwrite_output_dir --per_device_train_batch_size 1 --per_device_eval_batch_size 4 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 29, in main
    torch.set_deterministic(True)
AttributeError: module 'torch' has no attribute 'set_deterministic'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 17913) of binary: /home2/xh/.conda/envs/skg/bin/python
Traceback (most recent call last):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 17914)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 17915)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 17916)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-14_21:36:50
  host      : 4210GPU
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 17913)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Looking forward to your reply.
Thank you.

@cdhx
Copy link
Author

cdhx commented Mar 14, 2022

Another information, I only use one GPU.

@Timothyxxx
Copy link
Contributor

Indeed, we met the similar situation during our experiments on some machines(we used a lot of GPUs on different kinds of HPC). I remember we fixed that issue by installing proper PyTorch for your CUDA version.

@cdhx
Copy link
Author

cdhx commented Mar 14, 2022

I will try it, thanks

@cdhx cdhx closed this as completed Mar 14, 2022
@cdhx
Copy link
Author

cdhx commented Mar 16, 2022

sorry for bothering, I still can not run through it

if i do not choose GPU, it seems work fine(not sure, but it ends of out of memory)

but If i choose one GPU it will got error, my torch version is 1.8.1+cu111, same as the env in readme

INFO:filelock:Lock 140672744878032 acquired on .lock
INFO:filelock:Lock 140672744878032 released on .lock
INFO:filelock:Lock 139725435957376 acquired on .lock
INFO:filelock:Lock 139725435957376 released on .lock
INFO:filelock:Lock 140361970860216 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
INFO:filelock:Lock 140361970860216 released on .lock
INFO:filelock:Lock 139971805774008 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
INFO:filelock:Lock 139971805774008 released on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
Traceback (most recent call last):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    cached = self.fget(obj)    
if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper

  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 510, in init_process_group
    timeout=timeout))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 603, in _new_process_group_helper
    timeout)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Killing subprocess 4345
Killing subprocess 4346
Killing subprocess 4347
Killing subprocess 4348
Traceback (most recent call last):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home2/xh/.conda/envs/skg/bin/python', '-u', 'train.py', '--local_rank=3', '--seed', '2', '--cfg', 'Salesforce/T5_base_finetune_compwebq.cfg', '--run_name', 'T5_base_finetune_compwebq', '--logging_strategy', 'steps', '--logging_first_step', 'true', '--logging_steps', '4', '--evaluation_strategy', 'steps', '--eval_steps', '500', '--metric_for_best_model', 'avr', '--greater_is_better', 'true', '--save_strategy', 'steps', '--save_steps', '500', '--save_total_limit', '1', '--load_best_model_at_end', '--gradient_accumulation_steps', '2', '--num_train_epochs', '400', '--adafactor', 'true', '--learning_rate', '5e-5', '--do_train', '--do_eval', '--do_predict', '--predict_with_generate', '--output_dir', 'output/T5_base_finetune_compwebq', '--overwrite_output_dir', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '4', '--generation_num_beams', '4', '--generation_max_length', '128', '--input_max_length', '1024', '--ddp_find_unused_parameters', 'true']' returned non-zero exit status 1.

Process finished with exit code 1

Another question is, does it support torch 1.11? because it got error AttributeError: module 'torch' has no attribute 'set_deterministic' when use torch 1.11+cu113, but I check it in source code and find it does have that attribute

Thanks

@ChenWu98
Copy link
Contributor

I think it is a PyTorch version issue. My personal experience is that removing the following lines works for other PyTorch versions.
https://github.com/HKUNLP/UnifiedSKG/blob/fab45fea3a349c9dbda4ed34482df227920272db/train.py#L27-L29

@ChenWu98
Copy link
Contributor

It may sacrifice reproducibility, if it is not your main concern.

@cdhx
Copy link
Author

cdhx commented Mar 16, 2022

thanks for your reply, but it still not work.
this error occurs only when i chose a single GPU and it the error message seems like it happens because distributed training?

@ChenWu98
Copy link
Contributor

Ohh the above answer is for your second question. After removing the three lines, does torch 1.11+cu113 work?
For your first question, we are still exploring.

@cdhx
Copy link
Author

cdhx commented Mar 16, 2022

Ohh the above answer is for your second question. After removing the three lines, does torch 1.11+cu113 work? For your first question, we are still exploring.

it does not work, this is the error log of single GPU

ssh://xh@210.28.134.34:22/home2/xh/.conda/envs/skg/bin/python -u -m torch.distributed.launch --nproc_per_node 4 --master_port 1234 train.py --seed 2 --cfg Salesforce/T5_base_finetune_compwebq.cfg --run_name T5_base_finetune_compwebq --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 2 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_finetune_compwebq --overwrite_output_dir --per_device_train_batch_size 2 --per_device_eval_batch_size 4 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
INFO:filelock:Lock 140395078486112 acquired on .lock
INFO:filelock:Lock 140395078486112 released on .lock
INFO:filelock:Lock 140049714288456 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 610, in init_process_group
    timeout=timeout,
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 738, in _new_process_group_helper
    pg = ProcessGroupNCCL(prefix_store, rank, world_size, pg_options)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
INFO:filelock:Lock 140049714288456 released on .lock
INFO:filelock:Lock 140359993235328 acquired on .lock
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 41, in main
    training_args, = parser.parse_args_into_dataclasses()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/hf_argparser.py", line 191, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 83, in __init__
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 702, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and (self.fp16 or self.fp16_full_eval):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 873, in device
    return self._setup_devices
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1717, in __get__
    cached = self.fget(obj)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/file_utils.py", line 1727, in wrapper
    return func(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/training_args.py", line 858, in _setup_devices
    torch.distributed.init_process_group(backend="nccl")
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 610, in init_process_group
    timeout=timeout,
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 738, in _new_process_group_helper
    pg = ProcessGroupNCCL(prefix_store, rank, world_size, pg_options)
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 7061 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 7062 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 7067 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 7066) of binary: /home2/xh/.conda/envs/skg/bin/python
Traceback (most recent call last):
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run
    )(*cmd_args)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-03-16_11:27:44
  host      : 4210GPU
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 7066)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Process finished with exit code 1

@cdhx
Copy link
Author

cdhx commented Mar 16, 2022

here is the log when do not choose GPU, it seems work fine

ssh://xh@210.28.134.34:22/home2/xh/.conda/envs/skg/bin/python -u -m torch.distributed.launch --nproc_per_node 4 --master_port 1234 train.py --seed 2 --cfg Salesforce/T5_base_finetune_compwebq.cfg --run_name T5_base_finetune_compwebq --logging_strategy steps --logging_first_step true --logging_steps 4 --evaluation_strategy steps --eval_steps 500 --metric_for_best_model avr --greater_is_better true --save_strategy steps --save_steps 500 --save_total_limit 1 --load_best_model_at_end --gradient_accumulation_steps 2 --num_train_epochs 400 --adafactor true --learning_rate 5e-5 --do_train --do_eval --do_predict --predict_with_generate --output_dir output/T5_base_finetune_compwebq --overwrite_output_dir --per_device_train_batch_size 2 --per_device_eval_batch_size 4 --generation_num_beams 4 --generation_max_length 128 --input_max_length 1024 --ddp_find_unused_parameters true
/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  FutureWarning,
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
INFO:filelock:Lock 140516411049744 acquired on .lock
INFO:filelock:Lock 140516411049744 released on .lock
INFO:filelock:Lock 140004243490688 acquired on .lock
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3
INFO:filelock:Lock 140004243490688 released on .lock
INFO:filelock:Lock 139957069264600 acquired on .lock
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:filelock:Lock 139957069264600 released on .lock
INFO:filelock:Lock 140293807781760 acquired on .lock
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2
INFO:filelock:Lock 140293807781760 released on .lock
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
task_args.bert.location: t5-base
task_args.bert.location: t5-base
task_args.bert.location: t5-base
WARNING:datasets.builder:Reusing dataset complex_web_questions (./data/complex_web_questions/compwebq/1.0.0/99dbaa17d7f00c56fd6810e977c673ffdd1e4f645fb01e302e6f33bd8de8556b)
  0%|                                                     | 0/3 [00:00<?, ?it/s]WARNING:datasets.builder:Reusing dataset complex_web_questions (./data/complex_web_questions/compwebq/1.0.0/99dbaa17d7f00c56fd6810e977c673ffdd1e4f645fb01e302e6f33bd8de8556b)
WARNING:datasets.builder:Reusing dataset complex_web_questions (./data/complex_web_questions/compwebq/1.0.0/99dbaa17d7f00c56fd6810e977c673ffdd1e4f645fb01e302e6f33bd8de8556b)
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 782.23it/s]
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 814.48it/s]
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 596.74it/s]
wandb: Currently logged in as: myproject (use `wandb login --relogin` to force relogin)
wandb: wandb version 0.12.11 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.12.1
wandb: Syncing run T5_base_finetune_compwebq
wandb: ⭐️ View project at https://wandb.ai/myproject/skg
wandb: 🚀 View run at https://wandb.ai/myproject/skg/runs/2n7lsqyq
wandb: Run data is saved locally in /home2/xh/PycharmProject/UnifiedSKG/wandb/run-20220316_114350-2n7lsqyq
wandb: Run `wandb offline` to turn off syncing.

task_args.bert.location: t5-base
WARNING:datasets.builder:Reusing dataset complex_web_questions (./data/complex_web_questions/compwebq/1.0.0/99dbaa17d7f00c56fd6810e977c673ffdd1e4f645fb01e302e6f33bd8de8556b)
100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 923.04it/s]
Before upsampling {'META_TUNING/compwebq.cfg': 27639}
Upsampling weights {'META_TUNING/compwebq.cfg': 1.0}
After upsampling {'META_TUNING/compwebq.cfg': 27639}
Before upsampling {'META_TUNING/compwebq.cfg': 27639}
Upsampling weights {'META_TUNING/compwebq.cfg': 1.0}
After upsampling {'META_TUNING/compwebq.cfg': 27639}
Before upsampling {'META_TUNING/compwebq.cfg': 27639}
Upsampling weights {'META_TUNING/compwebq.cfg': 1.0}
After upsampling {'META_TUNING/compwebq.cfg': 27639}
Before upsampling {'META_TUNING/compwebq.cfg': 27639}
Upsampling weights {'META_TUNING/compwebq.cfg': 1.0}
After upsampling {'META_TUNING/compwebq.cfg': 27639}
Trainer build successfully.
Trainer build successfully.
Trainer build successfully.
Traceback (most recent call last):
  File "train.py", line 225, in <module>
    main()
  File "train.py", line 146, in main
    callbacks=[early_stopping_callback],
  File "/home2/xh/PycharmProject/UnifiedSKG/utils/trainer.py", line 50, in __init__
    super().__init__(*args, **kwargs)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/trainer.py", line 367, in __init__
    self._move_model_to_device(model, args.device)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/transformers/trainer.py", line 509, in _move_model_to_device
    model = model.to(device)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 907, in to
    return self._apply(convert)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/home2/xh/.conda/envs/skg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 905, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

@ChenWu98
Copy link
Contributor

Here is a minimal example for distributed training:
https://towardsdatascience.com/distributed-neural-network-training-in-pytorch-5e766e2a9e62
Could you verify if it works on your machine?

@neel04
Copy link

neel04 commented Apr 3, 2023

Resetting the PORT and RDVZ_ID works for me. I think multiple runs with those same parameters collide? I'm not really sure here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants