exits with return code = -9 #279

FmLiu · 2023-04-21T14:35:56Z

[2023-04-21 22:17:06,284] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-21 22:17:06,284] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-21 22:17:06,284] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-21 22:17:06,284] [INFO] [launch.py:162:main] dist_world_size=1
[2023-04-21 22:17:06,284] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-04-21 22:17:11,840] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
04/21/2023 22:17:12 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
04/21/2023 22:17:14 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-cc7d8860227c3483/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 33/33 [00:50<00:00, 1.55s/it]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
[2023-04-21 22:20:31,527] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 19501
[2023-04-21 22:20:31,529] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3.8', '-u', '/hy-tmp/LMFlow-main/examples/finetune.py', '--local_rank=0', '--model_name_or_path', '/hy-tmp/models/llama-7b-hf', '--dataset_path', '/hy-tmp/LMFlow-main/data/alpaca/train', '--output_dir', '/hy-tmp/models/new_7b', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '1e-4', '--block_size', '16', '--per_device_train_batch_size', '1', '--use_lora', '1', '--lora_r', '8', '--save_aggregated_lora', '0', '--deepspeed', '/hy-tmp/LMFlow-main/configs/ds_config_zero2.json', '--bf16', '--run_name', 'finetune_with_lora', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -9

The model's size is about 13GB, why 62GB mem is not enough? I saw the process of python3's mem to reach 62GB, then killed by OS. By the way, GPU mem usage is 1MiB / 24576MiB all the time.

pengxiao-song · 2023-04-21T15:48:54Z

sad...I meet the same problem and have no idea on it.

shizhediao · 2023-04-21T17:18:22Z

Usually, it requires about 200GB RAM for fine-tuning. #179
I think it is related to the Deepspeed strategy.

shizhediao · 2023-06-19T11:08:52Z

This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks

shizhediao closed this as completed Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exits with return code = -9 #279

exits with return code = -9 #279

FmLiu commented Apr 21, 2023

pengxiao-song commented Apr 21, 2023

shizhediao commented Apr 21, 2023 •

edited

Loading

shizhediao commented Jun 19, 2023

exits with return code = -9 #279

exits with return code = -9 #279

Comments

FmLiu commented Apr 21, 2023

pengxiao-song commented Apr 21, 2023

shizhediao commented Apr 21, 2023 • edited Loading

shizhediao commented Jun 19, 2023

shizhediao commented Apr 21, 2023 •

edited

Loading