Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exits with return code = -9 #279

Closed
FmLiu opened this issue Apr 21, 2023 · 3 comments
Closed

exits with return code = -9 #279

FmLiu opened this issue Apr 21, 2023 · 3 comments

Comments

@FmLiu
Copy link

FmLiu commented Apr 21, 2023

[2023-04-21 22:17:06,284] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-21 22:17:06,284] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-21 22:17:06,284] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-21 22:17:06,284] [INFO] [launch.py:162:main] dist_world_size=1
[2023-04-21 22:17:06,284] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-04-21 22:17:11,840] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
04/21/2023 22:17:12 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
04/21/2023 22:17:14 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-cc7d8860227c3483/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 33/33 [00:50<00:00, 1.55s/it]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
[2023-04-21 22:20:31,527] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 19501
[2023-04-21 22:20:31,529] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3.8', '-u', '/hy-tmp/LMFlow-main/examples/finetune.py', '--local_rank=0', '--model_name_or_path', '/hy-tmp/models/llama-7b-hf', '--dataset_path', '/hy-tmp/LMFlow-main/data/alpaca/train', '--output_dir', '/hy-tmp/models/new_7b', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '1e-4', '--block_size', '16', '--per_device_train_batch_size', '1', '--use_lora', '1', '--lora_r', '8', '--save_aggregated_lora', '0', '--deepspeed', '/hy-tmp/LMFlow-main/configs/ds_config_zero2.json', '--bf16', '--run_name', 'finetune_with_lora', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -9

The model's size is about 13GB, why 62GB mem is not enough? I saw the process of python3's mem to reach 62GB, then killed by OS. By the way, GPU mem usage is 1MiB / 24576MiB all the time.

@pengxiao-song
Copy link

sad...I meet the same problem and have no idea on it.

@shizhediao
Copy link
Contributor

shizhediao commented Apr 21, 2023

Usually, it requires about 200GB RAM for fine-tuning. #179
I think it is related to the Deepspeed strategy.

@shizhediao
Copy link
Contributor

This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@FmLiu @shizhediao @pengxiao-song and others