You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The model's size is about 13GB, why 62GB mem is not enough? I saw the process of python3's mem to reach 62GB, then killed by OS. By the way, GPU mem usage is 1MiB / 24576MiB all the time.
The text was updated successfully, but these errors were encountered:
This issue has been marked as stale because it has not had recent activity. If you think this still needs to be addressed please feel free to reopen this issue. Thanks
[2023-04-21 22:17:06,284] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-21 22:17:06,284] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-21 22:17:06,284] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-21 22:17:06,284] [INFO] [launch.py:162:main] dist_world_size=1
[2023-04-21 22:17:06,284] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-04-21 22:17:11,840] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
04/21/2023 22:17:12 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
04/21/2023 22:17:14 - WARNING - datasets.builder - Found cached dataset json (/root/.cache/huggingface/datasets/json/default-cc7d8860227c3483/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 33/33 [00:50<00:00, 1.55s/it]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
[2023-04-21 22:20:31,527] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 19501
[2023-04-21 22:20:31,529] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3.8', '-u', '/hy-tmp/LMFlow-main/examples/finetune.py', '--local_rank=0', '--model_name_or_path', '/hy-tmp/models/llama-7b-hf', '--dataset_path', '/hy-tmp/LMFlow-main/data/alpaca/train', '--output_dir', '/hy-tmp/models/new_7b', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '1e-4', '--block_size', '16', '--per_device_train_batch_size', '1', '--use_lora', '1', '--lora_r', '8', '--save_aggregated_lora', '0', '--deepspeed', '/hy-tmp/LMFlow-main/configs/ds_config_zero2.json', '--bf16', '--run_name', 'finetune_with_lora', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -9
The model's size is about 13GB, why 62GB mem is not enough? I saw the process of python3's mem to reach 62GB, then killed by OS. By the way, GPU mem usage is 1MiB / 24576MiB all the time.
The text was updated successfully, but these errors were encountered: