Getting OOM Error for finetuning Falcon-7b Model in 80GB A100 GPU with Custom Data. #240

krishna0125 · 2023-07-07T05:35:37Z

No description provided.

thanhnew2001 · 2023-07-08T09:00:01Z

Yes, that's true. I had the same problem. It i weird that however you can run the Alpaca 52k without problem.

keurcien · 2023-07-09T08:11:16Z

Had the same issue today, tried to run finetune/adapter_v2 for Falcon 7B on a NVIDIA L4 (24 GB RAM)

Loading model 'checkpoints/tiiuae/falcon-7b/lit_model.pth' with {'org': 'tiiuae', 'name': 'falcon-7b', 'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True, '_norm_class': 'LayerNorm', 'norm_eps': 1e-05, '_mlp_class': 'GptNeoxMLP', 'intermediate_size': 18176, 'condense_ratio': 1, 'adapter_prompt_length': 10, 'adapter_start_layer': 2}

but it exited at the training step:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacty of 21.96 GiB of which 70.88 MiB is free. Including non-PyTorch memory, this process has 21.88 GiB memory in use. Of the allocated memory 21.62 GiB is allocated by PyTorch, and 58.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

thanhnew2001 · 2023-07-09T09:18:13Z

Indeed, no luck with 24g VRAM. Some big dataset even failed in 80g machine. I was successful with Alpaca 52k in 80g machine. Two options: reduce the dataset size and number of epochs. The other approach is trying to use qlora which support 4bit data type. Unfortunately lit parrot does not support it. I hope to have time to write a detailed tutorials about this in few days. Thanh

On Sun, 9 Jul 2023 at 15:11 keurcien ***@***.***> wrote: Had the same issue today, tried to run finetune/adapter_v2 for Falcon 7B on a NVIDIA L4 (24 GB RAM) Loading model 'checkpoints/tiiuae/falcon-7b/lit_model.pth' with {'org': 'tiiuae', 'name': 'falcon-7b', 'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True, '_norm_class': 'LayerNorm', 'norm_eps': 1e-05, '_mlp_class': 'GptNeoxMLP', 'intermediate_size': 18176, 'condense_ratio': 1, 'adapter_prompt_length': 10, 'adapter_start_layer': 2} but it exited at the training step: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacty of 21.96 GiB of which 70.88 MiB is free. Including non-PyTorch memory, this process has 21.88 GiB memory in use. Of the allocated memory 21.62 GiB is allocated by PyTorch, and 58.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF — Reply to this email directly, view it on GitHub <#240 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAY4GWBCPZSDBBNCY7E4RTTXPJRS7ANCNFSM6AAAAAA2BKWRM4> . You are receiving this because you commented.Message ID: ***@***.***>

-- Best regards, Thanh

carmocca · 2023-07-12T14:33:22Z

You can try the suggestions described in https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/oom.md

krishna0125 changed the title ~~Getting OOM Error for finetuning Falcon-7b Model in 80GB A100 GPU.~~ Getting OOM Error for finetuning Falcon-7b Model in 80GB A100 GPU with Custom Data. Jul 7, 2023

carmocca added the question Further information is requested label Jul 12, 2023

carmocca mentioned this issue Aug 11, 2023

Support QLoRA 4-bit finetuning with bitsandbytes #275

Merged

carmocca closed this as completed in #275 Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting OOM Error for finetuning Falcon-7b Model in 80GB A100 GPU with Custom Data. #240

Getting OOM Error for finetuning Falcon-7b Model in 80GB A100 GPU with Custom Data. #240

krishna0125 commented Jul 7, 2023

thanhnew2001 commented Jul 8, 2023

keurcien commented Jul 9, 2023

thanhnew2001 commented Jul 9, 2023 via email

carmocca commented Jul 12, 2023

Getting OOM Error for finetuning Falcon-7b Model in 80GB A100 GPU with Custom Data. #240

Getting OOM Error for finetuning Falcon-7b Model in 80GB A100 GPU with Custom Data. #240

Comments

krishna0125 commented Jul 7, 2023

thanhnew2001 commented Jul 8, 2023

keurcien commented Jul 9, 2023

thanhnew2001 commented Jul 9, 2023 via email

carmocca commented Jul 12, 2023