-
Notifications
You must be signed in to change notification settings - Fork 731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting OOM Error for finetuning Falcon-7b Model in 80GB A100 GPU with Custom Data. #240
Labels
question
Further information is requested
Comments
krishna0125
changed the title
Getting OOM Error for finetuning Falcon-7b Model in 80GB A100 GPU.
Getting OOM Error for finetuning Falcon-7b Model in 80GB A100 GPU with Custom Data.
Jul 7, 2023
Yes, that's true. I had the same problem. It i weird that however you can run the Alpaca 52k without problem. |
Had the same issue today, tried to run finetune/adapter_v2 for Falcon 7B on a NVIDIA L4 (24 GB RAM)
but it exited at the training step:
|
Indeed, no luck with 24g VRAM. Some big dataset even failed in 80g machine.
I was successful with Alpaca 52k in 80g machine.
Two options: reduce the dataset size and number of epochs.
The other approach is trying to use qlora which support 4bit data type.
Unfortunately lit parrot does not support it.
I hope to have time to write a detailed tutorials about this in few days.
Thanh
On Sun, 9 Jul 2023 at 15:11 keurcien ***@***.***> wrote:
Had the same issue today, tried to run finetune/adapter_v2 for Falcon 7B
on a NVIDIA L4 (24 GB RAM)
Loading model 'checkpoints/tiiuae/falcon-7b/lit_model.pth' with {'org': 'tiiuae', 'name': 'falcon-7b', 'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True, '_norm_class': 'LayerNorm', 'norm_eps': 1e-05, '_mlp_class': 'GptNeoxMLP', 'intermediate_size': 18176, 'condense_ratio': 1, 'adapter_prompt_length': 10, 'adapter_start_layer': 2}
but it exited at the training step:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacty of 21.96 GiB of which 70.88 MiB is free. Including non-PyTorch memory, this process has 21.88 GiB memory in use. Of the allocated memory 21.62 GiB is allocated by PyTorch, and 58.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
—
Reply to this email directly, view it on GitHub
<#240 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAY4GWBCPZSDBBNCY7E4RTTXPJRS7ANCNFSM6AAAAAA2BKWRM4>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
Best regards,
Thanh
|
You can try the suggestions described in https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/oom.md |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
No description provided.
The text was updated successfully, but these errors were encountered: