Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting OOM Error for finetuning Falcon-7b Model in 80GB A100 GPU with Custom Data. #240

Closed
krishna0125 opened this issue Jul 7, 2023 · 4 comments · Fixed by #275
Closed
Labels
question Further information is requested

Comments

@krishna0125
Copy link

No description provided.

@krishna0125 krishna0125 changed the title Getting OOM Error for finetuning Falcon-7b Model in 80GB A100 GPU. Getting OOM Error for finetuning Falcon-7b Model in 80GB A100 GPU with Custom Data. Jul 7, 2023
@thanhnew2001
Copy link

Yes, that's true. I had the same problem. It i weird that however you can run the Alpaca 52k without problem.

@keurcien
Copy link

keurcien commented Jul 9, 2023

Had the same issue today, tried to run finetune/adapter_v2 for Falcon 7B on a NVIDIA L4 (24 GB RAM)

Loading model 'checkpoints/tiiuae/falcon-7b/lit_model.pth' with {'org': 'tiiuae', 'name': 'falcon-7b', 'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True, '_norm_class': 'LayerNorm', 'norm_eps': 1e-05, '_mlp_class': 'GptNeoxMLP', 'intermediate_size': 18176, 'condense_ratio': 1, 'adapter_prompt_length': 10, 'adapter_start_layer': 2}

but it exited at the training step:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacty of 21.96 GiB of which 70.88 MiB is free. Including non-PyTorch memory, this process has 21.88 GiB memory in use. Of the allocated memory 21.62 GiB is allocated by PyTorch, and 58.35 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@thanhnew2001
Copy link

thanhnew2001 commented Jul 9, 2023 via email

@carmocca
Copy link
Contributor

You can try the suggestions described in https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/oom.md

@carmocca carmocca added the question Further information is requested label Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants