Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model saving error #81

Closed
imrankh46 opened this issue Apr 17, 2023 · 10 comments
Closed

model saving error #81

imrankh46 opened this issue Apr 17, 2023 · 10 comments

Comments

@imrankh46
Copy link

the trainer not save the mode weights . its give me the following error

OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 14.75 
GiB total capacity; 12.97 GiB already allocated; 6.81 MiB free; 13.69 GiB 
reserved in total by PyTorch) If reserved memory is >> allocated memory try 
setting max_split_size_mb to avoid fragmentation.  See documentation for Memory 
Management and PYTORCH_CUDA_ALLOC_CONF

@Facico
Copy link
Owner

Facico commented Apr 18, 2023

Your error is exceeding the GPU memory limit. It should be unrelated to model saving. Did your program train properly when it was running?

@imrankh46
Copy link
Author

Your error is exceeding the GPU memory limit. It should be unrelated to model saving. Did your program train properly when it was running?

No, when all the epochs completed, so they showing this behavior. We can not save llama weights like other model using trianer.save_pretrianed() method
Or model.save_model().

@SunnyMarkLiu
Copy link

Same error to me!

@Facico
Copy link
Owner

Facico commented Apr 19, 2023

What is the version of your transformers?

@imrankh46
Copy link
Author

Same error to me!

I solve the error.
Just add this code.

model.cpp()

And then save the model

@imrankh46
Copy link
Author

What is the version of your transformers?

Same like you.

@Facico
Copy link
Owner

Facico commented Apr 19, 2023

@imrankh46 Our transformers is pulled directly from github, so there may be a slight difference. The commit hash of our transformers at the time was roughly the same as ff20f9cf3615a8638023bc82925573cb9d0f3560. Maybe you can slove the question by uninstalling transformers and reinstalling it as "git+https://github.com/huggingface/transformers@ff20f9cf3615a8638023bc82925573cb9d0f3560"

@imrankh46
Copy link
Author

@imrankh46 Our transformers is pulled directly from github, so there may be a slight difference. The commit hash of our transformers at the time was roughly the same as ff20f9cf3615a8638023bc82925573cb9d0f3560. Maybe you can slove the question by uninstalling transformers and reinstalling it as "git+https://github.com/huggingface/transformers@ff20f9cf3615a8638023bc82925573cb9d0f3560"

I tried, but they not working.
I think The llama model code or tokenizer written in cpp. The model is train successfully.

After saving they give out of cuda error.

I will also try your approach..

@Facico
Copy link
Owner

Facico commented Apr 19, 2023

same issue in other repo like you. You can also refer to their method to downgrade the version of bitsandbytes

@imrankh46
Copy link
Author

same issue in other repo like you. You can also refer to their method to downgrade the version of bitsandbytes

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants