-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to set the number of layers to offload to GPU #51
Comments
TIL that we can do that in llama.cpp haha. So off the top of my head, I don't know how to do this. But let's look into it together. Can you point me to some docs in llama.cpp that explains how to do it in llama.cpp? I also encourage you to dig through the code and docs of Ollama if you're up for it (and then show us lol). |
Oh wait, I just saw the |
Ollama allows us to customize this through
We would need to pass this value to Ollama when making API requests. |
Probably the best way to do this would be to let the system admin specify these as the Documenting the steps in case you (or someone else) want to contribute this feature:
|
@rahulvk007 With the latest release (v0.0.3), you can now set the number of layers to offload: https://github.com/SecureAI-Tools/SecureAI-Tools#customize-llm-provider-specific-options Please try it out and let me know if you run into any issues. |
Since this is using llama.cpp in the backend, is there any way to customise the number of layers to offload to GPU ?
Because right now I am using localGPT and I can get great performance by offloading 17/35 layers to GPU without any crashes caused by CUDA out of memory.
But in this I can see that it is offloading 21 layers automatically and that causes it to crash due to cuda out of memory and fall back to cpu resulting in extremely slow performance.
Nevertheless this is a great product with a very friendly ui.
The text was updated successfully, but these errors were encountered: