You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ok, forget about it. From ollma docs I can see it is using llamacpp in background. And it supports manual addition of custom gguf models, with various params and etc.
Are there any changes for llama.cpp integration?
It is using quantized models - allowing running bigger models with less VRAM required ( https://github.com/ggerganov/llama.cpp#quantization) and you can put part of computation of CPU, part on GPU, if whole model cannot be loaded into GPU.
It supports server mode too- https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md#testing-with-curl
The text was updated successfully, but these errors were encountered: