-
Hi, System specs wise I run a single 3090, have 64GB system RAM with a Ryzen 5 3600. I recently switched to using llama-server as a backend to get closer to the prompt-building process, especially with special tokens, for an app I am working on. Previously I was using Ooba's TextGen WebUI as my backend (so in other words, llama-cpp-python). I know for the most part it's all the same under the hood, but I am just wondering whether in my own compilation process I may be contributing to what appears to be somewhat lower inference speeds with some models since making the switch. I also was wondering if there is anything I can maybe do at compile time to help. On Ooba I believe I was using the "CUDA_USE_TENSOR_CORES" option, and was wondering if that is just something for llama-cpp-python somehow, or is there a way for me to make sure that is used at compile time or run-time? Here is some of the relevant output I get when I run llama-server:
When I looked under "Issues" it seemed that people with similar RTX cards had been advised to build for "all" architectures if I understood correctly, so I am guessing I probably would not get much benefit from trying to specify anything beyond turning the CUDA flag on when running CMake? FWIW, I ran "cmake -B build -DGGML_CUDA=ON" for my compile. Thank you in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
@wwoodsTM If you offload all layers to GPU you will get maximum performance: From your log: -ngl should be set to 41 this will you the best you can get for now. |
Beta Was this translation helpful? Give feedback.
-
Try to quantize the KV cache and enable Flash Attention:
This should give you some room for extra layers on the GPU |
Beta Was this translation helpful? Give feedback.
Try to quantize the KV cache and enable Flash Attention:
This should give you some room for extra layers on the GPU