Squeezing out faster inference on a 3090? Is CUDA_USE_TENSOR_CORES something I can compile for? #8422

wwoodsTM · 2024-07-10T22:00:02Z

wwoodsTM
Jul 10, 2024

Hi,

System specs wise I run a single 3090, have 64GB system RAM with a Ryzen 5 3600. I recently switched to using llama-server as a backend to get closer to the prompt-building process, especially with special tokens, for an app I am working on. Previously I was using Ooba's TextGen WebUI as my backend (so in other words, llama-cpp-python). I know for the most part it's all the same under the hood, but I am just wondering whether in my own compilation process I may be contributing to what appears to be somewhat lower inference speeds with some models since making the switch.

I also was wondering if there is anything I can maybe do at compile time to help. On Ooba I believe I was using the "CUDA_USE_TENSOR_CORES" option, and was wondering if that is just something for llama-cpp-python somehow, or is there a way for me to make sure that is used at compile time or run-time?

Here is some of the relevant output I get when I run llama-server:

./llama-server -m ./models/CommandR-35B-NEO-V1-D_AU-Q6_K-imat13.gguf -c 16128 -ngl 15 -b 512 -t 6 -tb 12 --port 8080 

INFO [ main] build info | tid="124954289917952" timestamp=1720642654 build=3353 commit="9925ca40"
INFO [main] system info | tid="124954289917952" timestamp=1720642654 n_threads=6 n_threads_batch=12 
total_threads=12  system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | 
AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | 
FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
MATMUL_INT8 = 0 | LLAMAFILE = 0 | "

[.... skipping ....]

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.31 MiB
llm_load_tensors: offloading 15 repeating layers to GPU
llm_load_tensors: offloaded 15/41 layers to GPU
llm_load_tensors:        CPU buffer size = 27366.91 MiB
llm_load_tensors:      CUDA0 buffer size =  9647.34 MiB
...........................................................................................
llama_new_context_with_model: n_ctx      = 16128
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 8000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size = 12600.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  7560.00 MiB
llama_new_context_with_model: KV self size  = 20160.00 MiB, K (f16): 10080.00 MiB, V (f16): 10080.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     1.95 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  2620.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    63.51 MiB
llama_new_context_with_model: graph nodes  = 1208
llama_new_context_with_model: graph splits = 254

When I looked under "Issues" it seemed that people with similar RTX cards had been advised to build for "all" architectures if I understood correctly, so I am guessing I probably would not get much benefit from trying to specify anything beyond turning the CUDA flag on when running CMake? FWIW, I ran "cmake -B build -DGGML_CUDA=ON" for my compile.

Thank you in advance!

Answered by ggerganov

Jul 11, 2024

Try to quantize the KV cache and enable Flash Attention:

-ctk q8_0 -ctv q8_0 -fa 1

This should give you some room for extra layers on the GPU

View full answer

dspasyuk · 2024-07-11T02:11:12Z

dspasyuk
Jul 11, 2024

@wwoodsTM If you offload all layers to GPU you will get maximum performance:

From your log:
offloaded 15/41 layers to GPU

-ngl should be set to 41 this will you the best you can get for now.

1 reply

wwoodsTM Jul 11, 2024
Author

Unfortunately 17 is the most I can do before I get OOM errors on load with my total of 24 GB of VRAM, 15 layers is giving it a little bit of breathing room.

ggerganov · 2024-07-11T07:20:52Z

ggerganov
Jul 11, 2024
Maintainer

Try to quantize the KV cache and enable Flash Attention:

-ctk q8_0 -ctv q8_0 -fa 1

This should give you some room for extra layers on the GPU

2 replies

wwoodsTM Jul 11, 2024
Author

Thank you, that definitely helped!

goodglitch Mar 14, 2025

Try to quantize the KV cache and enable Flash Attention:
-ctk q8_0 -ctv q8_0 -fa 1
This should give you some room for extra layers on the GPU

I am writing a scientific paper using PhD level math. I use QwQ-32B Q5_K_L model and non-quantized cache to insure that a model will not miss some minus sign or confuse one variable for another. I am very tempted to go for K_M model version and quantized cache since it will give me much more context, but I am afraid that in my use case it can degraded a lot reliability of the results. Would you suggest me to quantize cache too?

PS: Obviously I can try it myself, but it will require at least 20 reruns, and at 2-3t/s (big share of the model is on DDR4) and average 20k+ answers from QwQ it is practically impossible. So I will really appreciate any suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Squeezing out faster inference on a 3090? Is CUDA_USE_TENSOR_CORES something I can compile for? #8422

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Squeezing out faster inference on a 3090? Is CUDA_USE_TENSOR_CORES something I can compile for? #8422

Uh oh!

wwoodsTM Jul 10, 2024

Replies: 2 comments · 3 replies

Uh oh!

dspasyuk Jul 11, 2024

Uh oh!

Uh oh!

wwoodsTM Jul 11, 2024 Author

Uh oh!

ggerganov Jul 11, 2024 Maintainer

Uh oh!

wwoodsTM Jul 11, 2024 Author

Uh oh!

Uh oh!

goodglitch Mar 14, 2025

wwoodsTM
Jul 10, 2024

Replies: 2 comments 3 replies

dspasyuk
Jul 11, 2024

wwoodsTM Jul 11, 2024
Author

ggerganov
Jul 11, 2024
Maintainer

wwoodsTM Jul 11, 2024
Author