Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed regression on multi-Pascal-GPU with 1.56 #642

Open
candre23 opened this issue Jan 27, 2024 · 64 comments
Open

Speed regression on multi-Pascal-GPU with 1.56 #642

candre23 opened this issue Jan 27, 2024 · 64 comments

Comments

@candre23
Copy link

I'm seeing some significant increases in ms/t when running 1.56 across multiple pascal GPUs. It works out to about a 33% speed reduction overall. 103b split across three P40s, identical 6k prompt:

1.55.1: Processing:99.62s (14.6ms/T), Generation:65.22s (324.5ms/T)

1.56: Processing:136.17s (20.0ms/T), Generation:214.71s (419.3ms/T)

I mentioned this on discord and the answer seemed to be "that's just how it is now". I wasn't particularly satisfied with that answer, so I wanted to make an actual issue. Are we sure that's just how it is now, or is it possible that something isn't working correctly?

I get that pascal is pretty old, but a lot of folks are using these cards still and this a substantial speed hit. If this is an inevitable consequence of "something" having changed in how inferencing is done, would it be possible to revert back to the old method with a command line arg or something?

@Vladonai
Copy link

Although I don't have Pascal and maybe it's off topic, but by the way I'll note that initialization of 1.56 takes twice as long as 1.55....

@LostRuins
Copy link
Owner

By initialization you mean loading the model?

@Vladonai
Copy link

By initialization you mean loading the model?

Tried running the program now and got the usual initialization speed. I guess yesterday the computer was busy with something else :) No, this problem is not confirmed.

But since I want to buy 3 Tesla P40s myself, please pay close attention to the problem in the startpost.

@LostRuins
Copy link
Owner

LostRuins commented Jan 28, 2024

Yeah, I did run a few tests myself but unfortunately I don't have a multi-gpu setup. For single GPU it is as fast as ever

1.56:
ContextLimit: 2048/2048, Processing:4.64s (2.3ms/T), Generation:1.60s (32.0ms/T), Total:6.25s (124.9ms/T = 8.01T/s)
ContextLimit: 2048/2048, Processing:4.61s (2.3ms/T), Generation:1.61s (32.3ms/T), Total:6.22s (124.4ms/T = 8.04T/s)

1.54:
ContextLimit: 2048/2048, Processing:4.82s (2.4ms/T), Generation:1.66s (33.1ms/T), Total:6.48s (7.72T/s)
ContextLimit: 2048/2048, Processing:4.72s (2.4ms/T), Generation:1.66s (33.2ms/T), Total:6.38s (7.84T/s)

Note that this is with mmq, lowvram set to off and full offload.

@candre23
Copy link
Author

Yes, I tried it with just a single P40, and the speed was basically the same from 1.55 to 1.56. It's just in multi-GPU that the new version slows down.

And just to confirm, the multi-GPU tests up top were for a full offload without lowvram enabled.

@Vladonai
Copy link

Yes, I tried it with just a single P40, and the speed was basically the same from 1.55 to 1.56. It's just in multi-GPU that the new version slows down.

Try asking this question in the llamacpp repository. One of the developers there also has 3xP40, he will probably want to figure it out.

@candre23
Copy link
Author

I went to run some benchmarks on llama.cpp and the results are confusing. Obviously something is not like-for-like, but I have no way of determining what. The fact that the llama folks release multiple revisions per day makes it really tough to pick an "equivalent" version of LCPP to compare to a given version of KCPP. But here's the TL;DR chart for an identical 1k prompt on a 103b model split across three P40s.

Version         PP ms/t   Gen ms/t
KCPP 1.56       17.9      272.2
KCPP 1.55.1     12.8      177.9
llama 1993      16.9      271.7
llama 1886      17.0      268.1
llama 1721      32.0      731.9

As you can see, I can't go complaining about a regression on the LCPP github when there isn't a regression on their end. On the flip side, it's kind of hard to complain here when the latest KCPP is more or less on par with the latest LCPP. The weird outlier is 1.55.1, which is significantly faster than current KCPP, current LCPP, and LCPP from about the same timeframe.

I cannot explain this, or even suggest a "fix" for this regression that wouldn't make things worse for everybody outside my (admittedly niche) use-case. But whatever the cause, this is the behavior I'm seeing.

@LostRuins
Copy link
Owner

Yeah a lot of stuff has changed under the hood with the ggml backend rework, much of it is opaque to me.

I'll keep an eye on it but I don't think I have a solution right now - the timings being the same as llama.cpp now probably means that whatever KCPP was doing differently from llama.cpp before the backend refactor is now back in sync with it. If you can pinpoint what that is - I can look into changing it again.

Are you able to compile from source yourself?

@candre23
Copy link
Author

Unfortunately, no. Maybe if it bugs me enough and I have enough downtime I'll try to figure that out, but it's not something I'm set up to do or have any experience with.

@LostRuins
Copy link
Owner

Alright. Well let me know if you figure something out.

@GF-110
Copy link

GF-110 commented Jan 30, 2024

Just adding on that this significant speed regression also happens in my context as well:
Format: .gguf with a Q5_KM quant
Single GPU with load split between GPU and CPU: RTX4090 & i9-13900K

1.551
Processing Prompt [BLAS] (1547 / 1547 tokens)
Generating (176 / 301 tokens)
(Stop sequence triggered: \n#)
ContextLimit: 1723/8192, Processing:19.34s (12.5ms/T), Generation:25.85s (146.9ms/T), Total:45.20s (3.89T/s)

1.56

Processing Prompt [BLAS] (1547 / 1547 tokens)
Generating (174 / 301 tokens)
(Stop sequence triggered: \n#)
ContextLimit: 1721/8192, Processing:8.42s (5.4ms/T), Generation:64.39s (370.1ms/T), Total:72.81s (418.5ms/T = 2.39T/s)

@ZavaruKitsu
Copy link

ZavaruKitsu commented Jan 30, 2024

Confirming @GF-110 comment, I have the same speed regression.
Model: dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf
Specs: RTX 4060, i7-12700.

1.55.1

dry:
Processing Prompt [BLAS] (1728 / 1728 tokens)
Generating (150 / 150 tokens)
ContextLimit: 1878/16384, Processing:89.49s (51.8ms/T), Generation:24.70s (164.6ms/T), Total:114.19s (1.31T/s)

second call:
Processing Prompt (1 / 1 tokens)
Generating (150 / 150 tokens)
ContextLimit: 1878/16384, Processing:0.15s (150.0ms/T), Generation:21.82s (145.5ms/T), Total:21.97s (6.83T/s)
1.56

dry:
Processing Prompt [BLAS] (1728 / 1728 tokens)
Generating (150 / 150 tokens)
ContextLimit: 1878/16384, Processing:75.67s (43.8ms/T), Generation:99.67s (664.5ms/T), Total:175.35s (1169.0ms/T = 0.86T/s)

second call:
Processing Prompt (1 / 1 tokens)
Generating (150 / 150 tokens)
ContextLimit: 1878/16384, Processing:0.51s (509.0ms/T), Generation:110.86s (739.1ms/T), Total:111.37s (742.5ms/T = 1.35T/s)

@LostRuins
Copy link
Owner

LostRuins commented Jan 30, 2024

Just for the record, what models are you all running?

Also try to provide more complete specs: system and gpu info, layers offloaded, mmq on/off, lowvram on/off, model name and quant

@ZavaruKitsu
Copy link

Windows 11, RTX 4060, i7-12700, 32GM RAM
Use CuBLAS
mmq on
lowvram off
offloaded 7 GPU layers (same for 4)
model dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf
16k context size

@candre23
Copy link
Author

My tests were using KitchenSink 103b fully offloaded (no lowvram) onto three P40s. Windows 10, latest drivers and cuda as of like a week ago.

@Nexesenex
Copy link

Nexesenex commented Jan 31, 2024

I confirm this tg speed regression on the experimental 1.57 (yesterday evening) as well, with a Llama 2 70b ran in Cublas mode on a 3090+3060 setup.

So I used the koboldcpp_cublas.dll of a late 1.55.1 (27/01/2024) to compile KoboldCPP.exe, and everything went back to normal.

I don't remember if it's allowed to share such files here, but here comes the .dll.

Edit : the file is useless, I removed it.

@LostRuins
Copy link
Owner

LostRuins commented Jan 31, 2024

That won't help, the .dll is the C++ inference program itself. The python file is only the server. If you replace it with an older dll, then you lose the updated functionalities anyway.

@Nexesenex , when you tried experimental 1.57, did you try after this commit:
Commit: 21ab727e83c550fdb777f386b417bbcb54f59da1 [21ab727] (change split mode to rows)

@Nexesenex
Copy link

Nexesenex commented Jan 31, 2024

I compiled a version including this commit, and still affected by the problem.

https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.57_b2022

Nexesenex/croco.cpp@v1.55.1_b1971...v1.57_b2022

And after noticing, I reverted to an older koboldcpp_cublas.dll which predated 1.56, because I saw people complaining about 1.56 slow speed.

And thanks for explaining me what is what. I'll recompile the .dll from the adequate ggml-cuda.cu, considering that most often the problem comes from there.

@Nexesenex
Copy link

Nexesenex commented Jan 31, 2024

I got a potential culprit 👍

cuda : fix tensor size calculation for non-split buffer (#5145)

I checked out this commit, and recompiled kobold_cublas.dll with everything else, including "change split mode to rows".

And the newly compiled KCPP works, speed is back on my setup. Q3_K_M works veryyy well (+15% speed compared to a v1.55.1!) IQ3_XXS works also and is blazing fast on my 3090-3060 (8.5 t/s tg at 3k context on a 70b Miqu model quantized in IQ3_XXS).

I am so happy!!! :D

@LostRuins
Copy link
Owner

LostRuins commented Jan 31, 2024

@Nexesenex cool! Can you pinpoint which lines of code I should change, or better yet, send me a PR with the changes.

Or did you just revert that entire commit?

@Nexesenex
Copy link

Oh man, it's way beyond my paygrade to edit such technical stuff. I just reverted the commit!

@LostRuins
Copy link
Owner

hmm okay i'll take a closer look then

@LostRuins
Copy link
Owner

@Nexesenex that specific commit has a bugfix for Mixtral that may be necessary.

Can you confirm again, for my current latest concedo_experimental, whether the slowdown is still present as of the latest commit in experimental branch: Checkpoint to test for speed

Commit: d229150d28a035bcef815b0e7455894d443d3c2a [d229150]
Parents: 15deabd200
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date: Wednesday, January 31, 2024 10:26:33 PM

Try a clean build at this point. Then, check if the slowdown exists first...

If it still does, i'll try reverting parts of that commit. Reverting the whole commit might break stuff.

@Nexesenex
Copy link

Lol. Ok, I'm doing it right now.

@Nexesenex
Copy link

Nexesenex commented Jan 31, 2024

U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 4096 --launch


Welcome to KoboldCpp - Version 1.57
For command line arguments, please refer to --help


Setting process to Higher Priority - Use Caution
Error, Could not change process priority: No module named 'psutil'
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll

Namespace(model=None, model_param='X:/text-generation-webui/models/miqu-1-70b-Requant-b2007-iMat-c32_ch400-IQ3_XXS.gguf', port=5001, port_param=5001, host='', launch=True, lora=None, config=None, threads=1, blasthreads=1, highpriority=True, contextsize=4096, blasbatchsize=128, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['mmq'], usevulkan=None, gpulayers=99, tensor_split=None, onready='', multiuser=0, remotetunnel=False, foreground=False, preloadstory='', quiet=False, ssl=None)

Loading model: X:\text-generation-webui\models\miqu-1-70b-Requant-b2007-iMat-c32_ch400-IQ3_XXS.gguf
[Threads: 1, BlasThreads: 1, SmartContext: False, ContextShift: True]

The reported GGUF Arch is: llama


Identified as GGUF model: (ver 6)
Attempting to Load...

Using automatic RoPE scaling. If the model has customized RoPE settings, they will be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
ggml_init_cublas: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from X:\text-generation-webui\models\miqu-1-70b-ReqFñ╝©ıllm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 32764
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32764
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = unknown, may not work
llm_load_print_meta: model params = 68.98 B
llm_load_print_meta: model size = 25.17 GiB (3.13 BPW)
llm_load_print_meta: general.name = D:\HF
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.83 MiB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors: CUDA_Split buffer size = 25630.08 MiB
llm_load_tensors: CPU buffer size = 140.62 MiB
llm_load_tensors: CUDA0 buffer size = 5.03 MiB
....................................................................................................
Automatic RoPE Scaling: Using model internal value.
llama_new_context_with_model: n_ctx = 4176
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 1305.00 MiB
llama_new_context_with_model: KV self size = 1305.00 MiB, K (f16): 652.50 MiB, V (f16): 652.50 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 6.06 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 158.99 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 4.40 MiB
llama_new_context_with_model: graph splits (measure): 3
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/

Please connect to custom endpoint at http://localhost:5001

Prompt : 2855 tokens

Silly tavern used.

My last release :

ContextLimit: 3124/5888, Processing:18.42s (6.5ms/T = 155.03T/s), Generation:31.97s (118.9ms/T = 8.41T/s), Total:50.39s (187.3ms/T = 5.34T/s)

Your experimental with the removed line in koboldcpp.py :

ContextLimit: 3060/4096, Processing:43.98s (15.4ms/T = 64.92T/s), Generation:39.56s (193.0ms/T = 5.18T/s), Total:83.54s (407.5ms/T = 2.45T/s)

My affected releases (I deleted them on the repo) :

ContextLimit: 3090/5376, Processing:44.19s (15.5ms/T = 64.61T/s), Generation:45.70s (194.5ms/T = 5.14T/s), Total:89.89s (382.5ms/T = 2.61T/s)

ContextLimit: 2994/5888, Processing:43.56s (15.3ms/T = 65.55T/s), Generation:26.20s (188.5ms/T = 5.31T/s), Total:69.75s (501.8ms/T = 1.99T/s)

Aside for unlocked context size, I used the same parameters everywhere.

@LostRuins
Copy link
Owner

So that single commit really affected the speeds huh.. hmmm... not sure what to do

@Nexesenex
Copy link

Nexesenex commented Jan 31, 2024

My thoughts :

  • Do they have such problems upstream?
  • Can it be the leftovers of your MMQ implementation which collides with that PR?
  • Does the backward compability your offer in KCPP start to be a too big burden to maintain if not fully "fossilized"?
  • Maybe Slaren can help you, considering that KCPP contributes to the mainstream popularity of LCPP and the GGUF format?

@LostRuins
Copy link
Owner

@Nexesenex yes, I would think they would have the same issue. But replicating it will be tricky. I cannot even test it myself as I don't see any issues.

I changed some more code. Can you try building from at this new commit and see if it solves the speed issue: Commit: 8929d34b04a26b88ee57d78e72ed24eb769bffc3 [8929d34] (try with async memset)

@Nexesenex
Copy link

U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 4096 --launch


Welcome to KoboldCpp - Version 1.57
For command line arguments, please refer to --help


Setting process to Higher Priority - Use Caution
Error, Could not change process priority: No module named 'psutil'
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll
Traceback (most recent call last):
File "koboldcpp.py", line 2597, in
File "koboldcpp.py", line 2408, in main
File "koboldcpp.py", line 242, in init_library
File "ctypes_init_.py", line 392, in getattr
File "ctypes_init_.py", line 397, in getitem
AttributeError: function 'get_last_seed' not found
[23440] Failed to execute script 'koboldcpp' due to unhandled exception!

That's what I get when I try to launch the same model with your last experimental with async memset.

@LostRuins
Copy link
Owner

Something is wrong with your setup.

Nothing else has changed except one line with the asyncmemset. Are you still trying to use 1.55 dlls for your build? You cannot do that. Do not try to use a different .dll for an intended version, they cannot be mixed and matched ever.

Now I am not sure about the results we got yesterday anymore.

Can you try:

  1. Clean and rebuild from the Checkpoint to test for speed commit
  2. Clean and rebuild from the try with async memset commit

do not mix and match any dlls other than the one for that version!

@LostRuins
Copy link
Owner

Don't worry about it I just wanna be thorough.

Hmm so the memset alone didn't change anything. But if you revert the entire commit of cuda : fix tensor size calculation for non-split buffer then its fast again correct?

@Nexesenex
Copy link

Nexesenex commented Feb 1, 2024

Correct. That's the only revert I did in my last release.
And the edition you did if the one I'd have tried myself if I wanted to actually find what's the problem.
Beyond that, the code of ggml-cuda.cu has been simplified in the problematic commit, maybe too much, I don't know.
It's damn frustrating, I know.

And look, If Slaren can't help, he might have offered an alternative workaround 👍

"As a workaround, increasing the alignment to 4096 in ggml_backend_cuda_buffer_type_get_alignment seems to fix it."

ggerganov#5137 (comment)

I know it's not best to fork this kind of stuff, but whatever works is better that whatever doesn't, no matter what, including dumping a non-working commit, right?

Else, the problem happens on partial Mixtral offload between 30 and 31 layers (I supposed 32 too? I don't know).

So, at worst, cap the max layers offloaded on GPU for Mixtral models at 29 for the time being, and dump the non-working commit without forking furthermore LlamaCPP files themselves.

Also, I highlight once again the differences between your ggml-cuda.cu and the LlamaCPP one. It serves a purpose, but maybe it needs to be reviewed?

@LostRuins
Copy link
Owner

The good news is I managed to get my hands on a Pascal device and it seems like I can repro the speed reduction. So hopefully I can narrow down the cause.

@LostRuins
Copy link
Owner

LostRuins commented Feb 1, 2024

The bad news is that reverting the commit @Nexesenex mentioned did not fully solve the performance issue. I reverted the whole commit, and my speeds are still much slower than 1.55, though maybe slightly faster than with the commit

@Nexesenex
Copy link

Nexesenex commented Feb 1, 2024

Well, that's what I have on my side :

U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --tensor_split 49 25 --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 7168 --launch


Welcome to KoboldCpp - Version (varies)
For command line arguments, please refer to --help


Setting process to Higher Priority - Use Caution
Error, Could not change process priority: No module named 'psutil'
Attempting to use CuBLAS library for faster prompt ingestion. A compatible CuBLAS will be required.
Initializing dynamic library: koboldcpp_cublas.dll

Namespace(model=None, model_param='X:/text-generation-webui/models/MiquMaid-v1-70B.q3_k_m.gguf', port=5001, port_param=5001, host='', launch=True, lora=None, config=None, threads=1, blasthreads=1, highpriority=True, contextsize=7168, blasbatchsize=128, ropeconfig=[0.0, 10000.0], smartcontext=False, noshift=False, bantokens=None, forceversion=0, nommap=False, usemlock=False, noavx2=False, debugmode=0, skiplauncher=False, hordeconfig=None, noblas=False, useclblast=None, usecublas=['mmq'], usevulkan=None, gpulayers=99, tensor_split=[49.0, 25.0], onready='', multiuser=0, remotetunnel=False, foreground=False, preloadstory='', quiet=False, ssl=None)

Loading model: X:\text-generation-webui\models\MiquMaid-v1-70B.q3_k_m.gguf
[Threads: 1, BlasThreads: 1, SmartContext: False, ContextShift: True]

prompt, 852 tokens.

Your experimental build for testing (31/01/2024) (with PR5145) :

ContextLimit: 964/4096, Processing:9.03s (10.6ms/T = 94.31T/s), Generation:22.37s (199.7ms/T = 5.01T/s), Total:31.40s (280.4ms/T = 3.57T/s)

1.57 b2030 :

ContextLimit: 980/4096, Processing:6.67s (7.8ms/T = 127.68T/s), Generation:18.69s (146.0ms/T = 6.85T/s), Total:25.37s (198.2ms/T = 5.05T/s)

1.56 b1971 :

ContextLimit: 939/4096, Processing:7.10s (8.3ms/T), Generation:12.87s (147.9ms/T), Total:19.96s (229.5ms/T = 4.36T/s)

1.56 b1963 :

ContextLimit: 939/4096, Processing:7.09s (8.3ms/T), Generation:14.00s (160.9ms/T), Total:21.09s (242.4ms/T = 4.13T/s)

1.56 b1953 :

ContextLimit: 1037/4096, Processing:7.12s (8.4ms/T), Generation:28.20s (152.4ms/T), Total:35.32s (190.9ms/T = 5.24T/s)

1.56 b1933 :

ContextLimit: 926/4096, Processing:7.30s (8.6ms/T), Generation:10.89s (147.2ms/T), Total:18.19s (245.8ms/T = 4.07T/s)

1.56 b1841 :

ContextLimit: 936/4096, Processing:9.54s (11.2ms/T), Generation:15.91s (189.4ms/T), Total:25.44s (3.30T/s)

1.55.1 b1828 :

ContextLimit: 908/4096, Processing:9.90s (11.6ms/T), Generation:10.62s (189.7ms/T), Total:20.53s (2.73T/s)

@LostRuins
Copy link
Owner

LostRuins commented Feb 2, 2024

I spent half a day going through the commits one by one and I cannot figure out what caused it. So unless someone else is able to troubleshoot, I'm afraid we are out of luck.

If someone else can replicate Nexesenex results on reverting the cuda : fix tensor size calculation for non-split buffer then please note it here. For me, it is not making any difference at all. Ever since the backend integration it has been significantly slower I think.

@Nexesenex
Copy link

Well, sorry for that waste of time, man.

And even worst :

1.57 b2030, new experimental (with PR5238, but without PR5145) :

CtxLimit: 892/4096, Process:9.36s (11.0ms/T = 91.05T/s), Generate:8.01s (200.2ms/T = 4.99T/s), Total:17.37s (2.30T/s)

Tested 2 time, and.. same problem. No further comment, I can't remotely figure out what's up.

If it's me who doesn't handle properly Github (that much), you have all my apologies, sincerely. I really hate when people waste my time, and even more to waste the time of others.

Otherwise, we'll see others reporting soon as well.

@DaveYognaught
Copy link

DaveYognaught commented Feb 2, 2024

Did some testing today in Discord KoboldCPP as I was upgrading from 1.52 to the latest version of 1.56.
I always test performance when I do this, and noticed a 200% decrease in generation speeds.

I usually launch through this bat:
koboldcpp.exe --usecublas mmq --gpulayers 35 --threads 4 --contextsize 8192 --blasbatchsize 256 --highpriority

This is with the same fully offloaded, 6GB Vram and 7B Q4_K_S Mistral based modal.
(synatra-7b-v0.3-rp.Q4_K_S)

For context, compiled test results:
KoboldCPP 1.52: 32.7ms/T ~ 54.5ms/T (AVG: 44ms/T)
KoboldCPP 1.56: 64.6ms/T ~ 224ms/T (AVG: 131.35ms/T)

With further debugging and brainstorming, I found the generation was arguably even worse in 1.55.1
So I would point towards that as being the culprit rather than anything in the 1.56 update. Copy of the discord summary I made:

So just to summarise, I set context to 2048. I tested 128 BLAS and then 512 BLAS.
Once on 1.55.1 and then 1.56. (Then a control test with 1.52 again, with only 512 BLAS)

On 1.55.1
First of all, I'm also getting the same / if not worse generation speeds on this version. Much to my surprise. I'm, well, weeeellll within my VRAM limits now that I lowered my context substantially. Not sure what else would possibly butcher my speeds so much. So something in this version appears to be the cause of at least my particular issues rather than 1.56. Additionally, no notable difference in generation speeds when swapping BLAS size. Does anyone have or can compile the original 1.55 rather than the hotfix one that is 1.55.1?

On 1.56
Regardless of what size of BLAS I use, there's still a 300-400MB chunk of VRAM reduction that's unaccounted for. Not sure if relevant given previous observation now, this might genuinely just be an optimisation of the buffers. If so, that'd be great. Once you factor in the performance degradation of 1.55.1, this is actually a slight upgrade now. (possibly? Looks kinda the same, in hindsight, hard to tell) Generation speeds seems 'about' the same too regardless of BLAS size.

Need to test 1.55 to confirm 1.55.1 is the cause I suppose.
I'm on a NVIDIA GeForce GTX 1660ti if relevant.

Copy of tests attached.
KoboldTests.txt

@DaveYognaught
Copy link

Ok, appendum of shame. 😞

I downloaded 1.54 and it has the exact same performance issues of 1.55.1 and 1.56....
So what I said above still stands, but whatever the issue is on my end goes even further back than I ever imagined. So apologies.
1.53 works fine. I have confirmed this much at least or i'd have lost my mind.

At this point, i've gone an entire month back in versions.
So, i'm not even convinced my issues are related anymore to this one.... but food for thought.
The same issues I have on 1.54, I have on 1.55.1 and 1.5.6. If there is a seperate issue within 1.55.1 or 1.56, with single GPUS a seperate speed regressions, It's not been reflected in my tests at all from what I can see, as they all seem to roughly be in the regression range which all seem to source from 1.54.

Soo... is it possible it's the same issue from 1.54 in that case?
Just copy pasting fresh test notes on 1.54 and 1.53.....

512 BLAS Size, on 1.54

Initial:
ContextLimit: 1035/2048, Processing:0.22s (222.0ms/T), Generation:37.90s (74.0ms/T), Total:38.12s (13.43T/s)
ContextLimit: 1035/2048, Processing:0.06s (61.0ms/T), Generation:38.04s (74.3ms/T), Total:38.10s (13.44T/s)
ContextLimit: 1035/2048, Processing:1.81s (3.5ms/T), Generation:38.17s (74.6ms/T), Total:39.98s (12.81T/s)

Subsequent:
ContextLimit: 2048/2048, Processing:0.42s (422.0ms/T), Generation:66.48s (129.8ms/T), Total:66.90s (7.65T/s)
ContextLimit: 2048/2048, Processing:2.40s (4.6ms/T), Generation:65.62s (128.2ms/T), Total:68.02s (7.53T/s)
ContextLimit: 1664/2048, Processing:2.50s (4.8ms/T), Generation:15.60s (121.9ms/T), Total:18.10s (7.07T/s)
ContextLimit: 1667/2048, Processing:2.50s (4.8ms/T), Generation:15.49s (118.3ms/T), Total:17.99s (7.28T/s)
ContextLimit: 1668/2048, Processing:2.59s (5.0ms/T), Generation:16.11s (122.0ms/T), Total:18.69s (7.06T/s)
ContextLimit: 1556/2048, Processing:3.75s (3.6ms/T), Generation:52.08s (101.7ms/T), Total:55.84s (9.17T/s)
No "High Priority" - Seems to do nothing
ContextLimit: 1922/2048, Processing:0.30s (301.0ms/T), Generation:47.95s (124.2ms/T), Total:48.25s (8.00T/s)
ContextLimit: 1577/2048, Processing:5.38s (3.5ms/T), Generation:4.66s (113.6ms/T), Total:10.04s (4.08T/s)


Control Test 2:
512 BLAS size, on 1.53

Initial:
ContextLimit: 1035/2048, Processing:0.10s (101.0ms/T), Generation:16.60s (32.4ms/T), Total:16.70s (30.66T/s)
ContextLimit: 2048/2048, Processing:5.70s (3.7ms/T), Generation:19.38s (37.9ms/T), Total:25.08s (20.41T/s)

Subsequent:
ContextLimit: 2048/2048, Processing:0.32s (318.0ms/T), Generation:19.61s (38.3ms/T), Total:19.93s (25.69T/s)
ContextLimit: 1879/2048, Processing:0.24s (242.0ms/T), Generation:13.04s (38.0ms/T), Total:13.28s (25.83T/s)
ContextLimit: 1909/2048, Processing:2.75s (5.3ms/T), Generation:14.48s (38.8ms/T), Total:17.23s (21.65T/s)
ContextLimit: 2048/2048, Processing:2.68s (5.2ms/T), Generation:20.27s (39.6ms/T), Total:22.96s (22.30T/s)
ContextLimit: 2048/2048, Processing:2.83s (5.4ms/T), Generation:20.81s (40.6ms/T), Total:23.64s (21.66T/s)

@LostRuins
Copy link
Owner

Okay I've done some tweaking and hopefully v1.57 should have better performance. Please try to use the mmq option and check if speeds are adequate.

@candre23
Copy link
Author

candre23 commented Feb 8, 2024

Just updating the speed tests to include 1.57. It seems the performance is now slightly faster than 1.55 levels!

Version         PP ms/t   Gen ms/t
KCPP 1.57       11.6      159.3
KCPP 1.56       17.9      272.2
KCPP 1.55.1     12.8      177.9
llama 1993      16.9      271.7
llama 1886      17.0      268.1
llama 1721      32.0      731.9

There is a tradeoff though. With 1.55 and 1.56 I was able to load the 103b model with 12k context. With 1.57, it goes OOM on load. I have to drop down to 8k to get the model to successfully load. Not ideal, but I'll take it.

Further observations: The memory/layer allocation between GPUs is clearly different now compared to 1.56. Previously, there was only a couple hundred MB of difference in VRAM usage between the cards. Now with 8k context, GPU0 is full to the brim while GPUs 1 and 2 have a little over 4GB free. I tried doing a manual split, and after some experimentation I conclude that A) manual layer split disables per-layer KV, and B) in this mode of operation, speeds are identical to 1.55.

So it seems that, intentional or not, you now have "options". You can let KCPP split the layers automatically, and you get a bit of a speed boost in exchange for slightly-suboptimal splitting which can limit your max context in edge cases. Or you can manually specify a split, getting the absolute most out of all your VRAM but at a slightly slower PP and gen speed.

Honestly, at this point, I'm not sure it's even an "issue" that needs resolving. I mean it would be great to get the max theoretical context at the fastest possible speed without any manual effort, but I'm more than OK with the current situation. I kinda suspect that the tradeoff is inherent to how per-layer KV works, so it may not even be "resolvable".

@Nexesenex
Copy link

Nexesenex commented Feb 9, 2024

I confirm @candre23 's observations, at least on the Token Generation speed.
1.57.1, last experimental with commit 0ec0055

  • ONLY are added my unlocking CTX/BBS, frag-cache zero, and Rope tweaks.
    Compiled with VS 2019, MSVC_x64_x64 compiler, on Windows 11 with Cuda 12.2.0, with a cache cleaning beforehand.

U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --tensor_split 49 25 --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 7168 --launch

Generating (128 / 128 tokens) / 821 tokens)
CtxLimit: 950/4096, Process:9.06s (11.0ms/T = 90.64T/s), Generate:15.18s (118.6ms/T = 8.43T/s), Total:24.23s (5.28T/s)

Compared to my last well working Frankenstein version ( https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.57_b2030 ) , I get around 15% TG speed increase. Also, -30% PP speed. But I can live with that, TG matters much more to me.

KoboldCPP Bench 👍

Timestamp Backend Layers Model MaxCtx GenAmount ProcessingTime ProcessingSpeed GenerationTime GenerationSpeed TotalTime Coherent Output
2024-02-09 19:40:00.084778+00:00 koboldcpp_cublas.dll Frankenstein 1.57.1_b2106 99 Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M 2048 100 19.87 98.05 12.78 7.82 32.65 True 11111
2024-02-09 20:23:49.732334+00:00 koboldcpp_cublas.dll Release 1.57.1 99 Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M 2048 100 27.09 71.9 19.25 5.2 46.34 True  

The difference between your Windows release and my frankenfork now boils down to its compilation.

Congratulations, @LostRuins !

@LostRuins
Copy link
Owner

In the next version I will add a new toggle to switch between cuda row split and layer split modes. Since Pascal cards in particular seem to do better on Row split, whereas some others prefer layer.

@mattbbx1
Copy link

In the next version I will add a new toggle to switch between cuda row split and layer split modes. Since Pascal cards in particular seem to do better on Row split, whereas some others prefer layer.

Awesome. Thank you for this. I have had the opposite, inference speeds increased considerably for me in 1.56 and have returned to their old speeds in 1.57. I am running on Debian Linux with an RTX4090 and a P40 in tandem.

@Nexesenex
Copy link

Nexesenex commented Feb 11, 2024

@candre23 : you try can to revert commit 15b4538 to shrink a bit the CUDA buffer and regain a bit of context. Also, Blast Batch Size 128 is (on GF3090 at least) the best compromise speed / buffer size for prompt processing (it might be smaller for a smaller GPU, I don't know).

@mattbbx1 : you can try to revert commit acb7928 to see if LostRuin's attempt to fix CUDA slowdown is actually doing the opposite on your configuration.

Also, either revert : 21ab727
Or add : 35111ce

Rows split mode is slower on Ampere.

For a 3090-3060 bi-GPU config under Windows 11, that worked for me.

Timestamp Backend Layers Model MaxCtx GenAmount ProcessingTime ProcessingSpeed GenerationTime GenerationSpeed TotalTime Coherent Output
2024-02-10 02:54:15.366616+00:00 koboldcpp_cublas.dll Frankenstein 1.57.1_b2106 – Split rows 99 Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M 2048 100 25.59 76.12 17.04 5.87 42.63 True 11111
-- -- -- -- -- -- -- -- -- -- -- -- --
2024-02-11 00:25:01.896050+00:00 koboldcpp_cublas.dll F1.57.1 b2112 - No Split Rows 99 Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M 2048 100 19.99 97.44 12.5 8 32.49 True 11111
-- -- -- -- -- -- -- -- -- -- -- -- --
2024-02-11 00:36:16.137143+00:00 koboldcpp_cublas.dll F1.57.1 b2112 No Split Rows and minus Cuda Slowdown fix attempt 99 Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M 2048 100 15.34 126.98 10.49 9.53 25.83 True 11111

@mattbbx1
Copy link

mattbbx1 commented Feb 11, 2024

@candre23 : you try can revert commit 15b4538 to shrink a bit the CUDA buffer and regain a bit of context.

@mattbbx1 : you can try to revert commit acb7928 to see if LostRuin's attempt to fix CUDA slowdown is actually doing the opposite on your configuration.

Also, either revert : 21ab727 Or add : 35111ce

Rows split mode is slower on Ampere.

For a 3090-3060 bi-GPU config under Windows 11, that worked for me.

Timestamp Backend Layers Model MaxCtx GenAmount ProcessingTime ProcessingSpeed GenerationTime GenerationSpeed TotalTime Coherent Output
2024-02-10 02:54:15.366616+00:00 koboldcpp_cublas.dll Frankenstein 1.57.1_b2106 – Split rows 99 Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M 2048 100 25.59 76.12 17.04 5.87 42.63 True 11111


2024-02-11 00:25:01.896050+00:00 koboldcpp_cublas.dll F1.57.1 b2112 - No Split Rows 99 Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M 2048 100 19.99 97.44 12.5 8 32.49 True 11111


2024-02-11 00:36:16.137143+00:00 koboldcpp_cublas.dll F1.57.1 b2112 No Split Rows and minus Cuda Slowdown fix attempt 99 Undi95_Miqu-70B-Alpaca-DPO-b2101-iMat-c32_ch1000-Q3_K_M 2048 100 15.34 126.98 10.49 9.53 25.83 True 11111

Thanks for the reply! Reverting to commit 21ab727 restored the speed increase. Prompt processing is significantly faster with my current build with this commit.

Build details just in case someone sees something similar:
Ryzen 5800x3d
RTX4090 and P40 (p40 is on a 1x lane, external mining riser)
64GB DDR4@3600
Asus X570 Chipset
Debian Linux

Edit: acb7928 did not seem to change much for my particular issue, but reverting 21ab727 did.

@Nexesenex
Copy link

@mattbbx1 Glad 1/2 worked out!
@LostRuins Thanks for the benchmark, man, one less reason for me to be messy, one more to be tidy.

@mattbbx1
Copy link

@mattbbx1 Glad 1/2 worked out! @LostRuins Thanks for the benchmark, man, one less reason for me to be messy, one more to be tidy.

If I am correct then @LostRuins including the toggle feature in the next update should resolve this if I'm correct as that's essentially what changed?

@LostRuins
Copy link
Owner

Yes, in the next version split mode will be configurable. So you can try both layer and row split and see which works better for you.

@LostRuins
Copy link
Owner

Just a reminder that in 1.58 now the split mode for multigpu is selectable. You can toggle it between Layer or Row split. So please try and see which is faster for pascal.

@Vladonai
Copy link

Suppose I have two identical Pascal graphics cards and the model fits completely in their video memory. What should the command line be like?

"koboldcpp.exe --usecublas mmq rowsplit normal --contextsize 4096 --blasbatchsize 512 --threads 9 --highpriority --model 70B.q2_k.gguf" ? --gpulayers, --tensor_split are not needed in this case?

@candre23
Copy link
Author

Updated for 1.58. Still 103b, three P40s, 1k prompt.

Version                PP ms/t   Gen ms/t
KCPP 1.58 (row)         12.3      161.1
KCPP 1.58 (layer)       18.9      274.3
KCPP 1.57               11.6      159.3
KCPP 1.56               17.9      272.2
KCPP 1.55.1             12.8      177.9
llama 1993              16.9      271.7
llama 1886              17.0      268.1
llama 1721              32.0      731.9

Rowsplit is slightly slower than it was in 1.57, but it's damn close. Layersplit is slightly slower than 1.56, but again it's within a couple percent. As before, rowsplit demands manual layer splitting since it keeps all the context in GPU0. Pretty sure this is intended (or inevitable) operation. Not really a problem, just worth mentioning.

@LostRuins
Copy link
Owner

Alright should be good enough then. Thanks for helping to test. Hopefully these toggles allow Pascal users to enjoy decent speeds while allowing other cards to perform well too.

@LostRuins
Copy link
Owner

@Vladonai depending on your context size and split mode, tensor_split may still be needed. If you are using row split, then the KV will only be stored on one of the cards not both, so it may feel lopsided.

@DaveYognaught
Copy link

DaveYognaught commented Feb 18, 2024

Alright should be good enough then. Thanks for helping to test. Hopefully these toggles allow Pascal users to enjoy decent speeds while allowing other cards to perform well too.

For what's it's worth, I can also confirm that the weirdly positioned Nvidia GeForce 1660ti is improved in 1.58 also.
We're getting damn near identical speed as 1.53 or 1.56 if manually fix. As default. Didn't have to tweak launch script or anything. I tried testing enabling Rowsplit out of curiousity but didn't seem to have any noticeable impact in my situation but since we're back to pre-change speeds, i'm happy eitherway :)

@Vladonai
Copy link

Updated for 1.58. Still 103b, three P40s, 1k prompt.

Version                PP ms/t   Gen ms/t
KCPP 1.58 (row)         12.3      161.1

Can you give a sample of your Koboldcpp command line (with rowsplit) for your hardware configuration?

@candre23
Copy link
Author

Can you give a sample of your Koboldcpp command line (with rowsplit) for your hardware configuration?

Nothing fancy, just a standard manual split for 103b. The only change is adding rowsplit.

set CUDA_VISIBLE_DEVICES=0,1,2
koboldcpp --threads 14 --usecublas rowsplit --highpriority --nommap --gpulayers 121 --tensor_split 24 48 48 --contextsize 16384

@brokofankone
Copy link

brokofankone commented Mar 21, 2024

Hi, any idea if cublas v12.4 has fixed the problem?

@LostRuins
Copy link
Owner

It should already be resolved, just try the toggle on/off and see which works better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants