-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed regression on multi-Pascal-GPU with 1.56 #642
Comments
Although I don't have Pascal and maybe it's off topic, but by the way I'll note that initialization of 1.56 takes twice as long as 1.55.... |
By initialization you mean loading the model? |
Tried running the program now and got the usual initialization speed. I guess yesterday the computer was busy with something else :) No, this problem is not confirmed. But since I want to buy 3 Tesla P40s myself, please pay close attention to the problem in the startpost. |
Yeah, I did run a few tests myself but unfortunately I don't have a multi-gpu setup. For single GPU it is as fast as ever
Note that this is with mmq, lowvram set to off and full offload. |
Yes, I tried it with just a single P40, and the speed was basically the same from 1.55 to 1.56. It's just in multi-GPU that the new version slows down. And just to confirm, the multi-GPU tests up top were for a full offload without lowvram enabled. |
Try asking this question in the llamacpp repository. One of the developers there also has 3xP40, he will probably want to figure it out. |
I went to run some benchmarks on llama.cpp and the results are confusing. Obviously something is not like-for-like, but I have no way of determining what. The fact that the llama folks release multiple revisions per day makes it really tough to pick an "equivalent" version of LCPP to compare to a given version of KCPP. But here's the TL;DR chart for an identical 1k prompt on a 103b model split across three P40s.
As you can see, I can't go complaining about a regression on the LCPP github when there isn't a regression on their end. On the flip side, it's kind of hard to complain here when the latest KCPP is more or less on par with the latest LCPP. The weird outlier is 1.55.1, which is significantly faster than current KCPP, current LCPP, and LCPP from about the same timeframe. I cannot explain this, or even suggest a "fix" for this regression that wouldn't make things worse for everybody outside my (admittedly niche) use-case. But whatever the cause, this is the behavior I'm seeing. |
Yeah a lot of stuff has changed under the hood with the ggml backend rework, much of it is opaque to me. I'll keep an eye on it but I don't think I have a solution right now - the timings being the same as llama.cpp now probably means that whatever KCPP was doing differently from llama.cpp before the backend refactor is now back in sync with it. If you can pinpoint what that is - I can look into changing it again. Are you able to compile from source yourself? |
Unfortunately, no. Maybe if it bugs me enough and I have enough downtime I'll try to figure that out, but it's not something I'm set up to do or have any experience with. |
Alright. Well let me know if you figure something out. |
Just adding on that this significant speed regression also happens in my context as well:
|
Confirming @GF-110 comment, I have the same speed regression.
|
Just for the record, what models are you all running? Also try to provide more complete specs: system and gpu info, layers offloaded, mmq on/off, lowvram on/off, model name and quant |
Windows 11, RTX 4060, i7-12700, 32GM RAM |
My tests were using KitchenSink 103b fully offloaded (no lowvram) onto three P40s. Windows 10, latest drivers and cuda as of like a week ago. |
I confirm this tg speed regression on the experimental 1.57 (yesterday evening) as well, with a Llama 2 70b ran in Cublas mode on a 3090+3060 setup. So I used the koboldcpp_cublas.dll of a late 1.55.1 (27/01/2024) to compile KoboldCPP.exe, and everything went back to normal. I don't remember if it's allowed to share such files here, but here comes the .dll. Edit : the file is useless, I removed it. |
That won't help, the @Nexesenex , when you tried experimental 1.57, did you try after this commit: |
I compiled a version including this commit, and still affected by the problem. https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.57_b2022 Nexesenex/croco.cpp@v1.55.1_b1971...v1.57_b2022 And after noticing, I reverted to an older koboldcpp_cublas.dll which predated 1.56, because I saw people complaining about 1.56 slow speed. And thanks for explaining me what is what. I'll recompile the .dll from the adequate ggml-cuda.cu, considering that most often the problem comes from there. |
I got a potential culprit 👍 cuda : fix tensor size calculation for non-split buffer (#5145) I checked out this commit, and recompiled kobold_cublas.dll with everything else, including "change split mode to rows". And the newly compiled KCPP works, speed is back on my setup. Q3_K_M works veryyy well (+15% speed compared to a v1.55.1!) IQ3_XXS works also and is blazing fast on my 3090-3060 (8.5 t/s tg at 3k context on a 70b Miqu model quantized in IQ3_XXS). I am so happy!!! :D |
@Nexesenex cool! Can you pinpoint which lines of code I should change, or better yet, send me a PR with the changes. Or did you just revert that entire commit? |
Oh man, it's way beyond my paygrade to edit such technical stuff. I just reverted the commit! |
hmm okay i'll take a closer look then |
@Nexesenex that specific commit has a bugfix for Mixtral that may be necessary. Can you confirm again, for my current latest concedo_experimental, whether the slowdown is still present as of the latest commit in experimental branch:
Try a clean build at this point. Then, check if the slowdown exists first... If it still does, i'll try reverting parts of that commit. Reverting the whole commit might break stuff. |
Lol. Ok, I'm doing it right now. |
U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 4096 --launch Welcome to KoboldCpp - Version 1.57 Setting process to Higher Priority - Use Caution
|
So that single commit really affected the speeds huh.. hmmm... not sure what to do |
My thoughts :
|
@Nexesenex yes, I would think they would have the same issue. But replicating it will be tricky. I cannot even test it myself as I don't see any issues. I changed some more code. Can you try building from at this new commit and see if it solves the speed issue: |
U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 4096 --launch Welcome to KoboldCpp - Version 1.57 Setting process to Higher Priority - Use Caution That's what I get when I try to launch the same model with your last experimental with async memset. |
Something is wrong with your setup. Nothing else has changed except one line with the asyncmemset. Are you still trying to use 1.55 dlls for your build? You cannot do that. Do not try to use a different .dll for an intended version, they cannot be mixed and matched ever. Now I am not sure about the results we got yesterday anymore. Can you try:
do not mix and match any dlls other than the one for that version! |
Don't worry about it I just wanna be thorough. Hmm so the memset alone didn't change anything. But if you revert the entire commit of |
Correct. That's the only revert I did in my last release. And look, If Slaren can't help, he might have offered an alternative workaround 👍 "As a workaround, increasing the alignment to 4096 in ggml_backend_cuda_buffer_type_get_alignment seems to fix it." I know it's not best to fork this kind of stuff, but whatever works is better that whatever doesn't, no matter what, including dumping a non-working commit, right? Else, the problem happens on partial Mixtral offload between 30 and 31 layers (I supposed 32 too? I don't know). So, at worst, cap the max layers offloaded on GPU for Mixtral models at 29 for the time being, and dump the non-working commit without forking furthermore LlamaCPP files themselves. Also, I highlight once again the differences between your ggml-cuda.cu and the LlamaCPP one. It serves a purpose, but maybe it needs to be reviewed? |
The good news is I managed to get my hands on a Pascal device and it seems like I can repro the speed reduction. So hopefully I can narrow down the cause. |
The bad news is that reverting the commit @Nexesenex mentioned did not fully solve the performance issue. I reverted the whole commit, and my speeds are still much slower than 1.55, though maybe slightly faster than with the commit |
Well, that's what I have on my side : U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --tensor_split 49 25 --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 7168 --launch Welcome to KoboldCpp - Version (varies) Setting process to Higher Priority - Use Caution
|
I spent half a day going through the commits one by one and I cannot figure out what caused it. So unless someone else is able to troubleshoot, I'm afraid we are out of luck. If someone else can replicate Nexesenex results on reverting the |
Well, sorry for that waste of time, man. And even worst : 1.57 b2030, new experimental (with PR5238, but without PR5145) : CtxLimit: 892/4096, Process:9.36s (11.0ms/T = 91.05T/s), Generate:8.01s (200.2ms/T = 4.99T/s), Total:17.37s (2.30T/s) Tested 2 time, and.. same problem. No further comment, I can't remotely figure out what's up. If it's me who doesn't handle properly Github (that much), you have all my apologies, sincerely. I really hate when people waste my time, and even more to waste the time of others. Otherwise, we'll see others reporting soon as well. |
Did some testing today in Discord KoboldCPP as I was upgrading from 1.52 to the latest version of 1.56. I usually launch through this bat: This is with the same fully offloaded, 6GB Vram and 7B Q4_K_S Mistral based modal. For context, compiled test results: With further debugging and brainstorming, I found the generation was arguably even worse in 1.55.1 So just to summarise, I set context to 2048. I tested 128 BLAS and then 512 BLAS. On 1.55.1 On 1.56 Need to test 1.55 to confirm 1.55.1 is the cause I suppose. Copy of tests attached. |
Ok, appendum of shame. 😞 I downloaded 1.54 and it has the exact same performance issues of 1.55.1 and 1.56.... At this point, i've gone an entire month back in versions. Soo... is it possible it's the same issue from 1.54 in that case?
|
Okay I've done some tweaking and hopefully v1.57 should have better performance. Please try to use the |
Just updating the speed tests to include 1.57. It seems the performance is now slightly faster than 1.55 levels!
There is a tradeoff though. With 1.55 and 1.56 I was able to load the 103b model with 12k context. With 1.57, it goes OOM on load. I have to drop down to 8k to get the model to successfully load. Not ideal, but I'll take it. Further observations: The memory/layer allocation between GPUs is clearly different now compared to 1.56. Previously, there was only a couple hundred MB of difference in VRAM usage between the cards. Now with 8k context, GPU0 is full to the brim while GPUs 1 and 2 have a little over 4GB free. I tried doing a manual split, and after some experimentation I conclude that A) manual layer split disables per-layer KV, and B) in this mode of operation, speeds are identical to 1.55. So it seems that, intentional or not, you now have "options". You can let KCPP split the layers automatically, and you get a bit of a speed boost in exchange for slightly-suboptimal splitting which can limit your max context in edge cases. Or you can manually specify a split, getting the absolute most out of all your VRAM but at a slightly slower PP and gen speed. Honestly, at this point, I'm not sure it's even an "issue" that needs resolving. I mean it would be great to get the max theoretical context at the fastest possible speed without any manual effort, but I'm more than OK with the current situation. I kinda suspect that the tradeoff is inherent to how per-layer KV works, so it may not even be "resolvable". |
I confirm @candre23 's observations, at least on the Token Generation speed.
U:\Kob\KoboldNew\Dist>koboldcpp_cuda.exe --usecublas mmq --tensor_split 49 25 --port 5001 --threads 1 --gpulayers 99 --highpriority --blasbatchsize 128 --contextsize 7168 --launch Generating (128 / 128 tokens) / 821 tokens) Compared to my last well working Frankenstein version ( https://github.com/Nexesenex/kobold.cpp/releases/tag/v1.57_b2030 ) , I get around 15% TG speed increase. Also, -30% PP speed. But I can live with that, TG matters much more to me. KoboldCPP Bench 👍
The difference between your Windows release and my frankenfork now boils down to its compilation. Congratulations, @LostRuins ! |
In the next version I will add a new toggle to switch between cuda row split and layer split modes. Since Pascal cards in particular seem to do better on Row split, whereas some others prefer layer. |
Awesome. Thank you for this. I have had the opposite, inference speeds increased considerably for me in 1.56 and have returned to their old speeds in 1.57. I am running on Debian Linux with an RTX4090 and a P40 in tandem. |
@candre23 : you try can to revert commit 15b4538 to shrink a bit the CUDA buffer and regain a bit of context. Also, Blast Batch Size 128 is (on GF3090 at least) the best compromise speed / buffer size for prompt processing (it might be smaller for a smaller GPU, I don't know). @mattbbx1 : you can try to revert commit acb7928 to see if LostRuin's attempt to fix CUDA slowdown is actually doing the opposite on your configuration. Also, either revert : 21ab727 Rows split mode is slower on Ampere. For a 3090-3060 bi-GPU config under Windows 11, that worked for me.
|
Thanks for the reply! Reverting to commit 21ab727 restored the speed increase. Prompt processing is significantly faster with my current build with this commit. Build details just in case someone sees something similar: Edit: acb7928 did not seem to change much for my particular issue, but reverting 21ab727 did. |
@mattbbx1 Glad 1/2 worked out! |
If I am correct then @LostRuins including the toggle feature in the next update should resolve this if I'm correct as that's essentially what changed? |
Yes, in the next version split mode will be configurable. So you can try both layer and row split and see which works better for you. |
Just a reminder that in 1.58 now the split mode for multigpu is selectable. You can toggle it between Layer or Row split. So please try and see which is faster for pascal. |
Suppose I have two identical Pascal graphics cards and the model fits completely in their video memory. What should the command line be like? "koboldcpp.exe --usecublas mmq rowsplit normal --contextsize 4096 --blasbatchsize 512 --threads 9 --highpriority --model 70B.q2_k.gguf" ? --gpulayers, --tensor_split are not needed in this case? |
Updated for 1.58. Still 103b, three P40s, 1k prompt.
Rowsplit is slightly slower than it was in 1.57, but it's damn close. Layersplit is slightly slower than 1.56, but again it's within a couple percent. As before, rowsplit demands manual layer splitting since it keeps all the context in GPU0. Pretty sure this is intended (or inevitable) operation. Not really a problem, just worth mentioning. |
Alright should be good enough then. Thanks for helping to test. Hopefully these toggles allow Pascal users to enjoy decent speeds while allowing other cards to perform well too. |
@Vladonai depending on your context size and split mode, tensor_split may still be needed. If you are using row split, then the KV will only be stored on one of the cards not both, so it may feel lopsided. |
For what's it's worth, I can also confirm that the weirdly positioned Nvidia GeForce 1660ti is improved in 1.58 also. |
Can you give a sample of your Koboldcpp command line (with rowsplit) for your hardware configuration? |
Nothing fancy, just a standard manual split for 103b. The only change is adding rowsplit.
|
Hi, any idea if cublas v12.4 has fixed the problem? |
It should already be resolved, just try the toggle on/off and see which works better. |
I'm seeing some significant increases in ms/t when running 1.56 across multiple pascal GPUs. It works out to about a 33% speed reduction overall. 103b split across three P40s, identical 6k prompt:
1.55.1: Processing:99.62s (14.6ms/T), Generation:65.22s (324.5ms/T)
1.56: Processing:136.17s (20.0ms/T), Generation:214.71s (419.3ms/T)
I mentioned this on discord and the answer seemed to be "that's just how it is now". I wasn't particularly satisfied with that answer, so I wanted to make an actual issue. Are we sure that's just how it is now, or is it possible that something isn't working correctly?
I get that pascal is pretty old, but a lot of folks are using these cards still and this a substantial speed hit. If this is an inevitable consequence of "something" having changed in how inferencing is done, would it be possible to revert back to the old method with a command line arg or something?
The text was updated successfully, but these errors were encountered: