I currently have an RTX 3050 and the latest releases of koboldcpp have really speed up the prompt processing. Obviously, a more powerful graphics card will speed up this process even more. But what about generation? I might buy an RTX 4090 if it would make the token generation rate significantly faster, but I suspect that won't happen. Can't CUDA and RTX cores be plugged in for at least some computation to speed up generation? Because 70B models on a CPU would be very slow...
I currently have an RTX 3050 and the latest releases of koboldcpp have really speed up the prompt processing. Obviously, a more powerful graphics card will speed up this process even more. But what about generation? I might buy an RTX 4090 if it would make the token generation rate significantly faster, but I suspect that won't happen. Can't CUDA and RTX cores be plugged in for at least some computation to speed up generation? Because 70B models on a CPU would be very slow...