koboldcpp-1.35

Note: This build adds significant changes for CUDA and may be less stable than normal - please report any performance regressions or bugs you encounter. It may be slower than usual. If that is the case, please use the previous version for now.

Enabled the CUDA 8Bit MMV mode (see ggerganov#2067) , now that it seems stable enough and works correctly, this approach uses quantized dot products instead of the traditional DMMV approach for the formats q4_0, q4_1, q5_0 and q5_1. If you're able to do a full GPU offload, then CUDA for such models will likely be significantly faster than before. K-quants and CL are not affected.
Exposed performance information through the API (prompt processing and generation timing), access it with /api/extra/perf
Added support for linear RoPE as an alternative to NTK-Aware RoPE (similar to in 1.33, but using 2048 as a base). This is triggered by the launcher parameter --linearrope. The RoPE scale is determined by the --contextsize parameter, thus for best results on SuperHOT models, you should launch with --linearrope --contextsize 8192 which provides a 0.25 linear scale as the SuperHOT finetune suggests. If --linearrope is not specified, then NTK-aware RoPE is used by default.
Added a Save and Load settings option to the GUI launcher.
Added the ability to select "All Devices" in the GUI for CUDA. You are still recommended to select a specific device - split GPU is usually slower.
Displays a warning if poor sampler orders are used, as the default configuration will give much better results.
Updated Kobold Lite, pulled other upstream fixes and optimizations.

1.35.H Henk-Cuda Hotfix: This is an alternative version from Henk that you can try if you encounter speed reductions. Please let me know if it's better for you.

Henk may have newer versions at https://github.com/henk717/koboldcpp/releases/tag/1.35 please check that out for now. I will be able to upstream any fixes only in a few days.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

koboldcpp-1.35

koboldcpp-1.35

Henk may have newer versions at https://github.com/henk717/koboldcpp/releases/tag/1.35 please check that out for now. I will be able to upstream any fixes only in a few days.