02 Aug 10:29

804d352

koboldcpp-1.72 Latest

Latest

koboldcpp-1.72

NEW: GPU accelerated Stable Diffusion Image Generation is now possible on Vulkan, huge thanks to @0cc4m
Fixed an issue with mismatched CUDA device ID order.
Incomplete SSE response for short sequences fixed (thanks @pi6am)
SSE streaming fix for unicode heavy languages, which should hopefully mitigate characters going missing due to failed decoding.
GPU layers now defaults to -1 when running in GUI mode, instead of overwriting the existing layer count. The predicted layers is now shown as an overlay label text instead, allowing you to see total layers as well as estimation changes when you adjust launcher settings.
Auto GPU Layer estimation takes into account loading image and whisper models.
Updated Kobold Lite: Now supports SSE streaming over OpenAI API as well, should you choose to use a different backend.
Merged fixes and improvements from upstream, including Gemma2 2B support.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

Contributors

0cc4m and pi6am

Assets 10

25 Jul 06:09

LostRuins

v1.71.1

e47477f

koboldcpp-1.71.1

oh boy, another extra 30MB just for me? you shouldn't have!

Updated Kobold Lite:
- Corpo UI Theme is now available for chat mode as well.
- More accessibility label for screen readers.
- Enabling inject chatnames in Corpo UI now replaces the AI's displayed name if enabled.
- Added setting for TTS narration speed.
- Allow selecting the greeting message in Character Cards with multiple greetings
NEW: Automatic GPU layer selection has been improved, thanks to the efforts of @henk717 and @Pyroserenus. You can also now set --gpulayers to -1 to have KoboldCpp guess how many layers to be used. Note that this is still experimental, and the estimation may not be fully accurate, so you will still get better results manually selecting the GPU layers to use.
NEW: Added KoboldCpp Launch Templates. These are sharable .kcppt files that contain the setup necessary for other users to easily load and use your models. You can embed everything necessary to use a model within one file, including URLs to the desired model files, a preloaded story, and a chatcompletions adapter. Then anyone using that template can immediately get a properly configured model setup, with correct backend, threads, GPU layers, and formats ready to use on their own machine.
- For a demo, to run Llama3.1-8B, try this koboldcpp.exe --config https://huggingface.co/koboldcpp/kcppt/resolve/main/Llama-3.1-8B.kcppt , everything needed will be automatically downloaded and configured.
Fixed a crash when running a model with llava and debug mode enabled.
iq4_nl format support in Vulkan by @0cc4m
Updated embedded winclinfo for windows, other minor fixes
--unpack now does not include .pyd files as they were causing version conflicts.
Merged fixes and improvements from upstream, including Mistral Nemo support.

Hotfix 1.71.1 - Fix for llama3 rope_factors, fixed loading older Phi3 models without SWA, other minor fixes.

For more information, be sure to run the program from command line with the --help flag.

Contributors

henk717, 0cc4m, and Pyroserenus

Assets 9

15 Jul 02:15

LostRuins

v1.70.1

a441c27

koboldcpp-1.70.1

mom: we have ChatGPT at home edition

Updated Kobold Lite:
- Introducting Corpo Mode: A new beginner friendly UI theme that aims to emulate the ChatGPT look and feel closely, providing a clean, simple and minimalistic interface. It has a limited feature set compared to other UI themes, but should feel very familiar and intuitive for new users. Now available for instruct mode!
- Settings Menu Rework: The settings menu has also been completely overhauled into 4 distinct panels, and should feel a lot less cramped now, especially on desktop.
- Sampler Presets and Instruct Presets have been updated and modernized.
- Added support for importing character cards from aicharactercards.com
- Added copy for code blocks
- Added support for dedicated System Tag and System Prompt (you are still encouraged to use the Memory feature instead)
- Improved accessibility, keyboard tab navigation and screen reader support
NEW: Official releases now provide windows binaries with included AVX1 CUDA support, download koboldcpp_oldcpu.exe
NEW: DRY dynamic N-gram anti-repetition sampler support has been added (credits @pi6am)
Added --unpack, a new self-extraction feature that allows KoboldCpp binary releases to be unpacked into an empty directory. This allows easy modification and access to the files and contents embedded inside the PyInstaller. Can also be used in the GUI launcher.
Fix for a Vulkan regression in Q4_K_S mistral models when offloading to GPU (thanks @0cc4m).
Experimental support for OpenAI tools and function calling API (credits @teddybear082)
Added a workaround for Deepseek crashing due to unicode decoding issues.
--chatcompletionsadapter can now be selected on included pre-bundled templates by filename, e.g. Llama-3.json, pre-bundled templates have also been updated for correctness (thanks @xzuyn).
Default --contextsize is finally increased to 4096, default Chat Completions API output length is also increased.
Merged fixes and improvements from upstream, including multiple Gemma fixes.

1.70.1: Fixed a bug with --unpack not including the py files, with the oldcpu binary missing some options, and swapped the cu11 linux binary to not use avx2 for best compatibility. The cu12 linux binary still uses avx2 for max performance.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

For more information, be sure to run the program from command line with the --help flag.

Contributors

0cc4m, xzuyn, and 2 other contributors

Assets 9

01 Jul 06:44

LostRuins

v1.69.1

0fc18d2

koboldcpp-1.69.1

Fixed an issue when selecting ubatch, which should now correctly match the blasbatchsize
Added separator tokens when selecting multiple images with LLaVA. Unfortunately, the model still tends to get mixed up and confused when working with multiple images in the same request.
Added a set of premade Chat Completions adapters selectable in the GUI launcher (thanks @henk717) which provide an easy instruct templates for various models and formats, should you want to use third party OpenAI based (chat completion) frontends along with KoboldCpp. This can help you override the instruct format even if the frontend does not directly support it. For more information on --chatcompletionsadapter see the wiki.
Allow inserting an extra added forced positive or forced negative prompt for stable diffusion (set add_sd_prompt and add_sd_negative_prompt in a loaded adapter).
Switched over the KoboldCpp Colab to use precompiled linux binaries, it starts and run much faster now. The Huggingface Tiefighter Space example has also been updated likewise (thanks @henk717) . Lastly, added information about using KoboldCpp in RunPod at https://koboldai.org/runpodcpp/
Fixed some utf decode errors.
Added tensor split GUI launcher input field for Vulkan.
Merged fixes and improvements from upstream, including the improved mmq with int8 tensor core support and gemma 2 features have been merged.
Updated Kobold Lite chatnames stopper for instruct mode. Also, Kobold Lite can now fall back to an alternative API or endpoint URL if the connection fails, you may attempt to reconnect using the OpenAI API instead, or to use a different URL.

1.69.1 - Merged the fixes for gemma 2 and IQ mmvq

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

For more information, be sure to run the program from command line with the --help flag.

Contributors

henk717

Assets 9

19 Jun 08:47

LostRuins

v1.68

c9c050f

koboldcpp-1.68

Added GradientAI Automatic RoPE calculation, thanks to @askmyteapot , this should provide a better automatic RoPE scaling values for large context sizes.
CLBlast support has been preserved, although it is now removed upstream. For now, I still intend to retain it as long as feasible.
Multi GPU is now made easy in Vulkan, with an All GPU option in the GUI launcher added similar to cuda. Also, vulkan now defaults to the first dedicated GPU if --usevulkan is run without any other parameters, instead of just the first GPU on the list (thanks @0cc4m )
The tokenize endpoint at /api/extra/tokencount now has an option to skip BOS tokens, by setting special to false.
Running a KCPP horde worker now automatically sets whisper and SD to quiet mode.
Allow the SD StableUI to be run even when no SD model is loaded.
Allow --sdclamped to provide a custom clamp size
Additional benchmark flags are saved (thanks @Nexesenex)
Merged fixes and improvements from upstream
Updated Kobold Lite:
- Fixed Whisper not working in some versions of Firefox
- Allow PTT to trigger a 'Generate More' if tapped, and still function as PTT if held.
- Fixed PWA functionality, now KoboldAI Lite can be installed as a web app even when running from KoboldCpp.
- Added a plaintext export option
- Increase retry history stack to 3.
- Increased default non-highres image size slightly.

Q: Why does Koboldcpp seem to constantly increase in filesize every single version?
A: Basically the upstream llama.cpp cuda maintainers believe that performance should always be prioritized over code size. Indeed, even the official llama.cpp libraries are now well over 130mb compressed without cublas runtimes, and continuing to grow in size at a geometric rate. Unfortunately, there is very little I can personally do about this.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

For more information, be sure to run the program from command line with the --help flag.

Contributors

0cc4m, askmyteapot, and Nexesenex

Assets 8

04 Jun 10:50

LostRuins

v1.67

5789417

koboldcpp-1.67

Hands free edition

KoboldLostAgain.mp4

NEW: Integrated Whisper.cpp into KoboldCpp. This can be used from Kobold Lite for speech-to-text (see below). You can obtain a whisper model from the whisper.cpp repo links or download one mirrored here
- Two new endpoints are added, /api/extra/transcribe used by KoboldCpp and the OpenAI compatible drop-in /v1/audio/transcriptions. Both endpoints accept payloads as .wav files (max 32MB), or base64 encoded wave data, please check KoboldCpp API docs for more info.
- Can be used in Kobold Lite. Uses microphone when enabled in settings panel. You can use Push-To-Talk (PTT) or automatic Voice Activity Detection (VAD) aka Hands Free Mode, everything runs locally within your browser including resampling and wav format conversion, and interfaces directly with the KoboldCpp transcription endpoint.
- Special thanks to @ggerganov and all the developers of whisper.cpp, without which none of this would have been possible.
NEW: You can now utilize the Quantized KV Cache feature in KoboldCpp with --quantkv [level], where level 0=f16, 1=q8, 2=q4. Note that quantized KV cache is only available if --flashattention is used, and is NOT compatible with Context Shifting, which will be disabled if --quantkv is used.
Merged improvements and fixes from upstream, including new MOE support for Vulkan by @0cc4m
Fixed a bug with stable diffusion generating blank images in CPU mode.
Updated Kobold Lite:
- Speech-To-Text features have been added, see above.
- Tavern Cards can now be imported in Instruct mode. Enable "Show Advanced Load" for this option.
- Logit Bias editor now has a built-in tokenizer for strings when using with koboldcpp.
- Fixed world info trigger probability, added escape button to close popups, fixed Cohere preamble dialog, fixed password input field sizes, various other bugfixes.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

For more information, be sure to run the program from command line with the --help flag.

Contributors

ggerganov and 0cc4m

Assets 8

24 May 10:33

LostRuins

v1.66

6530501

koboldcpp-1.66.1

Phi guess that's the way the cookie crumbles edition

NEW: Added custom SD LoRA support! Specify it with --sdlora and set the LoRA multiplier with --sdloramult. Note that SD LoRAs can only be used when loading in 16bit (e.g. with the .safetensors model) and will not work on quantized models (so incompatible with --sdquant)
NEW: Added custom SD VAE support, which can be specified in the Image Gen tab of the GUI launcher, or using --sdvae [vae_file.safetensors]
NEW: Added in-built support for TAE SD for SD1.5 and SDXL. This is a very small VAE replacement that can be used if a model has a broken VAE, it also works faster than regular VAE. To use it, select "Fix Bad VAE" checkbox or use the flag --sdvaeauto
- Note: Do not use the above new flags with --sdconfig, which is a deprecated flag and not to be used.
NEW: Added experimental support for Rep Pen Slope. This is not a true slope, but the end result is it applies a slightly reduced rep pen for older tokens within the rep pen range, scaled by the slope value. Setting rep pen slope to 1 negates this effect. For compatibility reasons, rep pen slope defaults to 1 if unspecified (same behavior as before).
NEW: You can now specify a http/https URL to a GGUF file when passing the --model parameter, or in the model selector UI. KoboldCpp will attempt to download the model file into your current working directory, and automatically load it when the download is done.
Disable UI launcher scaling on MacOS due to display issues. Please report any further scaling issues.
Improved EOT token handling, fixed a bug in token speed calculations.
Default thread count will not exceed 8 unless overridden, this helps mitigate e-core issues.
Merged improvements and fixes from upstream, including new Phi support and Vulkan fixes from @0cc4m
Updated Kobold Lite:
- Now attempts to function correctly if hosted on a subdirectory URL path (e.g. using a reverse proxy), if that fails it defaults back to the root URL.
- Changed default chatmode player name from "You" to "User", which solves some wonky phrasing issues.
- Added viewport width controls in settings, including horizontal fullscreen.
- Minor bugfixes for markdown

Fix for 1.66.1 - Fixed quant tools makefile, fixed sd seed parsing, updated lite

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

For more information, be sure to run the program from command line with the --help flag.

Contributors

0cc4m

Assets 9

11 May 02:44

LostRuins

v1.65

702be65

koboldcpp-1.65

at least we have a shovel edition

NEW: Added a new standalone UI for Image Generation, thanks to @ayunami2000 for porting StableUI (original by @aqualxx) to KoboldCpp! Now you have a powerful dedicated A1111 compatible GUI for generating images locally, with a similar look and feel to Automatic1111. And it runs in your browser, launching straight from KoboldCpp, simply load a Stable Diffusion model and visit http://localhost:5001/sdui/
NEW: Added official CUDA 12 binaries. If you have a newer NVIDIA GPU and don't mind larger files, you may get increased speeds by using the CUDA 12 build koboldcpp_cuda12.exe
Added a new API field bypass_eos to skip EOS tokens while still allowing them to be generated.
Hopefully fixed tk window resizing issues
Increased interrogate mode token amount by 30%, and increased default chat completions token amount by 250%
Merged improvements and fixes from upstream
Updated Kobold Lite:
- Added option to insert Instruct System Prompt
- Added option to bypass (skip) EOS
- Added toggle to return special tokens
- Added Chat Names insertion for instruct mode
- Added button to launch StableUI
- Various minor fixes, support importing cards from CharacterHub urls.

Important Deprecation Notice: The flags --smartcontext, --hordeconfig and --sdconfig are being deprecated.

--smartcontext is no longer as useful nowadays with context shifting, and just adds clutter and confusion. With it's removal, if contextshift is enabled, smartcontext will be used as a fallback if contextshift is unavailable, such as with old models. --noshift can still be used to turn both behaviors off.

--hordeconfig and --sdconfig are being replaced, as the number of configurations for these arguments grow, the order of these positional arguments confuses people, and makes it very difficult to add new flags and toggles as well, since a misplaced new parameter breaks existing parameters. Additionally, it also prevented me from properly validating each input for data type and range.

As this is a large change, these deprecated flags will remain functional for now. However, you are strongly advised to switch over to the new replacement flags below:

Replacement Flags:

--hordemodelname  Sets your AI Horde display model name.
--hordeworkername Sets your AI Horde worker name.
--hordekey        Sets your AI Horde API key.
--hordemaxctx     Sets the maximum context length your worker will accept.
--hordegenlen     Sets the maximum number of tokens your worker will generate.

--sdmodel     Specify a stable diffusion model to enable image generation.
--sdthreads   Use a different number of threads for image generation if specified. 
--sdquant     If specified, loads the model quantized to save memory.
--sdclamped   If specified, limit generation steps and resolution settings for shared use.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

For more information, be sure to run the program from command line with the --help flag.

Contributors

ayunami2000 and aqualxx

Assets 8

01 May 12:55

LostRuins

v1.64.1

a3718c6

koboldcpp-1.64.1

Added fixes for Llama 3 tokenization: Support updated Llama 3 GGUFs with pre-tokenizations.
- Note: In order to benefit from the tokenizer fix, the GGUF models need to be reconverted after this commit. A warning will be displayed if the model was created before this fix.
Automatically support and apply both EOS and EOT tokens. EOT tokens are also correctly biased when EOS is banned.
finish_reason is now correctly communicated in both sync and SSE streamed modes responses when token generation is stopped by EOS/EOT. Also, Kobold Lite no longer trims sentences if a EOS/EOT is detected as the stop reason in instruct mode.
Added proper support for trim_stop in SSE streaming modes. Stop sequences will no longer be exposed even during streaming when trim_stop is enabled. Additionally, using the Chat Completions endpoint automatically applies trim stop to the instruct tag format used. This allows better out-of-box compatibility with third party clients like LibreChat.
--bantokens flag has been removed. Instead, you can now submit banned_tokens dynamically via the generate API, for each specific generation, and all matching tokens will be banned for that generation.
Added render_special to the generate API, which allows you to enable rendering of special tokens like <|start_header_id|> or <|eot_id|> if enabled.
Added new experimental flag --flashattention to enable Flash Attention for compatible models.
Added support for resizing the GUI launcher, all GUI elements will auto-scale to fit. This can be useful for high DPI screens.
Improved speed of rep pen sampler.
Added additional debug information in --debugmode.
Added a button for starting the benchmark feature in GUI launcher mode.
Fixed slow clip processing speed issue on Colab
Fixed quantization tool compilation again
Updated Kobold Lite:
- Improved stop sequence and EOS handling
- Fixed instruct tag dropdown
- Added token filter feature
- Added enhanced regex replacement (now also allowed for submitted text)
- Support custom {{placeholder}} tags.
- Better max context handling when used in Kcpp
- Support for Inverted world info secondary keys (triggers when NOT present)
- Language customization for XTTS

Hotfix 1.64.1: Fixed LLAVA being incoherent on the second generation onwards. Also, the gui launcher has been tidied up, lowvram is now removed from quick launch tab and only in hardware tab. --benchmark includes version and gives clearer exit instructions in console output now. Fixed some tkinter error outputs on quit.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

For more information, be sure to run the program from command line with the --help flag.

Assets 6

20 Apr 03:00

LostRuins

v1.63

593f08b

koboldcpp-1.63

Enable Sound, Press Play

kobo_gif.mp4

Added support for special tokens in stop_sequences. Thus, if you set <|eot_id|> as a stop sequence and it can be tokenized into a single token, it will just work and function like the EOS token, allowing multiple EOS-like tokens.
Reworked the Automatic RoPE scaling calculations to support Llama3 (just specify the desired --contextsize and it will trigger automatically).
Added a console warning if another program is already using the desired port.
Improved server handling for bad or empty requests, which fixes a potential flooding vulnerability.
Fixed a scenario where the BOS token could get lost, potentially resulting in lower quality especially during context-shifting.
Pulled and merged new model support, improvements and fixes from upstream.
Updated Kobold Lite: Fixed markdown, reworked memory layout, added a regex replacer feature, added aesthetic background color settings, added more save slots, added usermod saving, added Llama3 prompt template

Edit: Something seems to be flagging the CI built binary on windows defender. Replaced it with a locally built one until I can figure it out.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

For more information, be sure to run the program from command line with the --help flag.

Assets 6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

koboldcpp-1.72

Contributors

koboldcpp-1.71.1

Contributors

koboldcpp-1.70.1

Contributors

koboldcpp-1.69.1

Contributors

koboldcpp-1.68

Contributors

koboldcpp-1.67

Contributors

koboldcpp-1.66.1

Contributors

koboldcpp-1.65

Contributors

koboldcpp-1.64.1

koboldcpp-1.63

Releases: LostRuins/koboldcpp

koboldcpp-1.72

koboldcpp-1.72

Contributors

koboldcpp-1.71.1

koboldcpp-1.71.1

Contributors

koboldcpp-1.70.1

koboldcpp-1.70.1

Contributors

koboldcpp-1.69.1

koboldcpp-1.69.1

Contributors

koboldcpp-1.68

koboldcpp-1.68

Contributors

koboldcpp-1.67

koboldcpp-1.67

Contributors

koboldcpp-1.66.1

koboldcpp-1.66.1

Contributors

koboldcpp-1.65

koboldcpp-1.65

Contributors

koboldcpp-1.64.1

koboldcpp-1.64.1

koboldcpp-1.63

koboldcpp-1.63