Skip to content

Releases: LostRuins/koboldcpp

koboldcpp-1.91

10 May 15:15
Compare
Choose a tag to compare

koboldcpp-1.91

Entering search mode edition

image

  • NEW: Huggingface Model Search Tool - Grabbing a model has never been easier! KoboldCpp now comes with a HF model browser so you can search and find the GGUF models you like directly from huggingface. Simply search for and select the model, and it will be downloaded before launch.
  • Embedded aria2c downloader for windows builds - this provides extremely fast downloads and is automatically used when downloading models via provided URLs.
  • Added CUDA target for compute capability 3.5. This may allow KoboldCpp to be used with K6000, GTX 780, K80. I have received some success stories - if you do try, share your experiences on the discussions page!
  • Reduced CUDA binary sizes by switching most cuda cc targets to virtual, thanks to a good suggestion from Johannes at ggml-org#13135
  • Improved ComfyUI emulation, can now adapt to any kind of workflow so long as there is a KSampler node connected to a text prompt somewhere in it.
  • Fixed GLM-4 prompt handling even for quants with incorrect BOS set.
  • Added support for Classifier-Free Guidance (CFG) since I wanted to mess with it. At long last I have finally added CFG, but I don't really like it - results are not great. Anyway, if you wish to use it simple check Enable Guidance or use --enableguidance, then set a negative prompt and CFG scale from the lite tokens menu. Note that guidance doubles KV usage and halves generation speed. Overall, it was a disappointing addition and not really worth the effort.
  • StableUI now clears the queue when cancelling a generation
  • Further fixes for Zenity/YAD in multilingual environments
  • Removed flash attention limits and warnings for Vulkan
  • Updated Kobold Lite, multiple fixes and improvements
    • Important Change: KoboldCppAuto is now the default instruct preset. This will let the KoboldCpp backend automatically choose the correct instruct tags to use at runtime, based on the model loaded. This is done transparently in the backend and not visible to the user. If it doesn't work properly, you can always still switch to your preferred instruct format (e.g. Alpaca).
    • NEW: Corpo mode now supports Text mode and Adventure mode as well, making it usable in all 4 modes.
    • Added quick save and delete buttons for corpo mode.
    • Added Pollinations.ai as an option for TTS and Image Gen (optional online service)
    • Instruct placeholders are now always used (but you can change what they map to, including themselves)
    • Added confirmation box for loading from slots
    • Improved think tag handling and output formatting.
    • Added a new scenario: Nemesis
    • Chat match any name is no longer on by default
    • Fixed autoscroll jumping on edit in corpo mode
    • Fix char spec v2 embedded WI import by @Cohee1207
  • Merged fixes and improvements from upstream

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.90.2

29 Apr 15:28
Compare
Choose a tag to compare

koboldcpp-1.90.2

Qwen of the line edition

  • NEW: Android Termux Auto-Installer - You can now setup KoboldCpp via Termux on Android with a single command, which triggers an automated installation script. Check it out here. Install Termux from F-Droid, then run the command with internet accessible, and everything will be setup, downloaded, compiled and configured for instant use with a Gemma3-1B model.

  • Merged support for Qwen3. Now also triggers --nobostoken automatically if a model metadata explicitly indicates no_bos_token, it can still be enabled manually for other models.

  • Fixes for THUDM GLM-4, note that this model enforces --blasbatchsize 16 or smaller in order to get coherent output.

  • Merged overhaul to Qwen2.5vl projector. Both old (HimariO version) and new (ngxson version) mmprojs should work, retaining backwards compatibility. However, you should update to the new projectors.

  • Merged functioning Pixtral support. Note that pixtral is very token heavy, about 4000 tokens for a 1024px image, you can try increasing max --contextsize or lowering --visionmaxres.

  • Added support for OpenAI Structured Outputs in chat completions API, also accepts the schema when sent as a stringified JSON object in the "grammar" field. You can use this to enforce JSON outputs with specific schema.

  • --blasbatchsize -1 now exclusively uses a batch size of 1 when processing prompt. Also permitted --blasbatchsize 16 which replicates the old behavior (batch of 16 does not trigger GEMM).

  • KCPP API server now correctly handles explicitly set nulled fields.

  • Fixed Zenity/YAD detection not working correctly in the previous version.

  • Improved input sanitization when launching and passing url as a model param, Also for better security, --onready shell commands can still be used as a CLI parameter, but cannot be embedded into a .kcppt or .kcpps file.

  • More robust checks for system glslc when building vulkan shaders.

  • Improved auto gpu layers when loading multi-part GGUF models (on 1 gpu), also slightly tightened memory estimation, and accounts for quantized KV when guessing layers.

  • Added new flag --mmprojcpu that allows you to load and run the projector on CPU while keeping the main model on GPU.

  • noscript mode randomizes generated image names to prevent browser caching.

  • Updated Kobold Lite, multiple fixes and improvements

    • Increased default tokens generated and slider limits (can be overridden)
    • ChatGLM-4 and Qwen3 (chatml think/nothinking) presets added. You can disable thinking in Qwen3 by swapping between ChatML (No Thinking) and normal ChatML.
    • Added toggle to disable LaTeX while leaving markdown enabled
  • Merged fixes and improvements from upstream

  • Hotfix 1.90.1:

    • Reworked thinking tags handling. ChatML (No thinking) is removed, instead, thinking can be forced or prevented for all instruct formats (Settings > Tokens > CoT).
    • More GLM4 fixes, now works fine with larger batches on CUDA, on vulkan glm4 ubatch size is still limited to 16.
    • Some chat completions parsing fixes.
    • Updated Lite with a new scenario
  • Hotfix 1.90.2:

    • Pulled further upstream updates. Massive file size increase caused by ggml-org#13199, I can't do anything about it. Don't ask me.
    • NEW: Added a hugginface model search tool! Now you can find, browse and download models straight from huggingface.
    • Increased --defaultgenamount range
    • Try to fix YAD GUI launcher
    • Added rudimentary websocket spoof for ComfyUI, increased comfyui compatibility.
    • Fixed a few parsing issues for nulled chat completions params
    • Automatically handle multipart file downloading, up to 9 parts.
    • Fixed rope config not saving correctly to kcpps sometimes
    • Merged fixes for Plamo models, thanks to @CISC

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.89

21 Apr 03:15
Compare
Choose a tag to compare

koboldcpp-1.89

retro.mp4
  • NEW: Improved NoScript mode - NoScript mode now has chat mode and image generation support entirely without Javascript! Access it by default at http://localhost:5001/noscript on your browser. Tested to work on Internet Explorer 5, Netscape Navigation 4, NetSurf, Lynx, Dillo, and basically any browser made after 1999.
  • Added new launcher flags --overridekv and --overridetensors which work in the same way as llama.cpp's flags.
    • --overridekv allows you to specify a single metadata property to be overwritten. Input format is keyname=type:value
    • --overridetensors allow you to place tensors matching a pattern onto a specific backend. Input format is tensornamepattern=buffertype
  • Enabled @jeffbolznv coopmat2 support for Vulkan (supports flash attention, overall slightly faster). CM2 is only enabled if you have the latest Nvidia Game Ready Driver (576.02) and should provide all round speedups. Thought the (OldCPU) Vulkan binaries will now exclude coopmat, coopmat2 and DP4A, so please use OldCPU mode if you encounter issues.
  • Display available GPU memory when estimating layers
  • Fixed RWKV model loading
  • Added more sanity checks for Zenity, made YAD the default filepicker instead. If you still encounter issues, please select Legacy TK filepicker in the extras page, and report the issue.
  • Minor fixes for stable UI inpainting brush selection.
  • Enabled usage of Image Generation LoRAs even with a quantized diffusion model (the LoRA should still be unquantized)
  • Fixed a crash when using certain image LoRAs due to graph size limits. Also reverted CLIP quant to f32 changes.
  • CLI mode fixes
  • Updated Kobold Lite, multiple fixes and improvements
    • IMPORTANT: Relocated Tokens Tab and WebSearch Tab into Settings Panel (from context panel). Likewise, the regex and token sequence configs are now stored in settings rather than story (and will persist even on a new story).
    • Fixed URLs not opening on new tab
    • Reworked thinking tag handling - now separates display and submit regex behaviors (3 modes each)
    • Added Retain History toggle for WebSearch to retain some old search results on subsequent queries.
    • Added a editable Template for character creator (by @PeterPeet)
    • Increased to 10 local and 10 remote save slots.
    • Removed aetherroom club (dead site)
  • Merged fixes and improvements from upstream

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.88

13 Apr 14:42
Compare
Choose a tag to compare

koboldcpp-1.88

image

  • NEW: Added Image Inpainting support to StableUI, and merged inpainting support from stable-diffusion.cpp (by @stduhpf)
    • You can use the built-in StableUI to mask out areas to inpaint when editing with Img2Img (Similar to A1111). API docs for this are updated.
    • Added slider for setting clip-skip in StableUI.
    • Other improvements from stable-diffusion.cpp are also merged.
  • Added Zenity and YAD support for displaying file picker dialogs on linux (by @henk717), if they are installed on your system they will be used. To continue using the previous TKinter file picker, you can select "Use Classic FilePicker" in the extras tab.
  • Added a new API endpoint /api/extra/json_to_grammar which can be used to convert a JSON schema into GBNF grammar (check API docs for an example).
  • Added --maxrequestsize flag, you can configure the server max payload size before a HTTP request is dropped (default 32mb).
  • Can now perform GPU memory estimation using vulkaninfo too (if nvidia-smi is not available).
  • Merged Llama 4 support from upstream llama.cpp. Qwen3 is technically included too, but until it releases officially we won't know if it actually works.
  • Fixed not autosetting backend and layers when swapping to new model in admin mode using a template.
  • Added additional warnings in GUI and terminal when you try to use FlashAttention on Vulkan backend - generally this is discouraged due to performance issues.
  • Fixed system prompt on gemma3 template
  • Updated Kobold Lite, multiple fixes and improvements
    • Added Llama4 prompt format
    • Consolidated vision dropdown when selecting a vision provider
    • Fixed think tokens formatting issue with markdown
  • Merged fixes and improvements from upstream

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.87.4

01 Apr 14:36
Compare
Choose a tag to compare

koboldcpp-1.87.4

April Fools Edition
shill

  • NEW: Embeddings endpoint added - GGUF embedding models can now be loaded with --embeddingsmodel and accessed from /v1/embeddings or /api/extra/embeddings, this can be used to encoding text for search or storage within a Vector database.
  • NEW: Added OuteTTS Voice Cloning Support! - Now you can upload Speaker JSONs over the TTS API which represent a cloned voice when generating TTS. Read more here and try some sample speakers, or make your own.
  • NEW: Merged Qwen2.5VL support from @HimariO fork - Also fixed issues with Qwen2VL when multiple images are used.
    • You can get the models and mmproj projectors here: 7B, 32B, the 3B doesn't work well.
  • NEW: Automatic function (tool) calling - Improved tool calling support, thanks to help from @henk717, KoboldCpp can now work with tool calls from frontends such as OpenWebUI. Additionally, auto mode is also supported, allowing the model to decide for itself whether a function call is needed or not, and which tool to use (though manually selecting the desired tool with tool_choice still provides better results). Note that tool calling requires a relatively intelligent modern model to work correctly (recommended model: Gemma3). For more info on function calling, see here. The tool call detection template can be customized by setting custom_tools_prompt in the chat completions adapter.
  • NEW: Added Command Line Chat Mode - KoboldCpp has come full circle! Now you can use it fully without a GUI, just like good old llama.cpp. Simply run it with --cli to enter terminal mode, where you can chat interactively using the command line shell!
  • Improved AMD rocwmma build detection, also changed Vulkan build process (now requires compiling shaders)
  • Merged DP4A Vulkan enhancements by @0cc4m for greater performance on legacy quants in AMD and Intel, please report if you encounter any issues.
  • --quantkv can now be used without flash attention, when this is done it only applies quantized-K without quantized-V. Not really advised, performance can suffer.
  • Truncated base64 image printouts in console (they were too long)
  • Added a timeout for vulkaninfo in case it hangs.
  • Fixed --savedatafile with relative paths
  • Fixed llama3 template AutoGuess detection.
  • Added localtunnel as an alternative fallback option in the Colab, in case Cloudflare tunnels happen to be blocked.
  • Updated Kobold Lite, multiple fixes and improvements
    • NEW: Added World Info Groups - You can now categorize your world info entries into groups (e.g. for a character/location/event) and easily toggle them on and off in a single click.
      • You can also toggle each entry on/off individually without removing it from WI.
      • You can easily Import and Export each world info group as JSON to use within another story or chat.
    • Added a menu to upload a cloned speaker JSON for use in voice cloning. Read the section for OuteTTS voice cloning above.
    • Multiplayer mode UI streamlining.
    • Add toggle to allow uploading images as a new turn.
    • Increased max resolution of uploaded images used with vision models.
    • Switching a model in admin mode now auto refreshes Lite when completed
  • Merged fixes and improvements from upstream

Hotfix 1.87.1 - Fixed embeddings endpoint, fixed gemma3 autoguess system tag.
Hotfix 1.87.2 - Fixed broken DP4A vulkan and savedatafile bug.
Hotfix 1.87.3 - Fixed a regression in savedatafile, also added queueing for SDUI.
Hotfix 1.87.4 - Revert gemma3 system role template as it was not working correctly. Increased warmup batch size. Merged further DP4A improvements.

This month marks KoboldCpp entering into it's third year! Somehow it has survived, against all odds. Thanks to everyone who has provided valuable feedback, contributions and support over the past months. Enjoy!

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.86.2

14 Mar 13:58
Compare
Choose a tag to compare

koboldcpp-1.86.2

splash

  • Integrated Gemma3 support, to use it you can grab the gguf model and vision mmproj such as this one and load both of them in KoboldCpp, similar to earlier vision models. Everything else should work out of the box in Lite (click Add Img to paste or upload an image). Vision will also work in SillyTavern in custom Chat Completions API (enabling inline images)
  • Fixed OpenAI API finish_reason value and tool calling behaviors.
  • Reenable support for cuda compute capability 3.7 (K80)
  • Allow option to save stories to google drive when used in Colab
  • Added speculative success rate information in /api/extra/perf/
  • Allow downloading Image Generation LoRAs from URL launch arguments
  • Added image Generation param metadata to generated image (thanks @wbruna)
  • CI builds now also rebuild Vulkan shaders.
  • Replaced winclinfo.exe with a simpler version (see simpleclinfo.cpp) that only fetches GPU names.
  • Allow admin mode to runtime swap between gguf model files as well, in addition to swapping between kcpps configs. When swapping models in this way, default GPU layers and selections will be picked.
  • Updated Kobold Lite, multiple fixes and improvements
    • Added a new instruct preset "KoboldCppAutomatic" which automatically obtains the instruct template from KoboldCpp.
    • Improvements and fixes for side panel mode
  • Merged fixes and improvements from upstream

Hotfix 1.86.1:

  • Added new option --defaultgenamount, controls the max amount of tokens generated by default (e.g. third party client using chat completions) if not specified
  • Added new option --nobostoken, prevents BOS tokens from being used automatically at the start. Not recommended unless you know what you're doing.
  • Fixed bugs with CL VRAM detections, Gemma3 Vision on MacOS, and rescaling issues.

Hotfix 1.86.2:

  • NEW: Allowed using quantized KV (--quantkv) together with context shift, as it's been enabled upstream. This means the only requirement for using quantized KV now is to enable flash attention. Report any issues you face.
  • Fixed chat completions function (tool) calling again, which now works in all modes except "auto" (set tool_choice to "forced" to let the AI choose which tool to use).
  • Improved mmproj memory estimation, fix for Image Gen metadata and input fields sanitization, fixed output image metadata.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.85.1

01 Mar 09:46
Compare
Choose a tag to compare

koboldcpp-1.85.1

Now with 5% more kobo edition

image

  • NEW: Added Server-Sided (networked) save slots! You can now specify a database file when launching KoboldCpp using --savedatafile. Then, you will be able to save and load persistent stories over the network to that KoboldCpp server, and access it from any other browser or device connected to it over the network. This can also be combined with --password to require an API key to save/load the stories.
  • Added Top-N Sigma sampler (credit @EquinoxPsychosis). Note that this sampler can only be combined with Top-K, Temperature, and XTC.
  • Added --exportconfig and --exporttemplate, allowing users to export any set of launch arguments as a .kcpps or .kcppt config file from the command line. This file can also be used subsequently for model switching in admin mode.
  • Minor refactors for TFS and rep pen by @Reithan
  • Fixed .kcppt templates backend override not working
  • Updated clinfo binary for windows.
  • Updated Kobold Lite, multiple fixes and improvements
    • Added improved thinking support, display and allow forced injecting <think> tokens in AI replies or filtering out old thoughts in subsequent generations.
    • Reworked and improved load/save UI, added 2 extra local slots and 8 extra remote save slots.
    • Top-N sigma support
    • Added customization options for assistant jailbreak prompt
    • Refactored 3rd party scenario loader (thanks @Desaroll)
  • Merged fixes and improvements from upstream (include vulkan and cuda enhancements and Granite support)

Hotfix 1.85.1 - Rolled back ggml-org#12015 to fix q8 on AMD Vulkan. Also added Side Panel Mode for KoboldAI Lite.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.84.2

15 Feb 16:03
Compare
Choose a tag to compare

koboldcpp-1.84.2

This is mostly a bugfix release as 1.83.1 had some issues, but there were too many changes for another patch release.

  • Added support for using aria2c and wget for model downloading if detected on system. (credits @henk717).
  • It's also now possible to specify multiple URLs when loading multipart models online with --model [url1] [url2]... (CLI only), which will allow KoboldCpp to download multiple model file URLs.
  • Added automatic recovery in admin mode if it fails when switching to a faulty config, it will attempt to rollback to the original known-good config.
  • Fixed MoE experts override not working for Deepseek
  • Fixed multiple loader bugs when using the AutoGuess adapter.
  • Fixed images failing to generate when using the AutoGuess adapter.
  • Removed TTS caching as it was not very good.
  • Updated Kobold Lite, multiple fixes and improvements
    • Fix websearch button visibility
    • Improved instruct formatting in classic UI
    • Fixed some LaTeX and markdown edge cases
    • Upped max length slider to 1024 if detected context is larger than 4096.
  • Merged fixes and improvements from upstream

Hotfix 1.84.1 - vulkan iq1 support and fixed lite instruct icon display
Hotfix 1.84.2 - Fixed autoguess errors, fixed incoherency issue due to flash attention in rtx4090 with mistral small.

This build may still have minor issues - if you have problems please use 1.82.4 for now, I am working on a fix.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.83.1

08 Feb 02:12
Compare
Choose a tag to compare

koboldcpp-1.83.1

The Kobo :.|:; Edition

is_this_loss

  • NEW: Added the ability to switch models, settings and configs at runtime! This also allows for remote model swapping. Credits to @esolithe for original reference implementation.
    • Launch with --admin to enable this feature, and also provide --admindir containing .kcpps launch configs.
    • Optionally, provide --adminpassword to secure admin functions
    • You will be able to swap between any model's config at runtime from the Admin panel in Lite. You can prepare .kcpps configs for different layers, backends, models, etc.
    • KoboldCpp will then terminate the current instance and relaunch to a new config.
  • Added a new backend option for CLBlast (Regular, OldCPU and OlderCPU), for avx2, avx and noavx respectively, ensuring a usable GPU alternatives for all ranges and ages of CPUs.
  • CLIP vision embeddings can now be reused between multiple requests, so they won't have to be reprocessed if the images don't change.
  • Context shifting disabled when using mrope (used in Qwen2VL) as it does not work correctly.
  • Now defaults to AutoGuess for chat completions adapter. Set to "Alpaca" for the old behavior instead.
  • You can now set the maximum resolution accepted by vision mmprojs with --visionmaxres. Images larger than that will be downscaled before processing.
  • You can now set a length limit for TTS, using --ttsmaxlen when launching, this limits the number of TTS tokens allowed to be generated (range 512 to 4096). Each 1s of audio is about 75 tokens.
  • Fixed a bug with TTS that could cause a crash.
  • Added cloudflared tunnel download for aarch64 (thanks @FlippFuzz). Also, allowed SSL combined with remote tunnels.
  • Updated Kobold Lite, multiple fixes and improvements
    • NEW: Added deepseek instruct template, and added support for reasoning/thinking template tags. You can configure thinking rendering behavior from Context > Tokens > Thinking
    • NEW: Finally allows specifying individual start and end instruct tags instead of combining them. Toggle this in Settings > Toggle End Tags.
    • NEW: Multi-pass websearch added. This allows you to specify a template that is used to generate the search query.
    • Added a websearch toggle button
    • TTS now allows downloading the audio output as a file when testing it, instead of just playing the sound.
    • Some regex parsing fixes
    • Added admin panel
  • Merged fixes and improvements from upstream

Hotfix 1.83.1 - Fixed crashes in non-gguf models due to autoguess adapter. Also reverts to single process only when not in admin mode.

Note: This version is known to have bugs - you should avoid it

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.82.4

18 Jan 08:20
Compare
Choose a tag to compare

koboldcpp-1.82.4

Old kobo yells at cloud edition

cloud

  • NEW: Added OuteTTS for Text-To-Speech: OuteTTS is a text-to-speech model that can be used for narration by generating audio from KoboldCpp.
    • You need two models, an OuteTTS GGUF and a WavTokenizer GGUF which you can find here.
    • Once downloaded, load them in the Audio tab or using --ttsmodel and --ttswavtokenizer. You can also use --ttsgpu to load them on the GPU instead, and --ttsthreads to set a custom thread count used.
    • When enabled, sets up OpenAI Speech API and XTTS API compatibility endpoints allowing you to easily hook KoboldCpp TTS into existing TTS frontends.
    • Comes with a set of included voices, as well as New Speaker Synthesis, allowing you to create hundreds of new unique voices just by entering a random name. Read more here.
    • All OuteTTS GGUF v0.2 and the NEW v0.3 models are supported, including both 500m and 1B models.
    • Credits to @ggerganov and @edwko for the original upstream implementation
  • NEW: Bundled GGUF file analyzer: In the GUI Extras tab, or with --analyze, you can now analyze any GGUF file, which will display the metadata and tensor names, dimensions and types within that file.
  • TAESD is now also available for SD3 and Flux! Enable with --sdvaeauto or "AutoFix VAE" in the GUI. TAESD is now compressed to fp8, making this VAE only about 3mb in size.
  • VAE tiling for image generation can now be disabled with --sdnotile, this fixes the bleeding graphical artifacts on some cards.
  • Adjusted compatibility build targets: CLBlast (Older CPU) mode now no longer requires AVX, providing a good option for very old/cheap systems to still have some level of GPU support. For users with AVX but not AVX2, you can use the Vulkan (Old CPU) mode instead.
  • mmap is no longer the default option. To enable it, you now need --usemmap or set it in the GUI.
  • Fix for save file GUI prompt not working
  • Fix for web browser not launching with --launch in Linux GUI.
  • Added more GUI slider options for context sizes.
  • Max supported images per API request for Multimodal Vision is now increased to 8.
  • Enabled multilingual support for Whisper (Voice Recognition) setting specific language codes.
  • KoboldCpp now displays what capabilities and endpoints enabled on launch.
    • Available Modules: TextGeneration ImageGeneration VoiceRecognition MultimodalVision NetworkMultiplayer WebSearchProxy TextToSpeech ApiKeyPassword
    • Available APIs: KoboldCppApi OpenAiApi OllamaApi A1111ForgeApi ComfyUiApi WhisperTranscribeApi XttsApi OpenAiSpeechApi
  • Updated Kobold Lite, multiple fixes and improvements
    • Added Whisper language selection: Instead of automatically detecting the speaker language, you can now optionally specify it with a 2 character language code (e.g. ja for Japanese, fr for French). This ensures the output is in the right language.
    • Added Text-To-Speech support for KoboldCpp backend
  • Merged fixes and improvements from upstream

Hotfix 1.82.1: Fixed --analyze which should be working correctly now. Minor fixes to OuteTTS v0.3 handling and updated Lite UI. Whisper now accepts 8 bit and 32bit wav files, and form data input.
Hotfix 1.82.2: Added support for Deepseek R1 Qwen Distill
Hotfix 1.82.3: Fixed a TTS crash, CLBlast mislabeling, quiet now overrides debug
Hotfix 1.82.4: Fixed deepseek adapter, draft decoding now accepts slightly different vocabs

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.