Skip to content

Releases: Anbeeld/beellama.cpp

v0.1.1

11 May 02:50

Choose a tag to compare

  • Improved agentic tool-call reliability with lazy grammars. DFlash now remains enabled before a lazy grammar trigger, but stops speculating once grammar-constrained output or reasoning-budget forcing requires normal token-by-token sampling.
  • Fixed DFlash accept bookkeeping at grammar and tool-call boundaries. The server now distinguishes accepted draft tokens from bonus-token-shaped results, updates DFlash hidden-state rows with the root plus accepted draft tokens, and uses the same keep count for rollback.
  • Added a DFlash suppression guard for raw tool-call markers. When a tool marker appears while lazy grammar is enabled, the server suppresses DFlash for the rest of that response without steering sampler state; fenced code and embedded marker-like strings are excluded from the guard.
  • Made partial OpenAI-compatible tool-call streaming safer. The server can stream a stable tool name/id early so clients can show a pending tool call, while withholding partial arguments until the parser sees a complete call.
  • Quarantined malformed raw tool-call text in tool-parsing streams. Unfinished or malformed tool-looking text no longer leaks into visible assistant content or hidden reasoning deltas before the parser can classify it.
  • Accepted direct tag-style function starts for Qwen-style tool calls. Lazy grammar triggers now include structural function markers such as <function=, and the tag parser can parse valid direct function calls without the outer <tool_call> wrapper.
  • Added regression coverage for Kimi and Qwen tool-call streaming, malformed raw marker quarantine, fenced-code false positives, direct Qwen function calls, lazy grammar triggers, and DFlash speculative boundary plumbing.
  • Fixed small build issues found after 0.1.0: the DFlash callback setup now uses an explicit callback type for GCC 15, and tests/server code include the required standard headers for INT_MAX and FLT_MAX.

Windows:

v0.1.0

09 May 16:31

Choose a tag to compare

  • DFlash speculative decoding: --spec-type dflash drives a DFlash draft GGUF alongside the target model. The target captures hidden states into a per-layer 4096-slot ring buffer, the drafter cross-attends to the most recent --spec-dflash-cross-ctx hidden-state tokens and proposes drafts for target verification.
  • TurboQuant / TCQ KV-cache compression: Five cache types (turbo2, turbo3, turbo4, turbo2_tcq, turbo3_tcq) spanning from 4x to 7.5x compression, with higher-bit options being practically lossless in many cases. Set independently with --cache-type-k and --cache-type-v.
  • Adaptive draft-max control: The server adjusts the active draft horizon at runtime instead of using a fixed --spec-draft-n-max. The default profit controller compares speculative throughput against a no-spec baseline; the fringe alternative maps acceptance-rate bands to draft depth. Use --no-spec-dm-adaptive for a static horizon.
  • Full multimodal support: When --mmproj is active, the server keeps flat DFlash available for text generation. The model can be fully offloaded to CPU with no problems to reduce VRAM pressure.
  • Reasoning-loop protection: The server detects repeated hidden reasoning output and intervenes. Default mode is force-close with --reasoning-loop-window and --reasoning-loop-max-period tuning available.
  • Sampled DFlash verification: --spec-draft-temp enables rejection-sampling drafter behavior. Activates when both draft and target temperature exceed zero. Draft log probabilities must be available for rejection sampling to produce correct output.
  • DDTree branch verification: optional --spec-branch-budget adds branch nodes beyond the main draft path with GPU parent_ids, tree masks, and recurrent tree kernels. Disabled automatically when the target model spans more than one GPU. This one is very much work in progress!
  • Request-level speculative overrides: Draft-max and branch budget can be overridden per-request through JSON fields without restarting the server.
  • CopySpec model-free speculation: --spec-type copyspec provides rolling-hash suffix matching over previous tokens without a draft model. Results must be benchmarked per workload.

Windows: