v0.8.2
What's Changed
- fix(gemma4): tool calling improvements and attention mask overhaul by @EricLBuehler in #2059
- feat(tools): mid-stream grammar enforcement for tool calls by @EricLBuehler in #2060
- feat(tools): suppoort harmony tool call grammars by @EricLBuehler in #2061
- feat(tools): add tool call strict mode by @EricLBuehler in #2062
- feat(tools): add tool dispatch and agentic docs by @EricLBuehler in #2063
- feat(docs): improved docs, guides, correctness by @EricLBuehler in #2065
- fix(tools): fixes and cleanup for tool agentic by @EricLBuehler in #2069
- fix(gemma4): tool call and masking fix by @EricLBuehler in #2073
- feat(gemma4): 3.5-5.5x faster moe prefill for quantized cuda case by @EricLBuehler in #2077
- feat(gemma4): ~10% faster moe decode through fused moe decode kernels by @EricLBuehler in #2080
- feat(gemma4,cuda): optimized fused moe decode path by @EricLBuehler in #2090
- fix(gemma4): no paged-attn cache cases by @EricLBuehler in #2091
- Add fast CUDA MMVQ GGUF kernels by @EricLBuehler in #2104
- Add fast CUDA MMQ GGUF kernels by @EricLBuehler in #2109
- feat(core): code execution, file outputs and
/v1/files, strict tool calling, new docs and ui by @EricLBuehler in #2130 - Bump quinn-proto from 0.11.13 to 0.11.14 by @dependabot[bot] in #2012
- chore(deps): bump tar from 0.4.44 to 0.4.45 by @dependabot[bot] in #2014
- chore(deps): bump devalue from 5.7.1 to 5.8.1 in /docs by @dependabot[bot] in #2133
- chore(deps): bump rustls-webpki from 0.103.9 to 0.103.13 by @dependabot[bot] in #2132
- chore(deps-dev): bump svelte from 5.55.4 to 5.55.7 in /mistralrs-cli/webui by @dependabot[bot] in #2134
- chore(deps): bump astro from 6.1.7 to 6.3.3 in /docs by @dependabot[bot] in #2136
- chore(deps): bump rand from 0.9.2 to 0.9.3 by @dependabot[bot] in #2135
- Bump candle to use new Metal input/output encoder tracking by @EricLBuehler in #2131
- fix(cuda): support AFQ BF16 on sm_75 (#2092) by @atzenhofer in #2126
- fix(gemma4): accept both
expert_intermediate_sizeandmoe_intermediate_sizeby @EricLBuehler in #2137 - fix(quant): fail missing dummy layers outside uqff by @EricLBuehler in #2138
- fix(install): run metal/xcode toolchain checks outside build_features by @EricLBuehler in #2139
- feat(core, docs): remove automatic downsampling for videos and add install docs for ffmpeg by @EricLBuehler in #2140
- feat(core): support HF_HUB_OFFLINE for loading pre-downloaded models fully offline by @EricLBuehler in #2141
- fix(qwen3_embedding): attention mask handling for flash attn by @EricLBuehler in #2142
- refactor(core): memory usage to handle discrete/unified systems better by @EricLBuehler in #2143
- feat(agentic): remove latex autocorrect from python exec by @EricLBuehler in #2144
- feat(agentic): add sandboxing for agentic code execution by @EricLBuehler in #2145
- docs(sandbox): tweak for clarity in design page by @EricLBuehler in #2146
- fix(core): use
from_envfor sandboxed apps by @setoelkahfi in #2064 - feat(cli): add smart quantization and agentic presets by @EricLBuehler in #2152
- feat(cli): add verbosity-controlled logging by @EricLBuehler in #2154
- feat(agent): add app-driven tool approvals by @EricLBuehler in #2155
- fix(agentic): add prefix to read and list file tools by @EricLBuehler in #2158
- feat(gemma4): support MTP speculative decoding! by @EricLBuehler in #2159
- feat(gemma4): optimize CUDA prompt and decode performance by @EricLBuehler in #2161
- fix(cuda): support bf16 indexed moe input quantization by @EricLBuehler in #2162
- feat(bench): improve benchmark sweeps by @EricLBuehler in #2163
- feat(gemma4): further optimize CUDA MoE prefill and decode by @EricLBuehler in #2165
- feat(metal): optimize Gemma 4 prefill and decode on Apple Silicon by @EricLBuehler in #2166
- feat(gemma4): optimize metal MoE perf by @EricLBuehler in #2179
- feat(cuda): implement cuda graphs and various optimizations by @EricLBuehler in #2180
New Contributors
- @atzenhofer made their first contribution in #2126
Full Changelog: v0.8.0...v0.8.2