v0.14.0a2
Pre-release
Pre-release
What's Changed
🚀 Features
- FP8 kv cache quantization by @CUHKSZzxy in #4563
- Support Qwen3 Omni by @CUHKSZzxy in #4411
- support qwen3.5(vit) inference in turbomind backend by @irexyc in #4602
- Add OpenAI Responses-compatible endpoint by @CUHKSZzxy in #4582
💥 Improvements
- Update turbomind modeling infrastructure by @lzhangzz in #4557
- refactor(turbomind): consolidate CUDA error handling and add manual stacktracing by @lzhangzz in #4565
- Add Qwen3.5 Moe lite awq by @43758726 in #4561
- [Improve]: Drain queues when sleep engine by @RunningLeon in #4577
- Extend chat completions by introducing token-in/out and returning routed experts by @lvhan028 in #4593
- Follow openai's spec to add "AllowedToolChoice" and report 400 when parsing request failed by @lvhan028 in #4585
- Improve health endpoint by @lvhan028 in #4615
- Remove state init by @grimoire in #4604
- Include spec stats in metrics by @RunningLeon in #4625
- Add raw chat completion logprob output by @lvhan028 in #4637
- fix(pytorch): offload guided decoding CPU ops to thread pool to prevent event loop blocking by @windreamer in #4590
- update gated delta rule state layout by @grimoire in #4636
- Improve kernel dispatch for dp>1 by @RunningLeon in #4653
- Extend v1/messages by introducing token-in/out and returning routed experts by @lvhan028 in #4642
- Fuse gdr preprocess by @grimoire in #4656
- refactor: simplify multimodal preprocessing expansion by @CUHKSZzxy in #4663
🐞 Bug fixes
- fix the anthropic adapter by @lvhan028 in #4578
- Fix Structured Output for GPT-OSS Models by @windreamer in #4386
- Allow W8A8Linear to accept dtype during initialization instead of hard code by @43758726 in #4586
- fix: compact split multimodal tensors by @CUHKSZzxy in #4583
- Fix legacy VLM preprocessors for normalized image data by @CUHKSZzxy in #4584
- fix dockerfile which missing common.txt by @lvhan028 in #4608
- fix: enable FA3 for SM80+ GPUs and fix CUDA version comparison by @windreamer in #4591
- flatten_kv_cache zero padding by @grimoire in #4613
- align streaming usage chunks with OpenAI spec by @lvhan028 in #4616
- fix(vl): reduce multimodal feature memory use by @CUHKSZzxy in #4603
- fix memleak when input contain large image data by @grimoire in #4610
- fix(turbomind): map Intern-S1 HF checkpoint keys by @lvhan028 in #4617
- fix(serve): emit all stream_chunk deltas to fix concurrent tool-call streaming by @lvhan028 in #4622
- fix cp inference by @irexyc in #4619
- refactor(serve): avoid per-request tokenizer work in parsers by @lvhan028 in #4633
- Bring MixtralForCausalLM back to Turbomind by @43758726 in #4623
- fix model loading on windows by @irexyc in #4626
- Fix mtp cudagraph when no warmup in RL by @RunningLeon in #4641
- fix: remove hard CUDA_PATH assert on Windows, search DLL paths from multiple sources by @windreamer in #4628
- Fix unit test by removing latest-transformers-unsupported models by @lvhan028 in #4649
- Fix qwen3.5 mtp by @RunningLeon in #4652
- fix gdr kernel for tilelang>=0.1.9 by @grimoire in #4660
- [Fix]: Revert the reuse of cudagraph buffer for mtp by @RunningLeon in #4661
- Fix client-disconnect session leaks in PyTorch MP engine by @grimoire in #4655
- fix cancel stopped seq by @RunningLeon in #4654
- feat: support num_experts_per_tok=10 in turbomind backend by @irexyc in #4665
- fix batched seqs with different stop words by @RunningLeon in #4671
- Move warmup inside wakeup by @RunningLeon in #4667
- Fix dequant_mixed by @irexyc in #4657
- Improve engine health monitoring by @lvhan028 in #4645
- fix qwen3.5 27b gdr preprocess by @grimoire in #4676
📚 Documentations
- docs: update multimodal model support docs by @CUHKSZzxy in #4643
🌐 Other
- chore: gate request logs behind request level by @CUHKSZzxy in #4581
- miss rdkit for intern-s models by @lvhan028 in #4587
- extract common deps into requirements/common.txt by @lvhan028 in #4595
- Remove staled cli arg in vlmevalkit docs by @CUHKSZzxy in #4598
- log reponse for debugging by @lvhan028 in #4592
- cancel in-progress runs when PR is updated or merged by @lvhan028 in #4609
- TEST: update qwen3.5 397b test by @littlegy in #4607
- TEST: update video test by @littlegy in #4606
- Validate final chat response structure by @lvhan028 in #4621
- Support dp for qwen35 mtp by @RunningLeon in #4611
- [ci] refactor testcoverage config by @zhulinJulia24 in #4630
- TEST: update ascend and mtp test config by @littlegy in #4659
- TEST: update FP8 processing logic and remove duplicate MTP tests by @littlegy in #4668
- freeze tilelang version by @grimoire in #4669
- fix windows ci by @irexyc in #4672
- [ci] add mtp test config in pr_test by @zhulinJulia24 in #4651
Full Changelog: v0.13.0...0.14.0a2