Release v0.14.0a2 · InternLM/lmdeploy

What's Changed

🚀 Features

FP8 kv cache quantization by @CUHKSZzxy in #4563
Support Qwen3 Omni by @CUHKSZzxy in #4411
support qwen3.5(vit) inference in turbomind backend by @irexyc in #4602
Add OpenAI Responses-compatible endpoint by @CUHKSZzxy in #4582

💥 Improvements

Update turbomind modeling infrastructure by @lzhangzz in #4557
refactor(turbomind): consolidate CUDA error handling and add manual stacktracing by @lzhangzz in #4565
Add Qwen3.5 Moe lite awq by @43758726 in #4561
[Improve]: Drain queues when sleep engine by @RunningLeon in #4577
Extend chat completions by introducing token-in/out and returning routed experts by @lvhan028 in #4593
Follow openai's spec to add "AllowedToolChoice" and report 400 when parsing request failed by @lvhan028 in #4585
Improve health endpoint by @lvhan028 in #4615
Remove state init by @grimoire in #4604
Include spec stats in metrics by @RunningLeon in #4625
Add raw chat completion logprob output by @lvhan028 in #4637
fix(pytorch): offload guided decoding CPU ops to thread pool to prevent event loop blocking by @windreamer in #4590
update gated delta rule state layout by @grimoire in #4636
Improve kernel dispatch for dp>1 by @RunningLeon in #4653
Extend v1/messages by introducing token-in/out and returning routed experts by @lvhan028 in #4642
Fuse gdr preprocess by @grimoire in #4656
refactor: simplify multimodal preprocessing expansion by @CUHKSZzxy in #4663

🐞 Bug fixes

fix the anthropic adapter by @lvhan028 in #4578
Fix Structured Output for GPT-OSS Models by @windreamer in #4386
Allow W8A8Linear to accept dtype during initialization instead of hard code by @43758726 in #4586
fix: compact split multimodal tensors by @CUHKSZzxy in #4583
Fix legacy VLM preprocessors for normalized image data by @CUHKSZzxy in #4584
fix dockerfile which missing common.txt by @lvhan028 in #4608
fix: enable FA3 for SM80+ GPUs and fix CUDA version comparison by @windreamer in #4591
flatten_kv_cache zero padding by @grimoire in #4613
align streaming usage chunks with OpenAI spec by @lvhan028 in #4616
fix(vl): reduce multimodal feature memory use by @CUHKSZzxy in #4603
fix memleak when input contain large image data by @grimoire in #4610
fix(turbomind): map Intern-S1 HF checkpoint keys by @lvhan028 in #4617
fix(serve): emit all stream_chunk deltas to fix concurrent tool-call streaming by @lvhan028 in #4622
fix cp inference by @irexyc in #4619
refactor(serve): avoid per-request tokenizer work in parsers by @lvhan028 in #4633
Bring MixtralForCausalLM back to Turbomind by @43758726 in #4623
fix model loading on windows by @irexyc in #4626
Fix mtp cudagraph when no warmup in RL by @RunningLeon in #4641
fix: remove hard CUDA_PATH assert on Windows, search DLL paths from multiple sources by @windreamer in #4628
Fix unit test by removing latest-transformers-unsupported models by @lvhan028 in #4649
Fix qwen3.5 mtp by @RunningLeon in #4652
fix gdr kernel for tilelang>=0.1.9 by @grimoire in #4660
[Fix]: Revert the reuse of cudagraph buffer for mtp by @RunningLeon in #4661
Fix client-disconnect session leaks in PyTorch MP engine by @grimoire in #4655
fix cancel stopped seq by @RunningLeon in #4654
feat: support num_experts_per_tok=10 in turbomind backend by @irexyc in #4665
fix batched seqs with different stop words by @RunningLeon in #4671
Move warmup inside wakeup by @RunningLeon in #4667
Fix dequant_mixed by @irexyc in #4657
Improve engine health monitoring by @lvhan028 in #4645
fix qwen3.5 27b gdr preprocess by @grimoire in #4676

📚 Documentations

docs: update multimodal model support docs by @CUHKSZzxy in #4643

🌐 Other

chore: gate request logs behind request level by @CUHKSZzxy in #4581
miss rdkit for intern-s models by @lvhan028 in #4587
extract common deps into requirements/common.txt by @lvhan028 in #4595
Remove staled cli arg in vlmevalkit docs by @CUHKSZzxy in #4598
log reponse for debugging by @lvhan028 in #4592
cancel in-progress runs when PR is updated or merged by @lvhan028 in #4609
TEST: update qwen3.5 397b test by @littlegy in #4607
TEST: update video test by @littlegy in #4606
Validate final chat response structure by @lvhan028 in #4621
Support dp for qwen35 mtp by @RunningLeon in #4611
[ci] refactor testcoverage config by @zhulinJulia24 in #4630
TEST: update ascend and mtp test config by @littlegy in #4659
TEST: update FP8 processing logic and remove duplicate MTP tests by @littlegy in #4668
freeze tilelang version by @grimoire in #4669
fix windows ci by @irexyc in #4672
[ci] add mtp test config in pr_test by @zhulinJulia24 in #4651

Full Changelog: v0.13.0...0.14.0a2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.14.0a2

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

Contributors

Uh oh!