Releases: InternLM/lmdeploy
Releases · InternLM/lmdeploy
v0.14.0a1
What's Changed
🚀 Features
- FP8 kv cache quantization by @CUHKSZzxy in #4563
💥 Improvements
- Update turbomind modeling infrastructure by @lzhangzz in #4557
- refactor(turbomind): consolidate CUDA error handling and add manual stacktracing by @lzhangzz in #4565
- Add Qwen3.5 Moe lite awq by @43758726 in #4561
- [Improve]: Drain queues when sleep engine by @RunningLeon in #4577
- Extend chat completions by introducing token-in/out and returning routed experts by @lvhan028 in #4593
- Follow openai's spec to add "AllowedToolChoice" and report 400 when parsing request failed by @lvhan028 in #4585
- Improve health endpoint by @lvhan028 in #4615
- Remove state init by @grimoire in #4604
- Include spec stats in metrics by @RunningLeon in #4625
🐞 Bug fixes
- fix the anthropic adapter by @lvhan028 in #4578
- Fix Structured Output for GPT-OSS Models by @windreamer in #4386
- Allow W8A8Linear to accept dtype during initialization instead of hard code by @43758726 in #4586
- fix: compact split multimodal tensors by @CUHKSZzxy in #4583
- Fix legacy VLM preprocessors for normalized image data by @CUHKSZzxy in #4584
- fix dockerfile which missing common.txt by @lvhan028 in #4608
- fix: enable FA3 for SM80+ GPUs and fix CUDA version comparison by @windreamer in #4591
- flatten_kv_cache zero padding by @grimoire in #4613
- align streaming usage chunks with OpenAI spec by @lvhan028 in #4616
- fix(vl): reduce multimodal feature memory use by @CUHKSZzxy in #4603
- fix memleak when input contain large image data by @grimoire in #4610
- fix(turbomind): map Intern-S1 HF checkpoint keys by @lvhan028 in #4617
- fix(serve): emit all stream_chunk deltas to fix concurrent tool-call streaming by @lvhan028 in #4622
- fix cp inference by @irexyc in #4619
- refactor(serve): avoid per-request tokenizer work in parsers by @lvhan028 in #4633
- Bring MixtralForCausalLM back to Turbomind by @43758726 in #4623
- fix model loading on windows by @irexyc in #4626
🌐 Other
- chore: gate request logs behind request level by @CUHKSZzxy in #4581
- miss rdkit for intern-s models by @lvhan028 in #4587
- extract common deps into requirements/common.txt by @lvhan028 in #4595
- Remove staled cli arg in vlmevalkit docs by @CUHKSZzxy in #4598
- log reponse for debugging by @lvhan028 in #4592
- cancel in-progress runs when PR is updated or merged by @lvhan028 in #4609
- TEST: update qwen3.5 397b test by @littlegy in #4607
- TEST: update video test by @littlegy in #4606
Full Changelog: v0.13.0...0.14.0a1
v0.13.0
What's Changed
🚀 Features
- [Ascend] support qwen3.5 35BA3B by @wanfengcxz in #4485
- feat: Add TurboQuant (quant_policy=42) support for KV Cache Quantization by @windreamer in #4510
- [refactor] [api_server] [2/N] improve tool parsers by abstracting xml parser by @lvhan028 in #4548
- feat(turbomind): integrate cublasGemmGroupedBatchedEx for Qwen3.5 MoE inference on Blackwell GPUs with memory copy optimizations by @hd9568 in #4490
- feat: add Anthropic-compatible serving endpoints by @lvhan028 in #4538
- Support InternS2 Preview by @CUHKSZzxy in #4575
💥 Improvements
- lmdeploy support kernel block size by @Tsundoku958 in #4421
- Reject requests on stale session or sleeping engine by @lvhan028 in #4496
- Add modern logging utils by @lzhangzz in #4486
- refine dlinfer update_weights by @yao-fengchen in #4519
- feat(serve): expose repetition n-gram params on OpenAI routes by @lvhan028 in #4522
- Refactor step inputs by @grimoire in #4504
- fix lite module for transformers>=5.0 by @43758726 in #4488
- [refactor] [api_server] [1/N] Improve reasoning and tool-call parsers by @lvhan028 in #4468
- fix: prevent prefill starvation under high decode load by @grimoire in #4532
- Mixed modality by @CUHKSZzxy in #4531
- optimize get_sorted_idx in moe by @grimoire in #4529
- Map user-input session_id to internal session_id to maintain session identity by @lvhan028 in #4523
- support more message item types by @CUHKSZzxy in #4501
- add explicit trust_remote_code controls to resolve the security issue by @lvhan028 in #4511
🐞 Bug fixes
- [ascend] fix prefix caching by @yao-fengchen in #4448
- fix update params by @CUHKSZzxy in #4514
- fix ray mem leak by @grimoire in #4487
- Fix mtp by @RunningLeon in #4517
- fix kernel-block-size by @grimoire in #4521
- fix: use
is not Nonecheck for seed to prevent seed=0 being silently ignored by @kuishou68 in #4526 - Fix qwen35 dp by @grimoire in #4535
- Fix mtp for rl by @RunningLeon in #4520
- cancel request and block new inputs when sleeping by @grimoire in #4541
- Fix mp engine by @RunningLeon in #4540
- Fix cache sizing and cache block layout edge cases by @grimoire in #4552
- Fix qwen3.5-moe mtp with tp>1 by @RunningLeon in #4568
- block_offsets padding 0 by @grimoire in #4569
- hotfix: resolve test issues for v0.13.0 by @lvhan028 in #4571
- ResponseParser forget to strip tag in non-stream mode by @lvhan028 in #4576
- yield error when prompt processing suffers exception by @lvhan028 in #4574
- Fix the reprefill of evicted seqs with invalid draft tokens by @RunningLeon in #4564
- Support mtp fp8 by @RunningLeon in #4572
🌐 Other
- Use env LMDEPLOY_FP32_MAMBA_SSM_DTYPE to control the dtype of recurrent state by @lvhan028 in #4518
- add tool and reasoning test by @littlegy in #4388
- update h config and add glm4.7 mtp test by @littlegy in #4424
- [ci] change test whl into python 312 and use test images by @zhulinJulia24 in #4513
- [Misc] fix typos in turbomind.py and model.py by @ZhijunLStudio in #4543
- [Misc] fix mutable default arguments by @ZhijunLStudio in #4544
- Add docker/Dockerfile_patch; minor tweaks in messages.py and setup.py. by @lvhan028 in #4546
- remove barely used skills and checkin docker-build skill by @lvhan028 in #4560
- bump version to v0.13.0 by @lvhan028 in #4549
New Contributors
- @kuishou68 made their first contribution in #4526
- @ZhijunLStudio made their first contribution in #4543
- @hd9568 made their first contribution in #4490
Full Changelog: v0.12.3...v0.13.0
v0.12.3
What's Changed
🚀 Features
- Support video inputs by @CUHKSZzxy in #4360
- feat: fully implement compressed-tensors gs32 support in TurboMind by @lapy in #4429
- Draft model update params by @CUHKSZzxy in #4452
💥 Improvements
- support qwen3.5 on volta by @grimoire in #4405
- Optimize Qwen3.5 by @lzhangzz in #4434
- Builtin mrope by @grimoire in #4393
- delete ray remote function return value by @grimoire in #4422
- support cache_seqlen on recurrent-gdr and causal-conv1d-update by @grimoire in #4417
- safe ray api by @CUHKSZzxy in #4455
- add R3 for qwen3-vl-moe models by @lvhan028 in #4457
- Align rope init in lmdeploy by @RangiLyu in #4466
- Make tilelang a Linux-only dependency (like triton) by @Copilot in #4469
- prepare chunk indices before cache initialize by @grimoire in #4458
- unify rope device by @CUHKSZzxy in #4467
- custom processor args by @CUHKSZzxy in #4472
- Assign sequential api_server ports when proxy_url is unset by @lvhan028 in #4416
- disable fla intracard_backend by @grimoire in #4482
- [Fix][Feat] Fix worker sorting with external pg bundles & Support persistent buffer for update_params by @CyCle1024 in #4397
- simplify interns1 pro codes by @CUHKSZzxy in #4480
🐞 Bug fixes
- fix test_hf_overrides for transformers>5 by @grimoire in #4418
- fix qwen3.5 pytorch multimodal inference by @CUHKSZzxy in #4430
- fix
generateendpoint by @CUHKSZzxy in #4432 - Make Intern-S1-Pro compatible with Transformers 5.0+ by @lvhan028 in #4435
- fix multiround chat by @CUHKSZzxy in #4438
- fix(async_engine): make safe_run cancellation cleanup reliable with shield and SafeRunException by @lvhan028 in #4439
- release state cache by @CUHKSZzxy in #4462
- Split/tool call args json for qwen3coder tool calls (Qwen3.5) by @lapy in #4433
- fix(turbomind): fix dimension mismatch in ApplyTokenBitmaskInplace by @windreamer in #4456
- fix metrics by @CUHKSZzxy in #4410
- fix security issues by @CUHKSZzxy in #4447
- fix qwen3.5 fp8 support by @grimoire in #4470
- fix image / video resize function by @CUHKSZzxy in #4478
- fix dynamic ntk device by @CUHKSZzxy in #4483
- fix pagedattention pointer range by @grimoire in #4494
- fix glm4.7-flash by @grimoire in #4500
- Fix torch awq by @grimoire in #4503
🌐 Other
- [ci] add legacy test workflow and test config by @zhulinJulia24 in #4387
- chore: add CLAUDE.md and Claude Code skills by @CUHKSZzxy in #4413
- Fix CI errors including linting error and unit test error by @lvhan028 in #4431
- Use pyupgrade and ruff to modernize LMDeploy Python Code by @windreamer in #4392
- reduce ci memory by @irexyc in #4471
- fix: add safe.directory for git in docker workflows by @windreamer in #4474
- [ci] add nightly docker build workflow by @zhulinJulia24 in #4406
- split docker wheel preparation into staged build steps and use python 3.12 as the default version by @lvhan028 in #4476
- [Feat]: Support qwen35 with mtp by @RunningLeon in #4437
- bump version to v0.12.3 by @lvhan028 in #4493
New Contributors
Full Changelog: v0.12.2...v0.12.3
v0.12.2
What's Changed
🚀 Features
- support glm5 by @grimoire in #4355
- Qwen/Internlm/Llama Dense/Moe model fp8 quant online by @43758726 in #4324
- Qwen3.5 by @grimoire in #4351
- GLM-4.7-Flash Turbomind support by @lapy in #4362
- Support router replay and ignore quant layer for qwen3.5 by @RunningLeon in #4394
- [Feature] Add TurboMind support for Qwen3.5 models (dense + MoE) by @lapy in #4389
- support repetition ngram logits processor by @grimoire in #4288
💥 Improvements
- Compatible with transformers 5.0 at TurboMind side by @lvhan028 in #4304
- Support fp32 head for qwen and internlm models by @RunningLeon in #4160
- Reduce MLA kv-cache memory by @lzhangzz in #4373
- add recurrent_gated_delta_rule kernel by @grimoire in #4376
- [ascend]adapt for s1-pro dp*tp+ep by @yao-fengchen in #4380
- Support glm4.7 with mtp by @RunningLeon in #4346
- Faster MLA kernels by @lzhangzz in #4391
- Attention kernel self-registration and decoupled dispatching by @lzhangzz in #4396
🐞 Bug fixes
- fix: change debug log from ERROR to DEBUG in RepetitionPenaltyKernel by @murray-macdonald in #4363
- Fix quant config parsing for internvl awq model by @RunningLeon in #4369
- Fix XGrammar bitmask initialization and add null check for gen_config in generate method by @windreamer in #4349
- fix the logic of closing session by @lvhan028 in #4370
- Fix authorization by @lvhan028 in #4338
- Fix some minor issues and provide tests for Pipeline by @windreamer in #4365
- fix dllm mask on set_step by @grimoire in #4278
- fix models for transformers>=5 by @grimoire in #4381
- fix exception when aborting a request by @lvhan028 in #4403
- fix inference crashed on v100 with qwen3.5-0.8b by @lvhan028 in #4420
🌐 Other
- ci(lint): skip flaky deadlink test for python wiki page by @windreamer in #4357
- fix fa3 install by @irexyc in #4361
- fix lint by @windreamer in #4375
- upgrade triton and torch by @grimoire in #4379
- Add speculative decoding test by @littlegy in #4377
- ci: integrate clang-format lint into pre-commit hooks by @windreamer in #4390
- Update dockerfile by removing cu11 and changing cu12.4 to cu12.6 by @lvhan028 in #4398
- manually build dev image instead of publishing it every version by @lvhan028 in #4409
- bump version to v0.12.2 by @lvhan028 in #4378
New Contributors
- @murray-macdonald made their first contribution in #4363
- @lapy made their first contribution in #4362
Full Changelog: v0.12.1...v0.12.2
v0.12.1
What's Changed
🚀 Features
- support glm-4.7-flash by @RunningLeon in #4320
- [ascend]suppot ep by @yao-fengchen in #3696
💥 Improvements
- fix rotary embedding for transformers v5 by @grimoire in #4303
- Improve metrics log by @CUHKSZzxy in #4297
- Support ignore layers in quant config for qwen3 models by @RunningLeon in #4293
- add custom noaux kernel by @grimoire in #4345
- fix qwen3vl with transformers5 by @grimoire in #4348
🐞 Bug fixes
- fix tool call parser's streaming cursor by @lvhan028 in #4333
- Fix data race for guided decoding in TP mode by @lzhangzz in #4341
- fa3 check by @grimoire in #4340
- Fix time series preprocess by @CUHKSZzxy in #4339
- Negative KV sequence length error in Attention op by @jinminxi104 in #4316
- fix qwen3-vl-moe long context by @grimoire in #4342
- fix: move quantized norm to CPU instead of stale q_linear reference in smooth_quant by @Mr-Neutr0n in #4352
- update noaux-kernel check by @grimoire in #4358
🌐 Other
- change INPUT_CUDA_VERSION to 12.6.2 by @lvhan028 in #4322
- add Qwen3-8B accuracy evaluation in llm_compressor.md by @43758726 in #4319
- [ci] refactor ete testcase by @zhulinJulia24 in #4274
- Set alias interns1_1 for interns1_pro by @lvhan028 in #4334
- build(docker): skip FA2 when use cu13 by @windreamer in #4356
- bump version to v0.12.1 by @lvhan028 in #4350
New Contributors
- @Mr-Neutr0n made their first contribution in #4352
Full Changelog: v0.12.0...v0.12.1
v0.12.0
What's Changed
🚀 Features
- Add Gloo communication to turbomind by @irexyc in #3362
- [Feat] Support llm-compressor AWQ models in TurboMind by @43758726 in #4290
- Router replay for gpt oss by @RunningLeon in #4298
- Support llm-compressor symmetric quantized model inference in TurboMind by @43758726 in #4305
- Support Intern-S1-Pro by @CUHKSZzxy in #4318
💥 Improvements
- Configurable max CTAs and NVLS usage for CUDA IPC communicator by @lzhangzz in #4227
- Improve aborting all sessions by @lvhan028 in #4215
- Moe Reduce kernel by @grimoire in #4228
- Refactor attn by @grimoire in #4238
- Optimize exception raising and error process by @grimoire in #4236
- [AsyncEngine Refactor 1/N] define MultimodalProcessor to handle multimodal data processing by @lvhan028 in #4250
- [AsyncEngine Refactor 2/N] Remove deprecates from chat template by @lvhan028 in #4252
- Configurable uvicorn timeout by @CUHKSZzxy in #4255
- Adapt to dlsime v0.0.2 by @JimyMa in #4242
- [Fix] fix quant calibration dataset by @43758726 in #4256
- lmdeploy suppport parrllel embedding by @Tsundoku958 in #4192
- Refactor turbomind engine by @lzhangzz in #4223
- Refactor Engine & ModelAgent interact by @grimoire in #4265
- Support sleep and destroy deepep buffer by @RunningLeon in #4246
- add yarn truncate by @grimoire in #4301
- [AsyncEngine Refactor 3/N] Introduce Session and SessionManager by @lvhan028 in #4253
- Add warning about NCCL 2.27 memory leaks by @lzhangzz in #4313
🐞 Bug fixes
- Fix fope cos/sin coef device type by @CUHKSZzxy in #4240
- Fix include_stop_str_in_output with output_logits Exception by @windreamer in #4244
- fix logit softcapping is None by @grimoire in #4247
- Fix performance regression for prefix caching by @lzhangzz in #4270
- convert float16 weight to bfloat16 for FP8 models by @lvhan028 in #4276
- [ascend] fix dp multinode rank_table mapping by @tangzhiyi11 in #4268
- [Fix] move calibrate load dataset location by @43758726 in #4289
- fix ignore-eos by @grimoire in #4282
- fix MPEngine poll by @grimoire in #4287
- Fix prefix caching by @lzhangzz in #4292
- Fix gemma chat template by @lvhan028 in #4280
- Fix scheduler metrics by @lzhangzz in #4294
- Fix NVLS init for mixed DP+TP by @lzhangzz in #4296
- [side-effect] The tool message dump is incomplete by @lvhan028 in #4299
- Fix mla with spec tokens by @RunningLeon in #4302
- fix stop long context by @grimoire in #4309
- fix crash on client disconnect (Ctrl+C) by @lvhan028 in #4308
- Ensure the pipe benchmark uses kwargs when calling
pipe.stream_inferby @lvhan028 in #4312 - fix get_ppl for long context by @lvhan028 in #4314
- fix sleep engine for dp=1 by @RunningLeon in #4315
🌐 Other
- [ci] fix fail testcase and add generate testcase in pr test by @zhulinJulia24 in #4231
- Pin nvshmem version by @CUHKSZzxy in #4257
- fix: Pin
timmversion to avoid failed tests by @windreamer in #4258 - docs: add generated openapi spec documentation by @windreamer in #4251
- fix: get rid of buggy timm-1.0.23 by @windreamer in #4260
- [ascend] fix paged prefill by @tangzhiyi11 in #4254
- Fix ascend/maca/camb runtime_requirements by @jinminxi104 in #4262
- docs: refine the documents by @windreamer in #4259
- docs: add cli docs by @windreamer in #4264
- Drop support for Python 3.9 as it has reached end-of-life by @lvhan028 in #4281
- bump version to v0.12.0 by @lvhan028 in #4300
New Contributors
Full Changelog: v0.11.1...v0.12.0
v0.11.1
What's Changed
🚀 Features
- [ascend] support dptp by @tangzhiyi11 in #4218
- Support Deepseek v32 by @grimoire in #4026
💥 Improvements
- Improve metrics by @CUHKSZzxy in #4178
- reserve blocks for dummy inputs by @grimoire in #4157
- Add vision id for Qwen3-VL by @CUHKSZzxy in #4183
- [Enhance]: Return routed experts when request canceled by @RunningLeon in #4197
- Add mm processor args for Qwen3-VL by @CUHKSZzxy in #4196
- support chat_template_kwargs in v1/chat/completions by @lvhan028 in #4201
- Refactor scheduler and engine.py by @grimoire in #4163
- update dp timeout by @grimoire in #4204
- Improve Qwen3-VL by @CUHKSZzxy in #4207
🐞 Bug fixes
- [Fix]: Split routed experts with query lens by @RunningLeon in #4180
- [Maca] fix ray and memory sync by @wanfengcxz in #4164
- Build block trie in prefill and add hit rate by @RunningLeon in #4184
- fix fope by @CUHKSZzxy in #4191
- fix hf modules read/write conflicts by multi processors by @lvhan028 in #4188
- Some Minor fix by @windreamer in #4185
- fix insecure deserialization when calling torch.load() by @lvhan028 in #4202
- Fix processor args by @CUHKSZzxy in #4200
- remove get_model_config to avoid pickle hf_config error in rpc calling by @lvhan028 in #4217
- Fix quant scale-fmt by @grimoire in #4212
- Fix requests of mix return_logprobs by @RunningLeon in #4222
- fix fillkv quant8 by @grimoire in #4229
- fix scale-fmt by @grimoire in #4230
📚 Documentations
- [Docs]: Add guide for VLMEvalKit by @CUHKSZzxy in #4156
🌐 Other
- Add FA3 by @CUHKSZzxy in #4166
- Add distributed test cases by @littlegy in #4161
- Add generate test by @littlegy in #4181
- [ci] add mllm eval by @zhulinJulia24 in #4194
- [ascend] refactor code by @yao-fengchen in #4176
- install serve.txt when building the docker image by @lvhan028 in #4219
- bump version to v0.11.1 by @lvhan028 in #4221
Full Changelog: v0.11.0...v0.11.1
v0.11.0
What's Changed
🚀 Features
- add endpoint /abort_request by @lvhan028 in #4092
- Qwen3 next by @grimoire in #4039
- Support Qwen3-VL by @CUHKSZzxy in #4093
- Support sync weights with flattened bucket tensor by @RunningLeon in #4109
- Support group router for moe models by @RunningLeon in #4120
- [Feature]: return routed experts to reuse by @RunningLeon in #4090
- support context parallel by @irexyc in #3951
- fope by @grimoire in #4043
- [Feature]: Support speculative decoding by @RunningLeon in #3945
- Moe bf16 ep by @grimoire in #4144
💥 Improvements
- Enlarge gc threshold by @grimoire in #4076
- remove num_tokens from EngineOutput by @lvhan028 in #4088
- revert masking vocab_size by @lvhan028 in #4089
- feat: add json_object support in response_format by @windreamer in #4080
- support image_data input to /generate endpoint by @irexyc in #4086
- [Fix] all RayEngineWorker actors created at node 0 in RL training by @CyCle1024 in #4107
- Optimize sleep level=1 for turbomind backend by @irexyc in #4074
- [Feat] enable ascend update_params by @CyCle1024 in #4111
- Enhance request checker by @lvhan028 in #4104
- Refactor dp tp by @grimoire in #4004
- fix kernel numerical error by @grimoire in #4133
- free ray put by @grimoire in #4137
- Reduce experts cache when resize by @RunningLeon in #4138
- support interleave text and image in messages by @lvhan028 in #4141
- optimize rms norm by @grimoire in #4153
- fix evict policy by @Tsundoku958 in #4127
🐞 Bug fixes
- fix type hint by @grimoire in #4078
- Fix inputs split by @RunningLeon in #4083
- add missing update_model_meta by @jinminxi104 in #4099
- Fix update_params for pytorch backend when loading vl model by @irexyc in #4101
- workaround for issue "TypeError argument 'tokens': 'NoneType' object cannot be converted to 'PyString" by @lvhan028 in #4103
- fix bug: schedule ratio support prefix-caching by @Tsundoku958 in #4100
- remove prefill free ratio threshold by @grimoire in #4110
- fix key error: api_server node might be removed by @lvhan028 in #4112
- Incorrectly judging the request as a bad request by @lvhan028 in #4121
- fix dist config keys by @grimoire in #4125
- proxy server miss media_type in streaming mode by @lvhan028 in #4130
- Fix logprobs to_tensor by @RunningLeon in #4132
- Fix cli help by @RunningLeon in #4139
- fix and optimize fill_kv_cache_quant by @grimoire in #4140
- fix: fix package deprecation introduced by CUDA 13 by @windreamer in #4117
- yield empty list for token_ids when it runs out of tokens by @lvhan028 in #4148
- Fix interns1 routed experts outputs by @RunningLeon in #4149
- fix qwen3-30-a3b lcb-code score by @yao-fengchen in #4142
- Fix ep deployment issues by @CUHKSZzxy in #4084
- Fix dllm to not use fa3 decoding by @RunningLeon in #4159
- fix: handle non-tuple decoder outputs during Qwen-2.5 quantization by @chengyuma in #4158
- fix cu11 docker build by @CUHKSZzxy in #4165
- Fix model config by @CUHKSZzxy in #4170
- fix lora by @grimoire in #4172
- fix cmake logic detect sm70, sm75 by @tuilakhanh in #4175
📚 Documentations
- Update model evalution guide by @lvhan028 in #4094
- [Docs]: Add guide for update weights by @RunningLeon in #4151
🌐 Other
- add dockerfile to build dev image by @lvhan028 in #4091
- add ascend_a3 Dockerfile by @yao-fengchen in #4097
- [ci] refactor longtext benchmark by @zhulinJulia24 in #4087
- enable metrics by default by @lvhan028 in #4108
- Replace pynvml with nvidia-ml-py in requirements by @myhloli in #4118
- [ci] add free disk before build test whl package and add session_len args in benchmark script by @zhulinJulia24 in #4136
- Add prefixcache functionality and performance testing by @littlegy in #4119
- [ci] modify pipeline.close and add more case into pr_test by @zhulinJulia24 in #4150
- bump version to v0.11.0 by @lvhan028 in #4155
New Contributors
- @myhloli made their first contribution in #4118
- @tuilakhanh made their first contribution in #4175
Full Changelog: v0.10.2...v0.11.0
v0.10.2
What's Changed
🚀 Features
- add /generate api by @irexyc in #4019
- Guided decoding with xgrammar for TurboMind by @windreamer in #3965
- Reimplement guided decoding with xgrammar for PyTorch Engine by @windreamer in #4028
💥 Improvements
- [ascend] support aclgraph by @yao-fengchen in #4063
- Leverage incremental output between the inference and async engines to improve performance by @lvhan028 in #4054
- Optimize multinomial sampling by @grimoire in #4056
🐞 Bug fixes
- zmqrpc localhost only by @grimoire in #4017
- fix bug: dp+tp warmup by @Tsundoku958 in #3991
- fix dllm long-context by @grimoire in #4012
- Fix GPT-OSS streaming tool call parsing by @QwertyJack in #4023
- move releasing resource from async_engine to inference engine by @lvhan028 in #4041
- fix: fix tokenizer parsing bug for guided decoding by @windreamer in #4044
- Fix message content field handling for tool calls and multimodal input by @QwertyJack in #4029
- fix builder for kimi-k2 by @CUHKSZzxy in #4069
- Skip unnecessary sampling and fix the random offset by @grimoire in #4068
- fix duplicated stop_token_string when ignore_special_tokens is False by @irexyc in #4077
🌐 Other
- Drop CUDA 11.8 build support, upgrade CI/CD to CUDA 12.6/12.8 by @windreamer in #4013
- remove profile_generation.py and its testcases by @lvhan028 in #4027
- [ci] refactor eval into api eval and add h800 eval workflow by @zhulinJulia24 in #4008
- Add Docker image for NVIDIA Jetson by @windreamer in #3834
- [ci] refactor api evaluate test into llm judger evaluation by @littlegy in #4046
- Check color logger by @grimoire in #4060
- Update API testing with HLE and LCB datasets by @littlegy in #4061
- update ascend requirements by @yao-fengchen in #4066
- bump version to v0.10.2 by @lvhan028 in #4062
Full Changelog: v0.10.1...v0.10.2
v0.10.1
What's Changed
🚀 Features
- Add ROCm support: installation guide and FlashAttention compatibility for AMD GPUs by @Vivicai1005 in #3925
- support gpt-oss basic output by @irexyc in #3956
- Add FP8*(B)F16 GEMM by @lzhangzz in #3960
- Support GLM-4.5 by @CUHKSZzxy in #3863
- [Refactor]: Remove tokenizer when building engine by @RunningLeon in #3978
- Support InternVL3.5-Flash by @CUHKSZzxy in #3952
- support gpt-oss function/reasoning in /v1/chat/completions by @irexyc in #3962
- support returning stop_str in output by @lvhan028 in #3984
- Support SDAR by @grimoire in #3922
💥 Improvements
- specify installation on GeForce RTX 50 series by @lvhan028 in #3947
- cherry pick PR-3708 to return token_id by @lvhan028 in #3976
- Optimize AsyncEngine generation method by @shell-nlp in #3982
- Use blocking sync when TP engine is idling by @lzhangzz in #3974
- add openai_harmony to requirements by @irexyc in #4006
🐞 Bug fixes
- fix bugs with triton3.4.0 by @grimoire in #3946
- fix longrope by @grimoire in #3968
- Fix tm rl usage in xtuner by @irexyc in #3912
- Disable prefix caching when serving a VLM model by @lvhan028 in #3990
- remove NCCL_LAUNCH_MODE by @irexyc in #3994
- return the last token's logprobs, logits and last_hidden_states if include_stop_str_in_output is requested by @lvhan028 in #4000
- [Fix] device args in chat cli when using pytorch engine by @CyCle1024 in #3999
- fix internvl by @CUHKSZzxy in #3997
- fix not-returned iterator in SequenceManager::Erase by @irexyc in #4001
- fix cudagraph without warmup by @grimoire in #4005
- fix internvl flash long context acc by @CUHKSZzxy in #4003
🌐 Other
- [ci] update daily testcase by @zhulinJulia24 in #3944
- [maca] change kv layout from pagedattn to flashattn by @yuchiwang in #3958
- remove cudnn by @irexyc in #3969
- build(pypi): add cuda 12.8 support for wheels by @windreamer in #3948
- [CI] add ascend test by @littlegy in #3959
- update serve requirement by @RunningLeon in #3986
- [ci] add h800 function test workflow by @zhulinJulia24 in #3985
- bump version to v0.10.1 by @lvhan028 in #3989
New Contributors
- @Vivicai1005 made their first contribution in #3925
- @shell-nlp made their first contribution in #3982
- @littlegy made their first contribution in #3959
Full Changelog: v0.10.0...v0.10.1