Releases · InternLM/lmdeploy

01 Jun 08:46

lvhan028

0.14.0a1

a969cc9

v0.14.0a1 Pre-release

Pre-release

What's Changed

🚀 Features

FP8 kv cache quantization by @CUHKSZzxy in #4563

💥 Improvements

Update turbomind modeling infrastructure by @lzhangzz in #4557
refactor(turbomind): consolidate CUDA error handling and add manual stacktracing by @lzhangzz in #4565
Add Qwen3.5 Moe lite awq by @43758726 in #4561
[Improve]: Drain queues when sleep engine by @RunningLeon in #4577
Extend chat completions by introducing token-in/out and returning routed experts by @lvhan028 in #4593
Follow openai's spec to add "AllowedToolChoice" and report 400 when parsing request failed by @lvhan028 in #4585
Improve health endpoint by @lvhan028 in #4615
Remove state init by @grimoire in #4604
Include spec stats in metrics by @RunningLeon in #4625

🐞 Bug fixes

fix the anthropic adapter by @lvhan028 in #4578
Fix Structured Output for GPT-OSS Models by @windreamer in #4386
Allow W8A8Linear to accept dtype during initialization instead of hard code by @43758726 in #4586
fix: compact split multimodal tensors by @CUHKSZzxy in #4583
Fix legacy VLM preprocessors for normalized image data by @CUHKSZzxy in #4584
fix dockerfile which missing common.txt by @lvhan028 in #4608
fix: enable FA3 for SM80+ GPUs and fix CUDA version comparison by @windreamer in #4591
flatten_kv_cache zero padding by @grimoire in #4613
align streaming usage chunks with OpenAI spec by @lvhan028 in #4616
fix(vl): reduce multimodal feature memory use by @CUHKSZzxy in #4603
fix memleak when input contain large image data by @grimoire in #4610
fix(turbomind): map Intern-S1 HF checkpoint keys by @lvhan028 in #4617
fix(serve): emit all stream_chunk deltas to fix concurrent tool-call streaming by @lvhan028 in #4622
fix cp inference by @irexyc in #4619
refactor(serve): avoid per-request tokenizer work in parsers by @lvhan028 in #4633
Bring MixtralForCausalLM back to Turbomind by @43758726 in #4623
fix model loading on windows by @irexyc in #4626

🌐 Other

chore: gate request logs behind request level by @CUHKSZzxy in #4581
miss rdkit for intern-s models by @lvhan028 in #4587
extract common deps into requirements/common.txt by @lvhan028 in #4595
Remove staled cli arg in vlmevalkit docs by @CUHKSZzxy in #4598
log reponse for debugging by @lvhan028 in #4592
cancel in-progress runs when PR is updated or merged by @lvhan028 in #4609
TEST: update qwen3.5 397b test by @littlegy in #4607
TEST: update video test by @littlegy in #4606

Full Changelog: v0.13.0...0.14.0a1

Contributors

windreamer, grimoire, and 7 other contributors

Assets 2

12 May 03:46

lvhan028

v0.13.0

e6948c1

v0.13.0 Latest

Latest

What's Changed

🚀 Features

[Ascend] support qwen3.5 35BA3B by @wanfengcxz in #4485
feat: Add TurboQuant (quant_policy=42) support for KV Cache Quantization by @windreamer in #4510
[refactor] [api_server] [2/N] improve tool parsers by abstracting xml parser by @lvhan028 in #4548
feat(turbomind): integrate cublasGemmGroupedBatchedEx for Qwen3.5 MoE inference on Blackwell GPUs with memory copy optimizations by @hd9568 in #4490
feat: add Anthropic-compatible serving endpoints by @lvhan028 in #4538
Support InternS2 Preview by @CUHKSZzxy in #4575

💥 Improvements

lmdeploy support kernel block size by @Tsundoku958 in #4421
Reject requests on stale session or sleeping engine by @lvhan028 in #4496
Add modern logging utils by @lzhangzz in #4486
refine dlinfer update_weights by @yao-fengchen in #4519
feat(serve): expose repetition n-gram params on OpenAI routes by @lvhan028 in #4522
Refactor step inputs by @grimoire in #4504
fix lite module for transformers>=5.0 by @43758726 in #4488
[refactor] [api_server] [1/N] Improve reasoning and tool-call parsers by @lvhan028 in #4468
fix: prevent prefill starvation under high decode load by @grimoire in #4532
Mixed modality by @CUHKSZzxy in #4531
optimize get_sorted_idx in moe by @grimoire in #4529
Map user-input session_id to internal session_id to maintain session identity by @lvhan028 in #4523
support more message item types by @CUHKSZzxy in #4501
add explicit trust_remote_code controls to resolve the security issue by @lvhan028 in #4511

🐞 Bug fixes

[ascend] fix prefix caching by @yao-fengchen in #4448
fix update params by @CUHKSZzxy in #4514
fix ray mem leak by @grimoire in #4487
Fix mtp by @RunningLeon in #4517
fix kernel-block-size by @grimoire in #4521
fix: use is not None check for seed to prevent seed=0 being silently ignored by @kuishou68 in #4526
Fix qwen35 dp by @grimoire in #4535
Fix mtp for rl by @RunningLeon in #4520
cancel request and block new inputs when sleeping by @grimoire in #4541
Fix mp engine by @RunningLeon in #4540
Fix cache sizing and cache block layout edge cases by @grimoire in #4552
Fix qwen3.5-moe mtp with tp>1 by @RunningLeon in #4568
block_offsets padding 0 by @grimoire in #4569
hotfix: resolve test issues for v0.13.0 by @lvhan028 in #4571
ResponseParser forget to strip tag in non-stream mode by @lvhan028 in #4576
yield error when prompt processing suffers exception by @lvhan028 in #4574
Fix the reprefill of evicted seqs with invalid draft tokens by @RunningLeon in #4564
Support mtp fp8 by @RunningLeon in #4572

🌐 Other

Use env LMDEPLOY_FP32_MAMBA_SSM_DTYPE to control the dtype of recurrent state by @lvhan028 in #4518
add tool and reasoning test by @littlegy in #4388
update h config and add glm4.7 mtp test by @littlegy in #4424
[ci] change test whl into python 312 and use test images by @zhulinJulia24 in #4513
[Misc] fix typos in turbomind.py and model.py by @ZhijunLStudio in #4543
[Misc] fix mutable default arguments by @ZhijunLStudio in #4544
Add docker/Dockerfile_patch; minor tweaks in messages.py and setup.py. by @lvhan028 in #4546
remove barely used skills and checkin docker-build skill by @lvhan028 in #4560
bump version to v0.13.0 by @lvhan028 in #4549

New Contributors

@kuishou68 made their first contribution in #4526
@ZhijunLStudio made their first contribution in #4543
@hd9568 made their first contribution in #4490

Full Changelog: v0.12.3...v0.13.0

Contributors

windreamer, grimoire, and 13 other contributors

Assets 2

08 Apr 03:37

lvhan028

v0.12.3

8ea459f

v0.12.3

What's Changed

🚀 Features

Support video inputs by @CUHKSZzxy in #4360
feat: fully implement compressed-tensors gs32 support in TurboMind by @lapy in #4429
Draft model update params by @CUHKSZzxy in #4452

💥 Improvements

support qwen3.5 on volta by @grimoire in #4405
Optimize Qwen3.5 by @lzhangzz in #4434
Builtin mrope by @grimoire in #4393
delete ray remote function return value by @grimoire in #4422
support cache_seqlen on recurrent-gdr and causal-conv1d-update by @grimoire in #4417
safe ray api by @CUHKSZzxy in #4455
add R3 for qwen3-vl-moe models by @lvhan028 in #4457
Align rope init in lmdeploy by @RangiLyu in #4466
Make tilelang a Linux-only dependency (like triton) by @Copilot in #4469
prepare chunk indices before cache initialize by @grimoire in #4458
unify rope device by @CUHKSZzxy in #4467
custom processor args by @CUHKSZzxy in #4472
Assign sequential api_server ports when proxy_url is unset by @lvhan028 in #4416
disable fla intracard_backend by @grimoire in #4482
[Fix][Feat] Fix worker sorting with external pg bundles & Support persistent buffer for update_params by @CyCle1024 in #4397
simplify interns1 pro codes by @CUHKSZzxy in #4480

🐞 Bug fixes

fix test_hf_overrides for transformers>5 by @grimoire in #4418
fix qwen3.5 pytorch multimodal inference by @CUHKSZzxy in #4430
fix generate endpoint by @CUHKSZzxy in #4432
Make Intern-S1-Pro compatible with Transformers 5.0+ by @lvhan028 in #4435
fix multiround chat by @CUHKSZzxy in #4438
fix(async_engine): make safe_run cancellation cleanup reliable with shield and SafeRunException by @lvhan028 in #4439
release state cache by @CUHKSZzxy in #4462
Split/tool call args json for qwen3coder tool calls (Qwen3.5) by @lapy in #4433
fix(turbomind): fix dimension mismatch in ApplyTokenBitmaskInplace by @windreamer in #4456
fix metrics by @CUHKSZzxy in #4410
fix security issues by @CUHKSZzxy in #4447
fix qwen3.5 fp8 support by @grimoire in #4470
fix image / video resize function by @CUHKSZzxy in #4478
fix dynamic ntk device by @CUHKSZzxy in #4483
fix pagedattention pointer range by @grimoire in #4494
fix glm4.7-flash by @grimoire in #4500
Fix torch awq by @grimoire in #4503

🌐 Other

[ci] add legacy test workflow and test config by @zhulinJulia24 in #4387
chore: add CLAUDE.md and Claude Code skills by @CUHKSZzxy in #4413
Fix CI errors including linting error and unit test error by @lvhan028 in #4431
Use pyupgrade and ruff to modernize LMDeploy Python Code by @windreamer in #4392
reduce ci memory by @irexyc in #4471
fix: add safe.directory for git in docker workflows by @windreamer in #4474
[ci] add nightly docker build workflow by @zhulinJulia24 in #4406
split docker wheel preparation into staged build steps and use python 3.12 as the default version by @lvhan028 in #4476
[Feat]: Support qwen35 with mtp by @RunningLeon in #4437
bump version to v0.12.3 by @lvhan028 in #4493

New Contributors

@RangiLyu made their first contribution in #4466
@Copilot made their first contribution in #4469

Full Changelog: v0.12.2...v0.12.3

Contributors

windreamer, grimoire, and 9 other contributors

Assets 10

18 Mar 03:13

lvhan028

v0.12.2

9a50f1f

v0.12.2

What's Changed

🚀 Features

support glm5 by @grimoire in #4355
Qwen/Internlm/Llama Dense/Moe model fp8 quant online by @43758726 in #4324
Qwen3.5 by @grimoire in #4351
GLM-4.7-Flash Turbomind support by @lapy in #4362
Support router replay and ignore quant layer for qwen3.5 by @RunningLeon in #4394
[Feature] Add TurboMind support for Qwen3.5 models (dense + MoE) by @lapy in #4389
support repetition ngram logits processor by @grimoire in #4288

💥 Improvements

Compatible with transformers 5.0 at TurboMind side by @lvhan028 in #4304
Support fp32 head for qwen and internlm models by @RunningLeon in #4160
Reduce MLA kv-cache memory by @lzhangzz in #4373
add recurrent_gated_delta_rule kernel by @grimoire in #4376
[ascend]adapt for s1-pro dp*tp+ep by @yao-fengchen in #4380
Support glm4.7 with mtp by @RunningLeon in #4346
Faster MLA kernels by @lzhangzz in #4391
Attention kernel self-registration and decoupled dispatching by @lzhangzz in #4396

🐞 Bug fixes

fix: change debug log from ERROR to DEBUG in RepetitionPenaltyKernel by @murray-macdonald in #4363
Fix quant config parsing for internvl awq model by @RunningLeon in #4369
Fix XGrammar bitmask initialization and add null check for gen_config in generate method by @windreamer in #4349
fix the logic of closing session by @lvhan028 in #4370
Fix authorization by @lvhan028 in #4338
Fix some minor issues and provide tests for Pipeline by @windreamer in #4365
fix dllm mask on set_step by @grimoire in #4278
fix models for transformers>=5 by @grimoire in #4381
fix exception when aborting a request by @lvhan028 in #4403
fix inference crashed on v100 with qwen3.5-0.8b by @lvhan028 in #4420

🌐 Other

ci(lint): skip flaky deadlink test for python wiki page by @windreamer in #4357
fix fa3 install by @irexyc in #4361
fix lint by @windreamer in #4375
upgrade triton and torch by @grimoire in #4379
Add speculative decoding test by @littlegy in #4377
ci: integrate clang-format lint into pre-commit hooks by @windreamer in #4390
Update dockerfile by removing cu11 and changing cu12.4 to cu12.6 by @lvhan028 in #4398
manually build dev image instead of publishing it every version by @lvhan028 in #4409
bump version to v0.12.2 by @lvhan028 in #4378

New Contributors

@murray-macdonald made their first contribution in #4363
@lapy made their first contribution in #4362

Full Changelog: v0.12.1...v0.12.2

Contributors

windreamer, grimoire, and 9 other contributors

Assets 10

13 Feb 09:02

lvhan028

v0.12.1

e5df4e8

v0.12.1

What's Changed

🚀 Features

support glm-4.7-flash by @RunningLeon in #4320
[ascend]suppot ep by @yao-fengchen in #3696

💥 Improvements

fix rotary embedding for transformers v5 by @grimoire in #4303
Improve metrics log by @CUHKSZzxy in #4297
Support ignore layers in quant config for qwen3 models by @RunningLeon in #4293
add custom noaux kernel by @grimoire in #4345
fix qwen3vl with transformers5 by @grimoire in #4348

🐞 Bug fixes

fix tool call parser's streaming cursor by @lvhan028 in #4333
Fix data race for guided decoding in TP mode by @lzhangzz in #4341
fa3 check by @grimoire in #4340
Fix time series preprocess by @CUHKSZzxy in #4339
Negative KV sequence length error in Attention op by @jinminxi104 in #4316
fix qwen3-vl-moe long context by @grimoire in #4342
fix: move quantized norm to CPU instead of stale q_linear reference in smooth_quant by @Mr-Neutr0n in #4352
update noaux-kernel check by @grimoire in #4358

🌐 Other

change INPUT_CUDA_VERSION to 12.6.2 by @lvhan028 in #4322
add Qwen3-8B accuracy evaluation in llm_compressor.md by @43758726 in #4319
[ci] refactor ete testcase by @zhulinJulia24 in #4274
Set alias interns1_1 for interns1_pro by @lvhan028 in #4334
build(docker): skip FA2 when use cu13 by @windreamer in #4356
bump version to v0.12.1 by @lvhan028 in #4350

New Contributors

@Mr-Neutr0n made their first contribution in #4352

Full Changelog: v0.12.0...v0.12.1

Contributors

windreamer, grimoire, and 9 other contributors

Assets 10

04 Feb 06:28

lvhan028

v0.12.0

9564876

v0.12.0

What's Changed

🚀 Features

Add Gloo communication to turbomind by @irexyc in #3362
[Feat] Support llm-compressor AWQ models in TurboMind by @43758726 in #4290
Router replay for gpt oss by @RunningLeon in #4298
Support llm-compressor symmetric quantized model inference in TurboMind by @43758726 in #4305
Support Intern-S1-Pro by @CUHKSZzxy in #4318

💥 Improvements

Configurable max CTAs and NVLS usage for CUDA IPC communicator by @lzhangzz in #4227
Improve aborting all sessions by @lvhan028 in #4215
Moe Reduce kernel by @grimoire in #4228
Refactor attn by @grimoire in #4238
Optimize exception raising and error process by @grimoire in #4236
[AsyncEngine Refactor 1/N] define MultimodalProcessor to handle multimodal data processing by @lvhan028 in #4250
[AsyncEngine Refactor 2/N] Remove deprecates from chat template by @lvhan028 in #4252
Configurable uvicorn timeout by @CUHKSZzxy in #4255
Adapt to dlsime v0.0.2 by @JimyMa in #4242
[Fix] fix quant calibration dataset by @43758726 in #4256
lmdeploy suppport parrllel embedding by @Tsundoku958 in #4192
Refactor turbomind engine by @lzhangzz in #4223
Refactor Engine & ModelAgent interact by @grimoire in #4265
Support sleep and destroy deepep buffer by @RunningLeon in #4246
add yarn truncate by @grimoire in #4301
[AsyncEngine Refactor 3/N] Introduce Session and SessionManager by @lvhan028 in #4253
Add warning about NCCL 2.27 memory leaks by @lzhangzz in #4313

🐞 Bug fixes

Fix fope cos/sin coef device type by @CUHKSZzxy in #4240
Fix include_stop_str_in_output with output_logits Exception by @windreamer in #4244
fix logit softcapping is None by @grimoire in #4247
Fix performance regression for prefix caching by @lzhangzz in #4270
convert float16 weight to bfloat16 for FP8 models by @lvhan028 in #4276
[ascend] fix dp multinode rank_table mapping by @tangzhiyi11 in #4268
[Fix] move calibrate load dataset location by @43758726 in #4289
fix ignore-eos by @grimoire in #4282
fix MPEngine poll by @grimoire in #4287
Fix prefix caching by @lzhangzz in #4292
Fix gemma chat template by @lvhan028 in #4280
Fix scheduler metrics by @lzhangzz in #4294
Fix NVLS init for mixed DP+TP by @lzhangzz in #4296
[side-effect] The tool message dump is incomplete by @lvhan028 in #4299
Fix mla with spec tokens by @RunningLeon in #4302
fix stop long context by @grimoire in #4309
fix crash on client disconnect (Ctrl+C) by @lvhan028 in #4308
Ensure the pipe benchmark uses kwargs when calling pipe.stream_infer by @lvhan028 in #4312
fix get_ppl for long context by @lvhan028 in #4314
fix sleep engine for dp=1 by @RunningLeon in #4315

🌐 Other

[ci] fix fail testcase and add generate testcase in pr test by @zhulinJulia24 in #4231
Pin nvshmem version by @CUHKSZzxy in #4257
fix: Pin timm version to avoid failed tests by @windreamer in #4258
docs: add generated openapi spec documentation by @windreamer in #4251
fix: get rid of buggy timm-1.0.23 by @windreamer in #4260
[ascend] fix paged prefill by @tangzhiyi11 in #4254
Fix ascend/maca/camb runtime_requirements by @jinminxi104 in #4262
docs: refine the documents by @windreamer in #4259
docs: add cli docs by @windreamer in #4264
Drop support for Python 3.9 as it has reached end-of-life by @lvhan028 in #4281
bump version to v0.12.0 by @lvhan028 in #4300

New Contributors

@43758726 made their first contribution in #4256

Full Changelog: v0.11.1...v0.12.0

Contributors

windreamer, grimoire, and 11 other contributors

Assets 10

24 Dec 13:27

lvhan028

v0.11.1

cdff769

v0.11.1

What's Changed

🚀 Features

[ascend] support dptp by @tangzhiyi11 in #4218
Support Deepseek v32 by @grimoire in #4026

💥 Improvements

Improve metrics by @CUHKSZzxy in #4178
reserve blocks for dummy inputs by @grimoire in #4157
Add vision id for Qwen3-VL by @CUHKSZzxy in #4183
[Enhance]: Return routed experts when request canceled by @RunningLeon in #4197
Add mm processor args for Qwen3-VL by @CUHKSZzxy in #4196
support chat_template_kwargs in v1/chat/completions by @lvhan028 in #4201
Refactor scheduler and engine.py by @grimoire in #4163
update dp timeout by @grimoire in #4204
Improve Qwen3-VL by @CUHKSZzxy in #4207

🐞 Bug fixes

[Fix]: Split routed experts with query lens by @RunningLeon in #4180
[Maca] fix ray and memory sync by @wanfengcxz in #4164
Build block trie in prefill and add hit rate by @RunningLeon in #4184
fix fope by @CUHKSZzxy in #4191
fix hf modules read/write conflicts by multi processors by @lvhan028 in #4188
Some Minor fix by @windreamer in #4185
fix insecure deserialization when calling torch.load() by @lvhan028 in #4202
Fix processor args by @CUHKSZzxy in #4200
remove get_model_config to avoid pickle hf_config error in rpc calling by @lvhan028 in #4217
Fix quant scale-fmt by @grimoire in #4212
Fix requests of mix return_logprobs by @RunningLeon in #4222
fix fillkv quant8 by @grimoire in #4229
fix scale-fmt by @grimoire in #4230

📚 Documentations

[Docs]: Add guide for VLMEvalKit by @CUHKSZzxy in #4156

🌐 Other

Add FA3 by @CUHKSZzxy in #4166
Add distributed test cases by @littlegy in #4161
Add generate test by @littlegy in #4181
[ci] add mllm eval by @zhulinJulia24 in #4194
[ascend] refactor code by @yao-fengchen in #4176
install serve.txt when building the docker image by @lvhan028 in #4219
bump version to v0.11.1 by @lvhan028 in #4221

Full Changelog: v0.11.0...v0.11.1

Contributors

windreamer, grimoire, and 8 other contributors

Assets 12

04 Dec 06:20

lvhan028

v0.11.0

4abccaf

v0.11.0

What's Changed

🚀 Features

add endpoint /abort_request by @lvhan028 in #4092
Qwen3 next by @grimoire in #4039
Support Qwen3-VL by @CUHKSZzxy in #4093
Support sync weights with flattened bucket tensor by @RunningLeon in #4109
Support group router for moe models by @RunningLeon in #4120
[Feature]: return routed experts to reuse by @RunningLeon in #4090
support context parallel by @irexyc in #3951
fope by @grimoire in #4043
[Feature]: Support speculative decoding by @RunningLeon in #3945
Moe bf16 ep by @grimoire in #4144

💥 Improvements

Enlarge gc threshold by @grimoire in #4076
remove num_tokens from EngineOutput by @lvhan028 in #4088
revert masking vocab_size by @lvhan028 in #4089
feat: add json_object support in response_format by @windreamer in #4080
support image_data input to /generate endpoint by @irexyc in #4086
[Fix] all RayEngineWorker actors created at node 0 in RL training by @CyCle1024 in #4107
Optimize sleep level=1 for turbomind backend by @irexyc in #4074
[Feat] enable ascend update_params by @CyCle1024 in #4111
Enhance request checker by @lvhan028 in #4104
Refactor dp tp by @grimoire in #4004
fix kernel numerical error by @grimoire in #4133
free ray put by @grimoire in #4137
Reduce experts cache when resize by @RunningLeon in #4138
support interleave text and image in messages by @lvhan028 in #4141
optimize rms norm by @grimoire in #4153
fix evict policy by @Tsundoku958 in #4127

🐞 Bug fixes

fix type hint by @grimoire in #4078
Fix inputs split by @RunningLeon in #4083
add missing update_model_meta by @jinminxi104 in #4099
Fix update_params for pytorch backend when loading vl model by @irexyc in #4101
workaround for issue "TypeError argument 'tokens': 'NoneType' object cannot be converted to 'PyString" by @lvhan028 in #4103
fix bug: schedule ratio support prefix-caching by @Tsundoku958 in #4100
remove prefill free ratio threshold by @grimoire in #4110
fix key error: api_server node might be removed by @lvhan028 in #4112
Incorrectly judging the request as a bad request by @lvhan028 in #4121
fix dist config keys by @grimoire in #4125
proxy server miss media_type in streaming mode by @lvhan028 in #4130
Fix logprobs to_tensor by @RunningLeon in #4132
Fix cli help by @RunningLeon in #4139
fix and optimize fill_kv_cache_quant by @grimoire in #4140
fix: fix package deprecation introduced by CUDA 13 by @windreamer in #4117
yield empty list for token_ids when it runs out of tokens by @lvhan028 in #4148
Fix interns1 routed experts outputs by @RunningLeon in #4149
fix qwen3-30-a3b lcb-code score by @yao-fengchen in #4142
Fix ep deployment issues by @CUHKSZzxy in #4084
Fix dllm to not use fa3 decoding by @RunningLeon in #4159
fix: handle non-tuple decoder outputs during Qwen-2.5 quantization by @chengyuma in #4158
fix cu11 docker build by @CUHKSZzxy in #4165
Fix model config by @CUHKSZzxy in #4170
fix lora by @grimoire in #4172
fix cmake logic detect sm70, sm75 by @tuilakhanh in #4175

📚 Documentations

Update model evalution guide by @lvhan028 in #4094
[Docs]: Add guide for update weights by @RunningLeon in #4151

🌐 Other

add dockerfile to build dev image by @lvhan028 in #4091
add ascend_a3 Dockerfile by @yao-fengchen in #4097
[ci] refactor longtext benchmark by @zhulinJulia24 in #4087
enable metrics by default by @lvhan028 in #4108
Replace pynvml with nvidia-ml-py in requirements by @myhloli in #4118
[ci] add free disk before build test whl package and add session_len args in benchmark script by @zhulinJulia24 in #4136
Add prefixcache functionality and performance testing by @littlegy in #4119
[ci] modify pipeline.close and add more case into pr_test by @zhulinJulia24 in #4150
bump version to v0.11.0 by @lvhan028 in #4155

New Contributors

@myhloli made their first contribution in #4118
@tuilakhanh made their first contribution in #4175

Full Changelog: v0.10.2...v0.11.0

Contributors

windreamer, grimoire, and 13 other contributors

Assets 12

28 Oct 11:32

lvhan028

v0.10.2

f36aa71

v0.10.2

What's Changed

🚀 Features

add /generate api by @irexyc in #4019
Guided decoding with xgrammar for TurboMind by @windreamer in #3965
Reimplement guided decoding with xgrammar for PyTorch Engine by @windreamer in #4028

💥 Improvements

[ascend] support aclgraph by @yao-fengchen in #4063
Leverage incremental output between the inference and async engines to improve performance by @lvhan028 in #4054
Optimize multinomial sampling by @grimoire in #4056

🐞 Bug fixes

zmqrpc localhost only by @grimoire in #4017
fix bug: dp+tp warmup by @Tsundoku958 in #3991
fix dllm long-context by @grimoire in #4012
Fix GPT-OSS streaming tool call parsing by @QwertyJack in #4023
move releasing resource from async_engine to inference engine by @lvhan028 in #4041
fix: fix tokenizer parsing bug for guided decoding by @windreamer in #4044
Fix message content field handling for tool calls and multimodal input by @QwertyJack in #4029
fix builder for kimi-k2 by @CUHKSZzxy in #4069
Skip unnecessary sampling and fix the random offset by @grimoire in #4068
fix duplicated stop_token_string when ignore_special_tokens is False by @irexyc in #4077

🌐 Other

Drop CUDA 11.8 build support, upgrade CI/CD to CUDA 12.6/12.8 by @windreamer in #4013
remove profile_generation.py and its testcases by @lvhan028 in #4027
[ci] refactor eval into api eval and add h800 eval workflow by @zhulinJulia24 in #4008
Add Docker image for NVIDIA Jetson by @windreamer in #3834
[ci] refactor api evaluate test into llm judger evaluation by @littlegy in #4046
Check color logger by @grimoire in #4060
Update API testing with HLE and LCB datasets by @littlegy in #4061
update ascend requirements by @yao-fengchen in #4066
bump version to v0.10.2 by @lvhan028 in #4062

Full Changelog: v0.10.1...v0.10.2

Contributors

windreamer, grimoire, and 8 other contributors

Assets 12

26 Sep 02:45

lvhan028

v0.10.1

1984a5d

v0.10.1

What's Changed

🚀 Features

Add ROCm support: installation guide and FlashAttention compatibility for AMD GPUs by @Vivicai1005 in #3925
support gpt-oss basic output by @irexyc in #3956
Add FP8*(B)F16 GEMM by @lzhangzz in #3960
Support GLM-4.5 by @CUHKSZzxy in #3863
[Refactor]: Remove tokenizer when building engine by @RunningLeon in #3978
Support InternVL3.5-Flash by @CUHKSZzxy in #3952
support gpt-oss function/reasoning in /v1/chat/completions by @irexyc in #3962
support returning stop_str in output by @lvhan028 in #3984
Support SDAR by @grimoire in #3922

💥 Improvements

specify installation on GeForce RTX 50 series by @lvhan028 in #3947
cherry pick PR-3708 to return token_id by @lvhan028 in #3976
Optimize AsyncEngine generation method by @shell-nlp in #3982
Use blocking sync when TP engine is idling by @lzhangzz in #3974
add openai_harmony to requirements by @irexyc in #4006

🐞 Bug fixes

fix bugs with triton3.4.0 by @grimoire in #3946
fix longrope by @grimoire in #3968
Fix tm rl usage in xtuner by @irexyc in #3912
Disable prefix caching when serving a VLM model by @lvhan028 in #3990
remove NCCL_LAUNCH_MODE by @irexyc in #3994
return the last token's logprobs, logits and last_hidden_states if include_stop_str_in_output is requested by @lvhan028 in #4000
[Fix] device args in chat cli when using pytorch engine by @CyCle1024 in #3999
fix internvl by @CUHKSZzxy in #3997
fix not-returned iterator in SequenceManager::Erase by @irexyc in #4001
fix cudagraph without warmup by @grimoire in #4005
fix internvl flash long context acc by @CUHKSZzxy in #4003

🌐 Other

[ci] update daily testcase by @zhulinJulia24 in #3944
[maca] change kv layout from pagedattn to flashattn by @yuchiwang in #3958
remove cudnn by @irexyc in #3969
build(pypi): add cuda 12.8 support for wheels by @windreamer in #3948
[CI] add ascend test by @littlegy in #3959
update serve requirement by @RunningLeon in #3986
[ci] add h800 function test workflow by @zhulinJulia24 in #3985
bump version to v0.10.1 by @lvhan028 in #3989

New Contributors

@Vivicai1005 made their first contribution in #3925
@shell-nlp made their first contribution in #3982
@littlegy made their first contribution in #3959

Full Changelog: v0.10.0...v0.10.1

Contributors

windreamer, grimoire, and 11 other contributors

Assets 22

Releases: InternLM/lmdeploy

v0.14.0a1

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

🌐 Other

Contributors

Uh oh!

v0.13.0

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

🌐 Other

New Contributors

Contributors

Uh oh!

v0.12.3

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

🌐 Other

New Contributors

Contributors

Uh oh!

v0.12.2

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

🌐 Other

New Contributors

Contributors

Uh oh!

v0.12.1

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

🌐 Other

New Contributors

Contributors

Uh oh!

v0.12.0

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

🌐 Other

New Contributors

Contributors

Uh oh!

v0.11.1

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

Contributors

Uh oh!

v0.11.0

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

Uh oh!

v0.10.2

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

🌐 Other

Contributors