Releases · InternLM/lmdeploy

26 Aug 09:12

lvhan028

v0.6.0a0

97b880b

LMDeploy Release V0.6.0a0 Latest

Latest

Highlight

Optimize W4A16 quantized model inference by implementing GEMM in TurboMind Engine
- Add GPTQ-INT4 inference
- Support CUDA architecture from SM70 and above, equivalent to the V100 and above.
Optimize the prefilling inference stage of PyTorchEngine
Distinguish between the concepts of the name of the deployed model and the name of the model's chat tempate

Before:

lmdeploy serve api_server /the/path/of/your/awesome/model \
    --model-name customized_chat_template.json

After

lmdeploy serve api_server  /the/path/of/your/awesome/model \
    --model-name "the served model name"
    --chat-template customized_chat_template.json

What's Changed

🚀 Features

support vlm custom image process parameters in openai input format by @irexyc in #2245
New GEMM kernels for weight-only quantization by @lzhangzz in #2090
Fix hidden size and support mistral nemo by @AllentDan in #2215
Support custom logits processors by @AllentDan in #2329
support openbmb/MiniCPM-V-2_6 by @irexyc in #2351
Support phi3.5 for pytorch engine by @RunningLeon in #2361

💥 Improvements

Remove deprecated arguments from API and clarify model_name and chat_template_name by @lvhan028 in #1931
Fix duplicated session_id when pipeline is used by multithreads by @irexyc in #2134
remove eviction param by @grimoire in #2285
Remove QoS serving by @AllentDan in #2294
Support send tool_calls back to internlm2 by @AllentDan in #2147
Add stream options to control usage by @AllentDan in #2313
add device type for pytorch engine in cli by @RunningLeon in #2321
Update error status_code to raise error in openai client by @AllentDan in #2333
Change to use device instead of device-type in cli by @RunningLeon in #2337
Add GEMM test utils by @lzhangzz in #2342
Add environment variable to control SILU fusion by @lzhangzz in #2343
Use single thread per model instance by @lzhangzz in #2339
add cache to speed up docker building by @RunningLeon in #2344
add max_prefill_token_num argument in CLI by @lvhan028 in #2345
torch engine optimize prefill for long context by @grimoire in #1962
Refactor turbomind (1/N) by @lzhangzz in #2352
feat(server): enable seed parameter for openai compatible server. by @DearPlanet in #2353

🐞 Bug fixes

enable run vlm with pytorch engine in gradio by @RunningLeon in #2256
fix side-effect: failed to update tm model config with tm engine config by @lvhan028 in #2275
Fix internvl2 template and update docs by @irexyc in #2292
fix the issue missing dependencies in the Dockerfile and pip by @ColorfulDick in #2240
Fix the way to get "quantization_config" from model's coniguration by @lvhan028 in #2325
fix(ascend): fix import error of pt engine in cli by @CyCle1024 in #2328
Default rope_scaling_factor of TurbomindEngineConfig to None by @lvhan028 in #2358
Fix the logic of update engine_config to TurbomindModelConfig for both tm model and hf model by @lvhan028 in #2362

📚 Documentations

Reorganize the user guide and update the get_started section by @lvhan028 in #2038
cancel support baichuan2 7b awq in pytorch engine by @grimoire in #2246
Add user guide about slora serving by @AllentDan in #2084

🌐 Other

test prtest image update by @zhulinJulia24 in #2192
Update python support version by @wuhongsheng in #2290
fix Windows compile error by @zhyncs in #2303
fix: follow up #2303 by @zhyncs in #2307
[ci] benchmark react by @zhulinJulia24 in #2183
bump version to v0.6.0a0 by @lvhan028 in #2371

New Contributors

@wuhongsheng made their first contribution in #2290
@ColorfulDick made their first contribution in #2240
@DearPlanet made their first contribution in #2353

Full Changelog: v0.5.3...v0.6.0a0

Contributors

grimoire, lvhan028, and 10 other contributors

Assets 12

07 Aug 03:38

lvhan028

v0.5.3

a129a14

LMDeploy Release V0.5.3

What's Changed

🚀 Features

PyTorch Engine AWQ support by @grimoire in #1913
Phi3 awq by @grimoire in #1984
Fix chunked prefill by @lzhangzz in #2201
support VLMs with Qwen as the language model by @irexyc in #2207

💥 Improvements

Support specifying a prefix of assistant response by @AllentDan in #2172
Strict check for name_map in InternLM2Chat7B by @SamuraiBUPT in #2156
Check errors for attention kernels by @lzhangzz in #2206
update base image to support cuda12.4 in dockerfile by @RunningLeon in #2182
Stop synchronizing for length_criterion by @lzhangzz in #2202
adapt MiniCPM-Llama3-V-2_5 new code by @irexyc in #2139
Remove duplicate code by @cmpute in #2133

🐞 Bug fixes

[Hotfix] miss parentheses when calcuating the coef of llama3 rope by @lvhan028 in #2157
support logit softcap by @grimoire in #2158
Fix gmem to smem WAW conflict in awq gemm kernel by @foreverrookie in #2111
Fix gradio serve using a wrong chat template by @AllentDan in #2131
fix runtime error when using dynamic scale rotary embed for InternLM2… by @CyCle1024 in #2212
Add peer-access-enabled allocator by @lzhangzz in #2218
Fix typos in profile_generation.py by @jiajie-yang in #2233

📚 Documentations

docs: fix Qwen typo by @ArtificialZeng in #2136
wrong expression by @ArtificialZeng in #2165
clearify the model type LLM or MLLM in supported model matrix by @lvhan028 in #2209
docs: add Japanese README by @eltociear in #2237

🌐 Other

bump version to 0.5.2.post1 by @lvhan028 in #2159
update news about cooperation with modelscope/swift by @lvhan028 in #2200
bump version to v0.5.3 by @lvhan028 in #2242

New Contributors

@ArtificialZeng made their first contribution in #2136
@foreverrookie made their first contribution in #2111
@SamuraiBUPT made their first contribution in #2156
@CyCle1024 made their first contribution in #2212
@jiajie-yang made their first contribution in #2233
@cmpute made their first contribution in #2133

Full Changelog: v0.5.2...v0.5.3

Contributors

grimoire, lvhan028, and 11 other contributors

Assets 12

26 Jul 12:22

lvhan028

v0.5.2.post1

fb6f8ea

LMDeploy Release V0.5.2.post1

What's Changed

🐞 Bug fixes

[Hotfix] miss parentheses when calcuating the coef of llama3 rope which causes needle-in-hays experiment failed by @lvhan028 in #2157

🌐 Other

bump version to 0.5.2.post1 by @lvhan028 in #2159

Full Changelog: v0.5.2...v0.5.2.post1

Contributors

lvhan028

Assets 12

26 Jul 08:07

lvhan028

v0.5.2

7199b4e

LMDeploy Release V0.5.2

Highlight

LMDeploy support Llama3.1 and its Tool Calling. An example of calling "Wolfram Alpha" to perform complex mathematical calculations can be found from here

What's Changed

🚀 Features

Support glm4 awq by @AllentDan in #1993
Support llama3.1 by @lvhan028 in #2122
Support Llama3.1 tool calling by @AllentDan in #2123

💥 Improvements

Remove the triton inference server backend "turbomind_backend" by @lvhan028 in #1986
Remove kv cache offline quantization by @AllentDan in #2097
Remove session_len and deprecated short names of the chat templates by @lvhan028 in #2105
clarify "n>1" in GenerationConfig hasn't been supported yet by @lvhan028 in #2108

🐞 Bug fixes

fix stop words for glm4 by @RunningLeon in #2044
Disable peer access code by @lzhangzz in #2082
set log level ERROR in benchmark scripts by @lvhan028 in #2086
raise thread exception by @irexyc in #2071
Fix index error when profiling token generation with -ct 1 by @lvhan028 in #1898

🌐 Other

misc: replace slow Jimver/cuda-toolkit by @zhyncs in #2065
misc: update bug issue template by @zhyncs in #2083
update daily testcase new by @zhulinJulia24 in #2035
bump version to v0.5.2 by @lvhan028 in #2143

Full Changelog: v0.5.1...v0.5.2

Contributors

lvhan028, irexyc, and 5 other contributors

Assets 12

16 Jul 10:05

lvhan028

v0.5.1

9cdce39

LMDeploy Release V0.5.1

What's Changed

🚀 Features

Support phi3-vision by @RunningLeon in #1845
Support internvl2 chat template by @AllentDan in #1911
support gemma2 in pytorch engine by @grimoire in #1924
Add tools to api_server for InternLM2 model by @AllentDan in #1763
support internvl2-1b by @RunningLeon in #1983
feat: support llama2 and internlm2 on 910B by @yao-fengchen in #2011
Support glm 4v by @RunningLeon in #1947
support internlm-xcomposer2d5-7b by @irexyc in #1932
add chat template for codegeex4 by @RunningLeon in #2013

💥 Improvements

misc: rm unnecessary files by @zhyncs in #1875
drop stop words by @grimoire in #1823
Add usage in stream response by @fbzhong in #1876
Optimize sampling on pytorch engine. by @grimoire in #1853
Remove deprecated chat cli and vl examples by @lvhan028 in #1899
vision model use tp number of gpu by @irexyc in #1854
misc: add default api_server_url for api_client by @zhyncs in #1922
misc: add transformers version check for TurboMind Tokenizer by @zhyncs in #1917
fix: append _stats when size > 0 by @zhyncs in #1809
refactor: update awq linear and rm legacy by @zhyncs in #1940
feat: add gpu topo for check_env by @zhyncs in #1944
fix transformers version check for InternVL2 by @zhyncs in #1952
Upgrade gradio by @AllentDan in #1930
refactor sampling layer setup by @irexyc in #1912
Add exception handler to imge encoder by @irexyc in #2010
Avoid the same session id for openai endpoint by @AllentDan in #1995

🐞 Bug fixes

Fix error link reference by @zihaomu in #1881
Fix internlm-xcomposer2-vl awq search scale by @AllentDan in #1890
fix SamplingDecodeTest and SamplingDecodeTest2 unittest failure by @zhyncs in #1874
Fix smem size for fused split-kv reduction by @lzhangzz in #1909
fix llama3 chat template by @AllentDan in #1956
fix: set PYTHONIOENCODING to UTF-8 before start tritonserver by @zhyncs in #1971
Fix internvl2-40b model export by @irexyc in #1979
fix logprobs by @irexyc in #1968
fix unexpected argument error when deploying "cogvlm-chat-hf" by @AllentDan in #1982
fix mixtral and mistral cache_position by @zhyncs in #1941
Fix the session_len assignment logic by @lvhan028 in #2007
Fix logprobs openai api by @irexyc in #1985
Fix internvl2-40b awq inference by @AllentDan in #2023
Fix side effect of #1995 by @AllentDan in #2033

📚 Documentations

docs: update faq for turbomind so not found by @zhyncs in #1877
[Doc]: Change to sphinx-book-theme in readthedocs by @RunningLeon in #1880
docs: update compatibility section in README by @zhyncs in #1946
docs: update kv quant doc by @zhyncs in #1977
docs: sync the core features in README to index.rst by @zhyncs in #1988
Fix table rendering for readthedocs by @RunningLeon in #1998
docs: fix Ada compatibility by @zhyncs in #2016
update xcomposer2d5 docs by @irexyc in #2037

🌐 Other

[ci] add internlm2.5 models into testcase by @zhulinJulia24 in #1928
bump version to v0.5.1 by @lvhan028 in #2022

New Contributors

@zihaomu made their first contribution in #1881
@fbzhong made their first contribution in #1876

Full Changelog: v0.5.0...v0.5.1

Contributors

fbzhong, grimoire, and 9 other contributors

Assets 12

01 Jul 07:22

lvhan028

v0.5.0

4cb3854

LMDeploy Release V0.5.0

What's Changed

🚀 Features

support MiniCPM-Llama3-V 2.5 by @irexyc in #1708
[Feature]: Support llava for pytorch engine by @RunningLeon in #1641
Device dispatcher by @grimoire in #1775
Add GLM-4-9B-Chat by @lzhangzz in #1724
Torch deepseek v2 by @grimoire in #1621
Support internvl-chat for pytorch engine by @RunningLeon in #1797
Add interfaces to the pipeline to obtain logits and ppl by @irexyc in #1652
[Feature]: Support cogvlm-chat by @RunningLeon in #1502

💥 Improvements

support mistral and llava_mistral in turbomind by @lvhan028 in #1579
Add health endpoint by @AllentDan in #1679
upgrade the version of the dependency package peft by @grimoire in #1687
Follow the conventional model_name by @AllentDan in #1677
API Image URL fetch timeout by @vody-am in #1684
Support internlm-xcomposer2-4khd-7b awq by @AllentDan in #1666
update dockerfile and docs by @RunningLeon in #1715
lazy import VLAsyncEngine to avoid bringing in VLMs dependencies when deploying LLMs by @lvhan028 in #1714
feat: align with OpenAI temperature range by @zhyncs in #1733
feat: align with OpenAI temperature range in api server by @zhyncs in #1734
Refactor converter about get_input_model_registered_name and get_output_model_registered_name_and_config by @lvhan028 in #1702
Refine max_new_tokens logic to improve user experience by @AllentDan in #1705
Refactor loading weights by @grimoire in #1603
refactor config by @grimoire in #1751
Add anomaly handler by @lzhangzz in #1780
Encode raw image file to base64 by @irexyc in #1773
skip inference for oversized inputs by @grimoire in #1769
fix: prevent numpy breakage by @zhyncs in #1791
More accurate time logging for ImageEncoder and fix concurrent image processing corruption by @irexyc in #1765
Optimize kernel launch for triton2.2.0 and triton2.3.0 by @grimoire in #1499
feat: auto set awq model_format from hf by @zhyncs in #1799
check driver mismatch by @grimoire in #1811
PyTorchEngine adapts to the latest internlm2 modeling. by @grimoire in #1798
AsyncEngine create cancel task in exception. by @grimoire in #1807
compat internlm2 for pytorch engine by @RunningLeon in #1825
Add model revision & download_dir to cli by @irexyc in #1814
fix image encoder request queue by @irexyc in #1837
Harden stream callback by @lzhangzz in #1838
Support Qwen2-1.5b awq by @AllentDan in #1793
remove chat template config in turbomind engine by @irexyc in #1161
misc: align PyTorch Engine temprature with TurboMind by @zhyncs in #1850
docs: update cache-max-entry-count help message by @zhyncs in #1892

🐞 Bug fixes

fix typos by @irexyc in #1690
[Bugfix] fix internvl-1.5-chat vision model preprocess and freeze weights by @DefTruth in #1741
lock setuptools version in dockerfile by @RunningLeon in #1770
Fix openai package can not use proxy stream mode by @AllentDan in #1692
Fix finish_reason by @AllentDan in #1768
fix uncached stop words by @grimoire in #1754
[side-effect]Fix param --cache-max-entry-count is not taking effect (#1758) by @QwertyJack in #1778
support qwen2 1.5b by @lvhan028 in #1782
fix falcon attention by @grimoire in #1761
Refine AsyncEngine exception handler by @AllentDan in #1789
[side-effect] fix weight_type caused by PR #1702 by @lvhan028 in #1795
fix best_match_model by @irexyc in #1812
Fix Request completed log by @irexyc in #1821
fix qwen-vl-chat hung by @irexyc in #1824
Detokenize with prompt token ids by @AllentDan in #1753
Update engine.py to fix small typos by @WANGSSSSSSS in #1829
[side-effect] bring back "--cap" argument in chat cli by @lvhan028 in #1859
Fix vl session-len by @AllentDan in #1860
fix gradio vl "stop_words" by @irexyc in #1873
fix qwen2 cache_position for PyTorch Engine when transformers>4.41.2 by @zhyncs in #1886
fix model name matching for internvl by @RunningLeon in #1867

📚 Documentations

docs: add BentoLMDeploy in README by @zhyncs in #1736
[Doc]: Update docs for internlm2.5 by @RunningLeon in #1887

🌐 Other

add longtext generation benchmark by @zhulinJulia24 in #1694
add qwen2 model into testcase by @zhulinJulia24 in #1772
fix pr test for newest internlm2 model by @zhulinJulia24 in #1806
react test evaluation config by @zhulinJulia24 in #1861
bump version to v0.5.0 by @lvhan028 in #1852

New Contributors

@DefTruth made their first contribution in #1741
@QwertyJack made their first contribution in #1778
@WANGSSSSSSS made their first contribution in #1829

Full Changelog: v0.4.2...v0.5.0

Contributors

grimoire, lvhan028, and 10 other contributors

Assets 12

27 May 08:56

lvhan028

v0.4.2

54b7230

LMDeploy Release V0.4.2

Highlight

Support 4-bit weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2

Quantization

lmdeploy lite auto_awq OpenGVLab/InternVL-Chat-V1-5 --work-dir ./InternVL-Chat-V1-5-AWQ

Inference with quantized model

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('./InternVL-Chat-V1-5-AWQ', backend_config=TurbomindEngineConfig(tp=1, model_format='awq'))

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)

Balance vision model when deploying VLMs with multiple GPUs

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL-Chat-V1-5', backend_config=TurbomindEngineConfig(tp=2))

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)

What's Changed

🚀 Features

PyTorch Engine hash table based prefix caching by @grimoire in #1429
support phi3 by @grimoire in #1497
Turbomind prefix caching by @ispobock in #1450
Enable search scale for awq by @AllentDan in #1545
[Feature] Support vl models quantization by @AllentDan in #1553

💥 Improvements

make Qwen compatible with Slora when TP > 1 by @jjjjohnson in #1518
Optimize slora by @grimoire in #1447
Use a faster format for images in VLMs by @isidentical in #1575
add chat-template args to chat cli by @RunningLeon in #1566
Get the max session len from config.json by @AllentDan in #1550
Optimize w8a8 kernel by @grimoire in #1353
support python 3.12 by @irexyc in #1605
Optimize moe by @grimoire in #1520
Balance vision model weights on multi gpus by @irexyc in #1591
Support user-specified IMAGE_TOKEN position for deepseek-vl model by @irexyc in #1627
Optimize GQA/MQA by @grimoire in #1649

🐞 Bug fixes

fix logger init by @AllentDan in #1598
Bugfix: wrongly assign gen_config with True by @thelongestusernameofall in #1594
Enable split-kv for attention by @lzhangzz in #1606
Fix xcomposer2 vision model process by @irexyc in #1640
Fix NTK scaling by @lzhangzz in #1636
Fix illegal memory access when seq_len < 64 by @lzhangzz in #1616
Fix llava vl template by @irexyc in #1620
[side-effect] fix deepseek-vl when tp is 1 by @irexyc in #1648
fix logprobs output by @irexyc in #1561
fix fused-moe in triton2.2.0 by @grimoire in #1654
Align tokenizers in pipeline and api_server benchmark scripts by @AllentDan in #1650
[side-effect] fix UnboundLocalError for internlm-xcomposer2-4khd-7b by @irexyc in #1661
remove paged attention prefill autotune by @grimoire in #1658
Fix transformers 4.41.0 prompt may differ after encode decode by @AllentDan in #1617

📚 Documentations

Fix typo in w8a8.md by @chg0901 in #1568
Update doc for prefix caching by @ispobock in #1597
Update VL document by @AllentDan in #1657

🌐 Other

remove first empty token check and add input validation testcase by @zhulinJulia24 in #1549
add more model into benchmark and evaluate workflow by @zhulinJulia24 in #1565
add vl awq testcase and refactor pipeline testcase by @zhulinJulia24 in #1630
bump version to v0.4.2 by @lvhan028 in #1644

New Contributors

@isidentical made their first contribution in #1575
@chg0901 made their first contribution in #1568
@thelongestusernameofall made their first contribution in #1594

Full Changelog: v0.4.1...v0.4.2

Contributors

grimoire, lvhan028, and 10 other contributors

Assets 12

07 May 08:20

lvhan028

v0.4.1

14e9953

LMDeploy Release V0.4.1

What's Changed

🚀 Features

Add colab demo by @AllentDan in #1428
support starcoder2 by @grimoire in #1468
support OpenGVLab/InternVL-Chat-V1-5 by @irexyc in #1490

💥 Improvements

variable CTA_H & fix qkv bias by @lzhangzz in #1491
refactor vision model loading by @irexyc in #1482
fix installation requirements for windows by @irexyc in #1531
Remove split batch inside pipline inference function by @AllentDan in #1507
Remove first empty chunck for api_server by @AllentDan in #1527
add benchmark script to profile pipeline APIs by @lvhan028 in #1528
Add input validation by @AllentDan in #1525

🐞 Bug fixes

fix local variable 'response' referenced before assignment in async_engine.generate by @irexyc in #1513
Fix turbomind import in windows by @irexyc in #1533
Fix convert qwen2 to turbomind by @AllentDan in #1546
Adding api_key and model_name parameters to the restful benchmark by @NiuBlibing in #1478

📚 Documentations

update supported models for Baichuan by @zhyncs in #1485
Fix typo in w8a8.md by @Infinity4B in #1523
complete build.md by @YanxingLiu in #1508
update readme wechat qrcode by @vansin in #1529
Update docker docs for VL api by @vody-am in #1534
Format supported model table using html syntax by @lvhan028 in #1493
doc: add example of deploying api server to Kubernetes by @uzuku in #1488

🌐 Other

add modelscope and lora testcase by @zhulinJulia24 in #1506
bump version to v0.4.1 by @lvhan028 in #1544

New Contributors

@NiuBlibing made their first contribution in #1478
@Infinity4B made their first contribution in #1523
@YanxingLiu made their first contribution in #1508
@vody-am made their first contribution in #1534
@uzuku made their first contribution in #1488

Full Changelog: v0.4.0...v0.4.1

Contributors

grimoire, lvhan028, and 11 other contributors

Assets 10

23 Apr 11:18

lvhan028

v0.4.0

04ba0ff

LMDeploy Release V0.4.0

Highlights

Support for Llama3 and additional Vision-Language Models (VLMs):

We now support Llama3 and an extended range of Vision-Language Models (VLMs), including InternVL versions 1.1 and 1.2, MiniGemini, and InternLMXComposer2.

Introduce online int4/int8 KV quantization and inference

data-free online quantization
Supports all nvidia GPU models with Volta architecture (sm70) and above
KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
Efficient inference, with int8/int4 KV quantization applied to llama2-7b, RPS is improved by approximately 30% and 40% respectively compared to fp16

The following table shows the evaluation results of three LLM models with different KV numerical precision:

-	-	-	llama2-7b-chat	-	-	internlm2-chat-7b	-	-	qwen1.5-7b-chat	-	-
dataset	version	metric	kv fp16	kv int8	kv int4	kv fp16	kv int8	kv int4	fp16	kv int8	kv int4
ceval	-	naive_average	28.42	27.96	27.58	60.45	60.88	60.28	70.56	70.49	68.62
mmlu	-	naive_average	35.64	35.58	34.79	63.91	64	62.36	61.48	61.56	60.65
triviaqa	2121ce	score	56.09	56.13	53.71	58.73	58.7	58.18	44.62	44.77	44.04
gsm8k	1d7fe4	accuracy	28.2	28.05	27.37	70.13	69.75	66.87	54.97	56.41	54.74
race-middle	9a54b6	accuracy	41.57	41.78	41.23	88.93	88.93	88.93	87.33	87.26	86.28
race-high	9a54b6	accuracy	39.65	39.77	40.77	85.33	85.31	84.62	82.53	82.59	82.02

The below table presents LMDeploy's inference performance with quantized KV.

model	kv type	test settings	RPS	v.s. kv fp16
llama2-chat-7b	fp16	tp1 / ratio 0.8 / bs 256 / prompts 10000	14.98	1.0
-	int8	tp1 / ratio 0.8 / bs 256 / prompts 10000	19.01	1.27
-	int4	tp1 / ratio 0.8 / bs 256 / prompts 10000	20.81	1.39
llama2-chat-13b	fp16	tp1 / ratio 0.9 / bs 128 / prompts 10000	8.55	1.0
-	int8	tp1 / ratio 0.9 / bs 256 / prompts 10000	10.96	1.28
-	int4	tp1 / ratio 0.9 / bs 256 / prompts 10000	11.91	1.39
internlm2-chat-7b	fp16	tp1 / ratio 0.8 / bs 256 / prompts 10000	24.13	1.0
-	int8	tp1 / ratio 0.8 / bs 256 / prompts 10000	25.28	1.05
-	int4	tp1 / ratio 0.8 / bs 256 / prompts 10000	25.80	1.07

What's Changed

🚀 Features

Support qwen1.5 in turbomind engine by @lvhan028 in #1406
Online 8/4-bit KV-cache quantization by @lzhangzz in #1377
Support qwen1.5-*-AWQ model inference in turbomind by @lvhan028 in #1430
support Internvl chat v1.1, v1.2 and v1.2-plus by @irexyc in #1425
support Internvl chat llava by @irexyc in #1426
Add llama3 chat template by @AllentDan in #1461
Support mini gemini llama by @AllentDan in #1438
add interactive api in service for VL models by @AllentDan in #1444
support output logprobs with turbomind backend. by @irexyc in #1391
support internlm-xcomposer2-7b & internlm-xcomposer2-4khd-7b by @irexyc in #1458
Add qwen1.5 awq quantization by @AllentDan in #1470

💥 Improvements

Reduce binary size, add sm_89 and sm_90 targets by @lzhangzz in #1383
Use new event loop instead of the current loop for pipeline by @AllentDan in #1352
Optimize inference of pytorch engine with tensor parallelism by @grimoire in #1397
add llava-v1.6-34b template by @irexyc in #1408
Initialize vl encoder first to avoid OOM by @AllentDan in #1434
Support model_name customization for api_server by @AllentDan in #1403
Expose dynamic split&fuse parameters by @lvhan028 in #1433
warning transformers version by @grimoire in #1453
Optimize apply_rotary kernel and remove useless inference_mode by @grimoire in #1457
set infinity timeout to nccl by @grimoire in #1465
Feat: format internlm2 chat template by @liujiangning30 in #1456

🐞 Bug fixes

handle SIGTERM by @grimoire in #1389
fix chat cli ArgumentError error happened in python 3.11 by @RunningLeon in #1401
Fix llama_triton_example by @AllentDan in #1414
miss --trust-remote-code in converter, which is side effect brought by pr #1406 by @lvhan028 in #1420
fix sampling kernel by @grimoire in #1417
Fix loading single safetensor file error by @AllentDan in #1427
remove space in deepseek template by @grimoire in #1441
fix free repetition_penalty_workspace_ buffer by @irexyc in #1467
fix adapter failure when tp>1 by @grimoire in #1476
get model in advance to fix downloading from modelscope error by @irexyc in #1473
Fix the side effect in engine_intance brought by #1391 by @lvhan028 in #1480

📚 Documentations

Add model name corresponding to the test data in the doc by @wykvictor in #1400
fix typo in get_started guide by @lvhan028 in #1411
Add async openai demo for api_server by @AllentDan in #1409
add the recommendation version for Python Backend by @zhyncs in #1436
Update kv quantization and inference guide by @lvhan028 in #1412
update doc for llama3 by @zhyncs in #1462

🌐 Other

hack cmakelist.txt in pr_test workflow by @zhulinJulia24 in #1405
Add benchmark report generated in summary by @zhulinJulia24 in #1419
add restful completions v1 test case by @ZhoujhZoe in #1416
Add kvint4/8 ete testcase by @zhulinJulia24 in #1448
impove rotary embedding of qwen in torch engine by @grimoire in #1451
change cutlass url in ut by @RunningLeon in #1464
bump version to v0.4.0 by @lvhan028 in #1469

New Contributors

@wykvictor made their first contribution in #1400
@ZhoujhZoe made their first contribution in #1416
@liujiangning30 made their first contribution in #1456

Full Changelog: v0.3.0...v0.4.0

Contributors

grimoire, lvhan028, and 9 other contributors

Assets 10

03 Apr 01:55

lvhan028

v0.3.0

4822fba

LMDeploy Release V0.3.0

Highlight

Refactor attention and optimize GQA(#1258 #1307 #1116), achieving 22+ and 16+ RPS for internlm2-7b and internlm2-20b, about 1.8x faster than vLLM
Support new models, including Qwen1.5-MOE(#1372), DBRX(#1367), DeepSeek-VL(#1335)

What's Changed

🚀 Features

Add tensor core GQA dispatch for [4,5,6,8] by @lzhangzz in #1258
upgrade turbomind to v2.1 by by @lzhangzz in #1307, #1116
Support slora to pipeline by @AllentDan in #1286
Support qwen for pytorch engine by @RunningLeon in #1265
Support Triton inference server python backend by @ispobock in #1329
torch engine support dbrx by @grimoire in #1367
Support qwen2 moe for pytorch engine by @RunningLeon in #1372
Add deepseek vl by @AllentDan in #1335

💥 Improvements

rm unused var by @zhyncs in #1256
Expose cache_block_seq_len to API by @ispobock in #1218
add chat template for deepseek coder model by @lvhan028 in #1310
Add more log info for api_server by @AllentDan in #1323
remove cuda cache after loading vison model by @irexyc in #1325
Add new chat cli with auto backend feature by @RunningLeon in #1276
Update rewritings for qwen by @RunningLeon in #1351
lazy import accelerate.init_empty_weights for vl async engine by @irexyc in #1359
update lmdeploy pypi packages deps to cuda12 by @irexyc in #1368
update max_prefill_token_num for low gpu memory by @grimoire in #1373
Optimize pipeline of pytorch engine by @grimoire in #1328

🐞 Bug fixes

fix different stop/bad words length in batch by @irexyc in #1246
Fix performance issue of chatbot by @ispobock in #1295
add missed argument by @irexyc in #1317
Fix dlpack memory leak by @ispobock in #1344
Fix invalid context for Internstudio platform by @lzhangzz in #1354
fix benchmark generation by @grimoire in #1349
fix window attention by @grimoire in #1341
fix batchApplyRepetitionPenalty by @irexyc in #1358
Fix memory leak of DLManagedTensor by @ispobock in #1361
fix vlm inference hung with tp by @irexyc in #1336
[Fix] fix the unit test of model name deduce by @AllentDan in #1382

📚 Documentations

add citation in readme by @RunningLeon in #1308
Add slora example for pipeline by @AllentDan in #1343

🌐 Other

Add restful interface regrssion daily test workflow. by @zhulinJulia24 in #1302
Add offline mode for testcase workflow by @zhulinJulia24 in #1318
workflow bugfix and add llava-v1.5-13b testcase by @zhulinJulia24 in #1339
Add benchmark test workflow by @zhulinJulia24 in #1364
bump version to v0.3.0 by @lvhan028 in #1387

Full Changelog: v0.2.6...v0.3.0

Contributors

grimoire, lvhan028, and 7 other contributors

Assets 10

Releases: InternLM/lmdeploy

LMDeploy Release V0.6.0a0

Highlight

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

LMDeploy Release V0.5.3

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

LMDeploy Release V0.5.2.post1

What's Changed

🐞 Bug fixes

🌐 Other

Contributors

LMDeploy Release V0.5.2

Highlight

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

🌐 Other

Contributors

LMDeploy Release V0.5.1

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

LMDeploy Release V0.5.0

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

LMDeploy Release V0.4.2

Highlight

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

LMDeploy Release V0.4.1

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors

LMDeploy Release V0.4.0

Highlights

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors