Skip to content

Releases: InternLM/lmdeploy

LMDeploy Release V0.6.0a0

26 Aug 09:12
97b880b
Compare
Choose a tag to compare

Highlight

  • Optimize W4A16 quantized model inference by implementing GEMM in TurboMind Engine
    • Add GPTQ-INT4 inference
    • Support CUDA architecture from SM70 and above, equivalent to the V100 and above.
  • Optimize the prefilling inference stage of PyTorchEngine
  • Distinguish between the concepts of the name of the deployed model and the name of the model's chat tempate

Before:

lmdeploy serve api_server /the/path/of/your/awesome/model \
    --model-name customized_chat_template.json 

After

lmdeploy serve api_server  /the/path/of/your/awesome/model \
    --model-name "the served model name"
    --chat-template customized_chat_template.json

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

  • enable run vlm with pytorch engine in gradio by @RunningLeon in #2256
  • fix side-effect: failed to update tm model config with tm engine config by @lvhan028 in #2275
  • Fix internvl2 template and update docs by @irexyc in #2292
  • fix the issue missing dependencies in the Dockerfile and pip by @ColorfulDick in #2240
  • Fix the way to get "quantization_config" from model's coniguration by @lvhan028 in #2325
  • fix(ascend): fix import error of pt engine in cli by @CyCle1024 in #2328
  • Default rope_scaling_factor of TurbomindEngineConfig to None by @lvhan028 in #2358
  • Fix the logic of update engine_config to TurbomindModelConfig for both tm model and hf model by @lvhan028 in #2362

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.5.3...v0.6.0a0

LMDeploy Release V0.5.3

07 Aug 03:38
a129a14
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.5.2...v0.5.3

LMDeploy Release V0.5.2.post1

26 Jul 12:22
fb6f8ea
Compare
Choose a tag to compare

What's Changed

🐞 Bug fixes

  • [Hotfix] miss parentheses when calcuating the coef of llama3 rope which causes needle-in-hays experiment failed by @lvhan028 in #2157

🌐 Other

Full Changelog: v0.5.2...v0.5.2.post1

LMDeploy Release V0.5.2

26 Jul 08:07
7199b4e
Compare
Choose a tag to compare

Highlight

  • LMDeploy support Llama3.1 and its Tool Calling. An example of calling "Wolfram Alpha" to perform complex mathematical calculations can be found from here

What's Changed

🚀 Features

💥 Improvements

  • Remove the triton inference server backend "turbomind_backend" by @lvhan028 in #1986
  • Remove kv cache offline quantization by @AllentDan in #2097
  • Remove session_len and deprecated short names of the chat templates by @lvhan028 in #2105
  • clarify "n>1" in GenerationConfig hasn't been supported yet by @lvhan028 in #2108

🐞 Bug fixes

🌐 Other

Full Changelog: v0.5.1...v0.5.2

LMDeploy Release V0.5.1

16 Jul 10:05
9cdce39
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.5.0...v0.5.1

LMDeploy Release V0.5.0

01 Jul 07:22
4cb3854
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.4.2...v0.5.0

LMDeploy Release V0.4.2

27 May 08:56
54b7230
Compare
Choose a tag to compare

Highlight

  • Support 4-bit weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2

Quantization

lmdeploy lite auto_awq OpenGVLab/InternVL-Chat-V1-5 --work-dir ./InternVL-Chat-V1-5-AWQ

Inference with quantized model

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('./InternVL-Chat-V1-5-AWQ', backend_config=TurbomindEngineConfig(tp=1, model_format='awq'))

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)
  • Balance vision model when deploying VLMs with multiple GPUs
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL-Chat-V1-5', backend_config=TurbomindEngineConfig(tp=2))

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.4.1...v0.4.2

LMDeploy Release V0.4.1

07 May 08:20
14e9953
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

  • fix local variable 'response' referenced before assignment in async_engine.generate by @irexyc in #1513
  • Fix turbomind import in windows by @irexyc in #1533
  • Fix convert qwen2 to turbomind by @AllentDan in #1546
  • Adding api_key and model_name parameters to the restful benchmark by @NiuBlibing in #1478

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.4.0...v0.4.1

LMDeploy Release V0.4.0

23 Apr 11:18
04ba0ff
Compare
Choose a tag to compare

Highlights

Support for Llama3 and additional Vision-Language Models (VLMs):

  • We now support Llama3 and an extended range of Vision-Language Models (VLMs), including InternVL versions 1.1 and 1.2, MiniGemini, and InternLMXComposer2.

Introduce online int4/int8 KV quantization and inference

  • data-free online quantization
  • Supports all nvidia GPU models with Volta architecture (sm70) and above
  • KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
  • Efficient inference, with int8/int4 KV quantization applied to llama2-7b, RPS is improved by approximately 30% and 40% respectively compared to fp16

The following table shows the evaluation results of three LLM models with different KV numerical precision:

- - - llama2-7b-chat - - internlm2-chat-7b - - qwen1.5-7b-chat - -
dataset version metric kv fp16 kv int8 kv int4 kv fp16 kv int8 kv int4 fp16 kv int8 kv int4
ceval - naive_average 28.42 27.96 27.58 60.45 60.88 60.28 70.56 70.49 68.62
mmlu - naive_average 35.64 35.58 34.79 63.91 64 62.36 61.48 61.56 60.65
triviaqa 2121ce score 56.09 56.13 53.71 58.73 58.7 58.18 44.62 44.77 44.04
gsm8k 1d7fe4 accuracy 28.2 28.05 27.37 70.13 69.75 66.87 54.97 56.41 54.74
race-middle 9a54b6 accuracy 41.57 41.78 41.23 88.93 88.93 88.93 87.33 87.26 86.28
race-high 9a54b6 accuracy 39.65 39.77 40.77 85.33 85.31 84.62 82.53 82.59 82.02

The below table presents LMDeploy's inference performance with quantized KV.

model kv type test settings RPS v.s. kv fp16
llama2-chat-7b fp16 tp1 / ratio 0.8 / bs 256 / prompts 10000 14.98 1.0
- int8 tp1 / ratio 0.8 / bs 256 / prompts 10000 19.01 1.27
- int4 tp1 / ratio 0.8 / bs 256 / prompts 10000 20.81 1.39
llama2-chat-13b fp16 tp1 / ratio 0.9 / bs 128 / prompts 10000 8.55 1.0
- int8 tp1 / ratio 0.9 / bs 256 / prompts 10000 10.96 1.28
- int4 tp1 / ratio 0.9 / bs 256 / prompts 10000 11.91 1.39
internlm2-chat-7b fp16 tp1 / ratio 0.8 / bs 256 / prompts 10000 24.13 1.0
- int8 tp1 / ratio 0.8 / bs 256 / prompts 10000 25.28 1.05
- int4 tp1 / ratio 0.8 / bs 256 / prompts 10000 25.80 1.07

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.3.0...v0.4.0

LMDeploy Release V0.3.0

03 Apr 01:55
4822fba
Compare
Choose a tag to compare

Highlight

  • Refactor attention and optimize GQA(#1258 #1307 #1116), achieving 22+ and 16+ RPS for internlm2-7b and internlm2-20b, about 1.8x faster than vLLM
  • Support new models, including Qwen1.5-MOE(#1372), DBRX(#1367), DeepSeek-VL(#1335)

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

Full Changelog: v0.2.6...v0.3.0