better cache allocation in pytorch engine #1272

grimoire · 2024-03-11T08:09:50Z

runtime intermediate memory usage might lead to OOM if too many cache blocks are using.
Since most intermediate cache are used by lm_head in ModelForCausalLM, we would leave enough space for the cache and allocate paged cache with rest of the memory.

Tested on V100, prompts with 10k+ input ids.

def main():

    model_path = 'Qwen1.5-7B-Chat'
    backend_config = PytorchEngineConfig(
        session_len=200000,
        cache_max_entry_count=0.99)
    pipe = pipeline(model_path, backend_config=backend_config)

    print('processing...')
    gen_config = GenerationConfig(max_new_tokens=1000)
    result = pipe([prompt], gen_config=gen_config)
    print(result[0].text)

grimoire · 2024-03-13T04:07:44Z

The estimate is based on vocab_size and max_prefill_token_num
Small max_prefill_token_num would bring a large cache size with the cost of prefill time. we better expose max_prefill_token_num in API.

lvhan028 · 2024-03-13T04:17:36Z

max_prefill_token_num is already defined in PyTorchEngineConfig and TurbomindEngineConfig
What does exposing it in API mean?
Pls define small max_prefill_token_num

grimoire · 2024-03-13T05:33:40Z

lmdeploy serve api_server can not recognize --max-prefill-token-num

lmdeploy/pytorch/config.py

lvhan028 · 2024-03-14T03:17:35Z

lmdeploy/pytorch/config.py

@@ -56,6 +57,7 @@ class ModelConfig:
    sliding_window: int = -1
    dtype: torch.dtype = torch.float16
    multi_query_attention: bool = False
+    vocab_size: int = 40000


是不是不该有默认值？

是，主要是担心有的模型没有 vocab_size 字段，这样可以保底留一部分 runtime memory。40000x4096x7 ~= 1G

RunningLeon

LGTM

better cache allocation

60ce553

grimoire added the improvement label Mar 11, 2024

grimoire mentioned this pull request Mar 11, 2024

[Bug] pytorch backend是不是不支持v100的架构？ #1269

Closed

2 tasks

grimoire added 2 commits March 12, 2024 17:19

Merge branch 'main' into torch-optimize-cache-allocation

43679f9

update check log level

90ac0fa

large pre allocate

bc94e02

update default

d2171b6

lvhan028 reviewed Mar 14, 2024

View reviewed changes

lmdeploy/pytorch/config.py Outdated Show resolved Hide resolved

lvhan028 reviewed Mar 14, 2024

View reviewed changes

lvhan028 requested a review from RunningLeon March 14, 2024 03:19

fixed

fe97134

lvhan028 approved these changes Mar 14, 2024

View reviewed changes

RunningLeon approved these changes Mar 14, 2024

View reviewed changes

lvhan028 changed the title ~~torch engine better cache allocation~~ better cache allocation in pytorch engine Mar 14, 2024

lvhan028 merged commit 5682efe into InternLM:main Mar 14, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

better cache allocation in pytorch engine #1272

better cache allocation in pytorch engine #1272

grimoire commented Mar 11, 2024 •

edited

grimoire commented Mar 13, 2024 •

edited

lvhan028 commented Mar 13, 2024

grimoire commented Mar 13, 2024

lvhan028 Mar 14, 2024

grimoire Mar 14, 2024 •

edited

RunningLeon left a comment

better cache allocation in pytorch engine #1272

better cache allocation in pytorch engine #1272

Conversation

grimoire commented Mar 11, 2024 • edited

grimoire commented Mar 13, 2024 • edited

lvhan028 commented Mar 13, 2024

grimoire commented Mar 13, 2024

lvhan028 Mar 14, 2024

Choose a reason for hiding this comment

grimoire Mar 14, 2024 • edited

Choose a reason for hiding this comment

RunningLeon left a comment

Choose a reason for hiding this comment

grimoire commented Mar 11, 2024 •

edited

grimoire commented Mar 13, 2024 •

edited

grimoire Mar 14, 2024 •

edited