Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better cache allocation in pytorch engine #1272

Merged
merged 6 commits into from
Mar 14, 2024

Conversation

grimoire
Copy link
Collaborator

@grimoire grimoire commented Mar 11, 2024

runtime intermediate memory usage might lead to OOM if too many cache blocks are using.
Since most intermediate cache are used by lm_head in ModelForCausalLM, we would leave enough space for the cache and allocate paged cache with rest of the memory.

Tested on V100, prompts with 10k+ input ids.

def main():

    model_path = 'Qwen1.5-7B-Chat'
    backend_config = PytorchEngineConfig(
        session_len=200000,
        cache_max_entry_count=0.99)
    pipe = pipeline(model_path, backend_config=backend_config)

    print('processing...')
    gen_config = GenerationConfig(max_new_tokens=1000)
    result = pipe([prompt], gen_config=gen_config)
    print(result[0].text)

@grimoire
Copy link
Collaborator Author

grimoire commented Mar 13, 2024

The estimate is based on vocab_size and max_prefill_token_num
Small max_prefill_token_num would bring a large cache size with the cost of prefill time. we better expose max_prefill_token_num in API.

@lvhan028
Copy link
Collaborator

max_prefill_token_num is already defined in PyTorchEngineConfig and TurbomindEngineConfig
What does exposing it in API mean?
Pls define small max_prefill_token_num

@grimoire
Copy link
Collaborator Author

lmdeploy serve api_server can not recognize --max-prefill-token-num

@@ -56,6 +57,7 @@ class ModelConfig:
sliding_window: int = -1
dtype: torch.dtype = torch.float16
multi_query_attention: bool = False
vocab_size: int = 40000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是不是不该有默认值?

Copy link
Collaborator Author

@grimoire grimoire Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是,主要是担心有的模型没有 vocab_size 字段,这样可以保底留一部分 runtime memory。40000x4096x7 ~= 1G

Copy link
Collaborator

@RunningLeon RunningLeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lvhan028 lvhan028 changed the title torch engine better cache allocation better cache allocation in pytorch engine Mar 14, 2024
@lvhan028 lvhan028 merged commit 5682efe into InternLM:main Mar 14, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants