-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
better cache allocation in pytorch engine #1272
better cache allocation in pytorch engine #1272
Conversation
The estimate is based on |
|
|
@@ -56,6 +57,7 @@ class ModelConfig: | |||
sliding_window: int = -1 | |||
dtype: torch.dtype = torch.float16 | |||
multi_query_attention: bool = False | |||
vocab_size: int = 40000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是不是不该有默认值?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是,主要是担心有的模型没有 vocab_size
字段,这样可以保底留一部分 runtime memory。40000x4096x7 ~= 1G
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
runtime intermediate memory usage might lead to OOM if too many cache blocks are using.
Since most intermediate cache are used by
lm_head
inModelForCausalLM
, we would leave enough space for the cache and allocate paged cache with rest of the memory.Tested on V100, prompts with 10k+ input ids.