You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"When prompts get longer than trivial sizes, the memory usage spikes as the prompt is thrown into one Tensor and sent off to a forward pass in the model at whatever length it comes in as. These spikes can be reduced by processing the batch in chunks."
@lucasavila00, this looks great. It'll require modifying the attention mask calculation of every model, so it may be helpful to factor those out into a layers.rs in mistralrs-core.
Similar to what was described here huggingface/candle#2108
"When prompts get longer than trivial sizes, the memory usage spikes as the prompt is thrown into one Tensor and sent off to a forward pass in the model at whatever length it comes in as. These spikes can be reduced by processing the batch in chunks."
There's a candle implementation here huggingface/candle#2111
Let's say we configure a setting batch_size = 512.
The scheduler would need to be aware of it and only schedule 2 prompts if they're less than 512 tokens combined.
And the engine should be aware of it and if a sequence is larger than 512 tokens, split it.
To reproduce it locally, use the benchmark with a high enough
-p
and you get an OOMBut generating this same amount of tokens work
The text was updated successfully, but these errors were encountered: