Optimizing Model Inference: Strategies for Efficient Memory Management and Storage Utilization #8
Replies: 3 comments
-
Layer-wise Inferenceas this article said, During inference, layers are executed sequentially. The output of the previous layer is the input to the next. Only one layer executes at a time. Therefore, |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
THE SHARED OBJECTS APPROACHthe Shared Objects approach is a method of memory management where each memory buffer (shared object) is exclusively assigned to an intermediate tensor during computation. The unique assignment rule, along with limitations on shared object usage, aims to minimize the total memory size occupied by these shared objects. This approach is emphasized for its suitability in the context of GPU textures. Memory footprint of Shared Objects strategiesTHE OFFSET CALCULATION APPROACH:the Offset Calculation approach involves carefully managing memory allocation by assigning offsets to intermediate tensors within a pre-allocated memory block. This approach aims to minimize the overall size of the allocated memory, making it particularly useful for CPU memory or GPU buffers. The analogy to the 2D strip packing problem helps visualize the optimization objective, where efficient packing of rectangular items into a container minimizes the size along a specific dimension. Memory footprint of Offset Calculation strategies |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Consider a language model with 70 billion parameters; its parameters alone take up 130GB of space. Merely initializing the model on a GPU demands two A100 GPUs with a capacity of 100GB each.
When the model processes input sequences during
inference
, the memory consumed escalates dramatically due to intricate “attention” computations. Furthermore, the memory needed for these attention processes increases quadratically with the expansion of the input sequence. So, in addition to the model’s 130GB, an ample amount of additional space is indispensable.Now, the question emerges: How can we economize on memory without compromising the model's performance?
Read more at the article
Beta Was this translation helpful? Give feedback.
All reactions