Optimizing Model Inference: Strategies for Efficient Memory Management and Storage Utilization #8

MarziehKaviani · 2024-01-28T12:09:24Z

MarziehKaviani
Jan 28, 2024

Consider a language model with 70 billion parameters; its parameters alone take up 130GB of space. Merely initializing the model on a GPU demands two A100 GPUs with a capacity of 100GB each.

When the model processes input sequences during inference, the memory consumed escalates dramatically due to intricate “attention” computations. Furthermore, the memory needed for these attention processes increases quadratically with the expansion of the input sequence. So, in addition to the model’s 130GB, an ample amount of additional space is indispensable.

Now, the question emerges: How can we economize on memory without compromising the model's performance?

MarziehKaviani
Jan 28, 2024
Author

Layer-wise Inference

as this article said, During inference, layers are executed sequentially. The output of the previous layer is the input to the next. Only one layer executes at a time.

Therefore, it is completely unnecessary to keep all layers in GPU memory. We can load whichever layer is needed from disk when executing that layer, do all the calculations, and then completely free the memory after.

0 replies

alirezasadeghi297 · 2024-01-31T14:48:41Z

alirezasadeghi297
Jan 31, 2024

Stream inputs in batches rather than loading all data at once. This bounds GPU memory usage to the batch size.
Pin input/output tensors in CPU memory to avoid unnecessary data transfers between CPU and GPU.
pinning is a technique that can be used to improve the performance of PyTorch
applications. When memory is pinned, it is stored in a special area of memory that is reserved for the GPU. This can improve the performance of data transfers between the CPU and the GPU.
To use CUDA memory pinning in PyTorch, you can set the pin_memory argument to True when calling the to() method on a tensor
. For example, the following code will pin the tensor tensor to the GPU:
tensor = tensor.to(device='cuda', pin_memory=True)
Detach parameters from the computation graph after each forward pass to free up GPU memory used for storing gradients.
Delete intermediate activation tensors after each layer to free memory as soon as those activations are no longer needed.
Consider half-precision (FP16) arithmetic which provides a 2x memory reduction with minimal loss in accuracy.
Compress models using techniques like pruning, quantization, factorization to reduce memory footprint at runtime.
Trace/export models to static graph formats like ONNX or TorchScript for standalone execution which requires less memory overhead compared to dynamic frameworks.
Partition models across multiple GPUs or devices to distribute memory usage if it exceeds the capacity of a single device.
Cache inference results in high-speed on-device storage like GPU memory or off-device in files/databases to avoid recomputing for duplicate requests.
Prioritize memory optimizations for the most frequently accessed layers which have the highest memory usage impact.

0 replies

mmdbln · 2024-01-31T23:13:30Z

mmdbln
Jan 31, 2024

THE SHARED OBJECTS APPROACH

the Shared Objects approach is a method of memory management where each memory buffer (shared object) is exclusively assigned to an intermediate tensor during computation. The unique assignment rule, along with limitations on shared object usage, aims to minimize the total memory size occupied by these shared objects. This approach is emphasized for its suitability in the context of GPU textures.

Memory footprint of Shared Objects strategies

THE OFFSET CALCULATION APPROACH:

the Offset Calculation approach involves carefully managing memory allocation by assigning offsets to intermediate tensors within a pre-allocated memory block. This approach aims to minimize the overall size of the allocated memory, making it particularly useful for CPU memory or GPU buffers. The analogy to the 2D strip packing problem helps visualize the optimization objective, where efficient packing of rectangular items into a container minimizes the size along a specific dimension.

Memory footprint of Offset Calculation strategies

EFFICIENT MEMORY MANAGEMENT
FOR DEEP NEURAL NET INFERENCE

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizing Model Inference: Strategies for Efficient Memory Management and Storage Utilization #8

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Optimizing Model Inference: Strategies for Efficient Memory Management and Storage Utilization #8

Uh oh!

Uh oh!

MarziehKaviani Jan 28, 2024

Now, the question emerges: How can we economize on memory without compromising the model's performance?

Replies: 3 comments

Uh oh!

Uh oh!

MarziehKaviani Jan 28, 2024 Author

Layer-wise Inference

Uh oh!

alirezasadeghi297 Jan 31, 2024

Uh oh!

Uh oh!

mmdbln Jan 31, 2024

THE SHARED OBJECTS APPROACH

Memory footprint of Shared Objects strategies

THE OFFSET CALCULATION APPROACH:

Memory footprint of Offset Calculation strategies

MarziehKaviani
Jan 28, 2024

MarziehKaviani
Jan 28, 2024
Author

alirezasadeghi297
Jan 31, 2024

mmdbln
Jan 31, 2024