Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, Piotr Nawrot+, N/A, arXiv'24 #1270

AkihikoWatanabe · 2024-04-07T10:29:11Z

URL

https://arxiv.org/abs/2403.09636

Affiliations

Piotr Nawrot, N/A
Adrian Łańcucki, N/A
Marcin Chochowski, N/A
David Tarjan, N/A
Edoardo M. Ponti, N/A

Abstract

Transformers have emerged as the backbone of large language models (LLMs).However, generation remains inefficient due to the need to store in memory acache of key-value representations for past tokens, whose size scales linearlywith the input sequence length and batch size. As a solution, we proposeDynamic Memory Compression (DMC), a method for on-line key-value cachecompression at inference time. Most importantly, the model learns to applydifferent compression rates in different heads and layers. We retrofitpre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers,achieving up to ~3.7x throughput increase in auto-regressive inference on aNVIDIA H100 GPU. DMC is applied via continued pre-training on a negligiblepercentage of the original data without adding any extra parameters. We findthat DMC preserves the original downstream performance with up to 4x cachecompression, outperforming up-trained grouped-query attention (GQA). GQA andDMC can be even combined to obtain compounded gains. As a result DMC fitslonger contexts and larger batches within any given memory budget.

Translation (by gpt-3.5-turbo)

トランスフォーマーは、大規模言語モデル（LLMs）のバックボーンとして台頭しています。
しかし、過去のトークンのキー値表現のキャッシュをメモリに保存する必要があるため、生成は効率的ではありません。このキャッシュのサイズは、入力シーケンスの長さとバッチサイズに線形にスケーリングします。
この問題に対する解決策として、推論時のオンラインキー値キャッシュ圧縮方法であるDynamic Memory Compression（DMC）を提案します。
最も重要な点は、モデルが異なるヘッドとレイヤーで異なる圧縮率を適用する方法を学習することです。
Llama 2（7B、13B、70Bなど）などの事前学習済みLLMsをDMCトランスフォーマーに後付けし、NVIDIA H100 GPU上での自己回帰推論において約3.7倍のスループット向上を達成しました。
DMCは、元のデータのわずかな割合での継続的な事前学習を通じて適用され、追加のパラメータを追加することなく、元の下流パフォーマンスを最大4倍のキャッシュ圧縮で維持し、up-trained grouped-query attention（GQA）を上回ることがわかりました。
GQAとDMCは、合成された利益を得るために組み合わせることもできます。
その結果、DMCは、任意のメモリ予算内でより長いコンテキストと大きなバッチを適合させることができます。

Summary (by gpt-3.5-turbo)

トランスフォーマーの生成効率を向上させるために、Dynamic Memory Compression（DMC）が提案された。DMCは、異なるヘッドとレイヤーで異なる圧縮率を適用する方法を学習し、事前学習済みLLMsに適用される。DMCは、元の下流パフォーマンスを最大4倍のキャッシュ圧縮で維持しつつ、スループットを向上させることができる。DMCは、GQAと組み合わせることでさらなる利益をもたらす可能性があり、長いコンテキストと大きなバッチを処理する際に有用である。

AkihikoWatanabe · 2024-04-07T10:29:50Z

参考: https://x.com/hillbig/status/1776755029581676943?s=46&t=Y6UuIHB0Lv0IpmFAjlc2-Q

AkihikoWatanabe · 2024-04-07T13:19:33Z

論文中のFigure1が非常にわかりやすい。

AkihikoWatanabe · 2024-04-07T13:42:54Z

GQA #1271 と比較して、2~4倍キャッシュを圧縮しつつ、より高い性能を実現。70Bモデルの場合は、GQAで8倍キャッシュを圧縮した上で、DMCで追加で2倍圧縮をかけたところ、同等のパフォーマンスを実現している。

AkihikoWatanabe added the Pocket label Apr 7, 2024

AkihikoWatanabe changed the title あ Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, Piotr Nawrot+, N/A, arXiv'24 Apr 7, 2024

AkihikoWatanabe added Efficiency/SpeedUp NLP LanguageModel Transformer Attention labels Apr 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, Piotr Nawrot+, N/A, arXiv'24 #1270

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, Piotr Nawrot+, N/A, arXiv'24 #1270

AkihikoWatanabe commented Apr 7, 2024 •

edited

AkihikoWatanabe commented Apr 7, 2024

AkihikoWatanabe commented Apr 7, 2024 •

edited

AkihikoWatanabe commented Apr 7, 2024 •

edited

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, Piotr Nawrot+, N/A, arXiv'24 #1270

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference, Piotr Nawrot+, N/A, arXiv'24 #1270

Comments

AkihikoWatanabe commented Apr 7, 2024 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Apr 7, 2024

AkihikoWatanabe commented Apr 7, 2024 • edited

AkihikoWatanabe commented Apr 7, 2024 • edited

AkihikoWatanabe commented Apr 7, 2024 •

edited

AkihikoWatanabe commented Apr 7, 2024 •

edited

AkihikoWatanabe commented Apr 7, 2024 •

edited