Fast Transformer Decoding: One Write-Head is All You Need, Noam Shazeer, N/A, arXiv'19 #1272

AkihikoWatanabe · 2024-04-07T13:32:20Z

URL

https://arxiv.org/abs/1911.02150v1

Affiliations

Noam Shazeer, N/A

Abstract

Multi-head attention layers, as used in the Transformer neural sequencemodel, are a powerful alternative to RNNs for moving information across andbetween sequences. While training these layers is generally fast and simple,due to parallelizability across the length of the sequence, incrementalinference (where such paralleization is impossible) is often slow, due to thememory-bandwidth cost of repeatedly loading the large "keys" and "values"tensors. We propose a variant called multi-query attention, where the keys andvalues are shared across all of the different attention "heads", greatlyreducing the size of these tensors and hence the memory bandwidth requirementsof incremental decoding. We verify experimentally that the resulting models canindeed be much faster to decode, and incur only minor quality degradation fromthe baseline.

Translation (by gpt-3.5-turbo)

マルチヘッドアテンションレイヤーは、Transformerニューラルシーケンスモデルで使用されるものであり、シーケンス間およびシーケンス間で情報を移動させるためのRNNに対する強力な代替手段です。
これらのレイヤーのトレーニングは一般的に高速かつ簡単ですが、シーケンスの長さに沿って並列化できるため、増分推論（このような並列化が不可能な場合）は、大きな"keys"と"values"テンソルを繰り返し読み込むことによるメモリ帯域幅のコストのためにしばしば遅くなります。
私たちは、キーと値がすべての異なるアテンション"heads"で共有されるマルチクエリアテンションという変種を提案します。これにより、これらのテンソルのサイズが大幅に削減され、それにより増分デコーディングのメモリ帯域幅要件が低減されます。
実験的に、結果として得られるモデルが実際にデコードがはるかに高速であり、ベースラインからわずかな品質の低下しか発生しないことを確認します。

Summary (by gpt-3.5-turbo)

マルチヘッドアテンションレイヤーのトレーニングは高速かつ簡単だが、増分推論は大きな"keys"と"values"テンソルを繰り返し読み込むために遅くなることがある。そこで、キーと値を共有するマルチクエリアテンションを提案し、メモリ帯域幅要件を低減する。実験により、高速なデコードが可能で、わずかな品質の低下しかないことが確認された。

AkihikoWatanabe · 2024-04-07T13:35:47Z

Multi Query Attention論文。KVのsetに対して、単一のQueryのみでMulti-Head Attentionを代替する。劇的にDecoderのInferenceが早くなりメモリ使用量が減るが、論文中では言及されていない？ようだが、性能と学習の安定性が課題となるようである。

AkihikoWatanabe added Efficiency/SpeedUp NLP LanguageModel Transformer Attention Pocket labels Apr 7, 2024

AkihikoWatanabe changed the title a Fast Transformer Decoding: One Write-Head is All You Need, Noam Shazeer, N/A, arXiv'19 Apr 7, 2024

AkihikoWatanabe mentioned this issue Apr 7, 2024

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, Joshua Ainslie+, N/A, arXiv'23 #1271

Open

AkihikoWatanabe mentioned this issue May 24, 2024

Gemma: Open Models Based on Gemini Research and Technology, 2024 #1277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast Transformer Decoding: One Write-Head is All You Need, Noam Shazeer, N/A, arXiv'19 #1272

Fast Transformer Decoding: One Write-Head is All You Need, Noam Shazeer, N/A, arXiv'19 #1272

AkihikoWatanabe commented Apr 7, 2024 •

edited

AkihikoWatanabe commented Apr 7, 2024 •

edited

Fast Transformer Decoding: One Write-Head is All You Need, Noam Shazeer, N/A, arXiv'19 #1272

Fast Transformer Decoding: One Write-Head is All You Need, Noam Shazeer, N/A, arXiv'19 #1272

Comments

AkihikoWatanabe commented Apr 7, 2024 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Apr 7, 2024 • edited

AkihikoWatanabe commented Apr 7, 2024 •

edited

AkihikoWatanabe commented Apr 7, 2024 •

edited