You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Multi-head attention layers, as used in the Transformer neural sequencemodel, are a powerful alternative to RNNs for moving information across andbetween sequences. While training these layers is generally fast and simple,due to parallelizability across the length of the sequence, incrementalinference (where such paralleization is impossible) is often slow, due to thememory-bandwidth cost of repeatedly loading the large "keys" and "values"tensors. We propose a variant called multi-query attention, where the keys andvalues are shared across all of the different attention "heads", greatlyreducing the size of these tensors and hence the memory bandwidth requirementsof incremental decoding. We verify experimentally that the resulting models canindeed be much faster to decode, and incur only minor quality degradation fromthe baseline.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
これらのレイヤーのトレーニングは一般的に高速かつ簡単ですが、シーケンスの長さに沿って並列化できるため、増分推論(このような並列化が不可能な場合)は、大きな"keys"と"values"テンソルを繰り返し読み込むことによるメモリ帯域幅のコストのためにしばしば遅くなります。
私たちは、キーと値がすべての異なるアテンション"heads"で共有されるマルチクエリアテンションという変種を提案します。これにより、これらのテンソルのサイズが大幅に削減され、それにより増分デコーディングのメモリ帯域幅要件が低減されます。
実験的に、結果として得られるモデルが実際にデコードがはるかに高速であり、ベースラインからわずかな品質の低下しか発生しないことを確認します。
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: