Efficiently Scaling Transformer Inference, Reiner Pope+, N/A, arXiv'22 #601

AkihikoWatanabe · 2023-04-30T13:56:32Z

URL

https://arxiv.org/abs/2211.05102

Affiliations

Reiner Pope, N/A
Sholto Douglas, N/A
Aakanksha Chowdhery, N/A
Jacob Devlin, N/A
James Bradbury, N/A
Anselm Levskaya, N/A
Jonathan Heek, N/A
Kefan Xiao, N/A
Shivani Agrawal, N/A
Jeff Dean, N/A

Abstract

We study the problem of efficient generative inference for Transformermodels, in one of its most challenging settings: large deep models, with tightlatency targets and long sequence lengths. Better understanding of theengineering tradeoffs for inference for large Transformer-based models isimportant as use cases of these models are growing rapidly throughoutapplication areas. We develop a simple analytical model for inferenceefficiency to select the best multi-dimensional partitioning techniquesoptimized for TPU v4 slices based on the application requirements. We combinethese with a suite of low-level optimizations to achieve a new Pareto frontieron the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parametermodels that outperforms the FasterTransformer suite of benchmarks. We furthershow that with appropriate partitioning, the lower memory requirements ofmultiquery attention (i.e. multiple query heads share single key/value head)enables scaling up to 32x larger context lengths. Finally, we achieve alow-batch-size latency of 29ms per token during generation (using int8 weightquantization) and a 76% MFU during large-batch-size processing of input tokens,while supporting a long 2048-token context length on the PaLM 540B parametermodel.

Translation (by gpt-3.5-turbo)

本研究では、Transformerモデルの効率的な生成推論の問題を、最も厳しい状況の1つである、大規模な深層モデル、タイトなレイテンシーターゲット、長いシーケンス長に対して研究しています。これらのモデルの使用例がアプリケーション領域全体で急速に増加しているため、大規模Transformerベースのモデルの推論のエンジニアリングのトレードオフをより良く理解することが重要です。我々は、アプリケーション要件に基づいてTPU v4スライスに最適化された最良の多次元分割技術を選択するための推論効率のための単純な解析モデルを開発しました。これらを低レベルの最適化と組み合わせることで、500B+パラメータモデルのレイテンシーとモデルFLOPS利用率のトレードオフにおいて、FasterTransformerベンチマークスイートを上回る新しいParetoフロンティアを実現しました。さらに、適切な分割により、マルチクエリアテンション（複数のクエリヘッドが単一のキー/バリューヘッドを共有する）の低いメモリ要件により、32倍の大きなコンテキスト長にスケーリングすることができることを示しました。最後に、int8ウェイト量子化を使用した生成中の低バッチサイズレイテンシーは、トークンあたり29msであり、入力トークンの大バッチサイズ処理において76％のMFUを実現し、PaLM 540Bパラメータモデルにおいて2048トークンの長いコンテキスト長をサポートしています。

Summary (by gpt-3.5-turbo)

- 大規模Transformerベースのモデルの推論のエンジニアリングのトレードオフを理解するために、最適な多次元分割技術を選択するための単純な解析モデルを開発
低レベルの最適化と組み合わせることで、500B+パラメータモデルのレイテンシーとモデルFLOPS利用率のトレードオフにおいて、FasterTransformerベンチマークスイートを上回る新しいParetoフロンティアを実現
適切な分割により、マルチクエリアテンションの低いメモリ要件により、32倍の大きなコンテキスト長にスケーリング可能
int8ウェイト量子化を使用した生成中の低バッチサイズレイテンシーは、トークンあたり29msであり、入力トークンの大バッチサイズ処理において76％のMFUを実現し、PaLM 540Bパラメータモデルにおいて2048トークンの長いコンテキスト長をサポートしている。

AkihikoWatanabe · 2023-04-30T13:57:29Z

特にMultiquery Attentionという技術がTransformerのinferenceのコスト削減に有効らしい

AkihikoWatanabe changed the title a Efficiently Scaling Transformer Inference, Reiner Pope+, N/A, arXiv'22 Apr 30, 2023

AkihikoWatanabe added the Pocket label May 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiently Scaling Transformer Inference, Reiner Pope+, N/A, arXiv'22 #601

Efficiently Scaling Transformer Inference, Reiner Pope+, N/A, arXiv'22 #601

AkihikoWatanabe commented Apr 30, 2023 •

edited

AkihikoWatanabe commented Apr 30, 2023

Efficiently Scaling Transformer Inference, Reiner Pope+, N/A, arXiv'22 #601

Efficiently Scaling Transformer Inference, Reiner Pope+, N/A, arXiv'22 #601

Comments

AkihikoWatanabe commented Apr 30, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented Apr 30, 2023

AkihikoWatanabe commented Apr 30, 2023 •

edited