You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We study the problem of efficient generative inference for Transformermodels, in one of its most challenging settings: large deep models, with tightlatency targets and long sequence lengths. Better understanding of theengineering tradeoffs for inference for large Transformer-based models isimportant as use cases of these models are growing rapidly throughoutapplication areas. We develop a simple analytical model for inferenceefficiency to select the best multi-dimensional partitioning techniquesoptimized for TPU v4 slices based on the application requirements. We combinethese with a suite of low-level optimizations to achieve a new Pareto frontieron the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parametermodels that outperforms the FasterTransformer suite of benchmarks. We furthershow that with appropriate partitioning, the lower memory requirements ofmultiquery attention (i.e. multiple query heads share single key/value head)enables scaling up to 32x larger context lengths. Finally, we achieve alow-batch-size latency of 29ms per token during generation (using int8 weightquantization) and a 76% MFU during large-batch-size processing of input tokens,while supporting a long 2048-token context length on the PaLM 540B parametermodel.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: