FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao+, N/A, arXiv'22 #688

AkihikoWatanabe · 2023-05-20T10:43:43Z

URL

https://arxiv.org/abs/2205.14135

Affiliations

Tri Dao, N/A
Daniel Y. Fu, N/A
Stefano Ermon, N/A
Atri Rudra, N/A
Christopher Ré, N/A

Abstract

Transformers are slow and memory-hungry on long sequences, since the time andmemory complexity of self-attention are quadratic in sequence length.Approximate attention methods have attempted to address this problem by tradingoff model quality to reduce the compute complexity, but often do not achievewall-clock speedup. We argue that a missing principle is making attentionalgorithms IO-aware -- accounting for reads and writes between levels of GPUmemory. We propose FlashAttention, an IO-aware exact attention algorithm thatuses tiling to reduce the number of memory reads/writes between GPU highbandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity ofFlashAttention, showing that it requires fewer HBM accesses than standardattention, and is optimal for a range of SRAM sizes. We also extendFlashAttention to block-sparse attention, yielding an approximate attentionalgorithm that is faster than any existing approximate attention method.FlashAttention trains Transformers faster than existing baselines: 15%end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to theMLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K),and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttentionand block-sparse FlashAttention enable longer context in Transformers, yieldinghigher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift onlong-document classification) and entirely new capabilities: the firstTransformers to achieve better-than-chance performance on the Path-X challenge(seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1%accuracy).

Translation (by gpt-3.5-turbo)

トランスフォーマーは、自己注意の時間とメモリの複雑度がシーケンスの長さに対して二次的であるため、長いシーケンスに対して遅く、メモリを多く消費します。
近似注意方法は、計算の複雑度を減らすためにモデルの品質を犠牲にすることで、この問題に対処しようとしてきましたが、ウォールクロックの高速化を実現することができませんでした。
私たちは、注意アルゴリズムをIO-awareにすることが欠けている原則であると主張します。つまり、GPUメモリのレベル間の読み取りと書き込みを考慮することです。
私たちは、FlashAttentionを提案しました。これは、タイリングを使用して、GPUの高帯域幅メモリ（HBM）とGPUのオンチップSRAM間のメモリ読み取り/書き込みの数を減らします。
FlashAttentionのIOの複雑度を分析し、標準の注意よりも少ないHBMアクセスが必要であり、一定のSRAMサイズに対して最適であることを示しました。
また、FlashAttentionをブロック疎な注意に拡張し、既存の近似注意方法よりも高速な近似注意アルゴリズムを提供します。
FlashAttentionは、既存のベースラインよりもトランスフォーマーを高速にトレーニングできます。BERT-large（seq. length 512）では、MLPerf 1.1トレーニング速度記録に比べて15％のウォールクロックの高速化、GPT-2（seq. length 1K）では3倍の高速化、長距離アリーナ（seq. length 1K-4K）では2.4倍の高速化が実現されます。
FlashAttentionとブロック疎なFlashAttentionは、トランスフォーマーでより長い文脈を可能にし、より高品質なモデル（GPT-2では0.7のパープレキシティ改善、長文書分類では6.4ポイントの向上）や、完全に新しい機能を提供します。Path-Xチャレンジ（seq. length 16K、61.4％の精度）とPath-256（seq. length 64K、63.1％の精度）で、チャンスよりも優れたパフォーマンスを発揮する最初のトランスフォーマーです。

Summary (by gpt-3.5-turbo)

トランスフォーマーは、長いシーケンスに対して遅く、メモリを多く消費するため、注意アルゴリズムを改善する必要がある。FlashAttentionは、タイリングを使用して、GPUの高帯域幅メモリ（HBM）とGPUのオンチップSRAM間のメモリ読み取り/書き込みの数を減らし、トランスフォーマーを高速にトレーニングできる。FlashAttentionは、トランスフォーマーでより長い文脈を可能にし、より高品質なモデルや、完全に新しい機能を提供する。

AkihikoWatanabe · 2023-05-20T10:44:51Z

より計算効率の良いFlashAttentionを提案

AkihikoWatanabe added the Pocket label May 20, 2023

AkihikoWatanabe changed the title あ FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao+, N/A, arXiv'22 May 20, 2023

AkihikoWatanabe added Efficiency/SpeedUp MachineLearning Attention and removed Pocket labels Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao+, N/A, arXiv'22 #688

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao+, N/A, arXiv'22 #688

AkihikoWatanabe commented May 20, 2023 •

edited

AkihikoWatanabe commented May 20, 2023

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao+, N/A, arXiv'22 #688

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao+, N/A, arXiv'22 #688

Comments

AkihikoWatanabe commented May 20, 2023 • edited

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)

AkihikoWatanabe commented May 20, 2023

AkihikoWatanabe commented May 20, 2023 •

edited