Blockwise Parallel Transformer for Long Context Large Models, Hao Liu+, N/A, arXiv'23 #747

AkihikoWatanabe · 2023-06-16T12:43:21Z

URL

Transformers have emerged as the cornerstone of state-of-the-art naturallanguage processing models, showcasing exceptional performance across a widerange of AI applications. However, the memory demands posed by theself-attention mechanism and the large feedforward network in Transformerslimit their ability to handle long sequences, thereby creating challenges fortasks involving multiple long sequences or long-term dependencies. We present adistinct approach, Blockwise Parallel Transformer (BPT), that leveragesblockwise computation of self-attention and feedforward network fusion tominimize memory costs. By processing longer input sequences while maintainingmemory efficiency, BPT enables training sequences up to 32 times longer thanvanilla Transformers and 2 to 4 times longer than previous memory-efficientmethods. Extensive experiments on language modeling and reinforcement learningtasks demonstrate the effectiveness of BPT in reducing memory requirements andimproving performance.

トランスフォーマーは、最先端の自然言語処理モデルの基盤として登場し、幅広いAIアプリケーションにおいて優れたパフォーマンスを発揮しています。しかし、トランスフォーマーの自己注意機構と大規模なフィードフォワードネットワークによって引き起こされるメモリ要件は、長いシーケンスを扱う能力を制限し、複数の長いシーケンスや長期的な依存関係を必要とするタスクに課題を生じさせます。本研究では、ブロックごとの並列トランスフォーマー（BPT）という独自のアプローチを提案し、自己注意とフィードフォワードネットワークのブロックごとの計算と融合を活用してメモリコストを最小限に抑えます。BPTは、メモリ効率を維持しながらより長い入力シーケンスを処理することにより、バニラトランスフォーマーの32倍、以前のメモリ効率の高い方法の2〜4倍の長さのトレーニングシーケンスを可能にします。言語モデリングや強化学習タスクにおける徹底的な実験により、BPTがメモリ要件を削減し、パフォーマンスを向上させる効果を実証しました。

トランスフォーマーの自己注意機構とフィードフォワードネットワークによるメモリ要件の制限を解決するために、ブロックごとの並列トランスフォーマー（BPT）を提案。BPTは、メモリ効率を維持しながらより長い入力シーケンスを処理することができ、徹底的な実験により、言語モデリングや強化学習タスクにおいてパフォーマンスを向上させることが示された。

AkihikoWatanabe added the Pocket label Jun 16, 2023

AkihikoWatanabe changed the title あ Blockwise Parallel Transformer for Long Context Large Models, Hao Liu+, N/A, arXiv'23 Jun 16, 2023