You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Transformers have emerged as the cornerstone of state-of-the-art naturallanguage processing models, showcasing exceptional performance across a widerange of AI applications. However, the memory demands posed by theself-attention mechanism and the large feedforward network in Transformerslimit their ability to handle long sequences, thereby creating challenges fortasks involving multiple long sequences or long-term dependencies. We present adistinct approach, Blockwise Parallel Transformer (BPT), that leveragesblockwise computation of self-attention and feedforward network fusion tominimize memory costs. By processing longer input sequences while maintainingmemory efficiency, BPT enables training sequences up to 32 times longer thanvanilla Transformers and 2 to 4 times longer than previous memory-efficientmethods. Extensive experiments on language modeling and reinforcement learningtasks demonstrate the effectiveness of BPT in reducing memory requirements andimproving performance.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)
The text was updated successfully, but these errors were encountered: