Attention Is All You Need #1

shnakazawa · 2022-11-21T07:03:22Z

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1706.03762.

Google Brainが2017年に報告した仕事
機械翻訳のためのモデルとして報告。
- 従来の主流であったRecurrent層をAttentionに置き換えたアーキテクチャで機械翻訳に挑戦
- 「高速な学習＆推論」が特長
本論文の公開後、翻訳に限らず様々なタスクに応用できることがわかり大流行
- BERT, GPT, DALL-E, Vision Transformer, etc...　
行列の強さを思い知る楽しい論文

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

(DeepL翻訳)

配列変換モデルの主流は、エンコーダーとデコーダーの構成による複雑なリカレントニューラルネットワークや畳み込みニューラルネットワークに基づいています。また、最も性能の良いモデルは、注意メカニズムを介してエンコーダとデコーダを接続している。我々は、リカレントや畳み込みを完全に排除し、アテンション機構のみに基づく新しいシンプルなネットワークアーキテクチャ、トランスフォーマーを提案する。2つの機械翻訳タスクで実験した結果、これらのモデルは品質が優れている一方で、より並列化可能であり、学習時間が大幅に短縮されることがわかった。我々のモデルは、WMT 2014英語-ドイツ語翻訳タスクで28.4BLEUを達成し、アンサンブルを含む既存の最良結果を2BLEU以上上回りました。WMT 2014英仏翻訳タスクにおいて、我々のモデルは8GPUで3.5日間学習した後、41.8という新しい単一モデルの最新BLEUスコアを確立し、文献から得られた最良のモデルの学習コストのごく一部であることを示した。我々は、Transformerが他のタスクにうまく一般化することを、大規模および限られた学習データの両方で英語の構成語解析にうまく適用することで示す。

コード

https://paperswithcode.com/paper/attention-is-all-you-need#code

解決した課題/先行研究との比較

機械翻訳タスクにおいて、本論文以前はlong short-term memory (ref. 13) や gated recurrent neural networks (ref. 7) などを用いた再帰的な言語モデルが主流であった → 逐次計算のために並列計算が行えず、計算時間がものすごくかかる
近年、Attention機構が報告され、様々なタスクにて良い成果を出すことが報告されていた (refs. 2, 19)
しかし、多くのAttention機構はリカレントネットワークと組み合わせて使われていた。
本論文はAttention機構をリカレントネットワークから独立させたアーキテクチャ "Transformer" を提案した。

技術・手法のポイント

エンコーダ・デコーダモデル
従来のLSTM, RNN翻訳モデルで主流であったRecurrent層を用いず、Attentionだけで実装。
予測箇所の直前までの単語 (正しくはトークン) を入力として入れて、予測箇所に各単語が入る確率が出力となる。
Attentionという仕組み
- Query (Q), Key (K), Value (V) の組み合わせ。
- Qが入力、Vが出力 (のベース)。
- QとKの類似度（= 内積）に応じた重みをVにかけて出力とする。
- こちらのページの画像がイメージを掴みやすい。
Multi-head attention
- KとVの学習に使うのが、"Multi-head" Attention.
- ここの説明はアイシアさんの説明を見てもらうのが最も雰囲気を掴みやすいと思います。
- 下の図のような構造で使用される。

評価指標

BLEU 英→独翻訳, 英→仏翻訳でstate-of-the-art (SOTA)
構文解析 (WSJ 23 F1) でSOTAに近いスコア
さらに、トレーニングにかかる計算コストも、既存のモデルと比較し4~100倍小さい

残された課題・議論

未知の長さの文章に対してはうまく汎化できない
- 単語 (正しくはトークン) の位置表現の手法として、SHAPE (Kiyono et al., EMNL 2021) などが提唱されている
  参考：より良いTransformerをつくる

重要な引用

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1409.0473.
- Attention機構の初出論文
Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. “Sequence to Sequence Learning with Neural Networks.” arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1409.3215.
Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. 2015. “Effective Approaches to Attention-Based Neural Machine Translation.” arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1508.04025.
- AttentionとRNNのあわせ技論文

参考情報

shnakazawa added Natural language processing Papers related to NLP Transformer Papers using transformer labels Nov 21, 2022

shnakazawa mentioned this issue Feb 13, 2023

When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute #14

Open

yonesuke0716 mentioned this issue May 22, 2023

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attention Is All You Need #1

Attention Is All You Need #1

shnakazawa commented Nov 21, 2022

Attention Is All You Need #1

Attention Is All You Need #1

Comments

shnakazawa commented Nov 21, 2022

Abstract

コード

解決した課題/先行研究との比較

技術・手法のポイント

評価指標

残された課題・議論

重要な引用

関連論文

参考情報