## Title
[Attention is all you need](https://arxiv.org/abs/1706.03762)

## Authors and Year
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin (2017)

## Abstract
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

## Model Type
Transformer (eschewing recurrence & relying entirely on an attention mechanism)

## Background knowledges in Language model
#### Recurrent Neural Network (RNN)

- **RNN** : hidden layer의 node에서 활성화 함수를 통해 나온 결과값을 출력층 방향으로도 보내면서, 다시 은닉층 노드의 다음 계산의 입력으로 보내는 특징을 가짐.

<p align="center">
    <img src="Images/RNN.jpg" alt="drawing" width="800"/>

$$
    h_{t} = tanh(W_{x}\;x_{t} + W_{h}\;h_{t-1} + b) \\
    y_{t} = f(W_{y}\;x_{t} + b)\quad\quad\quad\quad\quad\quad
$$

- **The problem of long-term dependencies** : 바닐라 RNN의 time step이 길어지며 **앞의 정보가 뒤로 충분히 전달되지 못하는 현상**. 충분한 기억력을 갖지 못해 다음 단어를 엉뚱하게 예측.
    - LSTM, GRU 등의 모델의 등장으로 성능이 크게 개선되었으나 모델 구조가 갖는 근본적인 한계가 있음.

#### Attention mechanism

<p align="center">
    <img src="Images/attention_mechanism.jpg" alt="drawing" width="800"/>

- 입력 시퀀스가 길어지면 출력 시퀀스의 정확도가 떨어지는 것을 보정해주기 위해 등장한 기법.
    - Idea : decoder에서 출력 단어를 예측하는 매 시점(time step)마다 인코더에서의 전체 입력 문장을 다시 한번 참고. 이 때, 전체 입력 문장을 전부 다 동일한 비율로 참고하는 것이 아니라, **해당 시점에서 예측해야 할 단어와 연관 있는 입력 단어 부분을 더 집중(attention)해서** 봄.
    - Encoder의 각 hidden state와 decoder의 현 시점 hidden state의 dot product -> Attention score ($e^{t}$)
    - Attention distribution ($\alpha^{t}$) : softmax 함수를 통해 구해진 각 시점마다의 분포
    
    $$
    \alpha^{t} = softmax(e^{t})
    $$

    - Attention value ($a^{t}$) : 각 attention 가중치와 encoder의 은닉 상태를 가중합함.
    - Concatenate : attention value와 decoder의 t 시점의 hidden state의 연결.
    - $\bar{s_{t}}$ : 출력층 연산의 입력, $\hat{y_{t}}$ : 예측 벡터

    $$
    \bar{s_{t}} = tanh \left(W_{c} \left[a_{t};s_{t} \right] + b_{c} \right) \\
    \hat{y_{t}} = softmax(W_{y}\;\bar{s_{t}} + b_{y})
    $$

#### BLEU Score(Bilingual Evaluation Understudy Score)

기계 번역과 사람이 직접 번역한 결과가 얼마나 유사한지 비교하여 번역에 대한 성능을 측정하는 방법.

자세한 설명은 [省略](https://wikidocs.net/31695)한다

## Model architecture

#### Scaled Dot-Product Attention & Multi-head Attention

1. Scaled Dot-Product Attentnion

<p align="center">
    <img src="Images/Attention.jpg" alt="drawing" width="600"/>

$$
    \mathrm{Attention}(Q, K, V) = \mathrm{softmax}({QK^{T} \over \sqrt{d_{k}}})V
$$

- Additive attention vs. Dot-product attention
    - 두 방식 모두 이론적인 복잡도는 비슷하지만, dot-product attention이 더 빠르고 space-efficient하여 적합하다 판단.
    - For large values of $d_{k}$, the dot products grow large in magnitude, **pushing the softmax function into regions where it has extremely small gradients**. To counteract this effect, we **scale the dot products by $\frac{1}{d_{k}}$**.

2. Multi-head Attention
- Instead of performing a single attention function with $d_{model}$-dimensional keys, values, and queries, we found it beneficial to **linearly project** the queries, keys and values **$h$ times** with **different, learned linear projections** to $d_{k},\;d_{k}$ and $d_{v}$ dimensions.
    - In this work, we employ $h$ = 8 parallel attention layers, or heads. For each of these we use $d_{k}\;=d_{v}\;=d_{model}/h\;=64$.
- We then perform the **attention function in parallel**, yielding $d_{v}$-dimensional output values. These are **concatenated and once again projected, resulting in the final values**.

$$
    \mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(head_{1}, ...,head_{h})W^{O} \\
    \quad\quad\quad\quad \mathrm{where} \ head_{i} = \mathrm{Attention}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})
$$

#### Encoder and Decoder Stacks

<p align="center">
    <img src="Images/Transformer_architecture.jpg" alt="drawing" width="400"/>

##### Encoder
- Stack of $N$ = 6 identical layers. Each layer has 2 sub-layers.
    - First : **Multi-head attention mechanism**.
    - Second : simple, positionwise fully connected **feed-forward network**.
- **Residual connection** around each of the two sub-layers, followed by **layer normalization**.


##### Decoder
- Stack of $N$ = 6 identical layers. Each layer has 3 sub-layers.

##### Positional Encoding
- Transformer 모델은 RNN이나 LSTM과 다르게 입력된 데이터를 한번에 병렬로 처리하기 때문에 속도가 빠르다는 장점이 있으나, 입력 순서가 단어 순서에 대한 정보를 보장하지 않음. 다시 말하면, 트랜스포머의 경우 시퀀스가 한번에 병렬로 입력되기에 단어 순서에 대한 정보가 사라짐.
- 단어 위치에 따라 문장 전체의 의미가 달라질 수 있기에 위치 정보가 중요하며, 본 논문에서는 이 문제를 positional encoding으로 해결하고 있음.

자세한 설명은 [민경](https://www.blossominkyung.com/deeplearning/transfomer-positional-encoding#4d058603-db0f-4d62-bb49-d85ea6dcbfc6)님께서 한다

## Training

#### Training Data and Batching
- WMT 2014 English-German dataset (4.5m sentence pair); byte-pair encoding, has a shared source-target vocabulary of about 37000 tokens.
- WMT 2014 English-French dataset (36m sentences); split tokens into a 32000 word-piece vocabulary.

#### Hardware & Schedule
- 8 NVIDIA P100 GPUs
- Base model
    - About 0.4 sec / each training step -> Trained the base models for a total of 100,000 steps or 12 hrs.
- Big model
    - About 1 sec / each training step -> Trained the base models for a total of 300,000 steps(3.5 days).

#### Optimizer
- Adam optimizer with $\beta_{1} = 0.9,\; \beta_{2} = 0.98$ and $\epsilon = 10^{-9}$

#### Regularization
- Three types of regularization -> Residual dropout, label smoothing