# Attention
## Why Attention
Attention solves the problem of long range information transfer. Although LSTM and GRU can help with this problem, but only one connection between Encoder and Decoder presents the bottleneck problem, which means the last computation of the Encoder must contain all information of the sentence so that Encoder can process properly. Instead, attention allows Decoder to directly contact the activations of every word.

Ps. Attention changes the way output is generated, meaning LSTM or GRU can be used as Encoder align with Attention model.

## Computational Process
For each word generated by Encoder, it uses a weighted mean across all Encoder activations and "attends to" the most relevant activations. The parameter here is $\alpha^{<t, t'>}$, where $t$ is the index for outputs, and $t'$ is the index for activations.\
__$\alpha^{<t,t'>}$ is the amound of attention $y^{<t>}$ should pay to a^{<t'>}__ \
The formula will be
$$c^{<1>} = \alpha^{<1,1>}a^{<1>} + \alpha^{<1,2>}a^{<2>} +\dots+ \alpha^{<1,t'>}a^{<t'>}$$
$$S^{<0>} + c^{<1>} \to S^{<1>} \to word^{<1>}$$
Similarly
$$c^{<t>} = \sum_{t'}\alpha^{<t,t'>}a^{<t'>}$$
$$S^{<t-1>} + c^{<t>} \to S^{<t>} \to word^{<t>}$$
The process continues until the output predicts \<EOS>

Notably, the sum of $\alpha^{<1,t'>}$ should be equal to one.
$$\sum_{t'}\alpha^{<t,t'>}=1$$

## How to Compute $\alpha^{<t,t'>}$
To ensure the sum of $\alpha^{<t,t'>}$ over $t'$ equals 1, we need a softmax function
$$\alpha^{<t,t'>} = \frac{exp(e^{<t,t'>})}{\sum_{t'=1}^{T_x}exp(e^{<t,t'>})}$$
where, $e^{<t,t'>}$ represents how relevant word $y^t$ is to $a^{<t,t'>}$. It is computed through a small neural network with $a^{<t,t'>}$ and $S^{<t-1>}$ as input, so reasonably $e^{<t,t'>}$ contains information from both the activation and the last output.
$$
S^{<t-1>}\\
\downarrow\\
\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ 
\begin{bmatrix}
    \Box \\
    \Box \\
    \Box
\end{bmatrix}\to e^{<t,t'>}\\
\uparrow\\
a^{<t,t'>}
$$

But __why__ does a this neural network work anyway? __It's mythological, just trust the neural network.__

## Papers in History
The first paper invented Attention and used in CNN.\
[Bahdanau et. al., 2014. Neural machine translation by jointly learning to align and translate]\
The first paper applied Attention to NLP.\
[Xu et. al., 2015. Show, attend and tell: Neural image caption generation with visual attention]