# Introduction to Attention
<!-- estimated time: 4hours -->

This section will cover:

1. Sequence to sequence recap
2. Attention overview - Encoding
3. Attention overview - Decoding
4. Bahdanau and Luong Attention
* Multiplicative attention
* Additive attention
* Computer vision applications
* NLP application: Google neural machine translation
* Other attention methods
* The transformer and self-attention
* Lab: Attention basics

Attention started out in the field of computer vision as an attempt to mimic human perception:
> "One important property of human percetption is that one does not tend to process a while in its entirety at once. Instead, humans focus attention selectively on parts of the visual space to acquire information when and where it is needed, and combine information from different fixations over time to build up an internal representation of the scene, guiding future eye movements and decision making"
- [Recurrent Models of Visual Attention](https://arxiv.org/abs/1406.6247)

Note here that instead of processing the entirety of the image, all that is needed to know it is a picture of a bird is to ignore the background and instead focus on the item of interest. Further, if we can separate attention from the entirety of the image to componenets of it, we can describe the image in a more complete and nuanced manner:
<img src="assets/images/06/img_001.png" width=700 align='center'>

# 1: Seq2Seq Recap

Classic, i.e., those without attention, Seq2Seq models have to look at the original sentence that is to be translated one time and then use that *entire* input to produce every single small output term.

A sequence to sequence model takes in an input that is a sequence of items and then produces another sequence of items as an output.

* In machine translation, the input sequence is a series of words in one language and the output is a translation in another language.

* In text summarization, the input is a long sequence of words and the output is a short one.

The seq2seq model usually consists of an encoder and decoder. It works by the encoder first processing all of the inputs, turning the inputs into a single representation. Typically a single vector known as the **context** vector. The *context* vector contains whatever information the encoder was able to capture from the input sequence.
<img src="assets/images/06/img_002.png" width=700 align='center'>

The context vector is then sent to the decoder which uses it to formulate an output sequence. In machine translation scenarios, the encoder and decoder are both recurrent neural networks (RNNs), usually LSTM cells (long short term memory)
<img src="assets/images/06/img_003.png" width=700 align='center'>

In this scenario, the context vector is a vector of numbers encoding the information that the encoder captured from the input sequence. In real world scenarios, this vector has a length of $2^{n}$, like 256, 512, etc.
<img src="assets/images/06/img_004.png" width=700 align='center'>

If we look at the previous example, translating *comment allez vous* to *how are you*, we can see how the hidden state develops:

1. Take the first word and develop the first hidden state:
<img src="assets/images/06/img_005.png" width=700 align='center'>

2. In the second step, we take the second word AND the first hidden state as inputs to the RNN and produce the second hidden state:
<img src="assets/images/06/img_006.png" width=700 align='center'>

3. In the third step, we do the same process as the second, we take the third (and last) word AND the second hidden state as inputs and generate the third hidden state:
<img src="assets/images/06/img_007.png" width=700 align='center'>

The third hidden state is the context vector that will be passed to the decoder. **This highlights a limitation of seq2seq models!**

The encoder is confined to sending a single vector, no matter how long or short the input sequence is. Choosing a reasonable size fot this vector makes the model have problems with long input sequences. If you just use a very large number for the hidden unit vectors so that the context is very large, then the model overfits with short sequences and there is a performance reduction as you increase the number of parameters. **Attention in neural nets solves this issue.**

# 2. Attention overview - Encoding

A seq2seq model with attention works like this:

1. The encoder processes the input sequence, just like the model without attention, one word at a time. It produces a hidden state for each of these inputs and uses that hidden state in the next step.
<img src="assets/images/06/img_008.png" width=700 align='center'>

2. Then, the model passes the context vector to the decoder. However, unlike the context vector in the model WITHOUT attenttion, this one is not just the final hidden state, it's all of the hidden states.
<img src="assets/images/06/img_009.png" width=700 align='center'>

The benefit of passing all the hidden input states is that it gives us flexibility in the context size. Longer sequences can have longer context vectors that better capture the information from the input sequence.

Intuitively, each hidden state is (likely) most associated with the part of the input sequence that preceded how that word was generated. I.e., the first hidden state was produced after encoding the first word/input so it captures the essence of the first input the most of the hidden states.

So, when we **focus** on the first hidden state, we **focus** on the first input. And likewise when we focus on the second hidden state, we are focusing on the second input, and so on.
<img src="assets/images/06/img_010.png" width=700 align='center'>

3: Attention Overview - Decoding

At every time step, the attention decoder pays attention to the appropriate part of the input sequence using the context vector. The process for the decoder to know which aspects of the input sequence are best to pay attention to is learned during the training phase.

<img src="assets/images/06/img_011.png" width=700 align='center'>

The process learned is not as simple as going from the first to the last hidden vector; it's not just associating the current hidden vector with the thing to be predicted. It is more sophisticated.

If we consider the example of translating a french sentence to an english one. Assume we have a trained transformer model. Let's take the first 4 words of the sentence on the left of the picture:

<img src="assets/images/06/img_012.png" width=700 align='center'>

If we consider the words in the top portion of the picture, they are pretty well lined up with their french counterpart. But then, note the next few words to translate are "zone economique europeene". Something different happens:

<img src="assets/images/06/img_013.png" width=700 align='center'>

If we consider the darker the lighter shaded blocks to be associated with the words that are **foucsed** on for producing the next word in the statement, you can see that in the three words, "zone economique europeene", are not in sequence attended to to produce "european economic area". It is not produced as "area economic european" as would be in order from the french "zone economique europeene". The model was able to learn this representation from the training data set.

The model goes on to more or less produce the successive terms with sequentiality:
<img src="assets/images/06/img_014.png" width=700 align='center'>

This shows how the attention mechanism lets the model focus on the right parts of the sequence at the right time.

## Check-on Learning:

True or False: A sequence-to-sequence model processes the input sequence all in one step
> * False: a seq2seq model works by feeding one element of the input sequence at a time to the encoder

What are two limitations of seq2seq models that are solved by attention methods:
1. The fixed size of the context matrix passed fro mthe encoder to the decoder is a bottleneck
2. The difficulty of encoding long sequences and recalling long-term dependencies

How large is the context matrix in an *attention* seq2seq model?
> * Depends on the length of the input sequence. Adding attention alters that where the general seq2seq model is of a fixed length.

# 3. Attention overview - Decoding

In models without attention, we'd only feed the last context vector to the decoder RNN, in addition to the embedding of the end token, and it will begin to generate an element of the output sequence at each time step. 

<img src="assets/images/06/img_015.png" width=700 align='center'>

The case is different in an attention decoder.

<img src="assets/images/06/img_016.png" width=700 align='center'>

The attention decoder has the ability to look at the inputted words and the decoder's own hidden state:

<img src="assets/images/06/img_017.png" width=700 align='center'>

and then do the following:
1. Use a scoring function to score each hidden state in the context matrix
2. then pass those scores into a softmax function so that all values are positive, between zero and one, and all sum to one - these values are how much each vector will be expressed in the attention vector that the decoder will look at before producing an output

<img src="assets/images/06/img_018.png" width=700 align='center'>

3. Multiply each vector by its softmax score and then summing up those vectors produces an attention context vector - this is a basic weighted sum operation

<img src="assets/images/06/img_019.png" width=700 align='center'>

Note: The context vector is an important milestone in this process but it is not the end goal

4. Now, the decoder has looked at the input word and the attention context vector, which focuses its attention on the appropriate place of the input sequence - it produces a hidden state and it produces the first word in the output sequence

<img src="assets/images/06/img_020.png" width=700 align='center'>

5. Next, the decoder takes the previous output and hidden states as an input, generates an attention context vector for that time step, which produces a new hidden state for that time step and next word in the output sequence

<img src="assets/images/06/img_021.png" width=700 align='center'>

6. This continues until the output sequence is completed

<img src="assets/images/06/img_022.png" width=700 align='center'>

## Check on Learning:

In machine learning applications, the encoder and decoder are typically:
* Generative Adversarial Networks (GANs)
* **Recurrent Neural Networks (Typically vanilla RNN, LSTM, or GRU)**
* Mentats

What's a more reasonable embedding size for a real-world application?
* 4
* **200**
* 6,000

What are the steps that require calculating an attention vector in a seq2seq model with attention?
* Every tiem step in the model (both encoder and decoder)
* Every time step in the encoder only
* **Every time step in the decoder only**

# 4. Bahdanau (additive) and Luong (multiplicative) Attention 

[Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) *additive*

[Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025) *multiplicative*

Before delving into the details of the scoring functions, we need to make a distinction between the two major types of attention mechanisms: **additive attention** and **multiplicative attention**

The scoring function in Bahdanau (additive) attention: $$e_{ij}=v_{a}^\top\text{tanh}\left ( W_{a}s_{i-1}+U_{a}h_{j} \right )$$

* $h_{j}$ is the hidden state from the encoder
* $s_{i-1}$ is the hidden state of the decoder in the previous time step
* $U_{a} , W_{a} , v_{a}$ are all weight matrices that are learned during the training process

Basically, this is a scoring function that takes the hidden state of the encoder ($h_{j}$), hidden state of the decoder ($s_{i-1}$) and produces a single number for each decoder time step.

The scores are then passed into the softmax:
$$a_{ij} = \frac{exp \left( e_{ij} \right) }{\Sigma_{k=1}^{T_{x}} exp\left(e_{ik}\right) }$$

and then applied to a weighted sum:
$$ c_{i} = \Sigma_{j=1}^{t_{x}} a_{ij}h_{j}$$

where we multiply each encoder hidden state by its score and then we sum them, producing our attention context vector

<img src="assets/images/06/img_023.png" width=700 align='center'>