# Introduction to Attention
<!-- estimated time: 4hours -->

This section will cover:

1. Sequence to sequence recap
2. Attention overview - Encoding
* Attention overview - Decoding
* Attention encoder
* Attention decoder
* Attention encoder & decoder
* Bahdanau and Luong attention
* Multiplicative attention
* Additive attention
* Computer vision applications
* NLP application: Google neural machine translation
* Other attention methods
* The transformer and self-attention
* Lab: Attention basics

Attention started out in the field of computer vision as an attempt to mimic human perception:
> "One important property of human percetption is that one does not tend to process a while in its entirety at once. Instead, humans focus attention selectively on parts of the visual space to acquire information when and where it is needed, and combine information from different fixations over time to build up an internal representation of the scene, guiding future eye movements and decision making"
- [Recurrent Models of Visual Attention](https://arxiv.org/abs/1406.6247)

Note here that instead of processing the entirety of the image, all that is needed to know it is a picture of a bird is to ignore the background and instead focus on the item of interest. Further, if we can separate attention from the entirety of the image to componenets of it, we can describe the image in a more complete and nuanced manner:
<img src="assets/images/06/img_001.png" width=700 align='center'>

# 1: Seq2Seq Recap

Classic, i.e., those without attention, Seq2Seq models have to look at the original sentence that is to be translated one time and then use that *entire* input to produce every single small output term.

A sequence to sequence model takes in an input that is a sequence of items and then produces another sequence of items as an output.

* In machine translation, the input sequence is a series of words in one language and the output is a translation in another language.

* In text summarization, the input is a long sequence of words and the output is a short one.

The seq2seq model usually consists of an encoder and decoder. It works by the encoder first processing all of the inputs, turning the inputs into a single representation. Typically a single vector known as the **context** vector. The *context* vector contains whatever information the encoder was able to capture from the input sequence.
<img src="assets/images/06/img_002.png" width=700 align='center'>

The context vector is then sent to the decoder which uses it to formulate an output sequence. In machine translation scenarios, the encoder and decoder are both recurrent neural networks (RNNs), usually LSTM cells (long short term memory)
<img src="assets/images/06/img_003.png" width=700 align='center'>

In this scenario, the context vector is a vector of numbers encoding the information that the encoder captured from the input sequence. In real world scenarios, this vector has a length of $2^{n}$, like 256, 512, etc.
<img src="assets/images/06/img_004.png" width=700 align='center'>

If we look at the previous example, translating *comment allez vous* to *how are you*, we can see how the hidden state develops:

1. Take the first word and develop the first hidden state:
<img src="assets/images/06/img_005.png" width=700 align='center'>

2. In the second step, we take the second word AND the first hidden state as inputs to the RNN and produce the second hidden state:
<img src="assets/images/06/img_006.png" width=700 align='center'>

3. In the third step, we do the same process as the second, we take the third (and last) word AND the second hidden state as inputs and generate the third hidden state:
<img src="assets/images/06/img_007.png" width=700 align='center'>

The third hidden state is the context vector that will be passed to the decoder. **This highlights a limitation of seq2seq models!**

The encoder is confined to sending a single vector, no matter how long or short the input sequence is. Choosing a reasonable size fot this vector makes the model have problems with long input sequences. If you just use a very large number for the hidden unit vectors so that the context is very large, then the model overfits with short sequences and there is a performance reduction as you increase the number of parameters. **Attention in neural nets solves this issue.**

# 2. Attention overview - Encoding

A seq2seq model with attention works like this:

1. The encoder processes the input sequence, just like the model without attention, one word at a time. It produces a hidden state for each of these inputs and uses that hidden state in the next step.
<img src="assets/images/06/img_008.png" width=700 align='center'>

2. Then, the model passes the context vector to the decoder. However, unlike the context vector in the model WITHOUT attenttion, this one is not just the final hidden state, it's all of the hidden states.
<img src="assets/images/06/img_009.png" width=700 align='center'>

The benefit of passing all the hidden input states is that it gives us flexibility in the context size. Longer sequences can have longer context vectors that better capture the information from the input sequence.

Intuitively, each hidden state is (likely) most associated with the part of the input sequence that preceded how that word was generated. I.e., the first hidden state was produced after encoding the first word/input so it captures the essence of the first input the most of the hiddent states.

So, when we **focus** on the first hidden state, we **focus** on the first input. And likewise when we focus on the second hidden state, we are focusing on the second input, and so on.
<img src="assets/images/06/img_010.png" width=700 align='center'>