# Attention mechanism 

State of the art approaches to data-to-sequence problems in deep learning rely on what is called the attention mechanism! It introduces a major difference from the way Recurrent neural networks treat data.


## What will you learn in this course? 🧐🧐

Before we dig into the details let's give a look at the outline of the course!

* Flaws of regular RNNs
* Attention theory
  * The concept
  * Attention in neural networks
  * The attention layer

## Flaws of regular RNNs

The main motivation for introducing the attention mechanism comes from the flaws of classicc RNN. The central idea of RNN is to analyse the elements in the sequence one by one and use the information to analyse the next elements.

This idea is good, but it creates difficulties regarding information persistence (memory) in the neuron. Even LSTM layers' performance starts to crumble when attempting to analyse very long sequences (beyond 50 in length). 

In addition to this, RNN are very slow and hard to train because of the great number of function compositions that make it difficult for the gradient descent algorithm to function properly.

## Attention theory



### The concept 

The principle of Attention is as follows:

_If someone asks to write the wikipedia article about the stars wars character Darth Vader, you may:_ 

1. _Watch the entire Series of movies._ 
2. _Watch only the portions where this character appears or is metionned!_

Attention follows the same principle as the `.2' proposal, you will ask your neural network to focus only on the parts of your data that are relevant to your prediction at a given time!

In the context of translation for example, each time you have to predict a new word from the output sequence, you will ask the network to focus on a different portion of the input that is relevant at this point in time.

In the context of an image captioning, we can define areas on the image that the model must *focus* on. The colored areas correspond the attention level that the model is giving to different pixels in the image when trying to output the corresponding underlined word in the output sequence.

<img src="https://miro.medium.com/max/2200/0*mCHDMNdwb_gB1Rj9." width="600" />

For the top left image, you can see that when trying to output the word "girl", the area in the image that recieved the most attention is located around the girl, the same goes for the bench and the umbrella, which makes total sense!

### Attention in neural networks

The attention mechanism implementation is based off of the encoder decoder framework that we luckily studied yesterday! The following schema will give you a visual sense of what is happening in the network when using attention:


![attention](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/attention/attention_gif.gif)

So the attention mechanism is based on the encoder decoder framework! But there is a major difference! Let's take a look at the way classic encoder decoder models work:

![encoder](https://full-stack-assets.s3.eu-west-3.amazonaws.com/models/M08_Deep_learning/encoder_decoder/Encoder_decoder_detail.png)

The difference is the following:

In the classic encoder decoder, the encoder extracts information from the input, this information is fed **in full** to the decoder. The decoder uses the encoder state to produce the first element of the output sequence, and from this point onwards does not access the encoder output directly, it relies entirely on its recurrent layer to persist that information until the whole output sequence is produced!

With the attention mechanism, everytime a new element of the output sequence has to be produced, a new vector of attention weights is produced and the decoder uses this **weighted** encoder output to make its prediction. At first the attention weights are based on the encoder output and state, then on the encoder output and decoder state for all subsequent predictions.

This means at each step, the decoder can access the encoder output when using attention, while it only sees it once at the beginning in the standard encoder decoder framework.

Enforcing this ability in the model is a way to help it use the piece of information it needs at the right time through an explicit importance assignment mechanism instead of letting the decoder recurrent layer figure it out by itself!

Remember that theoretically a dense network with two hidden layer should be able to solve any kind of problem, but the problem is that in practice it does not! Which is why the whole point of deep learning is developing smart ideas to explicitely enforce certain behaviours so the model can more easily take advantage of the specific characteristics of the data.

### The Bahdanau attention

Now that we understand how attention works theoretically, let's dig into the details of how it works in the specific case of Bahdanau attention, one of the most popular implementations of this technique!

The schema below describes in detail an encoder decoder architecture with an attention mechanism!

![bahdanau](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/attention/Attention-encoder-decoder.drawio.png)

* Encoder:
  * The encoder works in a very standard way, the input text sequence is embedded, then treated by a GRU layer which keeps the sequential aspect of the data plus a hidden state.

* Decoder:
  * First step:
    * The encoder output and the encoder state enter the attention layer (in white on the schema) then they are treated by two separate dense layers which outputs are then summed, "standardized" thanks to a tanh function, go through another dense layer before getting turned into a probability distribution over the elements in the input sequence through softmax. This produces the first attention weights vector!
    * The attention weights are outputted for reference and are multiplied with the encoder output to create the context vector!
    * In the meantime the first element of the teacher forcing sequence is embedded and concatenated with the context vector.
    * The concatenated vector goes through GRU, then a dense layer to produce the prediction for the first output sequence element.
  * Subsequent steps:
    * The attention layer replaces the encoder state with the hidden state from the decoder's recurrent layer, which makes the attention weights evolve, leading to the creation of a new attention weights vector.

Now you should understand very concretely how the information is treated by the encoder decoder with attention!

Another popular type of attention mechanism is the Luong attention, we'll include some documentation on this in the ressources section!

Now that we have studied the Theory, it's time to apply all this in practice!

## Ressources 📚📚

* [Bahdanau vs Luong attention](https://arabicprogrammer.com/article/3033619795/)
* [Bahdanau attention paper](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/attention/Bahdanau_attention.pdf)
* [Luong attention paper](https://full-stack-bigdata-datasets.s3.eu-west-3.amazonaws.com/Deep+Learning/attention/Luong_attention.pdf)
* [Attention is all you need](https://arxiv.org/abs/1706.03762)
* [Transformer Attention is all you need](https://towardsdatascience.com/transformer-attention-is-all-you-need-1e455701fdd9)
* [Multi-Modal Methods: Image Captioning (From Translation to Attention)](https://medium.com/mlreview/multi-modal-methods-image-captioning-from-translation-to-attention-895b6444256e)
* [Attention Model Intuiton](https://www.youtube.com/watch?v=SysgYptB198)
* [Attention Model](https://www.youtube.com/watch?v=quoGRI-1l0A&list=PLkDaE6sCZn6F6wUI9tvS_Gw1vaFAx6rd6&index=6)