<a name='0'></a>

# Attention and Transformer

In the previous notebook, we learned about Recurrent Neural Networks and their applications in computer vision. Toward the end, we saw that Recurrent Neural Networks are rarely used to day due to the advent of Attention and Transformers. In this notebook, we will take a step back and see the downsides of RNNs and how they motivate Transformer dive deep into Transformer architecture.

***Outline***:

- [1. The Downsides of Recurrent Networks and Motivation of Transformers](#1)
- [2. Transformer Architecture](#2)
- [3. Implementation of Transformer from Scratch](#3)
- [4. Transformer beyond NLP](#4)
- [5. Final Notes](#5)
- [6. References and Further Learning](#6)

<a name='1'></a>

## 1.The Downsides of Recurrent Networks and Motivation of Transformers

As we saw in the previous notebook, recurrent networks are neural network algorithm which are used to process sequential data such as texts and videos. Different to feed-forward networks, recurrent networks have a feedback loop in their design, a key element in such ability to model sequential data. But there is a problem in the way RNNs are wired.

Due to such inherent sequential design, there is a computation bottleneck since we can't process input training examples parallelly. We have to process each word or character(or simply a token) of a sentence(or each frame of a video) at a time and as you might imagine, this becomes even harder for longer sequences.

To this point, the issue is not only unstable gradient problems that makes it hard to learn longer dependencies, but also the fact that RNNs don't work well with modern GPUs(Graphical Processing Units or GPUs are designed for faster parallel computations and idle at sequential operations).

Transformer is a neural network architecture that can process sequential data in parallel manner. It doesn't use any recurrent or convolution layers, but rather, it is purely based on attention. We will learn more about attention in the next sections. Transformer was introduced in 2017 in the famous paper [Attention is all you need](https://arxiv.org/abs/1706.03762). Before its advent, researchers used to combine recurrent networks or convolutions with attention(examples [here](https://arxiv.org/pdf/1508.04025.pdf), [here](https://arxiv.org/pdf/1409.0473.pdf), and [here](https://arxiv.org/abs/1502.03044)). After Transformer, attention has been literally all you need not only in natural language processing but also in other modalities such as computer vision as well.



## 2. Transformer Architecture

Transformer is made of 2 main layers which are self-attention layers and fully connected layers(or dense layer in [Keras/TensorFlow](https://keras.io/api/layers/core_layers/dense/) or linear layer in [PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)). Those layers are stacked on top of each other and they are arranged in encoder-decoder.

In addition to self-attention and fully connected layers, Transformer also contains other layers such as layer normalization, embedding layer(at the input), and positional encoding layer for maintaining the order of tokens. We will discuss them in details.


![image](https://drive.google.com/uc?export=view&id=1bbcftxyRuYgpBr2Tj0fgxkcPp98G2LwT)

### 2. 1 Attention

Attention is the main layer of the Transformer architecture. Attention can sound complicated but in essence, it's this mechanism that can allow the neural network to pay more attention to a part of the input data that contain the semantic information and pay less attention to the rest of the input data. In image captioning for example, that could mean paying attention to parts of the image that contain key features or features that are useful for generating the desired caption.

![image](https://drive.google.com/uc?export=view&id=1kd8PKpZzjK2bQhlEqfkKQdCbr7kA9LsB)

Another example: In machine translation or sequence to sequence translation, attention could mean attending to the words that have associative meaning. A nice example of this is found in below image. Notice where the order of the words `European Economic Area` were translated to `zone économique européenne`.


![image](https://drive.google.com/uc?export=view&id=14UtDLC9vkFvrYg_dC9jTPXNHnGJR-_4f)

What's going on there? Do you see something? The order of the words were reversed in translated sequence because it makes sense that way. So, when translating a word, attention can give the model to not only translate the word correctly, but to also translate it in the right order and attending to other words that correrate with that particular word. In short, attention can identify and preserve the context when performing transations between different languages.

The machine translation example we used above to convey the idea of attention is not merely from Transformer, but the idea of attention itself is the same. It is to attend or to focus to the part of input that contain meaningful information and focus less on rest of the input data.

### 2.1.1 Attention Inputs: Querry, Keys, and Values


Technically speaking, attention is a function used to measure the similarity between two vectors. Attention function takes 3 inputs which are `querry`, `keys`, and `values`. These 3 terms might sounds complicated now, but by taking a real example, you will understand the gist of them.

Let's say that you are searching *attention is all you need paper* on [ArXiv](https://arxiv.org). The title of the paper or what you will enter in the search field is a `querry`. Internally, ArXiv will find papers that matches with your querry based on `keys` such titles, authors, fields, journal, abstract etc... Now, ArXiv will find the similarity between your `querry` and fixed `keys`, and display the top papers that matches with your `querry`, or simply, papers that have high similarity scores. The papers in the ArXiv database that we are qerrying from based on some sets of keys can be reffered to as `values.` The similary scores can also be referred to as relevancy scores or attention score.


![image](https://drive.google.com/uc?export=view&id=1KAuNB7qXCHvA1sFmMVhJ-qb6AnA1S83X)

So, let's use the above analogy to understand what our attention function is doing. But before we do that, let's look at the function of attention.

$$
Attention(Q, K, V) = softmax(\frac {QK^T}{\sqrt {d_k}})V
$$

In the above function:
* $Q$ denotes querry vector or in our ArXiv analogy, the paper that we want to search.
* $K$ denotes the Keys vector or sets of things(such as title, authors, journal, etc...) that describe papers in our analogy. In the attention function, $K$ vector is transposed(i.e $K_T$).
* $V$ is the Values vector, or all papers that we are querrying from in our ArXiv.
* $\sqrt {d_k}$ is the scaling factor used to avoid unstable gradients problems. In our analogy, this could mean clipping off papers that are far off our querry(not sure if that make sense but that's the idea).
* $Softmax()$ is a function that transforms the scaled dot-product $QK^T$ into probability distribution. Softmax output applied to the Values $V$ will produce weighted outputs.

![image](https://drive.google.com/uc?export=view&id=18WVSE7jD87tXO0ufDVwHfwW8BnGWFmLo)

Putting things together, attention function takes querry, keys, and values matrices, compute the dot-product(or matrix multiplication) or similarity between  querry and keys, scale the dot-product or (magnitude of $QK^T$) with $\sqrt {d_k}$, puts the result into softmax function to get the probabilistic output, and apply it on the values vector to get the weighted sum of the values. In brief, attention computes the similarity scores between querry and keys and weighs the scores with values to produce weighted values. Everything else is to smooth the computation.

As both querry, keys, and values are matrices, computing the attention function is pretty fast and can be done in a single pass since it's merely a normal matrix multiplication(no loops or recurrence involved). And as we alluded to in the beginning, that's where extreme parallelism of Transformer comes from.





### 2.1.2 Multi-Head Attention