
# Week 11 Attention and Transformers

Proprietary material - Under Creative Commons 4.0 licence CC-BY-NC https://creativecommons.org/licenses/by-nc/4.0/legalcode

# Introduction

In the last decade or so, there has been another change in paradigm for NLP with the introduction of transformers and attention models. Now, attention is a complex operation that might be a bit hard to understand at the beginning. But let's try to explain it step by step to make it as intuitively as possible.

# Encoder/Decoder

To understand attention, we need to explain a bit of the context where it was conceptualized.

Let's remember the previous section where we talked about sequential models. In particular about the sequence to sequence models like the following illustration:

<img src="RNNs-Page-5.drawio.png">

A generalization of this structure is called an encoder-decoder model. There are several types of encoder-decoders with diferent applications, but their general forms follows something like this:

<img src="RNNs-Page-12.png">

The job of the encoder is to create a vector representation of the input. This vector is often called the context vector (C) that contains the relevant information from the input. Then, the decoder interprets this context vector to produce the output of the model. 

### Autoencoders

It's a bit of a deviation, but let's quickly mention autoencoders. Structurally, they are the same as the encoder-decoders, but instead of decoding to a different space they reconstruct the original input. This creates a projection in a latent space in the context vector that represents the input in a reduced dimensionality. 

A common example of this is the projection of the mnist digits on a 2d plane:

<img src="autoencoder_schema.jpg">

<img src="latent_space.png">

This isn't pertinent to attention, but couldn't find somewhere else to explain it.

## Machine Translation

Let's get back in track by seeing an example of machine translation with encoder-decoders.

<img src="RNNs-Page-13.png">

Now, we see that the only communication from the coder and the decoder is the context vector. There for is the biggest bottleneck of information in the model. This is a glaring issue, as the quality of this context vector will directly impact the quality of the translation.

To try to address this issue is how the original attention operator was designed.

# Taking Attention

As we said in the previous week material, RNNs in general have a tendency of forgetting previous observations. This is particularly the case in machine translation, where the model would forget the beginning of the sentence by the end. LSTM and GRU models where introduced to fix this issue, but they still struggled with long sentences. 

To fix this, Attention was proposed to allow the model to focus on the relevant part of the whole sentence. 

To start, we need to make some changes to our encoder. Instead of ditching all the hidden states and keeping the last one, we can combine them on to a context matrix like so

<img src="RNNs-Page-14.png">

Now we can feed this matrix to the decoder. But with attention, the decoder has to calculate the attention of the inputs for each cell. This is done in the following steps

1. Grab the input matrix associating each vector column to a hidden state from the encoder and a hidden state from the decoder.  
2. Calculate the attention score for each vector in reference to the hidden state. In theory, this can be any function that grabs two vectors and returns a value. In practice, this is done by a fc layer but there are several proposals on how to apply attention
3. Get the softmax value for the attention scores and multiply them to their corresponding hidden states. This allows the model to dynamically give focus to different hidden states and ignore less important ones.
4. Finally, we add all the weighted states together and end up with a context vector for this time step

<img src="RNNs-Page-15.png">

Now we can operate the attention block inside the decoder by following the next steps:

1. Take the previous output $y_{i-1}$ and hidden state $h_{i-1}$ and pass them trough the RNN cell to create a new hidden state $h_i$ for this time step. In case it's the first cell, use the end token as an input and a default hidden state $h_0$
2. Take attention to the context matrix using the new hidden state and generate a context vector $C_i$
3. Concatenate both the context vector $C_i$ and the hidden vector $h_i$ together 
4. Pass the resulting vector trough a small fully connected network  (two layers is usually enough) to get the output for this time step
5. Repeat for the next time steps until the end token is received 

<img src="RNNs-Page-16.png">

With that, we have completed our architecture with attention! 

We can see an example of machine translation with attention in the next figure:

<img src="attention_sentence.png">

As you can see, attention even allows the model to understand when words are flipped in a sentence while translating. This power to understand complex sequences is what separated attention models from the rest. 

But even attention has some problems. First, like any sequential model it's slow to evaluate and train. This is because you still need the next hidden state to calculate the attention scores. This also means that the score and therefore the attention calculation depends heavily on the quality of the hidden states. 

# Self-Attention

What if instead of relying on the hidden state of a RNN, we could ask the data itself where the relevant information and correlations are? 

This is where the concept of self-attention comes into place. Not only improved on the performance of the original attention models, but it allowed models to operate sequences in parallel. 

To apply self-attention we can't just feed two times the attention score calculation. Instead we need to do the following steps. I'll start the explanation with a single input vector and show how it looks with a whole matrix at the end. Just consider that for the next couple of steps.


1. We need to start by calculating the Queries, Keys and Values for each of the sequence inputs. This is done by multiplying the input with $W^Q$, $W^K$ and $W^V$ to to end up with the vectors $q$, $k$ and $v$ respectively. This vectors are abstractions of our input that allow us to calculate attention


<img src="RNNs-Page-17.png">

2. Next, we need to calculate the attention score for each observation to the rest of the sequence. This is done by taking the dot product between the query vector of the observation $x_i$ that we want and the key vectors of the whole sequence

$$
q_i \cdot k_j^T
$$

3. Once we have the score we have to normalize it by the square of the length of the key vector. This is done to stabilize the gradients and ease the learning process. For our example this would be the square root of 3, but in the original paper the length of k is 64, so they divide by 8

$$
\frac{q_i \cdot k_j^T}{\sqrt{d_k}}
$$

4. Then, we the softmax value of the scores for the whole input to normalize the sum of all values to 1

$$
s_j = Softmax(\frac{q_i \cdot k_j^T}{\sqrt{d_k}}) 
$$

5. Then, we multiply each value vector by their corresponding attention scores

$$
V_j*s_j
$$

6. Finally, we add all the weighted value vectors to end up with the self-attention vector for our corresponding observation $x_i$

$$
z_i = \sum(V_j s_j)
$$

Then,  we could repeat the process for each of the input sequences to get the whole self-attention matrix, but we can use matrix multiplications to do this calculations very efficiently in parallel. Now we define the Query, Keys and Values as so:

<img src="RNNs-Page-18.png">

Which leave the final attention calculation like

<img src="RNNs-Page-19.png">

In equation form

$$
Q = X W^Q
$$

$$
K = X W^K
$$

$$
V = X W^V
$$

$$
Z = softmax(\frac{Q K^T}{\sqrt{d_k}})V
$$


Let's see how self-attention takes focus with an example.


"The animal didn't cross the street because it was too tired"

What does "it" refer in this sentence? 

For humans this clearly refers to the animal, but previous models found difficult to get this correlation. let's see how self-attention sees it:

<img src="selfattention.png">

Here we can see that the model is capable of making the connection to know that "it" refers to the animal.


# Multi Headed Self-Attention

Now that we have the attention operator defined are we done with our model? 

No, we are not limited to a single operator of self-attention. We refer to each of this operators as a "head", so using multiple operators is referred as multi-headed self-attention.

This has the advantage of allowing each head to focus in specific positions or different contexts. It also gives the multi-headed layer the ability to generate different representation subspaces that can project the embeddings to a different space for deeper layers.

An example of the same word on the same sentence but with multi-headed self attention can be seen here:


<img src="mhsa.png">

If you want to explore in more detail how attention is taken with interactive examples I would recommend this two references

https://github.com/jessevig/bertviz

https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb#scrollTo=OJKU36QAfqOC



# Transformer

As a last point for this session, let's talk about transformers!


Transformers are an architecture that was proposed next to self-attention in the paper "Attention is all you need". A very influential paper that I would highly recommend checking out!

The general architecture of the transformer can be seen in the following scheme:

<img src="transformer_small.png">

And with that we are finished!

There are some details about the implementation that we haven't covered, but this should cover the fundamental idea behind attention. If you want to deepen your knowledge about transformers I would higly recommend implementing them from scratch wit Pytorch. 

