<h1 align=center> Transformers (Attention Is All You Need) In Depth </h1>

Transformers, in the context of machine learning and artificial intelligence, refer to a type of deep learning model architecture designed primarily for natural language processing (NLP) tasks. They have revolutionized the field by enabling more effective training and performance on a variety of tasks such as translation, summarization, and question answering.

![alt text](../Images/nlp/transformers/transformer.png)

### 1. **Introduction and History**

- **Origins**: The transformer model was introduced in the paper "Attention Is All You Need" by Vaswani et al., published by Google Brain in 2017.
- **Impact**: Transformers have replaced recurrent neural networks (RNNs) and long short-term memory (LSTM) networks in many NLP tasks due to their efficiency and performance.

### 2. **Architecture Overview**

- **Attention Mechanism**: The core innovation of transformers is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence.
- **Encoder-Decoder Structure**: The original transformer model has an encoder-decoder structure. The encoder processes the input sequence, and the decoder generates the output sequence.
- **Parallelization**: Transformers can process all words in a sentence simultaneously, unlike RNNs which process sequentially, making training faster.

### 3. **Components of a Transformer**

- **Input Embeddings**: Converts words into vectors of real numbers.
- **Positional Encoding**: Adds information about the position of each word in the sentence, since transformers do not inherently understand order.
- **Multi-Head Self-Attention**: Allows the model to focus on different parts of the sentence simultaneously by using multiple attention mechanisms in parallel.
- **Feed-Forward Neural Networks**: Applied independently to each position to introduce non-linearity.
- **Layer Normalization**: Normalizes inputs to each layer to stabilize and accelerate training.
- **Residual Connections**: Helps with gradient flow and allows for deeper models by adding the input of each layer to its output.

### 4. **Self-Attention Mechanism**

- **Calculation**: Self-attention involves computing three vectors for each word - Query (Q), Key (K), and Value (V).
- **Scaled Dot-Product**: The dot product of the query and key vectors determines the attention scores, which are then scaled and passed through a SoftMax function to get attention weights.
- **Weighted Sum**: The attention weights are used to compute a weighted sum of the value vectors, producing the final output.

### 5. Transformer Practical Explanation

`Note`: All credits goes to [Jalammar](https://jalammar.github.io/illustrated-transformer/).

- Our goal is to translate French sentence to English.

![transformer1.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/77b26e45-cae9-4d87-a588-1c0f327e1744/transformer1.png)

- If we open transformer architect, we find encoder and decoder.

![transformer2.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/12aa1ca9-f54e-4b9d-85d1-80d2d97dab30/transformer2.png)

- If we go deeper, we find out sex encoder and sex decoder.

![transformer3.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/2d1b4892-0348-4b3e-a4e1-dd9d579b5a9d/transformer3.png)

- Each encoder consist of two sub-layers:

![transformer4.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/7f786008-b775-41cc-9c46-d9ec83e312e2/transformer4.png)

- The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word.
- The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
- The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in [seq2seq models](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/)).

![transformer5.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/91040ed1-11db-4c1c-b74f-2ebe8a09c4b8/transformer5.png)

Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.

- In NLP applications,  we begin by turning each input word into a vector using an [embedding algorithm](https://medium.com/deeper-learning/glossary-of-deep-learning-word-embedding-f90c3cec34ca).

![transformer6.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/b6b25c5b-70bb-4523-be86-d6e37c0d2324/transformer6.png)

- The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.
- After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.

![transformer7.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/799cdc46-3cf0-4130-8b4b-0ec479bd5e1f/transformer7.png)

- Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.

### **Now We’re Encoding!**

- As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

![transformer8.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/aa7d91e5-36e8-4700-a14a-108a349d05eb/transformer8.png)

### **Self-Attention at a High Level**

Say the following sentence is an input sentence we want to translate:

”`The animal didn't cross the street because it was too tired`”

What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.

When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.

![transformer9.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/aa7bd30a-1caa-42bf-b27c-9d389753ba1b/transformer9.png)

### **Self-Attention in Detail**

- The **first step** in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.

![transformer10.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/b9516bc7-f081-4962-9e4d-45d40865ad1c/transformer10.png)

What are the “query”, “key”, and “value” vectors?

They’re abstractions that are useful for calculating and thinking about attention.

- The **second step** in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
- The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.

![transformer11.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/2e921799-4962-43f7-a2b5-1b84f98005cc/transformer11.png)

- The **third and fourth steps** are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a SoftMax operation. SoftMax normalizes the scores so they’re all positive and add up to 1.

![transformer12.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/a102ad56-9fe7-4dd8-8530-315b97fd75c5/transformer12.png)

- This SoftMax score determines how much each word will be expressed at this position. Clearly the word at this position will have the highest SoftMax score, but sometimes it’s useful to attend to another word that is relevant to the current word.
- The **fifth step** is to multiply each value vector by the SoftMax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).
- The **sixth step** is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

![transformer13.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/c7c6a670-f839-4b8d-a053-433b81eb4317/transformer13.png)

- That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural network. In the actual implementation, however, this calculation is done in matrix form for faster processing. So let’s look at that now that we’ve seen the intuition of the calculation on the word level.

### **Matrix Calculation of Self-Attention**

- **The first step** is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV).

![transformer14.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/e5b3b8bd-eec2-48fa-a48b-ae98f74f3596/transformer14.png)

- **Finally**, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.

![transformer15.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/58a6a1a4-c0e4-4e4f-ae63-e692f3cfbfa8/transformer15.png)

### **The Beast With Many Heads**

The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:

1. It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the actual word itself. If we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, it would be useful to know which word “it” refers to.
2. It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.

![transformer16.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/0cd30677-4bb6-465c-b436-0bf7770858e7/transformer16.png)

- If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices

![transformer17.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/ffc06671-50d5-4009-bdba-0e9094c25d43/transformer17.png)

This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.

How do we do that? 

- We concat the matrices then multiply them by an additional weights matrix WO.

![transformer18.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/c16deb13-40c9-4158-9b9a-143ea868ddbb/transformer18.png)

- That’s pretty much all there is to multi-headed self-attention.

![transformer19.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/1aa9b1ce-c66a-4b3a-9f40-627be5776f65/transformer19.png)

- Now that we have touched upon attention heads, let’s revisit our example from before to see where the different attention heads are focusing as we encode the word “it” in our example sentence:

![transformer20.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/2ccaeb97-ff35-4025-bb59-5b0ec60de6cb/transformer20.png)

- If we add all the attention heads to the picture, however, things can be harder to interpret:

![transformer21.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/d122b723-6e58-49ef-86f1-6b4e86ce8055/transformer21.png)

### **Representing The Order of The Sequence Using Positional Encoding**

- One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.
- To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.

![transformer22.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/4295e4f1-a10a-4a44-823a-1ff64b731344/transformer22.png)

- If we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:

![transformer23.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/0f02a7f2-5b9c-4622-a121-9dfde634c0b5/transformer23.png)

### **The Residuals**

- One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a [layer-normalization](https://arxiv.org/abs/1607.06450) step.

![transformer24.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/dcd050f2-ae05-40fa-9e67-74362d004289/transformer24.png)

- If we’re to visualize the vectors and the layer-norm operation associated with self attention, it would look like this:

![transformer25.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/c166c77e-4f85-4c12-ad01-10c49ff46ff5/transformer25.png)

- This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:

![transformer26.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/039677ca-6020-4389-a88d-fa11f23885c6/transformer26.png)

### **The Decoder Side**

- The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence:

![transformer27.gif](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/14326a8f-d14a-499a-b523-4435e7849e64/transformer27.gif)

- The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output.
- The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

![transformer28.gif](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/ab69f290-fa91-4385-aa4f-c1eddfa59df0/transformer28.gif)

- The self attention layers in the decoder operate in a slightly different way than the one in the encoder:
- In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to `-inf`) before the SoftMax step in the self-attention calculation.
- The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

### **The Final Linear and SoftMax Layer**

The decoder stack outputs a vector of floats. How do we turn that into a word? 

That’s the job of the final Linear layer which is followed by a SoftMax Layer.

- The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.
- Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.
- The SoftMax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

![transformer29.png](https://prod-files-secure.s3.us-west-2.amazonaws.com/8c823248-283b-4e7e-84a9-b8db58b76882/1a45cd6f-c952-4beb-a2ad-493d2cbae9e5/transformer29.png)

### 6. **Popular Transformer Models**

- **BERT**: Designed for understanding the context of words in a sentence, enabling better performance on a variety of NLP tasks.
- **GPT**: Focuses on text generation, using a unidirectional approach where each word is predicted based on previous words.
- **T5**: Treats every NLP problem as a text-to-text problem, allowing for a unified approach to multiple tasks.
- **Transformer-XL**: Handles longer context by introducing recurrence.

### 7. **Advantages and Limitations**

- **Advantages**: High parallelization, effective handling of long-range dependencies, and superior performance on many benchmarks.
- **Limitations**: Large computational resources required, high memory usage, and complexity in understanding and interpreting model decisions.