# Comprehensive Guide to Transformers in NLP

### Chapter 1: Understanding Transformers

#### 1.1 Importance of Transformers

Transformers represent a significant advancement in NLP, addressing the limitations of previous models and enabling the development of state-of-the-art models. They are particularly important for generative AI and large language models (LLMs) like BERT and GPT. For instance, OpenAI's ChatGPT, which uses GPT-4, is based on Transformer architecture and is trained with vast amounts of data.

##### 1.1.1 Overview of Main Topics
- Recurrent Neural Networks (RNN)
- Long Short-Term Memory (LSTM)
- Gated Recurrent Units (GRU)
- Encoder-Decoder Architecture
  - Sequence-to-sequence learning
  - Attention mechanism

#### 1.2 Detailed Architecture

The architecture of Transformers includes several key components:
- **Encoder and Decoder**: The encoder creates a representation from the input, while the decoder generates the output sequence based on this representation and previously generated tokens.
- **Self-Attention Module**: This involves query, key, and value pairs, allowing all words in a sentence to be sent in parallel to the encoder for further processing.
- **Positional Encoding**: Ensures the position of each word is taken into account, maintaining the context and meaning of the sentence.
- **Multi-Head Attention**: Combines all these components to enhance the model's understanding and processing capabilities.

#### 1.3 Sequence-to-Sequence Tasks

Transformers are particularly effective for sequence-to-sequence tasks, such as language translation. For example, converting text from English to French involves many-to-many sequence-to-sequence tasks. As the length of the sentences increases, Transformers can handle the complexity and maintain accuracy.

#### 1.4 Encoder-Decoder Architecture

In the encoder-decoder architecture:
- **LSTM**: Used to process entire sentences.
- **Words**: Given based on time stamps, converted into vectors using an embedding layer, and then passed to the LSTM.
- **Context Vector**: Generated by the LSTM and provided to the next decoder layer for making predictions.
- **Challenges**: The context was often insufficient for longer sentences, leading to decreased accuracy.

#### 1.5 Attention Mechanism

To address the limitations of the encoder-decoder architecture, the attention mechanism was introduced. This mechanism allows for the creation of additional context, improving the accuracy of predictions for longer sentences. Despite its advantages, the attention mechanism still faced scalability issues, as it processed words sequentially based on time stamps.

#### 1.6 Scalability and Transfer Learning

One of the standout features of Transformers is their scalability. As the size of the dataset increases, Transformers continue to perform exceptionally well, producing models that are at the forefront of NLP research. This scalability is further enhanced by transfer learning. Pre-trained models like BERT and GPT can be fine-tuned for specific tasks without the need to train from scratch, saving time and computational resources.

#### 1.7 Application in Multimodal Tasks

Transformers are not limited to NLP tasks. They have proven to be highly effective in multimodal tasks that involve both text and images. For instance, OpenAI's DALL-E generates images based on textual descriptions, showcasing the versatility of Transformers. This capability is made possible by the same underlying architecture that powers NLP applications.

#### 1.8 Self-Attention Mechanism

The self-attention mechanism is central to the functionality of Transformers. It enables all words in a sentence to be sent in parallel to the encoder for further processing. This parallel execution is crucial for handling large datasets efficiently and effectively. The ability to process words simultaneously makes the model scalable and capable of producing state-of-the-art results.

##### 1.8.1 Importance of Self-Attention

The self-attention mechanism is crucial for the accuracy of Transformers. By capturing contextual relationships between words, it enhances the model's ability to understand and generate text. This makes Transformers particularly effective for a wide range of applications, from NLP to generative AI.

##### 1.8.2 Addressing Contextual Embeddings

A major limitation of previous models, such as encoder-decoder architectures, was the lack of contextual embeddings. Transformers address this issue through the self-attention mechanism, which creates contextual embeddings that capture the relationships between words in a sentence. This results in more accurate and meaningful representations of text.

##### 1.8.3 Example of Contextual Embeddings

Consider the sentence: "My name is Alex and I play chess." In this example, the embedding layer generates vectors for each word. However, contextual vectors should reflect the relationships between words, such as the connection between "Alex" and "I." The self-attention mechanism ensures that these relationships are captured, leading to more accurate embeddings.

#### 1.9 Summary

Transformers have revolutionized the field of artificial intelligence by enabling parallel processing of words and creating contextual embeddings. Their scalability and versatility make them suitable for various tasks, including NLP, multimodal applications, and generative AI. The self-attention module and positional encoding are key components that contribute to the success of Transformers, addressing the limitations of previous models and paving the way for future advancements in AI.

### Chapter 2: Transformers Architecture in Details

#### 2.1 Introduction
In the Transformer architecture, the order from bottom to top is as follows:

1. **Positional Encoding**: This is applied first to give the model information about the position of each word in the sequence.
2. **Self-Attention Layer**: After positional encoding, the input goes through the self-attention layer, which converts words into contextual vectors.
3. **Feed-Forward Neural Network**: Finally, the output from the self-attention layer is processed by the feed-forward neural network.



<style>
    .center {
        display: block;
        margin-left: auto;
        margin-right: auto;
        width: 350px; /* Adjust the width as needed */
    }
</style>

<img src="Transformer.PNG" alt="Description" class="center" width="350">



#### 2.2 Basic Transformer Architecture
The basic Transformer architecture can be understood through a sequence-to-sequence task, such as translating an English sentence into French. The input is an English sentence, and the output is its French translation. We'll focus on the components inside this block diagram.

#### 2.3 Transformer Architecture
The Transformer architecture features an encoder-decoder structure with multiple encoders and decoders:
- **Encoders**: The text input passes through these encoders sequentially.
- **Decoders**: The output is generated after passing through multiple decoders.

This setup is based on the research paper "Attention is All You Need," which discusses:
- Positional encoding
- Self-attention
- Multi-head attention
- Feed-forward networks

#### 2.4 Encoder and Decoder Architecture
The Transformer model processes the input sentence through the stack of encoders, generating a set of encodings. These encodings are then passed to the decoders, which generate the output sentence. The use of self-attention and multi-head attention allows the model to capture complex dependencies between words, making it highly effective for tasks like translation.

#### 2.5 Positional Encoding
Transformers do not have a built-in sense of the order of words. Positional encoding is used to give the model information about the position of each word in the sequence. This is achieved by adding sine and cosine functions of different frequencies to the input embeddings.

#### 2.6 Self-Attention Layer
The self-attention layer converts words into vectors and then into contextual vectors, considering the context of different words:
- **Vector Conversion**: Words are converted into vectors.
- **Contextual Vectors**: These vectors consider the context of other words.

These vectors are processed by the feed-forward neural network and passed to the next encoder. This process repeats through multiple encoders.

#### 2.7 Multi-Head Attention
Multi-head attention allows the model to focus on different parts of the input sequence simultaneously. It involves running multiple self-attention mechanisms in parallel and then concatenating their outputs.

#### 2.8 Feed-Forward Networks
Each encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This network consists of two linear transformations with a ReLU activation in between.

#### 2.9 Putting It All Together
For example, the encoder takes the input "how are you" and translates it into French using the decoder. Inside the encoder:
- **Self-Attention Layer**: Converts words into vectors.
- **Feed-Forward Neural Network Layer**: Processes these vectors.

The Transformer model processes the input sentence through the stack of encoders, generating a set of encodings. These encodings are then passed to the decoders, which generate the output sentence. The use of self-attention and multi-head attention allows the model to capture complex dependencies between words, making it highly effective for tasks like translation.

#### 2.10 Additional Details
- **Parallel Processing**: All words are processed in parallel, enhancing scalability.
- **Contextual Accuracy**: Improves accuracy for longer sentences by considering the context of other words.
- **Scalability**: The architecture allows for efficient processing of large datasets.

### Chapter 3: Self-Attention Mechanism

Let's dive deeper into the self-attention layer to understand how it works by focusing extensively on self-attention, including how the entire self-attention layer functions and the mathematical intuition behind it. This will help in understanding how contextual embeddings are created.

#### 3.1. Key Points on Self-Attention Layer

The idea behind self-attention is to weigh the importance of different tokens in the input sequence relative to each other. Instead of keeping embedding vectors fixed, they are transformed into contextual embedding vectors based on the importance of all the other words. This is crucial for applications like language translation and text summarization.

1. **Importance Weighing**: Self-attention assigns different weights to different tokens in the input sequence.
2. **Contextual Embeddings**: Embedding vectors are adjusted to reflect the importance of other words in the sequence.
3. **Applications**: This mechanism is essential for tasks such as language translation and text summarization.

#### 3.2. Illustration of Self-Attention

To illustrate, consider the sentence "the cat sat." Each word is converted into vectors, for example:
- "the" as $$[1, 0, 0]$$
- "cat" as $$[0, 1, 0]$$
- "sat" as $$[0, 0, 1]$$

Passing these through the self-attention layer results in new vectors, which are the contextual embeddings. These embeddings take into account the importance of different tokens in the input sequence.

Self-attention, also known as scaled dot-product attention, is a mechanism in the Transformer architecture that allows the model to weigh the importance of different tokens in the input sequence. The process involves creating three important vectors: queries, keys, and values. These vectors help in determining the importance of other tokens in the context of the current token.

#### 3.2.1. Queries, Keys, and Values

- **Queries**: Help the model decide which part of the sequence to focus on for each specific token. By calculating the dot product between a query vector and all key vectors, the model assesses how much attention to give to each token relative to the current token.
- **Keys**: Represent all the tokens in the sequence and are used to compare with the query vector to calculate the attention scores.
- **Values**: Hold the actual information that will be aggregated to form the output of the attention mechanism.

#### 3.3. Token Embeddings and Linear Transformation

For example, consider the input sequence "the cat sat." The embedding size is four, meaning each word is converted into four vectors. The first step is token embeddings, where the sentence is converted into vectors, V. The next step is linear transformation, where query, key, and value vectors are created by multiplying the embeddings by learned weight matrices.

This process ensures that the model has information about all the other words and sentences, allowing it to create contextual embeddings based on the entire sentence. This is essential for capturing dependencies and context, which is crucial for applications like language translation and text summarization.

#### 3.4. Practical Example

Here they are in one line:

Is there anything else you'd like to know about these weight matrices or their role in the attention mechanism?
For a practical example, consider token embeddings. Let's initialize the weights $$ W_q, W_k, \text{and} \, W_v $$ as identity matrices. For instance, if the identity matrix is 3x3, it will have all diagonal elements as one and the rest as zeros. Using this identity matrix, the query, key, and value vectors for the word "the" will be the same as the original vectors after the dot operation.

Similarly, for other words like "cat" and "sat," the query, key, and value vectors are computed. This is the first step: obtaining token embeddings and creating Q, K, and V by multiplying the embeddings by learned weight matrices. Initially, these weights are not learned but will be adjusted through backpropagation.


### 3.5. Attention Scores

#### 3.5.1. **Computation of Attention Scores**:
To calculate the dot product between a query vector and all key vectors, you follow these steps:

1. **Identify the query and key vectors**: For each token, you have a corresponding query vector and key vector, $$ Q \, \text{and} \, K $$

2. **Compute the dot product**: For each pair of tokens, calculate the dot product of their query and key vectors.

$$ \text{Attention Score} = Q \times K^T $$

- **Attention score for "the" with respect to "the"**:
  - Query vector for "the": $$ Q_{\text{the}} $$
  - Key vector for "the": $$ K_{\text{the}} $$
  - Dot product: $$ Q_{\text{the}} \cdot K_{\text{the}}^T = 2 $$

- **Attention score for "the" with "cat"**:
  - Query vector for "the": $$ Q_{\text{the}} $$
  - Key vector for "cat": $$ K_{\text{cat}} $$
  - Dot product: $$ Q_{\text{the}} \cdot K_{\text{cat}}^T $$

- **Attention score for "the" with "sat"**:
  - Query vector for "the": $$ Q_{\text{the}} $$
  - Key vector for "sat": $$ K_{\text{sat}} $$
  - Dot product: $$ Q_{\text{the}} \cdot K_{\text{sat}}^T $$

Each of these dot products gives you the attention score, indicating how much focus the model should place on each token relative to the current token "the."



#### 3.5.2. **Example for Token "cat"**:
- Scores are computed similarly, indicating the importance of "sat" for "cat."
- The same process is repeated for "sat," resulting in scores that reflect the dependencies between the tokens.

### 3.6. Scaling

#### 3.6.1. **Purpose of Scaling**:
- Scaling is crucial to prevent the dot product from growing too large, ensuring stable gradients during training.
- Scores are scaled down by dividing by the square root of the dimensions of the key vectors.
- Example: If the dimension is 4, the square root is 2.

    $$ \text{Scaled Score} = \frac{\text{Attention Score}}{\sqrt{d_k}} $$

These scores help determine the importance of each token relative to the current token. The next step is to scale these scores to prevent issues like gradient exploding and softmax saturation. Scaling is done by dividing the scores by the square root of the dimensions of the key vectors. 

### 3.6.2. **Benefits of Scaling**:
- **Prevents Gradient Exploding**: Scaling the attention scores helps prevent issues like gradient exploding during backpropagation, which can destabilize training.
- **Maintains Stable Gradients**: By scaling the scores, the gradients remain stable, ensuring effective training.
- **Avoids Softmax Saturation**: Without scaling, the dot products could become very large, pushing the softmax function into regions where it has extremely small gradients, leading to poor learning. Scaling helps keep the values in a range where the softmax function can operate effectively.

### 3.7. Softmax and Attention Weights
After scaling, the scores are passed through a softmax function to obtain the attention weights. The softmax function ensures that the attention weights sum to 1, providing a probability distribution over the tokens.

The softmax formula is:
$$ \text{Attention Weights} = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) $$

### 3.8. Weighted Sum of Values
The final step is to compute the weighted sum of the value vectors, using the attention weights as coefficients. This gives the final output of the self-attention mechanism. V is the embeding vectors.

The weighted sum formula is:
$$ \text{Output} = \sum (\text{Attention Weights} \cdot V) $$


#### Recall of Value Vectors (V)
The value vectors (V) are indeed related to the embedding vectors. In the context of the self-attention mechanism, the value vectors are derived from the input embeddings.
- **Embedding Vectors**: These are the initial representations of the input tokens. Each token in the input sequence is mapped to a high-dimensional vector that captures its semantic meaning.
- **Value Vectors (V)**: These are typically the same as the embedding vectors or a transformed version of them. In the self-attention mechanism, the value vectors contain the actual information that will be combined based on the attention weights.

### Process
1. **Input Embeddings**: The input tokens are first converted into embedding vectors.
2. **Transformation (Optional)**: Sometimes, these embeddings are linearly transformed to produce the value vectors.
3. **Attention Mechanism**: The attention mechanism computes the attention weights and uses them to combine the value vectors, resulting in the final output.

### 3.10. Self-Attention Mechanism: Detailed Mathematical Computation

Let's dive into the detailed mathematical computation for the self-attention mechanism using the example sentence "the cat sat."

#### Step 1: Token Embeddings
Assume the embedding vectors for the tokens are:
- "the": $$[1, 0, 0]$$
- "cat": $$[0, 1, 0]$$
- "sat": $$[0, 0, 1]$$

#### Step 2: Linear Transformation
Assume the weight matrices $$W_q$$, $$W_k$$, and $$W_v$$ are identity matrices:
- $$W_q = W_k = W_v = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}$$

The query, key, and value vectors for each token are the same as the original vectors:
- Query vector for "the": $$Q_{\text{the}} = [1, 0, 0]$$
- Key vector for "the": $$K_{\text{the}} = [1, 0, 0]$$
- Value vector for "the": $$V_{\text{the}} = [1, 0, 0]$$

Similarly, for "cat" and "sat":
- Query vector for "cat": $$Q_{\text{cat}} = [0, 1, 0]$$
- Key vector for "cat": $$K_{\text{cat}} = [0, 1, 0]$$
- Value vector for "cat": $$V_{\text{cat}} = [0, 1, 0]$$

- Query vector for "sat": $$Q_{\text{sat}} = [0, 0, 1]$$
- Key vector for "sat": $$K_{\text{sat}} = [0, 0, 1]$$
- Value vector for "sat": $$V_{\text{sat}} = [0, 0, 1]$$

#### Step 3: Compute Attention Scores
Compute the dot product between the query vector of each token and the key vectors of all tokens.

For "the":
- Attention score for "the" with respect to "the":
  $$ Q_{\text{the}} \cdot K_{\text{the}}^T = [1, 0, 0] \cdot [1, 0, 0]^T = 1 \times 1 + 0 \times 0 + 0 \times 0 = 1 $$

- Attention score for "the" with respect to "cat":
  $$ Q_{\text{the}} \cdot K_{\text{cat}}^T = [1, 0, 0] \cdot [0, 1, 0]^T = 1 \times 0 + 0 \times 1 + 0 \times 0 = 0 $$

- Attention score for "the" with respect to "sat":
  $$ Q_{\text{the}} \cdot K_{\text{sat}}^T = [1, 0, 0] \cdot [0, 0, 1]^T = 1 \times 0 + 0 \times 0 + 0 \times 1 = 0 $$

For "cat":
- Attention score for "cat" with respect to "the":
  $$ Q_{\text{cat}} \cdot K_{\text{the}}^T = [0, 1, 0] \cdot [1, 0, 0]^T = 0 \times 1 + 1 \times 0 + 0 \times 0 = 0 $$

- Attention score for "cat" with respect to "cat":
  $$ Q_{\text{cat}} \cdot K_{\text{cat}}^T = [0, 1, 0] \cdot [0, 1, 0]^T = 0 \times 0 + 1 \times 1 + 0 \times 0 = 1 $$

- Attention score for "cat" with respect to "sat":
  $$ Q_{\text{cat}} \cdot K_{\text{sat}}^T = [0, 1, 0] \cdot [0, 0, 1]^T = 0 \times 0 + 1 \times 0 + 0 \times 1 = 0 $$

For "sat":
- Attention score for "sat" with respect to "the":
  $$ Q_{\text{sat}} \cdot K_{\text{the}}^T = [0, 0, 1] \cdot [1, 0, 0]^T = 0 \times 1 + 0 \times 0 + 1 \times 0 = 0 $$

- Attention score for "sat" with respect to "cat":
  $$ Q_{\text{sat}} \cdot K_{\text{cat}}^T = [0, 0, 1] \cdot [0, 1, 0]^T = 0 \times 0 + 0 \times 1 + 1 \times 0 = 0 $$

- Attention score for "sat" with respect to "sat":
  $$ Q_{\text{sat}} \cdot K_{\text{sat}}^T = [0, 0, 1] \cdot [0, 0, 1]^T = 0 \times 0 + 0 \times 0 + 1 \times 1 = 1 $$

#### Step 4: Scaling
Scale the attention scores by dividing by the square root of the dimension of the key vectors ($$d_k$$). Assuming $$d_k = 3$$, the square root of 3 is approximately 1.732.

For "the":
- Scaled score for "the" with respect to "the":
  $$ \frac{1}{\sqrt{3}} \approx 0.577 $$

- Scaled score for "the" with respect to "cat":
  $$ \frac{0}{\sqrt{3}} = 0 $$

- Scaled score for "the" with respect to "sat":
  $$ \frac{0}{\sqrt{3}} = 0 $$

For "cat":
- Scaled score for "cat" with respect to "the":
  $$ \frac{0}{\sqrt{3}} = 0 $$

- Scaled score for "cat" with respect to "cat":
  $$ \frac{1}{\sqrt{3}} \approx 0.577 $$

- Scaled score for "cat" with respect to "sat":
  $$ \frac{0}{\sqrt{3}} = 0 $$

For "sat":
- Scaled score for "sat" with respect to "the":
  $$ \frac{0}{\sqrt{3}} = 0 $$

- Scaled score for "sat" with respect to "cat":
  $$ \frac{0}{\sqrt{3}} = 0 $$

- Scaled score for "sat" with respect to "sat":
  $$ \frac{1}{\sqrt{3}} \approx 0.577 $$

#### Step 5: Softmax and Attention Weights
Apply the softmax function to the scaled scores to obtain the attention weights.

For "the":
- Attention weights:
  $$ \text{softmax}([0.577, 0, 0]) = \left[\frac{e^{0.577}}{e^{0.577} + e^0 + e^0}, \frac{e^0}{e^{0.577} + e^0 + e^0}, \frac{e^0}{e^{0.577} + e^0 + e^0}\right] $$
  $$ \approx [0.491, 0.255, 0.255] $$

For "cat":
- Attention weights:
  $$ \text{softmax}([0, 0.577, 0]) = \left[\frac{e^0}{e^0 + e^{0.577} + e^0}, \frac{e^{0.577}}{e^0 + e^{0.577} + e^0}, \frac{e^0}{e^0 + e^{0.577} + e^0}\right] $$
  $$ \approx [0.255, 0.491, 0.255] $$

For "sat":
- Attention weights:
  $$ \text{softmax}([0, 0, 0.577]) = \left[\frac{e^0}{e^0 + e^0 + e^{0.577}}, \frac{e^0}{e^0 + e^0 + e^{0.577}}, \frac{e^{0.577}}{e^0 + e^0 + e^{0.577}}\right] $$
  $$ \approx [0.255, 0.255, 0.491] $$

#### Step 6: Weighted Sum of Values
Compute the weighted sum of the value vectors using the attention weights.

For "the":
- Output vector:
  $$ \text{Output}_{\text{the}} = 0.491 \times [1, 0, 0] + 0.255 \times [0, 1, 0] + 0.255 \times [0, 0, 1] $$
  $$


#### 3.11. Summary
Without scaling, the dot product of query and key vectors can result in very high values, leading to issues like gradient exploding and softmax saturation. For example, if the scores are 6 and 4, applying softmax without scaling results in a large difference in attention weights, such as 88 and 0.12. This imbalance can cause vanishing gradient problems during backpropagation.

By scaling the scores, the attention weights become more balanced. For example, scaling the scores 6 and 4 by dividing by the square root of 4 results in 3 and 2. Applying softmax to these scaled scores results in more balanced attention weights, such as 0.73 and 0.27. This balance prevents vanishing gradient problems and ensures stable training.

In summary, scaling stabilizes the training process by preventing extremely large dot products and ensuring balanced attention weights. This is crucial for the effective operation of the self-attention mechanism in the Transformer architecture.

## Chapter 4: Multi-Head Attention in Transformers

### 4.1 Multi-Head Attention in Transformers

Multi-head attention allows the model to capture more information by focusing on different parts of the input. This section covers the entire process of multi-head attention. In the next section, the process of passing these combined vectors to the feed-forward neural network and performing forward and backward propagation will be discussed.

### 4.2 Steps in Self-Attention

First, the query, key, and value vectors are calculated from learned weights \(W_q\), \(W_k\), and \(W_v\):

$$
Q = XW_q, \quad K = XW_k, \quad V = XW_v
$$

Then, the attention score is calculated by taking the dot product of the query and key vectors:

$$
\text{Attention Score} = QK^T
$$

The scores are scaled by the square root of the dimension of the key vectors (\(d_k\)):

$$
\text{Scaled Scores} = \frac{QK^T}{\sqrt{d_k}}
$$

A softmax activation function is applied to obtain the attention weights:

$$
\text{Attention Weights} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)
$$

Finally, the weighted sum of the values is calculated by multiplying the attention weights with the value vectors:

$$
Z = \text{Attention Weights} \cdot V
$$

### 4.3 Contextual Vectors

Initially, when embedding vectors are given, contextual vectors based on the context and the dependency of the other words in the sentence should be obtained.

### 4.4 Multi-Head Attention

Initially, words such as "the" and "cat" are used. All vectors will be available here. Multiplication and a dot operation with \(W_k\) are performed to get the query vectors, key vectors, and value vectors. For each word, separate query, key, and value vectors are obtained. After getting these, a multiplication operation with the query and key is performed, divided by \(\sqrt{d_k}\) for scaling, a softmax activation function is applied, and finally, a dot operation with the value vector is performed to get the \(Z\) value, which is the self-attention for the word.

### 4.5 Purpose of Multi-Head Attention

The idea with multi-head attention is to have self-attention with multiple heads. For the same words, multiple attention heads can be created. For example, with "thinking machine," weights like 

$$
XW_q, XW_k, XW_v
$$

are initialized, and 

$$
q_1, k_1, v_1
$$

are calculated. Each set of vectors may capture different contextual information. This expands the model's ability to focus on different positions of words or tokens.

### 4.6 Multiple Sets of Key-Value Pairs

In multi-head attention, multiple sets of key-value pairs are used, each randomly initialized initially. After training, each set projects the input embeddings into different representational subspaces. By using this, multiple attention heads are obtained, each capturing different aspects of the input.

### 4.7 Combining Attention Heads

After performing self-attention, this information needs to be provided to the feed-forward neural network. Instead of getting one attention head, multiple heads are obtained. These need to be combined before passing them to the feed-forward neural network. All the attention heads are concatenated and a dot product with \(W_o\) is performed to get the final \(Z\):

$$
Z = \text{Concat}(Z_1, Z_2, \ldots, Z_h)W_o
$$

### 4.8 Self-Attention Mechanism

In the self-attention mechanism, each word in the input sequence is transformed into three vectors: Query (Q), Key (K), and Value (V). These vectors are obtained by multiplying the input embeddings by learned weight matrices:

$$
Q = XW^Q, \quad K = XW^K, \quad V = XW^V
$$

where \( X \) is the input embedding, and \( W^Q \), \( W^K \), and \( W^V \) are the weight matrices for the query, key, and value vectors, respectively.

The attention scores are calculated by taking the dot product of the query vector with all key vectors, followed by a softmax function to obtain the attention weights:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

where \( d_k \) is the dimension of the key vectors, and the softmax function ensures that the attention weights sum to 1.

### 4.9 Multi-Head Attention

Multi-head attention allows the model to focus on different parts of the input sequence simultaneously. For each word, multiple attention heads are created, each with its own set of weight matrices. The outputs of these attention heads are concatenated and linearly transformed:

$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \ldots, \text{head}_h)W^O
$$

where 

$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$

and 

$$
W^O
$$

is the output weight matrix.

### 4.10 Combining Attention Heads

The next step involves passing these vectors to a feed-forward neural network for further processing within the encoder. Initially, a single word was passed to get one vector, but now multiple vectors are obtained. Before passing through the feed-forward neural network, these vectors need to be combined into a single matrix along with some weights, and then the forward and backward propagation can be performed.

The feed-forward neural network expects a single matrix, so a method is needed to condense all these attention heads into a single matrix with specific weights. This is achieved by concatenating all the attention heads and performing a dot product with \( W_0 \), initializing a new weight for the feed-forward neural network:

$$
Z = \text{Concat}(Z_0, Z_1, Z_2, \ldots, Z_7)W_0
$$

### 4.11 Positional Encoding

One major advantage of using Transformers is the ability to process word tokens in parallel. However, this advantage also introduces a drawback: the lack of sequential structure. The order of words is crucial, as it changes the meaning of sentences. For example, "lion kills tiger" and "tiger kills lion" have different meanings due to word order.

To address this, positional encoding is used to represent the order of sequences. According to the research paper "Attention is All You Need," a positional encoded vector is created and added to the embedding vector of each word. This positional encoded vector indicates the position of each word in the sequence.

The positional encoding for each position \( pos \) and dimension \( i \) is defined as:

$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$

$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$

where \( d_{model} \) is the dimension of the model. These sinusoidal functions ensure that each position has a unique encoding, and similar positions have similar encodings.

The positional encoded vector is then added to the input embedding vector:

$$
X' = X + PE
$$

This combined vector \( X' \) is then used in the self-attention mechanism, allowing the model to incorporate the positional information of each word.

### Reference
[1] Complete Transformers For NLP Deep Learning One Shot With Handwritten Notes by Krish Naik, https://www.youtube.com/watch?v=3bPhDUSAUYI

[2] https://arxiv.org/pdf/1706.03762