# The Annotated Transformer
* * *

My implementation based on http://nlp.seas.harvard.edu/annotated-transformer/ . Better said, the code will be almost the same but with my added annotations (with the help of my friend ChatGPT) to understand the transformer architecture on my own way.

### Prelims

- First install the dependencies from the requirements.txt file in the repo.
- Then, install the spacy dependencies using the following cells.

In [1]:
!python -m spacy download de_core_news_sm
!python -m spacy download en_core_web_sm

[33mDEPRECATION: https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.2.0/de_core_news_sm-3.2.0-py3-none-any.whl#egg=de_core_news_sm==3.2.0 contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617[0m[33m
[0mCollecting de-core-news-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.2.0/de_core_news_sm-3.2.0-py3-none-any.whl (19.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.1/19.1 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: de-core-news-sm
Successfully installed de-core-news-sm-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')
[33mDEPRECATION: https://github.com/explosion/spac

Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Imports

<div style="background-color:#FFFFE0; padding: 20px;">

The following imports are used in the Python code for various purposes:

1. `os`: The `os` module provides a way of interacting with the operating system, such as navigating the file system, creating directories, and managing environment variables.
2. `os.path.exists`: A function to check if a file or directory exists in the file system.
3. `torch`: The main PyTorch library, used for creating and managing tensors, defining neural network layers, and performing various operations on tensors.
4. `torch.nn`: A sub-module of PyTorch containing predefined neural network layers and other utilities.
5. `torch.nn.functional`: A sub-module of PyTorch containing various activation functions and utility functions, such as padding and normalization.
6. `math`: The Python standard library's math module, containing mathematical functions and constants.
7. `copy`: The Python standard library's copy module, used for creating shallow and deep copies of Python objects.
8. `time`: The Python standard library's time module, used for measuring the execution time of code segments.
9. `torch.optim.lr_scheduler`: A sub-module of PyTorch containing various learning rate schedulers for adjusting the learning rate during training.
10. `pandas`: A library for data manipulation and analysis, particularly useful for working with tabular data.
11. `altair`: A library for declarative data visualization in Python.
12. `torchtext.data.functional`: A sub-module of the TorchText library containing utility functions for working with text data.
13. `torch.utils.data`: A sub-module of PyTorch containing utilities for working with datasets and data loaders.
14. `torchtext.vocab`: A sub-module of the TorchText library containing utilities for building and managing vocabularies.
15. `torchtext.datasets`: A sub-module of the TorchText library containing various pre-built datasets for natural language processing tasks.
16. `spacy`: A library for natural language processing, used for tokenization, part-of-speech tagging, dependency parsing, and more.
17. `GPUtil`: A library for monitoring and managing the GPU utilization and memory usage of NVIDIA GPUs.
18. `warnings`: The Python standard library's warnings module, used for managing warning messages during code execution.
19. `torch.utils.data.distributed`: A sub-module of PyTorch containing utilities for working with distributed data samplers in multi-GPU or multi-node settings.
20. `torch.distributed`: A sub-module of PyTorch containing utilities for distributed training and communication between processes.
21. `torch.multiprocessing`: A sub-module of PyTorch providing a PyTorch-specific wrapper around the Python multiprocessing module, used for parallelizing code execution across multiple CPU cores.
22. `torch.nn.parallel`: A sub-module of PyTorch containing utilities for parallelizing the training of neural networks across multiple devices.

The `warnings.filterwarnings("ignore")` line is used to suppress warning messages during the execution of the notebook. The `RUN_EXAMPLES` variable is set to `True` to enable the execution of examples in the notebook; set it to `False` to skip execution (e.g., for debugging).

</div>

In [3]:
import os
from os.path import exists
import torch
import torch.nn as nn
from torch.nn.functional import log_softmax, pad
import math
import copy
import time
from torch.optim.lr_scheduler import LambdaLR
import pandas as pd
import altair as alt
from torchtext.data.functional import to_map_style_dataset
from torch.utils.data import DataLoader
from torchtext.vocab import build_vocab_from_iterator
import torchtext.datasets as datasets
import spacy
import GPUtil
import warnings
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP

# Set to False to skip notebook execution (e.g. for debugging)

warnings.filterwarnings("ignore")
RUN_EXAMPLES = True

### Convenience functions

In [4]:
def is_interactive_notebook():
    return __name__ == "__main__"

def show_example(fn, args=[]):
    if __name__ == "__main__" and RUN_EXAMPLES:
        return fn(*args)

def execute_example(fn, args=[]):
    if __name__ == "__main__" and RUN_EXAMPLES:
        fn(*args)


class DummyOptimizer(torch.optim.Optimizer):
    def __init__(self):
        self.param_groups = [{"lr": 0}]
        None
    
    def step(self):
        None
        
    def zero_grad(self, set_to_none=False):
        None
        

class DummyScheduler:
    def step(self):
        None

### Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet and ConvS2S, all of which use convolutional neural networks<sup>1</sup> as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions. In the Transformer<sup>2</sup> this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention<sup>3</sup>.
<div style="background-color:#FFFFE0; padding: 20px;">
<sup>1</sup> **Convolutional Neural Networks (CNNs)** are a class of deep learning models that are especially effective in processing grid-like data, such as images. They consist of convolutional layers, which apply filters to local input regions, allowing the model to learn spatial hierarchies and capture local patterns.

<sup>2</sup> **Self-Attention** is a mechanism in the Transformer model that allows it to weigh the importance of different input elements when processing a specific element. It computes a score for each pair of elements and uses these scores to create a weighted sum of the input elements, which is then used to compute the output.

<sup>3</sup> **Multi-Head Attention** is an extension of the self-attention mechanism in the Transformer model. It uses multiple parallel self-attention layers, or "heads," to focus on different parts of the input simultaneously. This allows the model to capture various aspects of the input data, improving its ability to learn and understand complex dependencies.
</div>


Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations. End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks.  
To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.

## Part 1: Model Architecture
### Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure<sup>1</sup>. Here, the encoder maps an input sequence of symbol representations $(x_1,...,x_n)$ to a sequence of continuous representations $z = (z_1,...,z_n)$. Given $z$, the decoder then generates an output sequence $(y_1,...,y_m)$ of symbols one element at a time. At each step, the model is auto-regressive<sup>2</sup>, consuming the previously generated symbols as additional input when generating the next.

<div style="background-color:#FFFFE0; padding: 20px;">
<sup>1</sup> **Encoder-Decoder Architecture**: This structure is commonly used in sequence-to-sequence (seq2seq) models, which are designed to map input sequences to output sequences. The encoder processes the input sequence and generates a continuous representation, often called a "context vector" or "hidden state." The decoder then uses this representation to generate the output sequence, step by step. This great <a href="https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/">reference </a> provides a cool visual explanation.

**Example**: Imagine a neural machine translation model that translates English sentences to French. The input sequence consists of English words (e.g., "How are you?"), and the output sequence consists of French words (e.g., "Comment ça va ?"). In this case, the encoder processes the English words and creates a continuous representation that captures their meaning. The decoder then generates the French translation based on this representation.

<sup>2</sup> **Auto-regressive Models**: These models generate output sequences one element at a time, using previously generated elements as additional input for generating the next element. In the context of the encoder-decoder architecture, this means that the decoder generates each output symbol based on the continuous representation from the encoder as well as the previously generated output symbols.
</div>

In [5]:
class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture.
    Base for this and many other models.
    """
    
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder  # The encoder layer (typically an RNN, CNN, or Transformer)
        self.decoder = decoder  # The decoder layer (typically an RNN, CNN, or Transformer)
        self.src_embed = src_embed  # The embedding layer for the source language tokens
        self.tgt_embed = tgt_embed  # The embedding layer for the target language tokens
        self.generator = generator  # The generator layer, which produces the final output (e.g., a linear layer)
        
    def forward(self, src, tgt, src_mask, tgt_mask):
        # Take in and process masked src and target sequences.
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)
    
    def encode(self, src, src_mask):
        # Encode the input sequence (src) using the source embedding layer and the encoder.
        return self.encoder(self.src_embed(src), src_mask)
    
    def decode(self, memory, src_mask, tgt, tgt_mask):
        # Decode the encoded memory using the target embedding layer, the decoder, and the masks.
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)


In [6]:
class Generator(nn.Module):
    """
    Define standard linear + softmax generation step.
    The Generator class is responsible for producing the final output
    after the Encoder-Decoder architecture processes the input.
    """
    
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)  # Linear layer that maps from the model's hidden dimension to the vocabulary size
        
    def forward(self, x):
        # Apply the linear layer and log softmax to produce the output probabilities
        return log_softmax(self.proj(x), dim=-1)


<div style="background-color:#FFFFE0; padding: 20px;">
**log_softmax**: The `log_softmax` function is a combination of the softmax function and the natural logarithm. The softmax function is used to convert a vector of scores into a probability distribution, where each element of the output vector represents the probability of a class in a multi-class classification problem. By applying the natural logarithm to the output of the softmax function, the `log_softmax` function provides more numerically stable results, especially when working with small probabilities or large score values. This stability is particularly useful during optimization, as it helps prevent issues caused by floating-point arithmetic and underflow/overflow errors.

In PyTorch, the `log_softmax` function can be found in the `torch.nn.functional` module and is used as follows:

```python
import torch.nn.functional as F
output = F.log_softmax(input_tensor, dim=-1)
```
Here, input_tensor is the input tensor for which the log_softmax function will be applied, and dim specifies the dimension along which the softmax operation should be computed.
</div>

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

<img src="fig1.png" />

### Encoder and Decoder Stacks
#### Encoder

The encoder is composed of a stack of N=6 identical layers.

In [7]:
def clones(module, N):
    # Produce N identical layers.
    return nn.ModuleList([copy.deepcopy(module)] for _ in range(N))

In [8]:
class Encoder(nn.Module):
    """
    Core encoder is a stack of N layers.
    The Encoder class is a part of the Transformer architecture and
    inherits from PyTorch's nn.Module class.
    """
    
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)  # Create N clones of the given layer
        self.norm = LayerNorm(layer.size)  # Initialize layer normalization for the final output
        
    def forward(self, x, mask):
        # Pass the input (x) and mask through each layer in turn
        for layer in self.layers:
            x = layer(x, mask)  # Process input and mask through the current layer
        return self.norm(x)  # Apply layer normalization to the output after processing all layers

<div style="background-color:#FFFFE0; padding: 20px;">
    
Layer normalization is a technique used to improve the training of deep neural networks by normalizing the activations of neurons within each layer. Unlike batch normalization, which normalizes activations across a batch of inputs, layer normalization normalizes activations across the features of a single input.

The main idea behind layer normalization is to compute the mean and standard deviation for each input sample and normalize the activations accordingly. After normalization, a learnable scale and shift parameter are applied to the normalized activations. These learnable parameters help the model to adjust the normalization according to the data distribution.

Layer normalization has several benefits:

1. It reduces the internal covariate shift, making the training process more stable and allowing the use of higher learning rates.
2. It allows for faster convergence and, in some cases, better generalization.
3. It is less sensitive to the batch size, making it suitable for tasks where the batch size may vary or be small.

In PyTorch, layer normalization can be implemented using the `nn.LayerNorm` class. The class constructor takes one required argument, which is the number of features to normalize, and optional arguments for the learnable scale and shift parameters' initial values and a small value for numerical stability (epsilon).

Example usage:

```python
import torch.nn as nn
layer_norm = nn.LayerNorm(num_features)
normalized_output = layer_norm(input_tensor)
```
</div>

We employ a residual connection around each of the two sub-layers, followed by layer normalization.

<div style="background-color:#FFFFE0; padding: 20px;">
    
A residual connection, also known as a skip connection or shortcut connection, is an architectural component in deep neural networks that allows the output of a layer to be added to the output of a later layer. This creates a direct path for the gradient to flow during backpropagation, mitigating the vanishing gradient problem and allowing for the training of deeper networks.

In the Transformer architecture, residual connections are employed around each of the two sub-layers within the encoder and decoder layers. The output of a sub-layer is added to the input of that sub-layer, and this combined output is then passed through layer normalization. This mechanism helps the model learn more efficiently by allowing the gradients to flow more easily through the network.

In PyTorch, a residual connection can be implemented simply by adding the input and output of a layer (or sub-layer) together:

```python
output = sub_layer(input) + input
```
    
The combined output can then be passed through layer normalization.

</div>

In [10]:
class LayerNorm(nn.Module):
    # Define a custom LayerNorm class for layer normalization
    
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        # Initialize learnable scale (a_2) and shift (b_2) parameters
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        # Set a small constant value (epsilon) for numerical stability
        self.eps = eps
        
    def forward(self, x):
        # Compute the mean and standard deviation of the input tensor along the last dimension
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        # Perform layer normalization: normalize input, then apply learnable scale and shift
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2


That is, the output of each sub-layer is LayerNorm(x+Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. We apply dropout (cite) to the output of each sub-layer, before it is added to the sub-layer input and normalized.

To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension d<sub>model</sub> = 512.

<div style="background-color:#FFFFE0; padding: 20px;">

Dropout is a regularization technique used in training neural networks to prevent overfitting. It is applied during training by randomly setting a fraction of the neurons' activations to zero at each update, effectively "dropping out" those neurons from the network. This helps the model to become more robust and generalize better to unseen data.

In the Transformer architecture, dropout is applied to the output of each sub-layer before it is added to the sub-layer input and normalized. By adding dropout to the sub-layers, the model becomes more resistant to overfitting, allowing it to learn more complex patterns in the data.

In PyTorch, dropout can be applied using the `nn.Dropout` module or the `F.dropout` function from the `torch.nn.functional` module. The dropout probability (i.e., the fraction of neurons to be dropped out) is specified as a hyperparameter when creating the dropout layer or calling the function.

For example, to apply dropout with a probability of 0.1, you can use:

```python
dropout_layer = nn.Dropout(0.1)
# or
import torch.nn.functional as F
output = F.dropout(input, p=0.1, training=True)
```
</div>

In [11]:
class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm. Note for code simplicity the norm is first as opposed to last.
    """
    
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, sublayer):
        # Apply residual connection to any sublayer with the same size.
        return x + self.dropout(sublayer(self.norm(x)))
        

Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.

<div style="background-color:#FFFFE0; padding: 20px;">

**Multi-Head Self-Attention Mechanism**: The multi-head self-attention mechanism allows the model to jointly learn different types of relationships between words in a sequence. It works by first computing a set of attention scores for each word in the sequence with respect to all other words. These scores are then used to weight the input representations, producing a context-aware output representation for each word. The multi-head aspect comes from performing this process multiple times (i.e., using multiple "heads") with different learned linear projections, allowing the model to capture different aspects of the relationships between words. Finally, the outputs from all heads are concatenated and projected to produce the final output of the multi-head self-attention layer.

**Position-wise Fully Connected Feed-Forward Network**: This is a simple feed-forward network that consists of two linear layers with a ReLU activation function in between. Unlike the multi-head self-attention mechanism, the feed-forward network operates independently on each position (i.e., each word) in the sequence. Its purpose is to provide an additional layer of non-linearity and complexity to the model, allowing it to learn more sophisticated relationships between words in the input sequence.

These two sub-layers, combined with the residual connections and layer normalization, form the core building blocks of the Transformer architecture. By stacking multiple such layers, the model can learn increasingly complex patterns and dependencies in the input data, ultimately leading to better performance on a wide range of natural language processing tasks.

</div>


In [13]:
class EncoderLayer(nn.Module):
    # Define the EncoderLayer class, which is made up of self-attention and feed-forward layers
    
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        # Initialize self-attention, feed-forward, and sublayer connection modules
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size
        
    def forward(self, x, mask):
        # Implement the forward pass, following the connections in Figure 1 (left) of the paper
        # Apply the self-attention sublayer and pass the output through a SublayerConnection
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        # Apply the feed-forward sublayer and pass the output through another SublayerConnection
        return self.sublayer[1](x, self.feed_forward)


#### Decoder

The decoder is also composed of a stack of N = 6 identical layers.

In [14]:
class Decoder(nn.Module):
    # Define the Decoder class with N layers and masking capabilities
    
    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        # Initialize the layers by cloning the provided layer N times
        self.layers = clones(layer, N)
        # Add layer normalization at the end of the processing
        self.norm = LayerNorm(layer.size)
        
    def forward(self, x, memory, src_mask, tgt_mask):
        # Implement the forward pass for the decoder
        # Iterate through each layer in the stack
        for layer in self.layers:
            # Pass the input, memory, source mask, and target mask to the current layer
            x = layer(x, memory, src_mask, tgt_mask)
        # Apply layer normalization to the final output
        return self.norm(x)


In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.

In [16]:
class DecoderLayer(nn.Module):
    # Define the DecoderLayer class, which is made up of self-attention, source-attention, and feed-forward layers
    
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        # Initialize size, self-attention, source-attention, feed-forward, and sublayer connection modules
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)
        
    def forward(self, x, memory, src_mask, tgt_mask):
        # Implement the forward pass for the decoder layer
        m = memory
        # Apply the self-attention sublayer and pass the output through a SublayerConnection
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        # Apply the source-attention sublayer and pass the output through another SublayerConnection
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        # Apply the feed-forward sublayer and pass the output through the final SublayerConnection
        return self.sublayer[2](x, self.feed_forward)


We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with the fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

<div style="background-color: lightyellow; padding: 10px;">
For example, let's consider a simple sentence: "I like ice cream." During the decoding process, when predicting the word "ice," we want the decoder to only attend to the words "I" and "like" (positions before "ice") and not the word "cream" (a position after "ice"). By masking subsequent positions in the self-attention sub-layer, we ensure that the model only considers the words before the current position when making a prediction, thus preventing it from using future information. This is crucial for tasks like translation, where the model needs to generate the target sentence in a left-to-right manner.
</div>


In [17]:
def subsequent_mask(size):
    # Mask out subsequent positions
    attn_shape = (1, size, size)
    subsequent_mask = torch.triu(torch.ones(attn_shape), diagonal=1).type(torch.uint8)
    return subsequent_mask==0

Below the attention mask shows the position each tgt word (row) is allowed to look at (column). Words are blocked for attending to future words during training.

In [21]:
def example_mask():
    # Combine all masking information into a Pandas DataFrame
    LS_data = pd.concat([
        pd.DataFrame({
            "Subsequent Mask": subsequent_mask(20)[0][x, y].flatten(),
            "Window": y,
            "Masking": x,
        }) for y in range(20) for x in range(20)
    ])
    
    # Create an Altair chart to visualize the subsequent mask
    return (
        alt.Chart(LS_data)
        .mark_rect()  # Use rectangular marks to represent masking values
        .properties(height=250, width=250)  # Set chart dimensions
        .encode(
            alt.X("Window:O"),  # Map the 'Window' column to the X-axis
            alt.Y("Masking:O"),  # Map the 'Masking' column to the Y-axis
            alt.Color("Subsequent Mask:Q", scale=alt.Scale(scheme="viridis")),  # Map the 'Subsequent Mask' column to the color of the marks
        )
        .interactive()  # Make the chart interactive (e.g., support zooming and panning)
    )

# Call the `show_example` function to display the chart
show_example(example_mask)


### Attention