# The Annotated Transformer
* * *

My implementation based on http://nlp.seas.harvard.edu/annotated-transformer/ . Better said, the code will be almost the same but with my added annotations (with the help of my friend ChatGPT) to understand the transformer architecture on my own way.

### Prelims

- First install the dependencies from the requirements.txt file in the repo.
- Then, install the spacy dependencies using the following cells.

In [1]:
!python -m spacy download de_core_news_sm > /dev/null 2>&1
!python -m spacy download en_core_web_sm > /dev/null 2>&1

### Imports

<div style="background-color:#FFFFE0; padding: 20px;">

The following imports are used in the Python code for various purposes:

1. `os`: The `os` module provides a way of interacting with the operating system, such as navigating the file system, creating directories, and managing environment variables.
2. `os.path.exists`: A function to check if a file or directory exists in the file system.
3. `torch`: The main PyTorch library, used for creating and managing tensors, defining neural network layers, and performing various operations on tensors.
4. `torch.nn`: A sub-module of PyTorch containing predefined neural network layers and other utilities.
5. `torch.nn.functional`: A sub-module of PyTorch containing various activation functions and utility functions, such as padding and normalization.
6. `math`: The Python standard library's math module, containing mathematical functions and constants.
7. `copy`: The Python standard library's copy module, used for creating shallow and deep copies of Python objects.
8. `time`: The Python standard library's time module, used for measuring the execution time of code segments.
9. `torch.optim.lr_scheduler`: A sub-module of PyTorch containing various learning rate schedulers for adjusting the learning rate during training.
10. `pandas`: A library for data manipulation and analysis, particularly useful for working with tabular data.
11. `altair`: A library for declarative data visualization in Python.
12. `torchtext.data.functional`: A sub-module of the TorchText library containing utility functions for working with text data.
13. `torch.utils.data`: A sub-module of PyTorch containing utilities for working with datasets and data loaders.
14. `torchtext.vocab`: A sub-module of the TorchText library containing utilities for building and managing vocabularies.
15. `torchtext.datasets`: A sub-module of the TorchText library containing various pre-built datasets for natural language processing tasks.
16. `spacy`: A library for natural language processing, used for tokenization, part-of-speech tagging, dependency parsing, and more.
17. `GPUtil`: A library for monitoring and managing the GPU utilization and memory usage of NVIDIA GPUs.
18. `warnings`: The Python standard library's warnings module, used for managing warning messages during code execution.
19. `torch.utils.data.distributed`: A sub-module of PyTorch containing utilities for working with distributed data samplers in multi-GPU or multi-node settings.
20. `torch.distributed`: A sub-module of PyTorch containing utilities for distributed training and communication between processes.
21. `torch.multiprocessing`: A sub-module of PyTorch providing a PyTorch-specific wrapper around the Python multiprocessing module, used for parallelizing code execution across multiple CPU cores.
22. `torch.nn.parallel`: A sub-module of PyTorch containing utilities for parallelizing the training of neural networks across multiple devices.

The `warnings.filterwarnings("ignore")` line is used to suppress warning messages during the execution of the notebook. The `RUN_EXAMPLES` variable is set to `True` to enable the execution of examples in the notebook; set it to `False` to skip execution (e.g., for debugging).

</div>

In [2]:
import os
from os.path import exists
import torch
import torch.nn as nn
from torch.nn.functional import log_softmax, pad
import math
import copy
import time
from torch.optim.lr_scheduler import LambdaLR
import pandas as pd
import altair as alt
from torchtext.data.functional import to_map_style_dataset
from torch.utils.data import DataLoader
from torchtext.vocab import build_vocab_from_iterator
import torchtext.datasets as datasets
import spacy
import GPUtil
import warnings
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP

# Set to False to skip notebook execution (e.g. for debugging)

warnings.filterwarnings("ignore")
RUN_EXAMPLES = True

#print(torch.backends.mps.is_available())
#print(torch.backends.mps.is_built())

### Convenience functions

In [3]:
def is_interactive_notebook():
    return __name__ == "__main__"

def show_example(fn, args=[]):
    if __name__ == "__main__" and RUN_EXAMPLES:
        return fn(*args)

def execute_example(fn, args=[]):
    if __name__ == "__main__" and RUN_EXAMPLES:
        fn(*args)


class DummyOptimizer(torch.optim.Optimizer):
    def __init__(self):
        self.param_groups = [{"lr": 0}]
        None
    
    def step(self):
        None
        
    def zero_grad(self, set_to_none=False):
        None
        

class DummyScheduler:
    def step(self):
        None

### Background

The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet and ConvS2S, all of which use convolutional neural networks<sup>1</sup> as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions. In the Transformer<sup>2</sup> this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention<sup>3</sup>.
<div style="background-color:#FFFFE0; padding: 20px;">
<sup>1</sup> **Convolutional Neural Networks (CNNs)** are a class of deep learning models that are especially effective in processing grid-like data, such as images. They consist of convolutional layers, which apply filters to local input regions, allowing the model to learn spatial hierarchies and capture local patterns.

<sup>2</sup> **Self-Attention** is a mechanism in the Transformer model that allows it to weigh the importance of different input elements when processing a specific element. It computes a score for each pair of elements and uses these scores to create a weighted sum of the input elements, which is then used to compute the output.

<sup>3</sup> **Multi-Head Attention** is an extension of the self-attention mechanism in the Transformer model. It uses multiple parallel self-attention layers, or "heads," to focus on different parts of the input simultaneously. This allows the model to capture various aspects of the input data, improving its ability to learn and understand complex dependencies.
</div>


Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations. End-to-end memory networks are based on a recurrent attention mechanism instead of sequencealigned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks.  
To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.

## Part 1: Model Architecture
### Model Architecture

Most competitive neural sequence transduction models have an encoder-decoder structure<sup>1</sup>. Here, the encoder maps an input sequence of symbol representations $(x_1,...,x_n)$ to a sequence of continuous representations $z = (z_1,...,z_n)$. Given $z$, the decoder then generates an output sequence $(y_1,...,y_m)$ of symbols one element at a time. At each step, the model is auto-regressive<sup>2</sup>, consuming the previously generated symbols as additional input when generating the next.

<div style="background-color:#FFFFE0; padding: 20px;">
<sup>1</sup> **Encoder-Decoder Architecture**: This structure is commonly used in sequence-to-sequence (seq2seq) models, which are designed to map input sequences to output sequences. The encoder processes the input sequence and generates a continuous representation, often called a "context vector" or "hidden state." The decoder then uses this representation to generate the output sequence, step by step. This great <a href="https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/">reference </a> provides a cool visual explanation.

**Example**: Imagine a neural machine translation model that translates English sentences to French. The input sequence consists of English words (e.g., "How are you?"), and the output sequence consists of French words (e.g., "Comment ça va ?"). In this case, the encoder processes the English words and creates a continuous representation that captures their meaning. The decoder then generates the French translation based on this representation.

<sup>2</sup> **Auto-regressive Models**: These models generate output sequences one element at a time, using previously generated elements as additional input for generating the next element. In the context of the encoder-decoder architecture, this means that the decoder generates each output symbol based on the continuous representation from the encoder as well as the previously generated output symbols.
</div>

In [4]:
class EncoderDecoder(nn.Module):
    """
    A standard Encoder-Decoder architecture.
    Base for this and many other models.
    """
    
    def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
        super(EncoderDecoder, self).__init__()
        self.encoder = encoder  # The encoder layer (typically an RNN, CNN, or Transformer)
        self.decoder = decoder  # The decoder layer (typically an RNN, CNN, or Transformer)
        self.src_embed = src_embed  # The embedding layer for the source language tokens
        self.tgt_embed = tgt_embed  # The embedding layer for the target language tokens
        self.generator = generator  # The generator layer, which produces the final output (e.g., a linear layer)
        
    def forward(self, src, tgt, src_mask, tgt_mask):
        # Take in and process masked src and target sequences.
        return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)
    
    def encode(self, src, src_mask):
        # Encode the input sequence (src) using the source embedding layer and the encoder.
        return self.encoder(self.src_embed(src), src_mask)
    
    def decode(self, memory, src_mask, tgt, tgt_mask):
        # Decode the encoded memory using the target embedding layer, the decoder, and the masks.
        return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)


In [5]:
class Generator(nn.Module):
    """
    Define standard linear + softmax generation step.
    The Generator class is responsible for producing the final output
    after the Encoder-Decoder architecture processes the input.
    """
    
    def __init__(self, d_model, vocab):
        super(Generator, self).__init__()
        self.proj = nn.Linear(d_model, vocab)  # Linear layer that maps from the model's hidden dimension to the vocabulary size
        
    def forward(self, x):
        # Apply the linear layer and log softmax to produce the output probabilities
        return log_softmax(self.proj(x), dim=-1)


<div style="background-color:#FFFFE0; padding: 20px;">
**log_softmax**: The `log_softmax` function is a combination of the softmax function and the natural logarithm. The softmax function is used to convert a vector of scores into a probability distribution, where each element of the output vector represents the probability of a class in a multi-class classification problem. By applying the natural logarithm to the output of the softmax function, the `log_softmax` function provides more numerically stable results, especially when working with small probabilities or large score values. This stability is particularly useful during optimization, as it helps prevent issues caused by floating-point arithmetic and underflow/overflow errors.

In PyTorch, the `log_softmax` function can be found in the `torch.nn.functional` module and is used as follows:

```python
import torch.nn.functional as F
output = F.log_softmax(input_tensor, dim=-1)
```
Here, input_tensor is the input tensor for which the log_softmax function will be applied, and dim specifies the dimension along which the softmax operation should be computed.
</div>

The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.

<img src="fig1.png" />

### Encoder and Decoder Stacks
#### Encoder

The encoder is composed of a stack of N=6 identical layers.

In [6]:
def clones(module, N):
    # Produce N identical layers.
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

In [7]:
class Encoder(nn.Module):
    """
    Core encoder is a stack of N layers.
    The Encoder class is a part of the Transformer architecture and
    inherits from PyTorch's nn.Module class.
    """
    
    def __init__(self, layer, N):
        super(Encoder, self).__init__()
        self.layers = clones(layer, N)  # Create N clones of the given layer
        self.norm = LayerNorm(layer.size)  # Initialize layer normalization for the final output
        
    def forward(self, x, mask):
        # Pass the input (x) and mask through each layer in turn
        for layer in self.layers:
            x = layer(x, mask)  # Process input and mask through the current layer
        return self.norm(x)  # Apply layer normalization to the output after processing all layers

<div style="background-color:#FFFFE0; padding: 20px;">
    
Layer normalization is a technique used to improve the training of deep neural networks by normalizing the activations of neurons within each layer. Unlike batch normalization, which normalizes activations across a batch of inputs, layer normalization normalizes activations across the features of a single input.

The main idea behind layer normalization is to compute the mean and standard deviation for each input sample and normalize the activations accordingly. After normalization, a learnable scale and shift parameter are applied to the normalized activations. These learnable parameters help the model to adjust the normalization according to the data distribution.

Layer normalization has several benefits:

1. It reduces the internal covariate shift, making the training process more stable and allowing the use of higher learning rates.
2. It allows for faster convergence and, in some cases, better generalization.
3. It is less sensitive to the batch size, making it suitable for tasks where the batch size may vary or be small.

In PyTorch, layer normalization can be implemented using the `nn.LayerNorm` class. The class constructor takes one required argument, which is the number of features to normalize, and optional arguments for the learnable scale and shift parameters' initial values and a small value for numerical stability (epsilon).

Example usage:

```python
import torch.nn as nn
layer_norm = nn.LayerNorm(num_features)
normalized_output = layer_norm(input_tensor)
```
</div>

We employ a residual connection around each of the two sub-layers, followed by layer normalization.

<div style="background-color:#FFFFE0; padding: 20px;">
    
A residual connection, also known as a skip connection or shortcut connection, is an architectural component in deep neural networks that allows the output of a layer to be added to the output of a later layer. This creates a direct path for the gradient to flow during backpropagation, mitigating the vanishing gradient problem and allowing for the training of deeper networks.

In the Transformer architecture, residual connections are employed around each of the two sub-layers within the encoder and decoder layers. The output of a sub-layer is added to the input of that sub-layer, and this combined output is then passed through layer normalization. This mechanism helps the model learn more efficiently by allowing the gradients to flow more easily through the network.

In PyTorch, a residual connection can be implemented simply by adding the input and output of a layer (or sub-layer) together:

```python
output = sub_layer(input) + input
```
    
The combined output can then be passed through layer normalization.

</div>

In [8]:
class LayerNorm(nn.Module):
    # Define a custom LayerNorm class for layer normalization
    
    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        # Initialize learnable scale (a_2) and shift (b_2) parameters
        self.a_2 = nn.Parameter(torch.ones(features))
        self.b_2 = nn.Parameter(torch.zeros(features))
        # Set a small constant value (epsilon) for numerical stability
        self.eps = eps
        
    def forward(self, x):
        # Compute the mean and standard deviation of the input tensor along the last dimension
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        # Perform layer normalization: normalize input, then apply learnable scale and shift
        return self.a_2 * (x - mean) / (std + self.eps) + self.b_2


That is, the output of each sub-layer is LayerNorm(x+Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. We apply dropout (cite) to the output of each sub-layer, before it is added to the sub-layer input and normalized.

To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension d<sub>model</sub> = 512.

<div style="background-color:#FFFFE0; padding: 20px;">

Dropout is a regularization technique used in training neural networks to prevent overfitting. It is applied during training by randomly setting a fraction of the neurons' activations to zero at each update, effectively "dropping out" those neurons from the network. This helps the model to become more robust and generalize better to unseen data.

In the Transformer architecture, dropout is applied to the output of each sub-layer before it is added to the sub-layer input and normalized. By adding dropout to the sub-layers, the model becomes more resistant to overfitting, allowing it to learn more complex patterns in the data.

In PyTorch, dropout can be applied using the `nn.Dropout` module or the `F.dropout` function from the `torch.nn.functional` module. The dropout probability (i.e., the fraction of neurons to be dropped out) is specified as a hyperparameter when creating the dropout layer or calling the function.

For example, to apply dropout with a probability of 0.1, you can use:

```python
dropout_layer = nn.Dropout(0.1)
# or
import torch.nn.functional as F
output = F.dropout(input, p=0.1, training=True)
```
</div>

In [9]:
class SublayerConnection(nn.Module):
    """
    A residual connection followed by a layer norm. Note for code simplicity the norm is first as opposed to last.
    """
    
    def __init__(self, size, dropout):
        super(SublayerConnection, self).__init__()
        self.norm = LayerNorm(size)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, sublayer):
        # Apply residual connection to any sublayer with the same size.
        return x + self.dropout(sublayer(self.norm(x)))
        

Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network.

<div style="background-color:#FFFFE0; padding: 20px;">

**Multi-Head Self-Attention Mechanism**: The multi-head self-attention mechanism allows the model to jointly learn different types of relationships between words in a sequence. It works by first computing a set of attention scores for each word in the sequence with respect to all other words. These scores are then used to weight the input representations, producing a context-aware output representation for each word. The multi-head aspect comes from performing this process multiple times (i.e., using multiple "heads") with different learned linear projections, allowing the model to capture different aspects of the relationships between words. Finally, the outputs from all heads are concatenated and projected to produce the final output of the multi-head self-attention layer.

**Position-wise Fully Connected Feed-Forward Network**: This is a simple feed-forward network that consists of two linear layers with a ReLU activation function in between. Unlike the multi-head self-attention mechanism, the feed-forward network operates independently on each position (i.e., each word) in the sequence. Its purpose is to provide an additional layer of non-linearity and complexity to the model, allowing it to learn more sophisticated relationships between words in the input sequence.

These two sub-layers, combined with the residual connections and layer normalization, form the core building blocks of the Transformer architecture. By stacking multiple such layers, the model can learn increasingly complex patterns and dependencies in the input data, ultimately leading to better performance on a wide range of natural language processing tasks.

</div>


In [10]:
class EncoderLayer(nn.Module):
    # Define the EncoderLayer class, which is made up of self-attention and feed-forward layers
    
    def __init__(self, size, self_attn, feed_forward, dropout):
        super(EncoderLayer, self).__init__()
        # Initialize self-attention, feed-forward, and sublayer connection modules
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 2)
        self.size = size
        
    def forward(self, x, mask):
        # Implement the forward pass, following the connections in Figure 1 (left) of the paper
        # Apply the self-attention sublayer and pass the output through a SublayerConnection
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        # Apply the feed-forward sublayer and pass the output through another SublayerConnection
        return self.sublayer[1](x, self.feed_forward)


#### Decoder

The decoder is also composed of a stack of N = 6 identical layers.

In [11]:
class Decoder(nn.Module):
    # Define the Decoder class with N layers and masking capabilities
    
    def __init__(self, layer, N):
        super(Decoder, self).__init__()
        # Initialize the layers by cloning the provided layer N times
        self.layers = clones(layer, N)
        # Add layer normalization at the end of the processing
        self.norm = LayerNorm(layer.size)
        
    def forward(self, x, memory, src_mask, tgt_mask):
        # Implement the forward pass for the decoder
        # Iterate through each layer in the stack
        for layer in self.layers:
            # Pass the input, memory, source mask, and target mask to the current layer
            x = layer(x, memory, src_mask, tgt_mask)
        # Apply layer normalization to the final output
        return self.norm(x)


In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.

In [12]:
class DecoderLayer(nn.Module):
    # Define the DecoderLayer class, which is made up of self-attention, source-attention, and feed-forward layers
    
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        super(DecoderLayer, self).__init__()
        # Initialize size, self-attention, source-attention, feed-forward, and sublayer connection modules
        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
        self.sublayer = clones(SublayerConnection(size, dropout), 3)
        
    def forward(self, x, memory, src_mask, tgt_mask):
        # Implement the forward pass for the decoder layer
        m = memory
        # Apply the self-attention sublayer and pass the output through a SublayerConnection
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
        # Apply the source-attention sublayer and pass the output through another SublayerConnection
        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
        # Apply the feed-forward sublayer and pass the output through the final SublayerConnection
        return self.sublayer[2](x, self.feed_forward)


We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with the fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

<div style="background-color: lightyellow; padding: 10px;">
For example, let's consider a simple sentence: "I like ice cream." During the decoding process, when predicting the word "ice," we want the decoder to only attend to the words "I" and "like" (positions before "ice") and not the word "cream" (a position after "ice"). By masking subsequent positions in the self-attention sub-layer, we ensure that the model only considers the words before the current position when making a prediction, thus preventing it from using future information. This is crucial for tasks like translation, where the model needs to generate the target sentence in a left-to-right manner.
</div>


In [13]:
def subsequent_mask(size):
    # Mask out subsequent positions
    attn_shape = (1, size, size)
    subsequent_mask = torch.triu(torch.ones(attn_shape), diagonal=1).type(torch.uint8)
    return subsequent_mask==0

Below the attention mask shows the position each tgt word (row) is allowed to look at (column). Words are blocked for attending to future words during training.

In [14]:
def example_mask():
    # Combine all masking information into a Pandas DataFrame
    LS_data = pd.concat([
        pd.DataFrame({
            "Subsequent Mask": subsequent_mask(20)[0][x, y].flatten(),
            "Window": y,
            "Masking": x,
        }) for y in range(20) for x in range(20)
    ])
    
    # Create an Altair chart to visualize the subsequent mask
    return (
        alt.Chart(LS_data)
        .mark_rect()  # Use rectangular marks to represent masking values
        .properties(height=250, width=250)  # Set chart dimensions
        .encode(
            alt.X("Window:O"),  # Map the 'Window' column to the X-axis
            alt.Y("Masking:O"),  # Map the 'Masking' column to the Y-axis
            alt.Color("Subsequent Mask:Q", scale=alt.Scale(scheme="viridis")),  # Map the 'Subsequent Mask' column to the color of the marks
        )
        .interactive()  # Make the chart interactive (e.g., support zooming and panning)
    )

# Call the `show_example` function to display the chart
show_example(example_mask)


### Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key<sup>1</sup>.

We call our particular attention "Scaled Dot-Product Attention". The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$. We compute the dot products of the query with all keys, divide each by $\sqrt{d_k}$, and apply a softmax function to obtain the weights on the values<sup>2</sup>.

<div style="background-color:#FFFFE0; padding: 20px;">
<sup>1</sup> **Query, Key, and Value (Q, K, V)**: Imagine you're searching for a specific book in a library. In this context, your "query" is the information you're searching for (e.g., the book title). The "keys" are the information associated with each book in the library (e.g., the titles on all the books' spines). The "values" are the books themselves or the full contents of the books. An attention function works similarly: it takes a query and a set of key-value pairs and returns an output that is a weighted sum of the values (similar to "books"), where each value's weight is determined by the compatibility between the query and the corresponding key (like how well the title on the spine matches your query).

<sup>2</sup> **Scaled Dot-Product Attention** involves three steps: First, it computes the dot product between the query and each key to measure their compatibility. Next, it scales (divides) these measurements by the square root of the key's dimension ($d_k$) to prevent large dot products in large-dimensional spaces (since large dot products can cause the softmax function to have extremely small gradients, hindering learning). Finally, it applies a softmax function to convert these scores into weights, which are then used to compute a weighted sum of the values.
</div>

<img src="fig2.png" />

In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix $Q$. The keys and values are also packed together into matrices $K$ and $V$. We compute the matrix of outputs as:

$$ Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$<sup>1</sup>

<div style="background-color:#FFFFE0; padding: 20px;">
<sup>1</sup> **Softmax Function**: The softmax function, often used in machine learning, is a function that takes a vector of values and converts them into a vector of probabilities, where each probability represents the likelihood of the value within the context of the entire vector. The probabilities outputted by the softmax function will always sum to 1. In the context of the attention mechanism, the softmax function is used to convert the compatibility scores between the query and keys into weights that are used for the weighted sum of the values.
</div>



In [15]:
def attention(query, key, value, mask=None, dropout=None):
    """Computes the scaled dot product attention."""
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask==0, -1e9)
    p_attn = scores.softmax(dim=1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

The two most commonly used attention functions are additive attention<sup>1</sup>, and dot-product (multiplicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factor of $\frac{1}{\sqrt{d_k}}$. Additive attention computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

While for small values of $d_k$ the two mechanisms perform similarly, additive attention outperforms dot product attention without scaling for larger values of $d_k$<sup>2</sup>. We suspect that for large values of $d_k$, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients (To illustrate why the dot products get large, assume that the components of $q$ and $k$ are independent random variables with mean 0 and variance 1. Then their dot product, $q \cdot k = \sum_{i=1}^{d_k} q_i k_i$, has mean 0 and variance $d_k$.). To counteract this effect, we scale the dot products by $\frac{1}{\sqrt{d_k}}$.

<div style="background-color:#FFFFE0; padding: 20px;">
<sup>1</sup> **Additive Attention**: In contrast to dot-product attention, additive attention computes the compatibility between the query and key using a feed-forward network with a single hidden layer. This method computes the attention score by applying a learned linear projection in a high-dimensional space, followed by a tanh activation function. The result is then projected onto a single dimension to compute the scores.

<sup>2</sup> **Large values of $d_k$**: For larger dimensions, without scaling the dot product can grow large, causing the softmax function to squash its input into regions with extremely small gradients. This means the model becomes difficult to optimize, as the steps taken in parameter space become very small. By scaling the dot products by $\frac{1}{\sqrt{d_k}}$, we ensure that the softmax operates in a region where it has a manageable gradient, allowing for more efficient learning.
</div>

<img src="fig3.png" width="20%" height="20%"/>

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this<sup>1</sup>.

$$ MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O $$

where

$$ head_i = Attention(QW_i^Q, KW_i^K, VW_i^V) $$

Where the projections are parameter matrices $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ and $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$<sup>2</sup>.

In this work we employ $h=8$ parallel attention layers, or heads. For each of these we use $d_k=d_v=d_{\text{model}}/h=64$. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality<sup>3</sup>.

<div style="background-color:#FFFFE0; padding: 20px;">
<sup>1</sup> **Multi-head attention**: With a single attention head, the model averages the attention scores over all positions. This can prevent the model from capturing various types of dependencies. With multi-head attention, the model uses multiple sets of attention weights (or "heads"), each of which can focus on a different type of information. This allows the model to capture a richer variety of dependencies.

<sup>2</sup> **Parameter Matrices**: These are the learned weights associated with the queries, keys, values, and output in the multi-head attention mechanism. They are responsible for transforming the input and output to the appropriate dimensions. Specifically, $W_i^Q$, $W_i^K$, and $W_i^V$ are the parameter matrices for the query, key, and value, respectively, for each attention head $i$. $W^O$ is the parameter matrix for the output of the multi-head attention.

<sup>3</sup> **Reduced Dimension and Computational Cost**: By splitting the original dimension $d_{\text{model}}$ into multiple heads, each head has a reduced dimension ($d_k = d_v = d_{\text{model}} / h$). This allows each head to focus on a different subspace of the input, while keeping the computational cost similar to a single-head attention mechanism with full dimensionality.
</div>


In [16]:
class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0  # Check if the model size can be evenly divided by the number of heads
        self.d_k = d_model // h  # Compute the dimension of the queries, keys and values
        self.h = h  # The number of heads
        self.linears = clones(nn.Linear(d_model, d_model), 4)  # Initialize 4 identical linear layers
        self.attn = None  # The attention weights are initially None
        self.dropout = nn.Dropout(p=dropout)  # The dropout layer to be used after the softmax

    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # If a mask is provided, expand it to cover all attention heads
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)  # Number of batches

        # 1) Do all the linear projections in batch from d_model => h x d_k
        query, key, value = [
            lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)  
            for lin, x in zip(self.linears, (query, key, value))
        ]  # Apply the linear layers and reshape the results

        # 2) Apply attention on all the projected vectors in batch.
        x, self.attn = attention(
            query, key, value, mask=mask, dropout=self.dropout
        )  # Apply the attention mechanism

        # 3) "Concat" using a view and apply a final linear.
        x = (
            x.transpose(1, 2)
            .contiguous()
            .view(nbatches, -1, self.h * self.d_k)
        )  # Rearrange and reshape the outputs

        del query
        del key
        del value
        return self.linears[-1](x)  # Apply the final linear layer


#### Applications of Attention in our Model

The Transformer uses multi-head attention in three different ways: 1) In “encoder-decoder attention” layers, the queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence. This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as (cite).

The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place, in this case, the output of the previous layer in the encoder. Each position in the encoder can attend to all positions in the previous layer of the encoder.

Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to 
−∞) all values in the input of the softmax which correspond to illegal connections.


### Position-wise Feed-Forward Networks

In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

$$ FFN(x) = max(0, xW_1 + b_1)W_2 + b_2 $$

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1. The dimensionality of input and output is $d_{model} = 512$, and the inner-layer has dimensionality $d_{ff} = 2048$.

<div style="background-color:#FFFFE0; padding: 20px;">
<sup>1</sup> **Feed-Forward Network (FFN)**: This is a type of artificial neural network where connections between nodes do not form a cycle. In the Transformer model, it's used as a sub-layer in both the encoder and decoder stages. Although called a "network", this is essentially a single layer with two linear (fully connected) transformations and a ReLU (Rectified Linear Unit) activation function in between. This sub-layer is applied to each position separately and identically, meaning it does not alter the word order information.

<sup>2</sup> **ReLU**: The ReLU function is an activation function defined as the positive part of its argument, i.e., $ReLU(x) = max(0,x)$, where $x$ is the input to the function. It introduces non-linearity in the model without affecting the receptive fields of convolution layers.

<sup>3</sup> **Two convolutions with kernel size 1**: This phrase refers to the two different linear transformations (the "convolutions") applied to the input by this feed-forward network. Each transformation is applied to each position separately (which is what "kernel size 1" means in this context). 

<sup>4</sup> **Dimensions**: The dimensionality of input and output ($d_{model}$) is 512, meaning each input and output vector has 512 components. The inner layer (the one in the middle of the two linear transformations) has a dimensionality of 2048. This implies that the 512-dimensional input is expanded into an intermediate 2048-dimensional representation before reducing it back to 512 dimensions.
</div>


In [17]:
class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."

    def __init__(self, d_model, d_ff, dropout=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, d_ff)
        self.w_2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.w_2(self.dropout(self.w_1(x).relu()))

### Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{model}$. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by $\sqrt{d_{model}}$.

<div style="background-color:#FFFFE0; padding: 20px;">

<sup>1</sup> **Learned Linear Transformation**: In the context of neural networks, a linear transformation involves changing the shape of the data by multiplying the input matrix with a set of learned weights.

<sup>2</sup> **Shared Weights**: In the Transformer model, the weights are shared between the two embedding layers (input and output) and the pre-softmax linear transformation. This parameter sharing can lead to a more cohesive and better-performing model.

<sup>3</sup> **Scaling by $\sqrt{d_{model}}$**: This scaling factor is used in the embedding layers to prevent the weights from growing too large. This helps to keep the model stable and improves training speed and performance.
</div>


In [18]:
class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)  # Lookup table that stores embeddings of a fixed dictionary and size
        self.d_model = d_model  # The embedding dimension

    def forward(self, x):
        # Apply the lookup table to the input tensor x and scale by sqrt(d_model)
        return self.lut(x) * math.sqrt(self.d_model)

# Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. 

To this end, we add "positional encodings" to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two can be summed[^1]. 

There are many choices of positional encodings, learned and fixed.

In this work, we use sine and cosine functions of different frequencies:

$ PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}}) $

$ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}}) $

where $ pos $ is the position and $ i $ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. 

The wavelengths form a geometric progression from $ 2\pi $ to $ 10000 \cdot 2\pi $. 

We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $ k $, $ PE_{pos+k} $ can be represented as a linear function of $ PE_{pos} $.

In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of $ P_{drop} = 0.1 $.

<div style="background-color: #ffffe0">

<sup>1</sup> **Example**: As an example, consider the phrase "I love coding". Here, the positional encoding for the word "love" would not just represent the word itself but also include information about its position (2nd in this case). The model can use this information to understand the sequence of the words in the sentence. Similar encoding would be done for each word in the sentence.

</div>


In [19]:
class PositionalEncoding(nn.Module):
    "Implement the PE function."

    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)

    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)

Below the positional encoding will add in a sine wave based on position. The frequency and offset of the wave is different for each dimension.

In [20]:
def example_positional():
    pe = PositionalEncoding(20, 0)
    y = pe.forward(torch.zeros(1, 100, 20))

    data = pd.concat(
        [
            pd.DataFrame(
                {
                    "embedding": y[0, :, dim],
                    "dimension": dim,
                    "position": list(range(100)),
                }
            )
            for dim in [4, 5, 6, 7]
        ]
    )

    return (
        alt.Chart(data)
        .mark_line()
        .properties(width=800)
        .encode(x="position", y="embedding", color="dimension:N")
        .interactive()
    )


show_example(example_positional)

We also experimented with using learned positional embeddings[^1] instead, and found that the two versions produced nearly identical results. We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training[^2].

<div style="background-color: #ffffe0">

<sup>1</sup>: Learned positional embeddings refer to a method where the position information is not hardcoded (as in sinusoidal encoding) but is learned from the data during the training of the model. This is another common method used in Transformer models.

<sup>2</sup>: An example of how sinusoidal encoding works: Let's take a sentence "I love coding". Each word in the sentence is represented by a vector. In order to keep track of the word's position in the sentence, the model adds a 'positional encoding' to the original embedding of each word. For the sinusoidal version, these encodings are a combination of sine and cosine functions. For instance, the positional encoding for the second word 'love' would be a value between -1 and 1 generated by the sine function for odd dimensions and the cosine function for even dimensions.

</div>


## Full Model

Here we define a function from hyperparameters to a full model.

In [21]:
def make_model(
    src_vocab, tgt_vocab, N=6, d_model=512, d_ff=2048, h=8, dropout=0.1
):
    "Helper: Construct a model from hyperparameters."
    c = copy.deepcopy
    attn = MultiHeadedAttention(h, d_model)
    ff = PositionwiseFeedForward(d_model, d_ff, dropout)
    position = PositionalEncoding(d_model, dropout)
    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, src_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, tgt_vocab), c(position)),
        Generator(d_model, tgt_vocab),
    )

    # This was important from their code.
    # Initialize parameters with Glorot / fan_avg.
    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
    return model

### Inference

Here we make a forward step to generate a prediction of the model. We try to use our transformer to memorize the input. As you will see the output is randomly generated due to the fact that the model is not trained yet. In the next tutorial we will build the training function and try to train our model to memorize the numbers from 1 to 10.

In [22]:
def inference_test():
    test_model = make_model(11, 11, 2)
    test_model.eval()
    src = torch.LongTensor([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
    src_mask = torch.ones(1, 1, 10)

    memory = test_model.encode(src, src_mask)
    ys = torch.zeros(1, 1).type_as(src)

    for i in range(9):
        out = test_model.decode(
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        prob = test_model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat(
            [ys, torch.empty(1, 1).type_as(src.data).fill_(next_word)], dim=1
        )

    print("Example Untrained Model Prediction:", ys)


def run_tests():
    for _ in range(10):
        inference_test()


show_example(run_tests)

Example Untrained Model Prediction: tensor([[ 0,  4,  9, 10,  7,  5,  4,  0,  4,  0]])
Example Untrained Model Prediction: tensor([[ 0, 10,  9,  5, 10,  9,  5,  5,  5,  5]])
Example Untrained Model Prediction: tensor([[0, 0, 0, 0, 0, 0, 0, 8, 4, 8]])
Example Untrained Model Prediction: tensor([[ 0,  6, 10,  6, 10,  6,  1,  6, 10,  6]])
Example Untrained Model Prediction: tensor([[ 0,  1,  6,  5,  5, 10,  8,  1,  1,  1]])
Example Untrained Model Prediction: tensor([[0, 6, 8, 8, 8, 8, 8, 8, 6, 3]])
Example Untrained Model Prediction: tensor([[ 0, 10, 10, 10, 10, 10, 10, 10,  3,  0]])
Example Untrained Model Prediction: tensor([[0, 8, 6, 6, 6, 6, 6, 6, 6, 4]])
Example Untrained Model Prediction: tensor([[0, 2, 5, 3, 3, 3, 3, 3, 3, 3]])
Example Untrained Model Prediction: tensor([[0, 9, 9, 9, 9, 9, 9, 3, 3, 3]])


## Model Training

### Training

This section describes the training regime for our models.

We stop for a quick interlude to introduce some of the tools needed to train a standard encoder decoder model. First we define a batch object that holds the src and target sentences for training, as well as constructing the masks.

#### Batches and Masking

In [23]:
class Batch:
    """Object for holding a batch of data with mask during training."""

    def __init__(self, src, tgt=None, pad=2):  # 2 = <blank>
        self.src = src
        self.src_mask = (src != pad).unsqueeze(-2)
        if tgt is not None:
            self.tgt = tgt[:, :-1]
            self.tgt_y = tgt[:, 1:]
            self.tgt_mask = self.make_std_mask(self.tgt, pad)
            self.ntokens = (self.tgt_y != pad).data.sum()

    @staticmethod
    def make_std_mask(tgt, pad):
        "Create a mask to hide padding and future words."
        tgt_mask = (tgt != pad).unsqueeze(-2)
        tgt_mask = tgt_mask & subsequent_mask(tgt.size(-1)).type_as(
            tgt_mask.data
        )
        return tgt_mask

Next we create a generic training and scoring function to keep track of loss. We pass in a generic loss compute function that also handles parameter updates.

#### Training Loop

In [24]:
class TrainState:
    """Track number of steps, examples, and tokens processed"""

    step: int = 0  # Steps in the current epoch
    accum_step: int = 0  # Number of gradient accumulation steps
    samples: int = 0  # total # of examples used
    tokens: int = 0  # total # of tokens processed

In [25]:
def run_epoch(
    data_iter,
    model,
    loss_compute,
    optimizer,
    scheduler,
    mode="train",
    accum_iter=1,
    train_state=TrainState(),
):
    """Train a single epoch"""
    
    # Record the start time for performance computation
    start = time.time()
    
    # Initialize counters for total tokens, total loss, and tokens
    total_tokens = 0
    total_loss = 0
    tokens = 0
    
    # Counter for number of accumulation steps
    n_accum = 0
    
    # Iterate over batches in data iterator
    for i, batch in enumerate(data_iter):
        
        # Forward pass through the model
        out = model.forward(
            batch.src, batch.tgt, batch.src_mask, batch.tgt_mask
        )
        
        # Compute loss
        loss, loss_node = loss_compute(out, batch.tgt_y, batch.ntokens)
        
        if mode == "train" or mode == "train+log":
            
            # Backward pass (compute gradient)
            loss_node.backward()
            
            # Update training state
            train_state.step += 1
            train_state.samples += batch.src.shape[0]
            train_state.tokens += batch.ntokens
            
            # Perform parameter update every accum_iter batches
            if i % accum_iter == 0:
                optimizer.step()  # Update parameters
                optimizer.zero_grad(set_to_none=True)  # Zero out gradients
                n_accum += 1
                train_state.accum_step += 1
            
            # Step the learning rate scheduler
            scheduler.step()

        # Update counters
        total_loss += loss
        total_tokens += batch.ntokens
        tokens += batch.ntokens
        
        # Every 40 batches, print a training status update
        if i % 40 == 1 and (mode == "train" or mode == "train+log"):
            lr = optimizer.param_groups[0]["lr"]  # Get learning rate
            elapsed = time.time() - start  # Time elapsed since last update
            
            # Print training update
            print(
                (
                    "Epoch Step: %6d | Accumulation Step: %3d | Loss: %6.2f "
                    + "| Tokens / Sec: %7.1f | Learning Rate: %6.1e"
                )
                % (i, n_accum, loss / batch.ntokens, tokens / elapsed, lr)
            )
            
            # Reset start time and token counter
            start = time.time()
            tokens = 0
        
        # Free memory occupied by loss and loss_node
        del loss
        del loss_node
    
    # Return average loss and training state
    return total_loss / total_tokens, train_state


### Training Data and Batching
We trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million sentence pairs. Sentences were encoded using byte-pair encoding, which has a shared source-target vocabulary of about 37000 tokens. For English-French, we used the significantly larger WMT 2014 English-French dataset consisting of 36M sentences and split tokens into a 32000 word-piece vocabulary.

Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens.

### Hardware and Schedule
We trained our models on one machine with 8 NVIDIA P100 GPUs. For our base models using the hyperparameters described throughout the paper, each training step took about 0.4 seconds. We trained the base models for a total of 100,000 steps or 12 hours. For our big models, step time was 1.0 seconds. The big models were trained for 300,000 steps (3.5 days).

We used the Adam optimizer with β1 = 0.9, β2 = 0.98 and  = 10−9
. We varied the learning
rate over the course of training, according to the formula:
$$ lrate = d^{−0.5}_{model} · min(stepnum^{−0.5}
, stepnum · warmupsteps^{−1.5}
) $$ 
(3)
This corresponds to increasing the learning rate linearly for the first warmup_steps training steps,
and decreasing it thereafter proportionally to the inverse square root of the step number. We used
warmup_steps = 4000.

Note: This part is very important. Need to train with this setup of the model.

Example of the curves of this model for different model sizes and for optimization hyperparameters.

In [26]:
def rate(step, model_size, factor, warmup):
    """
    we have to default the step to 1 for LambdaLR function
    to avoid zero raising to negative power.
    """
    if step == 0:
        step = 1
    return factor * (
        model_size ** (-0.5) * min(step ** (-0.5), step * warmup ** (-1.5))
    )

In [27]:
def example_learning_schedule():
    opts = [
        [512, 1, 4000],  # example 1
        [512, 1, 8000],  # example 2
        [256, 1, 4000],  # example 3
    ]

    dummy_model = torch.nn.Linear(1, 1)
    learning_rates = []

    # we have 3 examples in opts list.
    for idx, example in enumerate(opts):
        # run 20000 epoch for each example
        optimizer = torch.optim.Adam(
            dummy_model.parameters(), lr=1, betas=(0.9, 0.98), eps=1e-9
        )
        lr_scheduler = LambdaLR(
            optimizer=optimizer, lr_lambda=lambda step: rate(step, *example)
        )
        tmp = []
        # take 20K dummy training steps, save the learning rate at each step
        for step in range(20000):
            tmp.append(optimizer.param_groups[0]["lr"])
            optimizer.step()
            lr_scheduler.step()
        learning_rates.append(tmp)

    learning_rates = torch.tensor(learning_rates)

    # Enable altair to handle more than 5000 rows
    alt.data_transformers.disable_max_rows()

    opts_data = pd.concat(
        [
            pd.DataFrame(
                {
                    "Learning Rate": learning_rates[warmup_idx, :],
                    "model_size:warmup": ["512:4000", "512:8000", "256:4000"][
                        warmup_idx
                    ],
                    "step": range(20000),
                }
            )
            for warmup_idx in [0, 1, 2]
        ]
    )

    return (
        alt.Chart(opts_data)
        .mark_line()
        .properties(width=600)
        .encode(x="step", y="Learning Rate", color="model_size:warmup:N")
        .interactive()
    )


example_learning_schedule()

### Regularization

#### Label Smoothing
During training, we employed label smoothing of value 
$ ϵ_{ls}= 1 $ (cite). This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.

We implement label smoothing using the KL div loss. Instead of using a one-hot target distribution, we create a distribution that has confidence of the correct word and the rest of the smoothing mass distributed throughout the vocabulary.

In [None]:
class LabelSmoothing(nn.Module):
    "Implement label smoothing."

    def __init__(self, size, padding_idx, smoothing=0.0):
        super(LabelSmoothing, self).__init__()
        self.criterion = nn.KLDivLoss(reduction="sum")
        self.padding_idx = padding_idx
        self.confidence = 1.0 - smoothing
        self.smoothing = smoothing
        self.size = size
        self.true_dist = None

    def forward(self, x, target):
        assert x.size(1) == self.size
        true_dist = x.data.clone()
        true_dist.fill_(self.smoothing / (self.size - 2))
        true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
        true_dist[:, self.padding_idx] = 0
        mask = torch.nonzero(target.data == self.padding_idx)
        if mask.dim() > 0:
            true_dist.index_fill_(0, mask.squeeze(), 0.0)
        self.true_dist = true_dist
        return self.criterion(x, true_dist.clone().detach())

Here we can see an example of how the mass is distributed to the words based on confidence.

In [None]:
# Example of label smoothing.


def example_label_smoothing():
    crit = LabelSmoothing(5, 0, 0.4)
    predict = torch.FloatTensor(
        [
            [0, 0.2, 0.7, 0.1, 0],
            [0, 0.2, 0.7, 0.1, 0],
            [0, 0.2, 0.7, 0.1, 0],
            [0, 0.2, 0.7, 0.1, 0],
            [0, 0.2, 0.7, 0.1, 0],
        ]
    )
    crit(x=predict.log(), target=torch.LongTensor([2, 1, 0, 3, 3]))
    LS_data = pd.concat(
        [
            pd.DataFrame(
                {
                    "target distribution": crit.true_dist[x, y].flatten(),
                    "columns": y,
                    "rows": x,
                }
            )
            for y in range(5)
            for x in range(5)
        ]
    )

    return (
        alt.Chart(LS_data)
        .mark_rect(color="Blue", opacity=1)
        .properties(height=200, width=200)
        .encode(
            alt.X("columns:O", title=None),
            alt.Y("rows:O", title=None),
            alt.Color(
                "target distribution:Q", scale=alt.Scale(scheme="viridis")
            ),
        )
        .interactive()
    )


show_example(example_label_smoothing)


Label smoothing actually starts to penalize the model if it gets very confident about a given choice.

In [None]:
def loss(x, crit):
    d = x + 3 * 1
    predict = torch.FloatTensor([[0, x / d, 1 / d, 1 / d, 1 / d]])
    return crit(predict.log(), torch.LongTensor([1])).data


def penalization_visualization():
    crit = LabelSmoothing(5, 0, 0.1)
    loss_data = pd.DataFrame(
        {
            "Loss": [loss(x, crit) for x in range(1, 100)],
            "Steps": list(range(99)),
        }
    ).astype("float")

    return (
        alt.Chart(loss_data)
        .mark_line()
        .properties(width=350)
        .encode(
            x="Steps",
            y="Loss",
        )
        .interactive()
    )


show_example(penalization_visualization)

## A First Example
We can begin by trying out a simple copy-task. Given a random set of input symbols from a small vocabulary, the goal is to generate back those same symbols.

### Synthetic Data

In [31]:
def data_gen(V, batch_size, nbatches):
    "Generate random data for a src-tgt copy task."
    for i in range(nbatches):
        data = torch.randint(1, V, size=(batch_size, 10))
        data[:, 0] = 1
        src = data.requires_grad_(False).clone().detach()
        tgt = data.requires_grad_(False).clone().detach()
        yield Batch(src, tgt, 0)

### Loss Computation

In [32]:
class SimpleLossCompute:
    "A simple loss compute and train function."

    def __init__(self, generator, criterion):
        # The generator is usually the output layer of the model which generates the final output
        self.generator = generator
        
        # The criterion is the loss function used in training
        self.criterion = criterion

    def __call__(self, x, y, norm):
        # Apply the generator to the input tensor 'x'
        x = self.generator(x)
        
        # Compute the loss between the model output 'x' and true labels 'y', and normalize it by 'norm'
        # 'x' and 'y' are reshaped into 2D and 1D tensors respectively for the computation of loss
        sloss = (
            self.criterion(
                x.contiguous().view(-1, x.size(-1)), y.contiguous().view(-1)
            )
            / norm
        )
        
        # Return the total loss (scaled by 'norm') and the average loss (sloss)
        return sloss.data * norm, sloss


### Greedy Decoding

In [33]:
def greedy_decode(model, src, src_mask, max_len, start_symbol):
    memory = model.encode(src, src_mask)
    ys = torch.zeros(1, 1).fill_(start_symbol).type_as(src.data)
    for i in range(max_len - 1):
        out = model.decode(
            memory, src_mask, ys, subsequent_mask(ys.size(1)).type_as(src.data)
        )
        prob = model.generator(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        next_word = next_word.data[0]
        ys = torch.cat(
            [ys, torch.zeros(1, 1).type_as(src.data).fill_(next_word)], dim=1
        )
    return ys

In [42]:
# Train the simple copy task.

def example_simple_model():
    V = 11
    criterion = LabelSmoothing(size=V, padding_idx=0, smoothing=0.0)
    model = make_model(V, V, N=2)

    optimizer = torch.optim.Adam(
        model.parameters(), lr=0.5, betas=(0.9, 0.98), eps=1e-9
    )
    lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step, model_size=model.src_embed[0].d_model, factor=1.0, warmup=400
        ),
    )

    batch_size = 80
    for epoch in range(20):
        model.train()
        run_epoch(
            data_gen(V, batch_size, 20),
            model,
            SimpleLossCompute(model.generator, criterion),
            optimizer,
            lr_scheduler,
            mode="train",
        )
        model.eval()
        run_epoch(
            data_gen(V, batch_size, 5),
            model,
            SimpleLossCompute(model.generator, criterion),
            DummyOptimizer(),
            DummyScheduler(),
            mode="eval",
        )[0]

    model.eval()
    src = torch.LongTensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
    max_len = src.shape[1]
    src_mask = torch.ones(1, 1, max_len)
    print(greedy_decode(model, src, src_mask, max_len=max_len, start_symbol=0))
    
execute_example(example_simple_model)

Epoch Step:      1 | Accumulation Step:   2 | Loss:   2.96 | Tokens / Sec:   502.3 | Learning Rate: 5.5e-06
Epoch Step:      1 | Accumulation Step:   2 | Loss:   2.00 | Tokens / Sec:   550.8 | Learning Rate: 6.1e-05
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.88 | Tokens / Sec:   546.9 | Learning Rate: 1.2e-04
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.71 | Tokens / Sec:   529.2 | Learning Rate: 1.7e-04
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.54 | Tokens / Sec:   528.5 | Learning Rate: 2.3e-04
Epoch Step:      1 | Accumulation Step:   2 | Loss:   1.17 | Tokens / Sec:   535.7 | Learning Rate: 2.8e-04
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.84 | Tokens / Sec:   543.4 | Learning Rate: 3.4e-04
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.54 | Tokens / Sec:   530.8 | Learning Rate: 3.9e-04
Epoch Step:      1 | Accumulation Step:   2 | Loss:   0.36 | Tokens / Sec:   525.4 | Learning Rate: 4.5e-04
Epoch Step:      1 | Accumul

## Real Life Example

Now we consider a real-world example using the Multi30k German-English Translation task. This task is much smaller than the WMT task considered in the paper, but it illustrates the whole system. We also show how to use multi-gpu processing to make it really fast.

### Data Loading

We will load the dataset using torchtext and spacy for tokenization.

In [35]:
def load_tokenizers():

    try:
        spacy_de = spacy.load("de_core_news_sm")
    except IOError:
        os.system("python -m spacy download de_core_news_sm")
        spacy_de = spacy.load("de_core_news_sm")

    try:
        spacy_en = spacy.load("en_core_web_sm")
    except IOError:
        os.system("python -m spacy download en_core_web_sm")
        spacy_en = spacy.load("en_core_web_sm")

    return spacy_de, spacy_en

In [36]:
def tokenize(text, tokenizer):
    return [tok.text for tok in tokenizer.tokenizer(text)]


def yield_tokens(data_iter, tokenizer, index):
    for from_to_tuple in data_iter:
        yield tokenizer(from_to_tuple[index])

In [37]:
def build_vocabulary(spacy_de, spacy_en):
    # Define tokenization functions for German (de) and English (en) texts
    def tokenize_de(text):
        return tokenize(text, spacy_de)

    def tokenize_en(text):
        return tokenize(text, spacy_en)

    # Build vocabulary for German
    print("Building German Vocabulary ...")
    train, val, test = datasets.Multi30k(language_pair=("de", "en"))
    vocab_src = build_vocab_from_iterator(
        yield_tokens(train + val + test, tokenize_de, index=0),
        min_freq=2,
        specials=["<s>", "</s>", "<blank>", "<unk>"],
    )

    # Build vocabulary for English
    print("Building English Vocabulary ...")
    train, val, test = datasets.Multi30k(language_pair=("de", "en"))
    vocab_tgt = build_vocab_from_iterator(
        yield_tokens(train + val + test, tokenize_en, index=1),
        min_freq=2,
        specials=["<s>", "</s>", "<blank>", "<unk>"],
    )

    # Set default indices for unknown tokens
    vocab_src.set_default_index(vocab_src["<unk>"])
    vocab_tgt.set_default_index(vocab_tgt["<unk>"])

    return vocab_src, vocab_tgt

def load_vocab(spacy_de, spacy_en):
    # Check if vocabulary already exists, if not, build it
    if not exists("vocab.pt"):
        vocab_src, vocab_tgt = build_vocabulary(spacy_de, spacy_en)
        torch.save((vocab_src, vocab_tgt), "vocab.pt")
    else:
        # Load vocabulary from file
        vocab_src, vocab_tgt = torch.load("vocab.pt")
    
    print("Finished.\nVocabulary sizes:")
    print(len(vocab_src))
    print(len(vocab_tgt))
    return vocab_src, vocab_tgt

if is_interactive_notebook():
    # Load tokenizers and vocabularies (used in interactive notebook mode)
    spacy_de, spacy_en = show_example(load_tokenizers)
    vocab_src, vocab_tgt = show_example(load_vocab, args=[spacy_de, spacy_en])


Finished.
Vocabulary sizes:
8315
6384


### Iterators

In [38]:
def collate_batch(
    batch,
    src_pipeline,
    tgt_pipeline,
    src_vocab,
    tgt_vocab,
    max_padding=128,
    pad_id=2,
):
    # Define start and end of sentence token ids
    bs_id = torch.tensor([0], device=torch.device('mps'))  # <s> token id
    eos_id = torch.tensor([1], device=torch.device('mps'))  # </s> token id
    src_list, tgt_list = [], []

    # Iterate over each source-target pair in the batch
    for (_src, _tgt) in batch:
        # Process source and target sentences
        processed_src = torch.cat(
            [
                bs_id,
                torch.tensor(
                    src_vocab(src_pipeline(_src)),  # Tokenize and convert source sentence to tensor
                    dtype=torch.int64,
                    device=torch.device('mps'),
                ),
                eos_id,
            ],
            0,
        )
        processed_tgt = torch.cat(
            [
                bs_id,
                torch.tensor(
                    tgt_vocab(tgt_pipeline(_tgt)),  # Tokenize and convert target sentence to tensor
                    dtype=torch.int64,
                    device=torch.device('mps'),
                ),
                eos_id,
            ],
            0,
        )

        # Pad sentences to max_padding length and add to lists
        src_list.append(
            pad(
                processed_src,
                (
                    0,
                    max_padding - len(processed_src),
                ),
                value=pad_id,
            )
        )
        tgt_list.append(
            pad(
                processed_tgt,
                (0, max_padding - len(processed_tgt)),
                value=pad_id,
            )
        )

    # Stack source and target sentences into tensors
    src = torch.stack(src_list)
    tgt = torch.stack(tgt_list)

    return (src, tgt)


In [39]:
def create_dataloaders(
    device,
    vocab_src,
    vocab_tgt,
    spacy_de,
    spacy_en,
    batch_size=12000,
    max_padding=128,
    is_distributed=True,
):
    # Define tokenization functions for German (de) and English (en) texts
    def tokenize_de(text):
        return tokenize(text, spacy_de)

    def tokenize_en(text):
        return tokenize(text, spacy_en)

    # Collate function used to process the batch
    def collate_fn(batch):
        return collate_batch(
            batch,
            tokenize_de,
            tokenize_en,
            vocab_src,
            vocab_tgt,
            max_padding=max_padding,
            pad_id=vocab_src.get_stoi()["<blank>"],
        )


    # Load the dataset
    train_iter, valid_iter, test_iter = datasets.Multi30k(
        language_pair=("de", "en")
    )

    # Convert iterators to map-style datasets
    train_iter_map = to_map_style_dataset(
        train_iter
    )  # DistributedSampler needs a dataset len()
    valid_iter_map = to_map_style_dataset(valid_iter)

    # Create samplers for distributed training
    train_sampler = (
        DistributedSampler(train_iter_map) if is_distributed else None
    )
    valid_sampler = (
        DistributedSampler(valid_iter_map) if is_distributed else None
    )

    # Create data loaders
    train_dataloader = DataLoader(
        train_iter_map,
        batch_size=batch_size,
        shuffle=(train_sampler is None),  # Only shuffle if no sampler is provided
        sampler=train_sampler,
        collate_fn=collate_fn,  # Use custom collate function for processing batches
    )
    valid_dataloader = DataLoader(
        valid_iter_map,
        batch_size=batch_size,
        shuffle=(valid_sampler is None),
        sampler=valid_sampler,
        collate_fn=collate_fn,
    )

    return train_dataloader, valid_dataloader


### Training

In [40]:
def train_worker(
    ngpus_per_node,
    vocab_src,
    vocab_tgt,
    spacy_de,
    spacy_en,
    config,
    is_distributed=False,
):
    # Set the padding index.
    pad_idx = vocab_tgt["<blank>"]

    # Set the dimensionality of the model's embeddings.
    d_model = 512

    # Initialize the model.
    model = make_model(len(vocab_src), len(vocab_tgt), N=6)
    
    # Copy the model for further use.
    module = model
    is_main_process = True

    # If distributed training is enabled, initialize the distributed environment.
    if is_distributed:
        dist.init_process_group(
            "gloo", init_method="env://", world_size=ngpus_per_node
        )
        # Wrap the model with DistributedDataParallel for multi-GPU training.
        model = DDP(model)

    # Initialize the loss function.
    criterion = LabelSmoothing(
        size=len(vocab_tgt), padding_idx=pad_idx, smoothing=0.1
    )

    # Create DataLoader objects for training and validation datasets.
    train_dataloader, valid_dataloader = create_dataloaders(1,
        vocab_src,
        vocab_tgt,
        spacy_de,
        spacy_en,
        batch_size=config["batch_size"] // ngpus_per_node,
        max_padding=config["max_padding"],
        is_distributed=is_distributed,
    )

    # Initialize the optimizer.
    optimizer = torch.optim.Adam(
        model.parameters(), lr=config["base_lr"], betas=(0.9, 0.98), eps=1e-9
    )

    # Set up a learning rate scheduler.
    lr_scheduler = LambdaLR(
        optimizer=optimizer,
        lr_lambda=lambda step: rate(
            step, d_model, factor=1, warmup=config["warmup"]
        ),
    )

    # Initialize the training state.
    train_state = TrainState()

    # Start the training loop.
    for epoch in range(config["num_epochs"]):
        # If distributed training is enabled, set the epoch for each sampler.
        if is_distributed:
            train_dataloader.sampler.set_epoch(epoch)
            valid_dataloader.sampler.set_epoch(epoch)

        # Set the model to training mode.
        model.train()
        
        # Print the epoch number.
        print(f"Epoch {epoch} Training ====", flush=True)
        
        # Run a training epoch.
        _, train_state = run_epoch(
            (Batch(b[0], b[1], pad_idx) for b in train_dataloader),
            model,
            SimpleLossCompute(module.generator, criterion),
            optimizer,
            lr_scheduler,
            mode="train+log",
            accum_iter=config["accum_iter"],
            train_state=train_state,
        )        

        # If this is the main process, save the model after each epoch.
        if is_main_process:
            file_path = "%s%.2d.pt" % (config["file_prefix"], epoch)


In [41]:
def train_distributed_model(vocab_src, vocab_tgt, spacy_de, spacy_en, config):
    # Get the number of available GPUs.
    ngpus = torch.cuda.device_count()
    
    # Set environment variables for distributed training.
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12356"
    
    # Print the number of detected GPUs.
    print(f"Number of GPUs detected: {ngpus}")
    print("Spawning training processes ...")
    
    # Spawn multiple training worker processes.
    mp.spawn(
        train_worker,
        nprocs=ngpus,
        args=(ngpus, vocab_src, vocab_tgt, spacy_de, spacy_en, config, True),
    )


def train_model(vocab_src, vocab_tgt, spacy_de, spacy_en, config):
    # If distributed training is enabled in the config, train the model distributed.
    if config["distributed"]:
        train_distributed_model(
            vocab_src, vocab_tgt, spacy_de, spacy_en, config
        )
    # Otherwise, train the model on a single CPU.
    else:
        train_worker(
            1, vocab_src, vocab_tgt, spacy_de, spacy_en, config, False
        )


def load_trained_model():
    # Define the training configuration.
    config = {
        "batch_size": 32,
        "distributed": False,
        "num_epochs": 8,
        "accum_iter": 10,
        "base_lr": 1.0,
        "max_padding": 72,
        "warmup": 3000,
        "file_prefix": "multi30k_model_",
    }
    
    model_path = "multi30k_model_final.pt"
    
    # If a trained model does not exist, train a new model.
    if not exists(model_path):
        train_model(vocab_src, vocab_tgt, spacy_de, spacy_en, config)

    # Load the trained model.
    model = make_model(len(vocab_src), len(vocab_tgt), N=6)
    model.load_state_dict(torch.load("multi30k_model_final.pt", map_location=torch.device('mps')))
    
    return model


# If running in an interactive notebook, load the trained model.
#if is_interactive_notebook():
#    model = load_trained_model()

