# Implementing Transformer Models
## Practical VII
Carel van Niekerk & Hsien-Chin Lin

25-29.11.2024

---

In this practical we will combine the word embedding layer, positional encoding layer, and encoder and decoder layers from previous practicals to implement a transformer model.

# Exercises

1. Study the model in the paper [Attention is all you need](https://arxiv.org/abs/1706.03762). Write down the structure of the proposed model.
2. Study the section on word embeddings and pay close attention to the parameter sharing. Explain the benefits of parameter sharing in the transformer model.
3. Based on your implementations of all the components, implement a transformer model. Use the pytorch `nn.Module` class to implement the model. Your model should be configurable with the following parameters:
    - `vocab_size`: The size of the vocabulary
    - `d_model`: The dimensionality of the embedding layer
    - `n_heads`: The number of heads in the multi-head attention layers
    - `num_encoder_layers`: The number of encoder layers
    - `num_decoder_layers`: The number of decoder layers
    - `dim_feedforward`: The dimensionality of the feedforward layer
    - `dropout`: The dropout probability
    - `max_len`: The maximum length of the input sequence

___
## 1: Structure of the Transformer Model
#TODO: Write down the structure of the proposed model in the paper "Attention is all you need".
## 2: Advantages of Byte Pair Encoding (BPE)

In the Transformer model, the same BPE vocabulary is shared between the source (input) and target (output) languages. This means that the tokenization process for both the source and target texts uses the same set of subwords, which provides several advantages:

1. **Efficiency**: A shared vocabulary reduces the overall number of parameters in the embedding layers, as the same embeddings are used for both the source and target.
2. **Consistency**: Using a shared vocabulary ensures that the model can easily relate similar or identical subwords between the source and target languages. This is particularly useful in cases where the source and target languages share common roots or loanwords.
3. **Simplification**: A shared vocabulary simplifies the training process, as it removes the need to maintain two separate vocabularies.

However, a shared vocabulary can also present challenges, such as the need to balance the representation of both languages within a single vocabulary, which might be difficult if the languages are very different or have vastly different character sets.

## 3: Implementing the Transformer Model


In [8]:
from modelling.attention import MultiHeadAttention
from modelling.feedforward import PointWiseFeedForward
from modelling.layernorm import LayerNorm
from modelling.functional import TransformerDecoderLayer, BaseTransformerLayer
from modelling.positional_encoding import PositionalEncoding

from torch import nn
import torch

In [9]:
class Transformer(nn.Module):
    def __init__(self, 
                 vocab_size, 
                 d_model, 
                 n_heads, 
                 num_encoder_layers, 
                 num_decoder_layers, 
                 dim_feedforward, 
                 dropout, 
                 max_len):
        super().__init__()
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.num_encoder_layers = num_encoder_layers
        self.num_decoder_layers = num_decoder_layers
        self.dim_feedforward = dim_feedforward
        self.max_len = max_len

        self.src_embedding = nn.Embedding(vocab_size, d_model)
        self.tgt_embedding = nn.Embedding(vocab_size, d_model)

        self.pos_encoder = PositionalEncoding(d_model, max_len, dropout)

        self.dropout = nn.Dropout(dropout)

        self.transformer_encoder = nn.ModuleDict({
            f"encoder_layer_{i}": BaseTransformerLayer(d_model, n_heads, dim_feedforward, dropout) 
            for i in range(num_encoder_layers)
        })

        self.transformer_decoder = nn.ModuleDict({
            f"decoder_layer_{i}": TransformerDecoderLayer(d_model, n_heads, dim_feedforward, dropout) 
            for i in range(num_decoder_layers)
        })

        self.head = nn.Sequential(
            nn.Linear(d_model, vocab_size),
            nn.Softmax(dim=-1)
        )


    def forward(self, src, tgt):
        src = self.src_embedding(src)
        src = self.pos_encoder(src)
        src = self.dropout(src)
        for encoder_layer in self.transformer_encoder.values():
            src = encoder_layer(src)

        tgt = self.tgt_embedding(tgt)
        tgt = self.pos_encoder(tgt)
        tgt = self.dropout(tgt)
        for decoder_layer in self.transformer_decoder.values():
            tgt = decoder_layer(tgt, src)

        return self.head(tgt)

# Example usage
transformer = Transformer(100, 512, 8, 6, 6, 2048, 0.1, 5000)
x_in = torch.randint(0, 100, (32, 10))
y_in = torch.randint(0, 100, (32, 10))
out = transformer(x_in, y_in)
print(out.shape)

torch.Size([32, 10, 100])
