# Overview

 We have several notebooks to introduce Transformer like:
 
 * [Encoder in Transformer](https://www.kaggle.com/code/aisuko/encoder-in-transformers-architecture)
 * [Decoder in Transformer](https://www.kaggle.com/code/aisuko/decoder-in-transformers-architecture)
 * [Multiple Head Attention](https://www.kaggle.com/code/aisuko/mask-multi-multi-head-attention)
 

## Let's give a short review of these components.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/973/089/747/694/067/original/3de6c0032f0a7e53.webp)


**Encoder**

It has a `Multi-Head Attention` mechanism and a fully connected `Feed-Forward network`. There are also residual connections around two sub-layers, plus layer normalization for the output of each sub-layer. All sub-layers in the model and the embedding layers produce outputs of dimension $d_{model}=512$.

**Decoder**

The decoder follows a similar structure, but it inserts a third sub-layer taht performs multi-head attention over the output of the encoder block. There is also a modification of the self-attention sub-layer in the decoder block to avoid positions from attending to subsequent positions. This masking ensures that the predictions for position `i` depend solely on the known outputs at positions less than i.

Both the encoder and decoder blocks are repeated N times. In the original paper, it is N=6, and we will define a similar value in this notebook.

# Input Embeddings

The `InputEmbeddings` class below is responsible for converting the input text into numerical vectors of `d_model` dimensions. To prevent that our input embeddings become extremely small, we normalize them by multiplying them by the $\sqrt{d_{model}}$

In [1]:
import math
from typing import Any
import torch.nn as nn

class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.d_model=d_model # Dimension of vectors (512)
        self.vocab_size=vocab_size # Size of the vocabulary
        self.embedding=nn.Embedding(vocab_size, d_model)
    
    def forward(self, x):
        return self.embedding(x)*math.sqrt(self.d_model) # normalizing the variance of the embeddings

# Positional Encoding

In the original paper, the authors add the positional encodings to the input embeddings at the bottom of both the encoder and decoder block so the model can have some information about the relative or absolute position of the tokens in the sequence. The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two vectors can be summed and we can combine the semantic content from the word embeddings and positional information from the positional encodings.

In the `PositionalEncoding` class below, we will create a matrix of positional encodings `pe` with dimensions `(seq_len, d_model)`. We will start by filling it with 0s. We will then apply the sine function to even indices of the positional encoding matrix while the cosine function is applied to the odd ones.

$$Even Indices(2i): PE(pos,2i)=sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$

$$Odd Indices(2i+1): PE(pos, 2i+1)=cos(\frac{pos}{10000^{\frac{2i}{d_{model}}}})$$

We apply the sine and cosine functions because it allows the model to determine the position of a word based on the position of other word in the sequence, since for any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$. This happens due to the properties of sine and cosine functions, where a shift in the input results in a predictable change in the output.

In [2]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model:int, seq_len:int, dropout:float) -> None:
        super().__init__()
        self.d_model=d_model # Dimensionality of the model
        self.seq_len=seq_len # Maximum sequence length
        self.dropout=nn.Dropout(dropout) # dropout layer to prevent overfitting
        
        # creating a positional ecoding matrix of shape (seq_len, d_model) filled with zeros
        pe=torch.zeros(seq_len, d_model)
        
        # creating a tensor representing positions (0 to seq_len -1)
        position=torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1) # transforming `position` into a 2D tensor[seq_len,1]
        
        # creating te division term for the positional encoding formula
        div_term=torch.exp(torch.arange(0, d_model, 2).float()*(-math.log(10000.0)/d_model))
        
        # apply sine to even indices in pe
        pe[:,0::2]=torch.sin(position*div_term)
        
        # apply cosine to odd indices in pe
        pe[:,1::2]=torch.cos(position*div_term)
        
        # adding an extra dimension at the beginning of pe matrix for batch handling
        pe=pe.unsqueeze('pe', pe)
        
        # registering 'pe' as buffer, buffer is a tensor not considered as a model parameter
        self.register_buffer('pe',pe)
    
    def forward(self, x):
        # adding positional encoding to the input tensor X
        x=x+(self.pe[:,:x.shape[1],:].requires_grad_(False))
        return self.dropout(x) # dropout for regularization

# Layer Normalization

We have several normalization layers called `Add&Norm`.

The `LayerNormalization` class below performs layer normalization on the input data. During its forward pass, we compute the mean and standard deviation of the input data. We then normalize the input data by subtracing the mean and dividing by the standard deviation plus a small number called **epsilon** to avoid any division by zero. This process results in a normalized output with a mean 0 and standard deviation 1.

We will then scale the normalized output by a learnable parameter `alpha` and add a learnbale parameter called `bias`. The training process is repsonsible for adjusting these parameters. The final result is a layer-normalized tensor, which ensures that the scale of the inputs to layers in the network is consistent.

In [3]:
# creating layer normalization
class LayerNormalization(nn.Module):
    # we define epsilon as 0.000001 to avoid division by zero
    def __init__(self, eps: float=10**-6)-> None:
        super().__init__()
        self.eps=eps
        
        # we define alpha as a trainable parameter and initialize it with ones
        self.alpha=nn.Parameter(torch.ones(1)) # One-dimensional tensor that will be used to scale the input data
        
        # we define bias as a trainable parameter and initialize it with zeros
        self.bias=nn.Parameter(torch.zeros(1)) # One-dimensional tensor that will be added to the input data
        
    def forward(self, x):
        mean=x.mean(dim=-1, keepdim=True) # computing the mean of the input data. Keeping the number of dimensions unchanged
        std=x.std(dim=-1, keepdim=True) # computing the standard deviation of the input data. Keeping the number of dimensions unchanged
        
        # returning the normalized input
        return self.alpha*(x-mean)/(std+self.eps)+self.bias

# Feed-Forward Network

In the fully connected feed-forward network, we apply two linear transformations with a ReLU activation in between. We can mathematically represent this operation as:

$$FFN(x)=max(0, xW_{1}+b_{1})W_{2}+b_{2}$$

$W_{1}$ and $W_{2}$ are the weights, while $b_{1}$ and $b_{2}$ are the biases of the two linear transformations.

In the `FeedForwardBlock` below, we will define the two linear transformers -`self.linear_1` and `self.linear_2` and the inner-layer `d_ff`. The input data will first pass through the `self.linear_1` transformation, which increases its dimensionality from `d_model` to `d_ff`. The output  of this operation passes through the ReLU activation function, which introduces non-linearity so the network can learn more complex patterns, and the `self.dropout` layer is applied to mitigate overfitting. The final operation is the `self.linear_2` transformation to the dropout-modified tensor, which transforms it back to the original `d_model` dimension. 

In [4]:
class FeedForwardBlock(nn.Module):
    def __init__(self,d_model:int, d_ff:int, dropout:float) -> None:
        super().__init__()
        # First lienar transformation
        self.linear_1=nn.Linear(d_model, d_ff) # W1 & b1
        self.dropout=nn.Dropout(dropout) # Dropout to prevent overfitting
        
        # Second linear transformation
        self.linear_2=nn.Linear(d_ff, d_model) # W2 & b2
    
    def forward(self, x):
        # (batch, seq_len, d_model) --> (batch, seq_len, d_ff) --> (batch, seq_len, d_model)
        return self.linear_2(self.dropout(torch.relu(self.linear_1(x))))

# Multi-Head Attention

The Multi-Head Attention is the most crucial component of the Transformer. It is responsible for helping the model to understand complex relationships and patterns in the data.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/111/963/621/063/011/468/original/629903a63a938f0a.png)

The Multi-Head Attention block receives the input data split into queries, keys, and values organized into matrices Q, K and V. Each matrix contains different facets of the input, and they have the same dimensions as the input.

![](https://hostux.social/system/media_attachments/files/111/603/992/766/474/377/small/5df72b068852f4da.webp)

We then linearly transform each matrix by their respective weight matrices $W^Q$, $W^K$ and $W^V$. These transformations will result in new matrices $Q'$, $K'$ and $V'$, which will be split into smaller matrices corresponding to different heads $h$, allowing the model to attend to information from different representation subspaces in parallel. This split creates multiple sets of queries, keys, and values for each head.

Finally, we concatenate every head into an H matrix, which is then transformed by another wight matrix $W^o$ to produce the multi-head attention output, a matrix $MH$ - A that retains the input dimensionality.

In [5]:
class MultiHeadAttentionBlock(nn.Module):
    def __init__(self, d_model: int, h:int, dropout:float)-> None: # h= number of heads
        super().__init__()
        self.d_model=d_model
        self.h=h
        
        # we ensure that the dimensions of the model is divisible by the number of heads
        assert d_model %h==0, 'd_model is not divisible by h'
        
        # d_k is the dimension of each attention head's key, query, and values vectors
        self.d_k =d_model // h # d_k formula, like in the original paper
        
        # degining the weight matrices
        self.w_q=nn.Linear(d_model, d_model) # W_q
        self.w_k=nn.Linear(d_model, d_model) # W_k
        self.w_v=nn.Linear(d_model, d_model) # W_v
        self.w_o=nn.Linear(d_model, d_model) # W_o
        
        self.dropout=nn.Dropout(dropout) # Dropout layer to avoid overfitting
        
    @staticmethod
    def attention(query, key, value, mask, dropout:nn.Dropout): # mask=>when we certain words to not interact with others, we hide them
        d_k=query.shape[-1] # the last dimension of query, key and value
        
        # we calculate the Attention(Q,K,V) as in the formula in the image above
        attention_scores=(query@key.transpose(-2, -1))/math.sqrt(d_k) # @=matrix multiplication sign in PyTorch
        
        # before applying the softmax, we apply the mask to hide some interactions between words
        if mask is not None:
            attention_scores.masked_fill_(mask==0, -1e9) # replace each value where mask is equal to 0 by -1e9
        attention_scores=attention_scores.softmax(dim=-1) # applying softmax
        if dropout is not None:
            attention_scores=dropout(attention_scores) # we apply dropout to prevent overfitting
        
        return (attention_scores @ value), attention_scores # multiply the output matrix by the V matrix, as in the formula
    
    def forward(self, q,k,v, mask):
        query=self.w_q(q) # Q' matrix
        key=self.w_k(k) # K' matrix
        value=self.w_v(v) # V' matrix
        
        # splitting results into smaller matrices for the different heads
        # splitting embeddings (third dimension) into h parts
        
        # Transpose => bring the head to the second dimension
        query=query.view(query.shape[0], query.shape[1], self.h, self.d_k).transpose(1,2)
        
        # Transpose => bring the head to the second dimension
        key=key.view(key.shape[0], key.shape[1], self.h, self.d_k).transpose(1,2)
        
        # Transpose => bring the head to the second dimension
        value=value.view(value.shape[0], value.shape[1], self.h, self.d_k).transpose(1,2)
        
        # obtaining the output and the attention scores
        x, self.attention_scores=MultiHeadAttentionBlock.attention(query, key, value, mask, self.dropout)
        
        # obtaining the H matrix
        x=x.tranpose(1,2).contiguous().view(x.shape[0], -1, self.h * self.d_k)
        
        # multiply the H matrix by the weight matrix W_o, resulting in the MH-A matrix
        return self.w_o(x)

# Residual Connection

When we look at the architecture of the Transformer, we see that each sub-layer, including the `self-attention` and `Feed Forward` blocks, add its outputs to its input before passing it to the `Add&Norm` layer. This approach integrates the output with the original input in the `Add&Norm` layer. This process is known as the skip connection, which allows the Transformer to train deep networks more effectively by prociding a shotcut for the gradient to flow thorugh during backpropagation.

The `ResidualConnection` class below is responsible for this process.

In [6]:
class ResidualConnection(nn.Module):
    def __init__(self, dropout: float) -> None:
        super().__init__()
        # we use a dropout layer to prevent overfitting
        self.dropout=nn.Dropout(dropout)
        # we use a normalization layer
        self.norm=LayerNormalization()
        
    def forward(self, x, sublayer):
        # we normalize the input and add it to the original input x`. This creates the residual connection process
        return x+self.dropout(sublayer(self.norm(x)))

# Encoder

We wil now build the encoder. We carete the `EncoderBlock` clas, consisting of the Multi-Head Attention and Feed Forward layers, plus the residual connections. In the original paper, the Encoder Block repeats six times. We create the `Encoder` class as an assembly of multiple `EncoderBlock`s. We also add layer normalization as a final step after processing the input through all its blocks.

In [7]:
# building encoder block
class EncoderBlock(nn.Module):
    # this block takes in the MultiHeadAttentionBlock and FeedForwardBlock, as well as the dropout rate for the residual connections.
    def __init__(self, self_attention_block: MultiHeadAttentionBlock, feed_forward_block:FeedForwardBlock, dropout: float) -> None:
        super().__init__()
        #Strong the self-attention block and feed-forward block
        self.self_attention_block=self_attention_block
        self.feed_forward_block=feed_forward_block
        # 2 residual connections with dropout
        self.residual_connections=nn.ModuleList([ResidualConnection(dropout) for _ in range(2)])
        
    def forward(self, x, src_mask):
        # Applying the first residual connection with the self-attention block
        # Three x corresponding to query, key and value inputs plus source mask
        x=self.residual_connections[0](x,lambda x: self_attention_block(x,x,x,src_mask))
        
        # Appplying the second residual connection with the feed-forward block
        x=self.residual_connections[1](self.feed_forward_block)
        
        # output tensor after applying self-attention and feed-forward layers with residual connections
        return x


class Encoder(nn.Module):
    def __init__(self, layers: nn.ModuleList)-> None:
        super().__init__()
        self.layers=layers # storing the EncoderBlocks
        # layer for the normalization of the output of the encoder layers
        self.norm=LayerNormalization()
    
    def forward(seld, x, mask):
        # Iterating over each EncoderBlock stored in self.layers
        for layer in self.layers:
            # Applying each EncoderBlock to the input tensot 'x'
            x=layer(x, mask)
        return self.norm(x) # Normalizing output

# Decoder

Similarly, the Decoder also consists of several DecoderBlocks that repeat six times in the original paper. The main difference it that it has an additional sub-layer that performs multi-head attention with a `cross-attention` component that uses the output of the Encoder as its keys and values while using the Decoder's input as queries. For the Output Embedding, we can use the same `InputEmbeddings` class we use for the Encoder. You can also notice that the self-attention sub-layer is masked, which restricts the model from accessing future elements in the sequence.

We will start by building the `DecoderBlock` class, and then we will build the `Decoder` class, which will assemble multiple `DecoderBlock`'s.

In [8]:
class DecoderBlock(nn.Module):
    # the DecoderBlock takes in two MultiHeadAttentionBlock. One is self-attention, while the other is cross-attention.
    # it also takes in the feed-forward block and the dropout rate
    def __init__(self, self_attention_block: MultiHeadAttentionBlock, cross_attention_block: MultiHeadAttentionBlock, feed_forward_block: FeedForwardBlock, dropout: float)->None:
        super().__init__()
        self.self_attention_block=self_attention_block
        self.cross_attention_block=cross_attention_block
        self.feed_forward_block=feed_forward_block
        # list of three Residual Connection with dropout rate
        self.residual_connections=nn.ModuleList([ResidualConnection(dropout) for _ in rangge(3)])
    
    def forward(self, x, encoder_output, src_mask, tgt_mask):
        # self-attention block with query, key and value plus the target language mask
        x=self.residual_connections[0](x, lambda x: self_attention_block(x,x,x, tgt_mask))
        # the cross-attention block using two `encoder_output` for key and value plus the source language mask. It also takes in `x` for Decoder queries
        x=self.residual_connections[1](x, lambda x: self.cross_attention_block(x, encoder_output, encoder_output, src_mask))
        
        # feed-forward block with residual connections
        x=self.residual_connections[2](x,self.feed_forward_block)
        return x


class Decoder(nn.Module):
    def __init__(self, layers: nn.ModuleList)-> None:
        super.__init__()
        
        self.layers=layers
        self.norm=LayerNormalization()
    
    def forward(self, x, encoder_output, src_mask, tgt_mask):
        for layer in self.layers:
            x=layer(x, encoder_output, src_mask, tgt_mask)
        return self.norm(x)

# ProjectionLayer

The `ProjectionLayer` class below is responsible for converting the output of the model into a probability distribution over the `vocabulary`, where we select each output token from a vocabulary of possible tokens.

In [9]:
class ProjectionLayer(nn.Module):
    def __init__(self, d_model: int, vocab_size: int)-> None: # model dimension and the size of the output vocabulary
        super().__init__()
        # linear layer for projecting the feature space of `d_model` to the output space of `vocab_size`
        self.proj=nn.Linear(d_model, vocab_size)
    def forward(self, x):
        # applying the log Softmax function to the output
        return torch.log_softmax(self.proj(x), dim=-1)

# Building the Transformer

We will bring together all the components of the model's architecture.

In [10]:
# Creating the Transformer Architecture
class Transformer(nn.Module):
    # This takes in the encoder and decoder, as well the embeddings for the source and target language.
    # It also takes in the POsitional Encoding for the source and target language, as well as the projection layer
    def __init__(self, encoder: Encoder, decoder: Decoder, src_embed: InputEmbeddings, tgt_embed: InputEmbeddings, src_pos: PositionalEncoding, tgt_pos: PositionalEncoding, projection_layer:ProjectionLayer) -> None:
        super().__init__()
        self.encoder=encoder
        self.decoder=decoder
        self.src_embed=src_embed
        self.tgt_embed=tgt_embed
        self.src_pos=src_pos
        self.tgt_pos=tgt_pos
        self.projection_layer=projection_layer
    
    def encoder(self, src, src_mask):
        # applying source embeddings to the input source language
        src=self.src_embed(src)
        # applying source positional encoding to the soruce embeddings
        src=self.src_pos(src)
        # returning the source embeddings plus a source mask to prevent attention to certain elements
        return self.encoder(src, src_mask)

    def decoder(self, encoder_output, src_mask, tgt, tgt_mask):
        tgt=self.tgt_embed(tgt) # applying target embeddings to the input target language (tgt)
        tgt=self.tgt_pos(tgt) # applying target positional encoding to the target embeddings
        
        # return the target embeddings, the output of the encoder, and both source and target masks
        # The target mask ensures that the model won't see future elements of the sequence
        return self.decoder(tgt, encoder_output, src_mask, tgt_mask)
    
    # applying projection layer with the Softmax function to the Decoder output
    def project(self, x):
        return self.projection_layer(x)
        

# Initializing Transformer

We now define a function called `build_transformer`, in which we define the parameters and everything we need to have a fully operational Transformer model for the taks of `machine translation`. And we are going to use the same parameters as in the original paper, where $d_{model}=512$, $N=6$, $h=8$, dropout rate $P_{drop}=0.1$ and $d_{ff}=2048$.

In [11]:
def build_transformer(src_vocab_size: int, tgt_vocab_size: int, src_seq_len:int, tgt_seq_len:int, d_model:int=512, N:int=6, h:int=8, dropout:float=0.1, d_ff:int=2048)->Transformer:
    # creating embedding layers
    src_embed=InputEmbeddings(d_model, src_vocab_size) # source language (Source Vocabulary to 512-dimensional vectors)
    tgt_embed=InputEmbeddings(d_model, tgt_vocab_size) # target langauge (Target vocabulary to 512-dimensional vectors)
    
    # creating positional encoding layers
    src_pos=PositionalEncoding(d_model, src_seq_len, dropout)
    tgt_pos=PositionalEncoding(d_model, tgt_seq_len, dropout)
    
    # creating EncoderBlocks
    encoder_blocks=[]
    for _ in range(N):
        encoder_self_attention_block=MultiHeadAttentionBlock(d_model, h, dropout) # self-attention
        feed_forward_block=FeedForwardBlock(d_model, d_ff, dropout) # feedforward
        
        # combine layers into an EncoderBlock
        encoder_block=EncoderBlock(encoder_self_attention_block, feed_forward_block, dropout)
        encoder_blocks.append(encoder_block) # appending EncoderBlock to the list of EncoderBlocks
        
    # creating decoder blocks
    decoder_blocks=[]
    for _ in range(N):
        decoder_self_attention_block=MultiHeadAttentionBlock(d_model, h, dropout)
        decoder_cross_attention_block=MultiHeadAttentionBlock(d_model, h, dropout) # cross-attention
        feed_forward_block=FeedForwardBlock(d_model, d_ff, dropout) # feedforward
        
        # combining layers into a DecoderBlock
        decoder_block=DecoderBlock(decoder_self_attention_block, decoder_cross_attention_block, feed_forward_block, dropout)
        decoder_blocks.append(decoder_block) # appending DecoderBlock and DecoderBlocks lists
    
    # creating the Encoder and Decoder by using the EncoderBlocks and DecoderBlocks lists
    encoder=Encoder(nn.ModuleList(encoder_blocks))
    decoder=Decoder(nn.ModuleList(decoder_blocks))
    
    # Creating projection layer
    projection_layer=ProjectionLayer(d_model, tgt_vocab_size) # map the output of Decoder to the Target Vocabulary Space
    
    # crating the transformer by combining everything above
    transformer=Transformer(encoder, decoder, src_embed, tgt_embed, src_pos, tgt_pos, projection_layer)
    
    # Initialize the parameters
    for p in transformer.parameters():
        if p.dim()>1:
            nn.init.xavier_uniform_(p)
    
    # Assembled and initialized Transformer, Ready to be trained and validated!
    return transformer

# Tokenizer

Tokenization is a crucial preprocessing step for our Transformer model. In this step, we convert raw text into a number format that the model can process. There are several Tokenization strategies. We will use the `word-level` tokenization to transform each word in a sentence into a token. After tokenizing a sentence, we map each token to an unique integer ID based on the created vocabulary present in the training corpus during the training of the tokenizer. Each integer number represents a specific word in the vocabulary.

Besides the words in the training corpus, Transformers use special tokens for specific purpose. These are some that we will define right away:

* **[UNK]**: This token is used to identify an unknown word in the sequence
* **[PAD]**: Padding token to ensure that all sequences in a batch have the same length, so we pad shorter sentences with this token. We use attention masks to "tell" the model ignore the padded tokens during training since they don't have any real meaning to the task.
* **[SOS]**: This is a token used to signal the Start of Sentence.
* **[EOS]**: This is a token used to signal the End of Sentence.

In the `build_tokenizer` function below, we ensure a tokenizer is ready to train the model. It checks if there is an existing tokenizer, and if that is not case, it trains a new tokenizer.

In [13]:
from pathlib import Path
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace
from datasets import load_dataset

# Defining Tokenizer
def build_tokenizer(config, ds, lang):
    # creating a file path for the tokenizer
    tokenizer_path=Path(config['tokenizer_file'].format(lang))
    
    # checking if Tokenizer already exists
    if not Path.exists(tokenizer_path):
        # initializing a new world-level tokenizer
        tokenizer=Tokenizer(WordLevel(unk_token='[UNK]'))
        tokenizer.pre_tokenizer=Whitespce() # we will split the text into tokens based on whitespace
        
        # creating a trainer for the new tokenizer
        trainer=WordLevelTrainer(special_tokens=['[UNK]', '[PAD]', '[SOS]','[EOS]'], min_frequency=2) # defining Word Level strategy and special tokens
        
        # training new tokenizer on sentences from the dataset and language specified
        tokenizer.train_from_iterator(get_all_sentences(ds, lang), trainer=trainer)
        tokenizer.save(str(tokenizer_path)) # saving trained tokenizer to the file path specified at the beginning of the function
    else:
        tokenizer=Tokenizer.from_file(str(tokenizer_path)) # if the tokenizer already exist, we load it
    return tokenizer # returns the loaded tokenizer or the trained tokenizer

# Loading Dataset

We are going to use the OpusBooks dataset. It consists of two features, `id` and `translation`. The `translation` feature contains pairs of sentences in different languages. And we will train the model on the English-Italian pair.

In [14]:
def get_all_sentences(ds, lang):
    for pair in ds:
        yield pair['translation'][lang]

The `get_ds` function id defined to load and prepare the dataset for training and validaton. In this function, we build or load the tokenizer, split the dataset, and create DataLoaders, so the model can successfully iterate over the dataset in batches. The result of these functions is tokenizers for the source and taret languages plus the DataLoader objects.

In [16]:
from datasets import load_dataset

def get_ds(config):
    # the language pairs will be defined in the 'config' dictionary we will build later
    ds_raw=load_dataset('opus_books', f'{config["lang_src"]}-{config["lang_tgt"]}', split='train')
    
    # building or loading tokenizer for both the source and target languages
    tokenizer_src=build_tokenizer(config, ds_raw, config['lang_src'])
    tokenizer_tgt=build_tokenizer(config, ds_raw, config['lang_tgt'])
    
    # splitting the dataset for training and validation
    train_ds_size=int(0.9*len(ds_raw)) #90% for training
    val_ds_size=len(ds_raw)-train_ds_size # 10% for validation
    train_ds_raw, val_ds_raw=random_split(ds_raw, [train_ds_size, val_ds_size]) # randomly splitting the dataset
    
    # processing data with the BilingualDataset class
    train_ds=BilinguaDataset(train_ds_raw, tokenizer_src, tokenizer_tgt, config['lang_src'], config['lang_tgt'], config['seq_len'])
    val_ds=BilinguaDataset(val_ds_raw, tokenizer_src, tokenizer_tgt, config['lang_src'], config['lang_tgt'], config['seq_len'])
    
    # iterating over the entire dataset and printing the maximum length found in the sentences of both the source and target languages
    max_len_src=0
    max_len_tgt=0
    
    for pair in ds_raw:
        src_ids=tokenizer_src.encode(pair['translation'][config['lang_src']]).ids
        tgt_ids=tokenizer_src.encode(pair['translation'][config['lang_tgt']]).ids
        
        max_len_src=max(max_len_src, len(src_ids))
        max_len_tgt=max(max_len_tgt, len(tgt_ids))
    
    print(f'Max length of source sentence: {max_len_src}')
    print(f'Max length of target sentence: {max_len_tgt}')
    
    # creating dataloaders for the training and validation sets
    # Dataloaders are used to iterate over the dataset in batches during training and validation
    train_dataloader=DataLoader(train_ds, batch_size=config['batch_size'], shuffle=True) # Batch size will be defined in the config dictionary
    val_dataloader=DataLoader(val_ds, batch_size=1, shuffle=True)
    
    
    return train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt # returning the dataloader objects and tokenizers
    

# Credit

* https://www.kaggle.com/code/lusfernandotorres/transformer-from-scratch-with-pytorch/notebook?scriptVersionId=157547654
* https://www.youtube.com/watch?v=ISNdQcPhsts&t=9595s