Training the best custom Transformer 🤖
-----------------------------------

In this notebook, we will continue the training of the best custom transformer on the new extracted sentences from the bool **Grammaire de Wolof Moderne**. We obtained, after a hyperparameter tuning with `wandb`, a best bleu score of **?** for french to wolof translation model. We provide, bellow, the main evaluation figures, obtained from the hyperparameter search step.

- Parallel coordinates:

- Parameter importance (from [panel]()):


Let us add some libraries bellow:

In [1]:
# let us import all necessary libraries
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, T5TokenizerFast, set_seed, AdamW
from wolof_translate.utils.sent_transformers import TransformerSequences
from torch.nn import TransformerEncoderLayer, TransformerDecoderLayer
from torch.utils.data import Dataset, DataLoader, random_split
from wolof_translate.data.dataset_v2 import SentenceDataset
from wolof_translate.utils.sent_corrections import *
from sklearn.model_selection import train_test_split
from torch.optim.lr_scheduler import _LRScheduler
# from custom_rnn.utils.kwargs import Kwargs
from torch.nn.utils.rnn import pad_sequence
from plotly.subplots import make_subplots
from nlpaug.augmenter import char as nac
from torch.utils.data import DataLoader
from torch.nn import functional as F
import plotly.graph_objects as go
from tokenizers import Tokenizer
import matplotlib.pyplot as plt
from tqdm import tqdm, trange
from functools import partial
from torch.nn import utils
from copy import deepcopy
from torch import optim
from typing import *
from torch import nn
import pandas as pd
import numpy as np
import itertools
import evaluate
import random
import string
import shutil
import wandb
import torch
import json
import copy
import os


  from .autonotebook import tqdm as notebook_tqdm


### Steps

We must add some classes that we implemented when making the hyperparameter search including:
- The custom Sinusoidal-based encoder
- The custom Size prediction module
- The custom Transformer requiring the `pytorch encoder and decoder stacked layers`
- The custom Transformer' learning rate scheduler
- The custom Trainer

And include them in our `wolof-translate` package.

-------------------

After that we will continue the training of the custom Transformer, for which we will resume its parameters from the saved checkpoints.

-------------------

The last part is to evaluate the model on the test set.

Let us go into our pipeline 👌

### Add custom modules

#### Custom Positional Encoder

Let us add bellow the positional encoder module which will permit us to put the positions of the sequence elements on the embedding vector.

In [2]:
%%writefile wolof-translate/wolof_translate/models/transformers/position.py

from torch import nn
import numpy as np
import torch

class PositionalEncoding(nn.Module):

    def __init__(self, n_poses_max: int = 500, d_model: int = 512):
        super(PositionalEncoding, self).__init__()    
        
        self.n_poses = n_poses_max
        
        self.n_dims = d_model
        
        # the angle is calculated as following
        angle = lambda pos, i: pos / 10000 ** (i / self.n_dims)

        # let's initialize the different token positions
        poses = np.arange(0, self.n_poses)

        # let's initialize also the different dimension indexes
        dims = np.arange(0, self.n_dims)

        # let's initialize the index of the different positional vector values
        circle_index = np.arange(0, self.n_dims / 2)

        # let's create the possible combinations between a position and a dimension index
        xv, yv = np.meshgrid(poses, circle_index)

        # let's create a matrix which will contain all the different points initialized
        points = np.zeros((self.n_poses, self.n_dims))

        # let's calculate the circle y axis coordinates
        points[:, ::2] = np.sin(angle(xv.T, yv.T))

        # let's calculate the circle x axis coordinates
        points[:, 1::2] = np.cos(angle(xv.T, yv.T))
        
        self.register_buffer('pe', torch.from_numpy(points).unsqueeze(0))
    
    def forward(self, input_: torch.Tensor):
        
        # let's scale the input
        input_ = input_ * torch.sqrt(torch.tensor(self.n_dims))
        
        # let's recuperate the result of the sum between the input and the positional encoding vectors
        return input_ + self.pe[:, :input_.size(1), :].type_as(input_)
    

Writing wolof-translate/wolof_translate/models/transformers/position.py


Let us define bellow the Size Prediction's module. It is a multi layer perceptron with multiple layers of linear + relu activation + drop out + layer normalization. The number of features and the number of layers, the layer normalization activation and the drop out rate are given as parameters to the module.


In [None]:
class SizePredict(nn.Module):
    
    def __init__(self, input_size: int, target_size: int = 1, n_features: int = 100, n_layers: int = 1, normalization: bool = True, drop_out: float = 0.1):
        super(SizePredict, self).__init__()
        
        self.layers = nn.ModuleList([])
        
        for l in range(n_layers):
            
            # we have to add batch normalization and drop_out if their are specified
            self.layers.append(
                nn.Sequential(
                    nn.Linear(input_size if l == 0 else n_features, n_features),
                    nn.ReLU(),
                    nn.Dropout(drop_out),
                    nn.LayerNorm(n_features) if normalization else nn.Identity(),
                )
            )
        
        # Initiate the last linear layer
        self.output_layer = nn.Linear(n_features, target_size)
    
    def forward(self, input_: torch.Tensor):
        
        # let's pass the input into the different sequences
        out = input_
        
        for layer in self.layers:
            
            out = layer(out)
        
        # return the final result (you have to take the absolute value of the result to make the number positive)
        return self.output_layer(out)
        
        