Training the best custom Transformer 🤖
-----------------------------------

In this notebook, we will continue the training of the best custom transformer on the new extracted sentences from the book **Grammaire de Wolof Moderne**. We provide, bellow, the main evaluation figures, obtained from the hyperparameter search step. We will evaluate the training on the validation dataset.

- Parallel coordinates:

- Parameter importance (from [panel]()):


Let us add some libraries bellow:

In [1]:
# let us import all necessary libraries
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, T5TokenizerFast, set_seed, AdamW, get_linear_schedule_with_warmup, T5ForConditionalGeneration,\
    get_cosine_schedule_with_warmup, Adafactor
from wolof_translate.utils.sent_transformers import TransformerSequences
from torch.nn import TransformerEncoderLayer, TransformerDecoderLayer
from torch.utils.data import Dataset, DataLoader, random_split
from wolof_translate.data.dataset_v2 import SentenceDataset
from wolof_translate.utils.sent_corrections import *
from sklearn.model_selection import train_test_split
from torch.optim.lr_scheduler import _LRScheduler
# from custom_rnn.utils.kwargs import Kwargs
from torch.nn.utils.rnn import pad_sequence
from plotly.subplots import make_subplots
from nlpaug.augmenter import char as nac
from torch.utils.data import DataLoader
# from datasets  import load_metric # make pip install evaluate instead
# and pip install sacrebleu for instance
from torch.nn import functional as F
import plotly.graph_objects as go
from tokenizers import Tokenizer
import matplotlib.pyplot as plt
from tqdm import tqdm, trange
from functools import partial
from torch.nn import utils
from copy import deepcopy
from torch import optim
from typing import *
from torch import nn
import pandas as pd
import numpy as np
import itertools
import evaluate
import random
import string
import shutil
import wandb
import torch
import json
import copy
import os

os.environ["WANDB_DISABLED"] = "true"

  from .autonotebook import tqdm as notebook_tqdm


### Steps

We must add some classes that we implemented when making the hyperparameter search including:
- The custom Sinusoidal-based encoder
- The custom Size prediction module
- The custom Transformer requiring the `pytorch encoder and decoder stacked layers`
- The custom Transformer' learning rate scheduler
- The custom Trainer

And include them in our `wolof-translate` package.

-------------------

After that we will continue the training of the custom Transformer, for which we will resume its parameters from the saved checkpoints.

-------------------

The last part is to evaluate the model on the test set.

Let us go into our pipeline 👌

### Add custom modules

#### Custom Positional Encoder

Let us add bellow the positional encoder module which will permit us to put the positions of the sequence elements on the embedding vector.

In [2]:
%%writefile wolof-translate/wolof_translate/models/transformers/position.py

from torch import nn
import numpy as np
import torch

class PositionalEncoding(nn.Module):

    def __init__(self, n_poses_max: int = 500, d_model: int = 512):
        super(PositionalEncoding, self).__init__()    
        
        self.n_poses = n_poses_max
        
        self.n_dims = d_model
        
        # the angle is calculated as following
        angle = lambda pos, i: pos / 10000 ** (i / self.n_dims)

        # let's initialize the different token positions
        poses = np.arange(0, self.n_poses)

        # let's initialize also the different dimension indexes
        dims = np.arange(0, self.n_dims)

        # let's initialize the index of the different positional vector values
        circle_index = np.arange(0, self.n_dims / 2)

        # let's create the possible combinations between a position and a dimension index
        xv, yv = np.meshgrid(poses, circle_index)

        # let's create a matrix which will contain all the different points initialized
        points = np.zeros((self.n_poses, self.n_dims))

        # let's calculate the circle y axis coordinates
        points[:, ::2] = np.sin(angle(xv.T, yv.T))

        # let's calculate the circle x axis coordinates
        points[:, 1::2] = np.cos(angle(xv.T, yv.T))
        
        self.register_buffer('pe', torch.from_numpy(points).unsqueeze(0))
    
    def forward(self, input_: torch.Tensor):
        
        # let's scale the input
        input_ = input_ * torch.sqrt(torch.tensor(self.n_dims))
        
        # let's recuperate the result of the sum between the input and the positional encoding vectors
        return input_ + self.pe[:, :input_.size(1), :].type_as(input_)
    

Overwriting wolof-translate/wolof_translate/models/transformers/position.py


#### Size Prediction module

Let us define bellow the Size Prediction's module. It is a multi layer perceptron with multiple layers of `linear + relu activation + drop out + layer normalization`. `The number of features`, `the number of layers`, `the layer normalization' activation function` and `the drop out rate` are given as parameters to the module.


In [3]:
%%writefile wolof-translate/wolof_translate/models/transformers/size.py

from torch import nn
import torch
class SizePredict(nn.Module):
    
    def __init__(self, input_size: int, target_size: int = 1, n_features: int = 100, n_layers: int = 1, normalization: bool = True, drop_out: float = 0.1):
        super(SizePredict, self).__init__()
        
        self.layers = nn.ModuleList([])
        
        for l in range(n_layers):
            
            # we have to add batch normalization and drop_out if their are specified
            self.layers.append(
                nn.Sequential(
                    nn.Linear(input_size if l == 0 else n_features, n_features),
                    nn.LayerNorm(n_features) if normalization else nn.Identity(),
                    nn.ReLU(),
                    nn.Dropout(drop_out),
                )
            )
        
        # Initiate the last linear layer
        self.output_layer = nn.Linear(n_features, target_size)
    
    def forward(self, input_: torch.Tensor):
        
        # let's pass the input into the different sequences
        out = input_
        
        for layer in self.layers:
            
            out = layer(out)
        
        # return the final result (you have to take the absolute value of the result to make the number positive)
        return self.output_layer(out)
        
        

Overwriting wolof-translate/wolof_translate/models/transformers/size.py


#### Transformer

The following module is the primary transformer model. It takes as argument:
- a pytorch encoder and a pytorch decoder (they are defined outside of the module)
- the input size or vocabulary size
- the class criterion or loss function of the predict labels (as default to None but can be `nn.CrossEntropyLoss`, which apply the softmax transformation is made on the logits before calculation. label_smoothing can be added to the loss to prevent the model to over-fit according to the prediction values.)
- the size criterion (`Mean Squared Error` $\frac{1}{n}\sum_{i = 1}^n (y_i - \hat{y}_i)^2$ where $n$ is the batch size, $y_i$ is the true label and $\hat{y}_i$ is the predicted label)
- the number of features and the number of layers of the size prediction module
- the max number of positions (it must be the max number of tokens defined when creating the pytorch dataset or the tokenizer)
- the projection type (can be 'embedding' for 2-dimensional data containing integers as we are using or 'linear' for any other type of data different from the sequence of integers).

For the `forward` method we have the following arguments:

- the input sequence
- the input padding mask
- the target sequence or labels
- the target padding mask
- the padding token id.

For the `generate` method we have the following arguments:

- the input sequence
- the input padding mask
- the temperature
- the padding token id.

We added also two exception modules to handle errors.

In [4]:
%%writefile wolof-translate/wolof_translate/models/transformers/main.py

from wolof_translate.models.transformers.position import PositionalEncoding
from wolof_translate.models.transformers.size import SizePredict
from torch.nn.utils.rnn import pad_sequence
from torch import nn
from typing import *
import torch
import copy
# new Exception for that transformer
class TargetException(Exception):
    
    def __init__(self, error):
        
        print(error)

class GenerationException(Exception):

    def __init__(self, error):

        print(error)

class Transformer(nn.Module):
    
    def __init__(self, 
                 vocab_size: int,
                 encoder,
                 decoder,
                 class_criterion = nn.CrossEntropyLoss(label_smoothing=0.1),
                 size_criterion = nn.MSELoss(),
                 n_features: int = 100,
                 n_layers: int = 2,
                 n_poses_max: int = 500,
                 projection_type: str = "embedding",
                 max_len: Union[int, None] = None):
        
        super(Transformer, self).__init__()
        
        assert len(encoder.layers) > 0 and len(decoder.layers) > 0
    
        self.dropout = encoder.layers._modules['0'].dropout.p
        
        self.enc_embed_dim = encoder.layers._modules['0'].linear1.in_features
        
        self.dec_embed_dim = decoder.layers._modules['0'].linear1.in_features
        
        # we can initiate the positional encoding model
        self.pe = PositionalEncoding(n_poses_max, self.enc_embed_dim)
        
        if projection_type == "embedding":
            
            self.embedding_layer = nn.Embedding(vocab_size, self.enc_embed_dim)
        
        elif projection_type == "linear":
            
            self.embedding_layer = nn.Linear(vocab_size, self.enc_embed_dim)
        
        # initialize the first encoder and decoder
        self.encoder = encoder
        
        self.decoder = decoder
        
        self.class_criterion = class_criterion
        
        self.size_criterion = size_criterion
        
        # let's initiate the mlp for predicting the target size
        self.size_prediction = SizePredict(
            self.enc_embed_dim,
            n_features=n_features,
            n_layers=n_layers,
            normalization=True, # we always use normalization
            drop_out=self.dropout
            )

        self.classifier = nn.Linear(self.dec_embed_dim, vocab_size)

        # let us share the weights between the embedding layer and classification
        # linear layer
        self.classifier.weight.data = self.embedding_layer.weight.data

        self.max_len = max_len
        
        
    def forward(self, input_, input_mask = None, target = None, target_mask = None, 
                pad_token_id:int = 3):

        # ---> Encoder prediction
        input_embed = self.embedding_layer(input_)
        
        # recuperate the last input (before position)
        last_input = input_embed[:, -1:]
       
        # add position to input_embedding
        input_embed = self.pe(input_embed)
        
        # recuperate the input mask for pytorch encoder
        pad_mask1 = (input_mask == 0).to(next(self.parameters()).device, dtype = torch.bool) if not input_mask is None else None
        
        # let us compute the states
        input_embed = input_embed.type_as(next(self.encoder.parameters()))
        
        states = self.encoder(input_embed, src_key_padding_mask = pad_mask1)
   
        # ---> Decoder prediction
        # let's predict the size of the target 
        target_size = self.size_prediction(states).mean(axis = 1)
        
        target_embed = self.embedding_layer(target)
        
        # recuperate target mask for pytorch decoder            
        pad_mask2 = (target_mask == 0).to(next(self.parameters()).device, dtype = torch.bool) if not target_mask is None else None
        
        # define the attention mask
        targ_mask = self.get_target_mask(target_embed.size(1))

        # let's concatenate the last input and the target shifted from one position to the right (new seq dim = target seq dim)
        target_embed = torch.cat((last_input, target_embed[:, :-1]), dim = 1)
        
        # add position to target embed
        target_embed = self.pe(target_embed)
        
        # we pass all of the shifted target sequence to the decoder if training mode
        if self.training:
            
            target_embed = target_embed.type_as(next(self.encoder.parameters()))
            
            outputs = self.decoder(target_embed, states, tgt_mask = targ_mask, tgt_key_padding_mask = pad_mask2)
            
        else: ## This part was understand with the help of the professor Bousso.
            
            # if we are in evaluation mode we will not use the target but the outputs to make prediction and it is
            # sequentially done (see comments)
            
            # let us recuperate the last input as the current outputs
            outputs = last_input.type_as(next(self.encoder.parameters()))
            
            # for each target that we want to predict
            for t in range(target.size(1)):
                
                # recuperate the target mask of the current decoder input
                current_targ_mask = targ_mask[:t+1, :t+1] # all attentions between the elements before the last target
                
                # we do the same for the padding mask
                current_pad_mask = None
                
                if not pad_mask2 is None:
                    
                    current_pad_mask = pad_mask2[:, :t+1]
                
                # make new predictions
                out = self.decoder(outputs, states, tgt_mask = current_targ_mask, tgt_key_padding_mask = current_pad_mask) 
                
                # add the last new prediction to the decoder inputs
                outputs = torch.cat((outputs, out[:, -1:]), dim = 1) # the prediction of the last output is the last to add (!)
            
            # let's take only the predictions (the last input will not be taken)
            outputs = outputs[:, 1:]
        
        # let us add padding index to the outputs
        if not target_mask is None: 
          target = copy.deepcopy(target.cpu())
          target = target.to(target_mask.device).masked_fill_(target_mask == 0, -100)

        # ---> Loss Calculation
        # let us calculate the loss of the size prediction
        size_loss = 0
        if not self.size_criterion is None:
            
            size_loss = self.size_criterion(target_size, target_mask.sum(axis = -1).unsqueeze(1).type_as(next(self.parameters())))
            
        outputs = self.classifier(outputs)
        
        # let us permute the two last dimensions of the outputs
        outputs_ = outputs.permute(0, -1, -2)

        # calculate the loss
        loss = self.class_criterion(outputs_, target)

        outputs = torch.softmax(outputs, dim = -1)

        # calculate the predictionos
        outputs = copy.deepcopy(outputs.detach().cpu())
        predictions = torch.argmax(outputs, dim = -1).to(target_mask.device).masked_fill_(target_mask == 0, pad_token_id)

        return {'loss': loss + size_loss, 'preds': predictions}
    
    def generate(self, input_, input_mask = None, temperature: float = 0, pad_token_id:int = 3):

        if self.training:

          raise GenerationException("You cannot generate when the model is on training mode!")

        # ---> Encoder prediction
        input_embed = self.embedding_layer(input_)
        
        # recuperate the last input (before position)
        last_input = input_embed[:, -1:]
       
        # add position to input_embedding
        input_embed = self.pe(input_embed)
        
        # recuperate the input mask for pytorch encoder
        pad_mask1 = (input_mask == 0).bool().to(next(self.parameters()).device) if not input_mask is None else None
        
        # let us compute the states
        input_embed = input_embed.type_as(next(self.encoder.parameters()))
        
        states = self.encoder(input_embed, src_key_padding_mask = pad_mask1)

        # ---> Decoder prediction
        # let us recuperate the maximum length
        max_len = self.max_len if not self.max_len is None else 0

        # let's predict the size of the target and the target mask
        if max_len > 0:

          target_size = self.size_prediction(states).mean(axis = 1).round().clip(1, max_len)
        
        else:

          target_size = torch.max(self.size_prediction(states).mean(axis = 1).round(), torch.tensor(1.0))

        target_ = copy.deepcopy(target_size.cpu())

        target_mask = [torch.tensor(int(size[0])*[1] + [0] * max(max_len - int(size[0]), 0)) for size in target_.tolist()]

        if max_len > 0:

          target_mask = torch.stack(target_mask).to(next(self.parameters()).device, dtype = torch.bool)

        else:

          target_mask = pad_sequence(target_, batch_first = True).to(next(self.parameters()).device, dtype = torch.bool)
      
        # recuperate target mask for pytorch decoder            
        pad_mask2 = (target_mask == 0).to(next(self.parameters()).device, dtype = torch.bool) if not target_mask is None else None
        
        # define the attention mask
        targ_mask = self.get_target_mask(target_mask.size(1))
            
        # if we are in evaluation mode we will not use the target but the outputs to make prediction and it is
        # sequentially done (see comments)
        
        # let us recuperate the last input as the current outputs
        outputs = last_input.type_as(next(self.encoder.parameters()))
        
        # for each target that we want to predict
        for t in range(target_mask.size(1)):
            
            # recuperate the target mask of the current decoder input
            current_targ_mask = targ_mask[:t+1, :t+1] # all attentions between the elements before the last target
            
            # we do the same for the padding mask
            current_pad_mask = None
            
            if not pad_mask2 is None:
                
                current_pad_mask = pad_mask2[:, :t+1]
            
            # make new predictions
            out = self.decoder(outputs, states, tgt_mask = current_targ_mask, tgt_key_padding_mask = current_pad_mask) 
            
            # add the last new prediction to the decoder inputs
            outputs = torch.cat((outputs, out[:, -1:]), dim = 1) # the prediction of the last output is the last to add (!)
        
        # let's take only the predictions (the last input will not be taken)
        outputs = outputs[:, 1:]

        # ---> Predictions
        outputs = self.classifier(outputs)

        # calculate the resulted outputs with temperature
        if temperature > 0:

          outputs = torch.softmax(outputs / temperature, dim = -1)
        
        else:

          outputs = torch.softmax(outputs, dim = -1)

        # calculate the predictionos
        outputs = copy.deepcopy(outputs.detach().cpu())
        predictions = torch.argmax(outputs, dim = -1).to(target_mask.device).masked_fill_(target_mask == 0, pad_token_id)

        return predictions
    

    def get_target_mask(self, attention_size: int):
        
        return torch.triu(torch.ones((attention_size, attention_size)), diagonal = 1).to(next(self.parameters()).device, dtype = torch.bool)

Overwriting wolof-translate/wolof_translate/models/transformers/main.py


#### Learning scheduler

Let us create our own learning rate scheduler according to the paper [Attention Is All You Need](https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) paper.

![scheduler_transformer](https://i.stack.imgur.com/GQurA.png)

In [5]:
%%writefile wolof-translate/wolof_translate/models/transformers/optimization.py

from torch.optim.lr_scheduler import _LRScheduler
from torch import optim
from typing import *

class TransformerScheduler(_LRScheduler):
    
    def __init__(self, optimizer: Union[optim.AdamW, optim.Adam], d_model = 512, lr_warmup_step = 100, **kwargs):
        
        self._optimizer = optimizer
        
        self._dmodel = d_model
        
        self._lr_warmup = lr_warmup_step

        # get the number of parameters
        self.len_param_groups = len(self._optimizer.param_groups)

        # provide the LRScheduler parameters
        super().__init__(self._optimizer, **kwargs)
        
    def get_lr(self):
        
        # recuperate the step number
        _step_num = self._step_count
        
        # calculate the learning rate
        lr = self._dmodel ** -0.5 * min(_step_num ** -0.5, 
                                              _step_num * self._lr_warmup ** -1.5)
        # provide the corresponding learning rate of each parameter vector
        # for updating
        return [lr] * self.len_param_groups

        
        

Overwriting wolof-translate/wolof_translate/models/transformers/optimization.py


In [6]:
# %%writefile wolof-translate/wolof_translate/models/transformers/optimization.py

# from torch.optim.lr_scheduler import _LRScheduler
# from torch import optim
# from typing import *
# class TransformerScheduler(_LRScheduler):
    
#     def __init__(self, optimizer: Union[optim.AdamW, optim.Adam], scale_factor = 1.0, lr_warmup_step = 100, **kwargs):

#         self._optimizer = optimizer

#         self._scale_factor = scale_factor
        
#         self._lr_warmup = lr_warmup_step

#         # get the number of parameters
#         self.len_param_groups = len(self._optimizer.param_groups)

#         # provide the LRScheduler parameters
#         super().__init__(self._optimizer, **kwargs)
        
#     def get_lr(self):
        
#         # recuperate the step number
#         _step_num = self._step_count
        
#         # calculate the learning rate
#         lr = self._scale_factor * min(_step_num ** -0.5, 
#                                               _step_num * self._lr_warmup ** -1.5)
#         # provide the corresponding learning rate of each parameter vector
#         # for updating
#         return [lr] * self.len_param_groups

        
        

#### Trainer

Let us define bellow a part of our long training class that we create and which is available in github. But the lines are commented in French.

In [2]:
%%writefile wolof-translate/wolof_translate/trainers/transformer_trainer.py
"""Nouvelle classe d'entraînement. On la fournit un modèle et des hyperparamètres en entrée.
Nous allons créer des classes supplémentaire qui vont supporter la classe d'entraînement
"""

from wolof_translate.utils.evaluation import TranslationEvaluation
from torch.utils.tensorboard import SummaryWriter
from torch.utils.data import Dataset, DataLoader
from tokenizers import Tokenizer
from tqdm import tqdm, trange
from torch.nn import utils
from torch import optim
from typing import *
from torch import nn
import pandas as pd
import numpy as np
import string
import torch
import json
import copy
import os

# choose letters for random words
letters = string.ascii_lowercase

class PredictionError(Exception):
    
    def __init__(self, error: Union[str, None] = None):

        if not error is None:
            
            print(error)
        
        else:
            
            print("You cannot with this type of data! Provide a list of tensors, a list of numpy arrays, a numpy array or a torch tensor.")

class LossError(Exception):
    
    def __init__(self, error: Union[str, None] = None):

        if not error is None:
            
            print(error)
        
        else:
            
            print("A list of losses is provided for multiple outputs.")
        
class ModelRunner:

    def __init__(
        self,
        model: nn.Module,
        optimizer = optim.AdamW,
        seed: Union[int, None] = None, 
        evaluation: Union[TranslationEvaluation, None] = None,
        version: int = 1
    ):

        # Initialisation de la graine du générateur
        self.seed = seed
        
        # Initialisation de la version
        self.version = version

        # Recuperate the evaluation metric
        self.evaluation = evaluation

        # Initialisation du générateur
        if self.seed:
            torch.manual_seed(self.seed)

        # Le modèle à utiliser pour les différents entraînements
        self.orig_model = model

        # L'optimiseur à utiliser pour les différentes mises à jour du modèle
        self.orig_optimizer = optimizer

        # Récupération du type de 'device'
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        self.compilation = False

    # ------------------------------ Training staffs (Partie entraînement et compilation) --------------------------
    
    def batch_train(self, input_: torch.Tensor, input_mask: torch.Tensor,
                    labels: torch.Tensor, labels_mask: torch.Tensor, pad_token_id: int = 3):
        if self.hugging_face: # Nous allons utilise un modèle text to text de hugging face (but only for fine-tuning)

          # effectuons un passage vers l'avant
          outputs = self.model(input_ids = input_, attention_mask = input_mask, 
                               labels = labels)
          
          # recuperate the predictions and the loss
          preds, loss = outputs.logits, outputs.loss
        
        else:

          # effectuons un passage vers l'avant
          outputs = self.model(input_, input_mask, labels, labels_mask, pad_token_id = pad_token_id)

          # recuperate the predictions and the loss
          preds, loss = outputs['preds'], outputs['loss']

        # effectuons un passage vers l'arrière
        loss.backward()

        # forcons les valeurs des gradients à se tenir dans un certain interval si nécessaire
        if not self.clipping_value is None:

            utils.clip_grad_value_(
                self.model.parameters(), clip_value=self.clipping_value
            )

        # mettons à jour les paramètres
        self.optimizer.step()

        # Réduction du taux d'apprentissage à chaque itération si nécessaire
        if not self.lr_scheduling is None:

            self.lr_scheduling.step()

        # reinitialisation des gradients
        self.optimizer.zero_grad()

        return preds, loss

    def batch_eval(self, input_: torch.Tensor, input_mask: torch.Tensor,
                    labels: torch.Tensor, labels_mask: torch.Tensor, pad_token_id: int = 3):

        if self.hugging_face: # Nous allons utilise un modèle text to text de hugging face (but only for fine-tuning)

          # effectuons un passage vers l'avant
          outputs = self.model(input_ids = input_, attention_mask = input_mask, 
                               labels = labels)
          # recuperate the predictions and the loss
          preds, loss = outputs.logits, outputs.loss
        
        else:

          # effectuons un passage vers l'avant
          outputs = self.model(input_, input_mask, labels, labels_mask, pad_token_id = pad_token_id)

          # recuperate the predictions and the loss
          preds, loss = outputs['preds'], outputs['loss']

        return preds, loss

    # On a décidé d'ajouter quelques paramètres qui ont été utiles au niveau des enciennes classes d'entraînement
    def compile(
        self,
        train_dataset: Dataset,
        test_dataset: Union[Dataset, None] = None,
        tokenizer: Union[Tokenizer, None] = None,
        train_loader_kwargs: dict = {"batch_size": 16},
        test_loader_kwargs: dict = {"batch_size": 16},
        optimizer_kwargs: dict = {"lr": 1e-4, "weight_decay": 0.4},
        model_kwargs: dict = {'class_criterion': nn.CrossEntropyLoss(label_smoothing=0.1)},
        lr_scheduler_kwargs: dict = {'d_model': 512, 'lr_warmup_step': 100},
        lr_scheduler = None,
        gradient_clipping_value: Union[float, torch.Tensor, None] = None,
        predict_with_generate: bool = False,
        logging_dir: Union[str, None] = None,
        hugging_face: bool = False,
    ):

        if self.seed:
            torch.manual_seed(self.seed)

        # On devra utiliser la méthode 'spread' car on ne connait pas les paramètres du modèle
        if isinstance(self.orig_model, nn.Module): # si c'est une instance d'un modèle alors pas de paramètres requis
            
            self.model = copy.deepcopy(self.orig_model).to(self.device)
        
        else: # sinon on fournit les paramètres
        
            self.model = copy.deepcopy(self.orig_model(**model_kwargs)).to(self.device)

        # Initialisation des paramètres de l'optimiseur
        self.optimizer = self.orig_optimizer(
            self.model.parameters(), **optimizer_kwargs
        )
        
        # On ajoute un réducteur de taux d'apprentissage si nécessaire
        self.lr_scheduling = None

        if not lr_scheduler is None and self.lr_scheduling is None:

            self.lr_scheduling = lr_scheduler(self.optimizer, **lr_scheduler_kwargs)

        self.train_loader = DataLoader(
            train_dataset,
            shuffle=True,
            **train_loader_kwargs,
        )
        
        if test_dataset:
          self.test_loader = DataLoader(
              test_dataset,
              shuffle=False,
              **test_loader_kwargs,
          )
        
        else:
          self.test_loader = None
        
        # Let us initialize the clipping value to make gradient clipping
        self.clipping_value = gradient_clipping_value

        # Other parameters for step tracking and metrics
        self.compilation = True

        self.current_epoch = None

        self.best_score = None

        self.best_epoch = self.current_epoch

        # Recuperate some boolean attributes
        self.predict_with_generate = predict_with_generate

        # Recuperate tokenizer
        self.tokenizer = tokenizer
        
        # Recuperate the logging directory
        self.logging_dir = logging_dir
        
        # Initialize the metrics
        self.metrics = {}

        # Initialize the attribute which indicate if the model is from huggingface
        self.hugging_face = hugging_face
        

    def train(
        self,
        epochs: int = 100,
        auto_save: bool = False,
        log_step: Union[int, None] = None,
        saving_directory: str = "data/checkpoints/last_checkpoints",
        file_name: str = "checkpoints",
        save_best: bool = True,
        metric_for_best_model: str = 'test_loss',
        metric_objective: str = 'minimize'
    ):
        """Entraînement du modèle

        Args:
            epochs (int, optional): Le nombre d'itérations. Defaults to 100.
            auto_save (bool, optional): Auto-sauvegarde du modèle. Defaults to False.
            log_step (int, optional): Le nombre d'itération avant d'afficher les performances. Defaults to 1.
            saving_directory (str, optional): Le dossier de sauvegarde du modèle. Defaults to "inception_package/storage".
            file_name (str, optional): Le nom du fichier de sauvegarde. Defaults to "checkpoints".
            save_best (bool): Une varible booléenne indiquant si l'on souhaite sauvegarder le meilleur modèle. Defaults to True.
            metric_for_best_model (str): Le nom de la métrique qui permet de choisir le meilleur modèle. Defaults to 'eval_loss'.
            metric_objective (str): Indique si la métrique doit être maximisée 'maximize' ou minimisée 'minimize'. Defaults to 'minimize'.

        Raises:
            Exception: L'entraînement implique d'avoir déja initialisé les paramètres
        """

        # the file name cannot be "best_checkpoints"
        assert file_name != "best_checkpoints"
        
        ##################### Error Handling ##################################################
        if not self.compilation:
            raise Exception("You must initialize datasets and\
                            parameters with `compile` method. Make sure you don't forget any of them before \n \
                                training the model"
            )

        ##################### Initializations #################################################

        if metric_objective in ['maximize', 'minimize']:

          best_score = float('-inf') if metric_objective == 'maximize' else float('inf')

        else:

          raise ValueError("The metric objective can only between 'maximize' or minimize!")

        if not self.best_score is None:

          best_score = self.best_score

        start_epoch = self.current_epoch if not self.current_epoch is None else 0

        ##################### Training ########################################################

        modes = ['train', 'test'] 
        
        if self.test_loader is None: modes = ['train']

        for epoch in tqdm(range(start_epoch, start_epoch + epochs)):

            # Print the actual learning rate
            print(f"For epoch {epoch + 1}: {{Learning rate: {self.lr_scheduling.get_lr()}}}")
            
            self.metrics = {}
        
            for mode in modes:
            
                with torch.set_grad_enabled(mode == "train"):
                  
                    # Initialize the loss of the current mode
                    self.metrics[f'{mode}_loss'] = 0

                    # Let us initialize the predictions
                    predictions_ = []

                    # Let us initialize the labels
                    labels_ = []

                    if mode == "train":

                        self.model.train()

                        loader = list(iter(self.train_loader))

                    else:

                        self.model.eval()

                        loader = list(iter(self.test_loader))
                    
                    with trange(len(loader), unit = "batches", position = 0, leave = True) as pbar:
                      
                      for i in pbar:
                        
                        pbar.set_description(f"{mode[0].upper() + mode[1:]} batch number {i + 1}")
                        
                        data = loader[i]

                        input_ = data[0].long().to(self.device)
                        
                        input_mask = data[1].to(self.device)

                        labels = data[2].long().to(self.device)

                        if self.hugging_face:

                          labels[labels == self.tokenizer.pad_token_id] == -100

                        labels_mask = data[3].to(self.device)
                        
                        # Récupération de identifiant token du padding (par défaut = 3)
                        pad_token_id = 3 if self.tokenizer is None else self.tokenizer.pad_token_id

                        preds, loss = (
                            self.batch_train(input_, input_mask, labels, labels_mask, pad_token_id)
                            if mode == "train"
                            else self.batch_eval(input_, input_mask, labels, labels_mask, pad_token_id)
                        )

                        self.metrics[f"{mode}_loss"] += loss.item()
                        
                        # let us add the predictions and labels in the list of predictions and labels after their determinations
                        if mode == "test":

                            if self.predict_with_generate:

                              if self.hugging_face:

                                  preds = self.model.generate(input_, attention_mask = input_mask)

                              else:

                                  preds = self.model.generate(input_, input_mask, pad_token_id = pad_token_id)

                                  labels = labels.masked_fill_(labels_mask == 0, -100)

                            else:

                              if self.hugging_face:

                                  preds = torch.argmax(preds, dim = -1)

                              else:

                                  labels = labels.masked_fill_(labels_mask == 0, -100)

                            predictions_.extend(preds.detach().cpu().tolist())

                            labels_.extend(labels.detach().cpu().tolist())
                      
            if not self.evaluation is None and mode == 'test':
              
              self.metrics.update(self.evaluation.compute_metrics((np.array(predictions_), np.array(labels_))))

            self.metrics[f"train_loss"] = self.metrics[f"train_loss"] / len(self.train_loader)
            
            if not self.test_loader is None:
            
                self.metrics[f"test_loss"] = self.metrics[f"test_loss"] / len(self.test_loader)

            # for metric in self.metrics:

            #    if metric != 'train_loss':

            #     self.metrics[metric] = self.metrics[metric] / len(self.test_loader)

            # Affichage des métriques
            if not log_step is None and (epoch + 1) % log_step == 0:

              print(f"\nMetrics: {self.metrics}")
              
              if not self.logging_dir is None:
                  
                  with SummaryWriter(os.path.join(self.logging_dir, f'version_{self.version}')) as writer:
                      
                      for metric in self.metrics:
                          
                        writer.add_scalar(metric, self.metrics[metric], global_step = epoch)
                        
                        writer.add_scalar("global_step", epoch)

            print("\n=============================\n")

            ##################### Model saving #########################################################

            # Save the model in the end of the current epoch. Sauvegarde du modèle à la fin d'une itération
            if auto_save:

                self.current_epoch = epoch + 1
                
                if save_best:

                  # verify if the current score is best and recuperate it if yes
                  if metric_objective == 'maximize':
                    
                    last_score = best_score < self.metrics[metric_for_best_model]
                  
                  elif metric_objective == 'minimize':

                    last_score = best_score > self.metrics[metric_for_best_model]
                  
                  else:
                      
                      raise ValueError("The metric objective can only be in ['maximize', 'minimize'] !")
                  
                  # recuperate the best score
                  if last_score: 

                    best_score = self.metrics[metric_for_best_model]

                    self.best_epoch = self.current_epoch + 1
                    
                    self.best_score = best_score 
                    
                    self.save(saving_directory, "best_checkpoints")
                             
                self.save(saving_directory, file_name)

    # Pour la méthode nous allons nous inspirer sur la méthode save de l'agent ddpg (RL) que l'on avait créée
    def save(
        self,
        directory: str = "data/checkpoints/last_checkpoints",
        file_name: str = "checkpoints"
    ):

          if not os.path.exists(directory):
              os.makedirs(directory)

          file_path = os.path.join(directory, f"{file_name}.pth")

          checkpoints = {
              "model_state_dict": self.model.state_dict(),
              "optimizer_state_dict": self.optimizer.state_dict(),
              "current_epoch": self.current_epoch,
              "metrics": self.metrics,
              "best_score": self.best_score,
              "best_epoch": self.best_epoch,
              "lr_scheduler_state_dict": self.lr_scheduling.state_dict() if not self.lr_scheduling is None else None
          }

          torch.save(checkpoints, file_path)

          # update metrics and the best score dict
          self.metrics['current_epoch'] = self.current_epoch + 1 if not self.current_epoch is None else self.current_epoch

          best_score_dict = {"best_score": self.best_score, "best_epoch": self.best_epoch}

          # save the metrics as json file
          metrics = json.dumps({'metrics': self.metrics, "best_performance": best_score_dict}, indent=4)

          with open(os.path.join(directory, f'{file_name}.json'), 'w') as f:

            f.write(metrics)   
          
    # Ainsi que pour la méthode load
    def load(
        self,
        directory: str = "data/checkpoints/last_checkpoints",
        file_name: str = "checkpoints",
        load_best: bool = False
    ):

        if load_best: file_name = "best_checkpoints"
        
        file_path = os.path.join(
            directory, 
            f"{file_name}.pth"
        )

        if os.path.exists(file_path):

            checkpoints = torch.load(file_path)

            self.model.load_state_dict(checkpoints["model_state_dict"])

            self.optimizer.load_state_dict(checkpoints["optimizer_state_dict"])

            self.current_epoch = checkpoints["current_epoch"]

            self.best_score = checkpoints["best_score"]

            self.best_epoch = checkpoints["best_epoch"]

            if not self.lr_scheduling is None:
                
                self.lr_scheduling.load_state_dict(checkpoints["lr_scheduler_state_dict"])

        else:

            raise OSError(
                f"Le fichier {file_path} est introuvable. Vérifiez si le chemin fourni est correct!"
            )
    
    def evaluate(self, test_dataset, batch_size: int = 16, loader_kwargs: dict = {}):
        
        self.model.eval()
        
        test_loader = list(iter(DataLoader(
            test_dataset,
            batch_size,
            shuffle=False,
            **loader_kwargs,
        )))
        
        # Let us initialize the predictions
        predictions_ = []

        # Let us initialize the labels
        labels_ = []
        
        metrics = {'test_loss': 0.0}

        results = {'original_sentences': [], 'translations': [], 'predictions': []}

        with torch.no_grad():

            with trange(len(test_loader), unit = "batches", position = 0, leave = True) as pbar:

                for i in pbar:
                
                    pbar.set_description(f"Evaluation batch number {i + 1}")
                    
                    data = test_loader[i]
                                
                    input_ = data[0].long().to(self.device)
                        
                    input_mask = data[1].to(self.device)

                    labels = data[2].long().to(self.device)

                    if self.hugging_face:

                        labels[labels == self.tokenizer.pad_token_id] == -100

                    labels_mask = data[3].to(self.device)

                    preds, loss = self.batch_eval(input_, input_mask, labels, labels_mask, test_dataset.tokenizer.pad_token_id)

                    metrics[f"test_loss"] += loss.item()
                
                    if self.hugging_face:

                        preds = self.model.generate(input_, attention_mask = input_mask)
                        
                        labels_.extend(labels.detach().cpu().tolist())

                    else:

                        preds = self.model.generate(input_, input_mask, pad_token_id = pad_token_id)

                        labels__ = labels.masked_fill_(labels_mask == 0, -100)
                        
                        labels_.extend(labels__.detach().cpu().tolist())
                    
                    predictions_.extend(preds.detach().cpu().tolist())
                    
                    # let us recuperate the original sentences
                    results['original_sentences'].extend(test_dataset.tokenizer.batch_decode(input_, skip_special_tokens = True))

                    results['translations'].extend(test_dataset.tokenizer.batch_decode(labels, skip_special_tokens = True))

                    results['predictions'].extend(test_dataset.tokenizer.batch_decode(preds, skip_special_tokens = True))

            if not self.evaluation is None:
              
                metrics.update(self.evaluation.compute_metrics((np.array(predictions_), np.array(labels_))))

            metrics["test_loss"] = metrics["test_loss"] / len(test_loader)

            return metrics, pd.DataFrame(results)
        
            
            

Overwriting wolof-translate/wolof_translate/trainers/transformer_trainer.py


## French to wolof

### Configure dataset 🔠

In [3]:
%%writefile wolof-translate/wolof_translate/utils/split_with_valid.py
""" This module contains a function which split the data. It will consider adding the validation set
"""
from sklearn.model_selection import train_test_split
import pandas as pd
import os

def split_data(random_state: int = 50, data_directory: str = "data/extractions/new_data", csv_file: str = "sentences.csv"):
  """Split data between train, validation and test sets

  Args:
    random_state (int): the seed of the splitting generator. Defaults to 50
  """
  # load the corpora and split into train and test sets
  corpora = pd.read_csv(os.path.join(data_directory, csv_file))

  train_set, test_set = train_test_split(corpora, test_size=0.1, random_state=random_state)

  # let us save the final training set when performing

  train_set, valid_set = train_test_split(train_set, test_size=0.1, random_state=random_state)

  train_set.to_csv(os.path.join(data_directory, "final_train_set.csv"), index=False)

  # let us save the sets
  train_set.to_csv(os.path.join(data_directory, "train_set.csv"), index=False)

  valid_set.to_csv(os.path.join(data_directory, "valid_set.csv"), index=False)

  test_set.to_csv(os.path.join(data_directory, "test_set.csv"), index=False)

Overwriting wolof-translate/wolof_translate/utils/split_with_valid.py


In [9]:
# recuperate the tokenizer from a json file
tokenizer = T5TokenizerFast(tokenizer_file=f"wolof-translate/wolof_translate/tokenizers/t5_tokenizers/tokenizer_v3.json")


The following function is used to recuperate the datasets from csv files. The test test is not anymore the validation set which is now part of the train set.

In [10]:
def recuperate_datasets(fr_char_p: float, fr_word_p: float):

  # Create augmentation to add on French sentences
  fr_augmentation = TransformerSequences(nac.KeyboardAug(aug_char_p=fr_char_p, aug_word_p=fr_word_p),
                                        remove_mark_space, delete_guillemet_space)

  # Recuperate the train dataset
  train_dataset_aug = SentenceDataset(f"data/extractions/new_data/train_set.csv",
                                        tokenizer,
                                        truncation = True,
                                        cp1_transformer = fr_augmentation)

  # Recuperate the validation dataset
  valid_dataset = SentenceDataset(f"data/extractions/new_data/valid_set.csv",
                                        tokenizer,
                                        truncation = True)
  
  # Return the datasets
  return train_dataset_aug, valid_dataset

### Configure the evaluation class ⚙️

We will evaluate the predictions with the `bleu` metric. The predictions will be generated like we did when making hyperparameter search.

In [11]:
%%writefile wolof-translate/wolof_translate/utils/evaluation.py
from tokenizers import Tokenizer
from typing import *
import numpy as np
import evaluate

class TranslationEvaluation:
    
    def __init__(self, 
                 tokenizer: Tokenizer,
                 decoder: Union[Callable, None] = None,
                 metric = evaluate.load('sacrebleu'),
                 ):
        
        self.tokenizer = tokenizer
        
        self.decoder = decoder
        
        self.metric = metric
    
    def postprocess_text(self, preds, labels):
        
        preds = [pred.strip() for pred in preds]
        
        labels = [[label.strip()] for label in labels]
        
        return preds, labels

    def compute_metrics(self, eval_preds):

        preds, labels = eval_preds

        if isinstance(preds, tuple):
        
            preds = preds[0]
        
        decoded_preds = self.tokenizer.batch_decode(preds, skip_special_tokens=True)

        labels = np.where(labels != -100, labels, self.tokenizer.pad_token_id)
        
        decoded_labels = self.tokenizer.batch_decode(labels, skip_special_tokens=True)

        decoded_preds, decoded_labels = self.postprocess_text(decoded_preds, decoded_labels)

        result = self.metric.compute(predictions=decoded_preds, references=decoded_labels)
        
        result = {"bleu": result["score"]}

        prediction_lens = [np.count_nonzero(pred != self.tokenizer.pad_token_id) for pred in preds]
        
        result["gen_len"] = np.mean(prediction_lens)
        
        result = {k: round(v, 4) for k, v in result.items()}
        
        return result

Overwriting wolof-translate/wolof_translate/utils/evaluation.py


### Training

Let us import the transformer, the data splitter, the learning rate scheduler and the evaluation function, bellow:

In [12]:
from wolof_translate.models.transformers.optimization import TransformerScheduler
from wolof_translate.trainers.transformer_trainer import ModelRunner
from wolof_translate.utils.evaluation import TranslationEvaluation
from wolof_translate.models.transformers.main import Transformer
from wolof_translate.utils.split_with_valid import split_data


### ---

Let us configure the parameters.

In [15]:
# let us initialize the hyperparameter configuration
config = {
    'random_state': 0,
    'fr_char_p': 0.02586151531081308,
    'fr_word_p': 0.8713619950477987,
    'dim_ff': 2092,
    'drop_out_rate': 0.1892588927795259,
    'label_smoothing': 0.1,
    'n_layers': 11,
    'n_features': 186,
    'learning_rate': 0.004094000716163921,
    'weight_decay': 0.5218045932883156,
    'batch_size': 8,
    'max_epoch': 7663,
    'warmup_ratio': 0.05133210495607013,
    'bleu': 0.9865,
    'model_dir': 'data/checkpoints/fw_custom_v3_checkpoints/',
    'new_model_dir': 'data/checkpoints/custom_results_fw_v3/'
}

# let us initialize the evaluation class
evaluation = TranslationEvaluation(tokenizer)

# let us initialize the trainer
trainer = ModelRunner(model = Transformer, seed = 0, evaluation = evaluation)

# split the data
split_data(config['random_state'])

# recuperate train and test set
train_dataset, test_dataset = recuperate_datasets(config['fr_char_p'], 
                                                    config['fr_word_p'])

# initialize the encoder and the decoder layers
encoder_layer = nn.TransformerEncoderLayer(512, 
                                            8,
                                            config['dim_ff'],
                                            config['drop_out_rate'], batch_first = True)

decoder_layer = nn.TransformerDecoderLayer(512, 
                                            8,
                                            config['dim_ff'],
                                            config['drop_out_rate'], batch_first = True)

# let us initialize the encoder and the decoder
encoder = nn.TransformerEncoder(encoder_layer, 6)

decoder = nn.TransformerDecoder(decoder_layer, 6)

# let us calculate the appropriate warmup steps (let us take a max epoch of 100)
length = len(train_dataset)

n_steps = length // config['batch_size']

num_steps = config['max_epoch'] * n_steps

warmup_steps = (config['max_epoch'] * n_steps) * config['warmup_ratio']

# Initialize the scheduler parameters
scheduler_args = {'num_warmup_steps': warmup_steps, 'num_training_steps': num_steps}

# Initialize the transformer parameters
model_args = {
    'vocab_size': len(tokenizer),
    'encoder': encoder,
    'decoder': decoder,
    'class_criterion': nn.CrossEntropyLoss(label_smoothing = config['label_smoothing']),
    'n_poses_max': train_dataset.max_len,
    'n_layers': config['n_layers'],
    'n_features': config['n_features'],
    'max_len': test_dataset.max_len
}

# Initialize the optimizer parameters
optimizer_args = {
    'lr': config['learning_rate'],
    'weight_decay': config['weight_decay'],
    'betas': (0.9, 0.98),
}

# Initialize the loaders parameters
train_loader_args = {'batch_size': config['batch_size']}

# Add the datasets and hyperparameters to trainer
trainer.compile(train_dataset, test_dataset, tokenizer, train_loader_args,
                optimizer_kwargs = optimizer_args, model_kwargs = model_args,
                lr_scheduler=get_linear_schedule_with_warmup,
                lr_scheduler_kwargs=scheduler_args, 
                predict_with_generate = True,
                logging_dir="data/logs/custom_fw_v3"
                )

# We will from checkpoints so let us the model
trainer.load(config['model_dir'], load_best=True) # Only for the first loading
# trainer.load(config['new_model_dir'])

        

Let us train the model.

In [16]:
trainer.train(epochs = config['max_epoch'] - trainer.current_epoch, auto_save=True, metric_for_best_model='bleu', metric_objective='maximize', log_step=1,
              saving_directory = config['new_model_dir'])



For epoch 6: {Learning rate: [5.235838752223153e-05]}


Train batch number 164: 100%|██████████| 164/164 [00:42<00:00,  3.85batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:18<00:00,  1.84s/batches]



Metrics: {'train_loss': 25.17254326983196, 'test_loss': 22.810916709899903, 'bleu': 0.9354, 'gen_len': 7.0}




  0%|          | 1/7658 [01:04<137:18:32, 64.56s/it]

For epoch 7: {Learning rate: [6.283006502667784e-05]}


Train batch number 164: 100%|██████████| 164/164 [00:41<00:00,  3.94batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:19<00:00,  1.90s/batches]



Metrics: {'train_loss': 22.694559748579817, 'test_loss': 21.488267707824708, 'bleu': 0.6816, 'gen_len': 7.0}




  0%|          | 2/7658 [02:07<135:22:03, 63.65s/it]

For epoch 8: {Learning rate: [7.330174253112416e-05]}


Train batch number 164: 100%|██████████| 164/164 [00:42<00:00,  3.85batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:18<00:00,  1.87s/batches]



Metrics: {'train_loss': 21.025487684622043, 'test_loss': 19.922970581054688, 'bleu': 0.8345, 'gen_len': 7.0}




  0%|          | 3/7658 [03:10<135:07:07, 63.54s/it]

For epoch 9: {Learning rate: [8.377342003557045e-05]}


Train batch number 164: 100%|██████████| 164/164 [00:42<00:00,  3.83batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:19<00:00,  1.95s/batches]



Metrics: {'train_loss': 19.763988721661452, 'test_loss': 20.067963218688966, 'bleu': 0.3727, 'gen_len': 7.0}




  0%|          | 4/7658 [04:15<135:53:24, 63.91s/it]

For epoch 10: {Learning rate: [9.424509754001677e-05]}


Train batch number 164: 100%|██████████| 164/164 [00:48<00:00,  3.41batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:22<00:00,  2.28s/batches]



Metrics: {'train_loss': 18.89585805520779, 'test_loss': 18.763123416900633, 'bleu': 0.6127, 'gen_len': 8.0}




  0%|          | 5/7658 [05:28<142:49:56, 67.19s/it]

For epoch 11: {Learning rate: [0.00010471677504446306]}


Train batch number 164: 100%|██████████| 164/164 [00:49<00:00,  3.29batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:22<00:00,  2.22s/batches]



Metrics: {'train_loss': 18.004612713325315, 'test_loss': 17.962421703338624, 'bleu': 0.8481, 'gen_len': 7.0}




  0%|          | 6/7658 [06:42<147:54:03, 69.58s/it]

For epoch 12: {Learning rate: [0.00011518845254890938]}


Train batch number 164: 100%|██████████| 164/164 [00:42<00:00,  3.85batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:22<00:00,  2.25s/batches]



Metrics: {'train_loss': 17.304877121274064, 'test_loss': 19.071235752105714, 'bleu': 0.75, 'gen_len': 7.0}




  0%|          | 7/7658 [07:50<146:27:12, 68.91s/it]

For epoch 13: {Learning rate: [0.00012566013005335568]}


Train batch number 164: 100%|██████████| 164/164 [00:43<00:00,  3.78batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:19<00:00,  1.96s/batches]



Metrics: {'train_loss': 16.503681883579347, 'test_loss': 17.05331382751465, 'bleu': 0.63, 'gen_len': 7.0}




  0%|          | 8/7658 [08:59<146:57:27, 69.16s/it]

For epoch 14: {Learning rate: [0.000136131807557802]}


Train batch number 164: 100%|██████████| 164/164 [00:43<00:00,  3.74batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:19<00:00,  1.91s/batches]



Metrics: {'train_loss': 15.823595762252808, 'test_loss': 18.452527236938476, 'bleu': 0.7102, 'gen_len': 7.0}




  0%|          | 9/7658 [10:04<144:07:04, 67.83s/it]

For epoch 15: {Learning rate: [0.00014660348506224832]}


Train batch number 164: 100%|██████████| 164/164 [00:42<00:00,  3.86batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:19<00:00,  1.97s/batches]



Metrics: {'train_loss': 15.43495674540357, 'test_loss': 16.518132495880128, 'bleu': 0.431, 'gen_len': 8.0}




  0%|          | 10/7658 [11:08<141:40:45, 66.69s/it]

For epoch 16: {Learning rate: [0.00015707516256669463]}


Train batch number 164: 100%|██████████| 164/164 [00:44<00:00,  3.67batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:19<00:00,  1.96s/batches]



Metrics: {'train_loss': 14.792796794961138, 'test_loss': 17.613335800170898, 'bleu': 0.7231, 'gen_len': 8.0}




  0%|          | 11/7658 [12:15<141:32:59, 66.64s/it]

For epoch 17: {Learning rate: [0.0001675468400711409]}


Train batch number 164: 100%|██████████| 164/164 [00:45<00:00,  3.63batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:19<00:00,  1.95s/batches]



Metrics: {'train_loss': 14.484055321391036, 'test_loss': 18.315938663482665, 'bleu': 0.5124, 'gen_len': 7.0}




  0%|          | 12/7658 [13:22<141:32:44, 66.64s/it]

For epoch 18: {Learning rate: [0.0001780185175755872]}


Train batch number 164: 100%|██████████| 164/164 [00:41<00:00,  3.97batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:21<00:00,  2.18s/batches]



Metrics: {'train_loss': 13.276074848523955, 'test_loss': 16.692030334472655, 'bleu': 0.3781, 'gen_len': 7.7877}




  0%|          | 13/7658 [14:27<140:29:11, 66.15s/it]

For epoch 19: {Learning rate: [0.00018849019508003354]}


Train batch number 164: 100%|██████████| 164/164 [00:43<00:00,  3.80batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:21<00:00,  2.10s/batches]



Metrics: {'train_loss': 11.67305976879306, 'test_loss': 16.31410150527954, 'bleu': 0.3009, 'gen_len': 8.7055}




  0%|          | 14/7658 [15:33<140:32:05, 66.19s/it]

For epoch 20: {Learning rate: [0.00019896187258447985]}


Train batch number 164: 100%|██████████| 164/164 [00:41<00:00,  3.95batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:23<00:00,  2.32s/batches]



Metrics: {'train_loss': 11.60483580100827, 'test_loss': 15.374696254730225, 'bleu': 0.3983, 'gen_len': 7.5959}




  0%|          | 15/7658 [16:41<141:27:19, 66.63s/it]

For epoch 21: {Learning rate: [0.00020943355008892612]}


Train batch number 164: 100%|██████████| 164/164 [00:45<00:00,  3.64batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:21<00:00,  2.16s/batches]



Metrics: {'train_loss': 11.7785655056558, 'test_loss': 16.712066555023192, 'bleu': 0.7059, 'gen_len': 8.9726}




  0%|          | 16/7658 [17:49<142:34:16, 67.16s/it]

For epoch 22: {Learning rate: [0.00021990522759337246]}


Train batch number 164: 100%|██████████| 164/164 [00:42<00:00,  3.88batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:20<00:00,  2.08s/batches]



Metrics: {'train_loss': 11.041598311284693, 'test_loss': 13.52727108001709, 'bleu': 0.1913, 'gen_len': 6.5959}




  0%|          | 17/7658 [18:54<141:09:20, 66.50s/it]

For epoch 23: {Learning rate: [0.00023037690509781876]}


Train batch number 164: 100%|██████████| 164/164 [00:44<00:00,  3.67batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:21<00:00,  2.16s/batches]



Metrics: {'train_loss': 10.40609178891996, 'test_loss': 13.572932577133178, 'bleu': 0.4442, 'gen_len': 7.9315}




  0%|          | 18/7658 [20:02<142:20:10, 67.07s/it]

For epoch 24: {Learning rate: [0.00024084858260226506]}


Train batch number 164: 100%|██████████| 164/164 [00:44<00:00,  3.73batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:23<00:00,  2.32s/batches]



Metrics: {'train_loss': 11.525304282583841, 'test_loss': 15.881868267059327, 'bleu': 0.0881, 'gen_len': 8.726}




  0%|          | 19/7658 [21:12<143:47:32, 67.76s/it]

For epoch 25: {Learning rate: [0.00025132026010671137]}


Train batch number 164: 100%|██████████| 164/164 [00:43<00:00,  3.80batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:21<00:00,  2.19s/batches]



Metrics: {'train_loss': 10.475133506263175, 'test_loss': 14.993079662322998, 'bleu': 0.4345, 'gen_len': 7.1575}




  0%|          | 20/7658 [22:20<144:16:26, 68.00s/it]

For epoch 26: {Learning rate: [0.0002617919376111577]}


Train batch number 164: 100%|██████████| 164/164 [00:44<00:00,  3.69batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:23<00:00,  2.31s/batches]



Metrics: {'train_loss': 10.528580456245237, 'test_loss': 14.337573146820068, 'bleu': 0.2915, 'gen_len': 7.9658}




  0%|          | 21/7658 [23:31<145:52:49, 68.77s/it]

For epoch 27: {Learning rate: [0.000272263615115604]}


Train batch number 164: 100%|██████████| 164/164 [00:49<00:00,  3.33batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:19<00:00,  1.98s/batches]



Metrics: {'train_loss': 11.511181425757524, 'test_loss': 15.158411979675293, 'bleu': 0.3897, 'gen_len': 8.0}




  0%|          | 22/7658 [24:44<148:24:35, 69.97s/it]

For epoch 28: {Learning rate: [0.0002827352926200503]}


Train batch number 164: 100%|██████████| 164/164 [00:44<00:00,  3.68batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:22<00:00,  2.29s/batches]



Metrics: {'train_loss': 11.783760895089406, 'test_loss': 15.13263521194458, 'bleu': 0.4599, 'gen_len': 7.0}




  0%|          | 23/7658 [25:53<147:59:40, 69.78s/it]

For epoch 29: {Learning rate: [0.00029320697012449664]}


Train batch number 164: 100%|██████████| 164/164 [00:45<00:00,  3.63batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:22<00:00,  2.23s/batches]



Metrics: {'train_loss': 11.540030284625727, 'test_loss': 15.036835956573487, 'bleu': 0.3755, 'gen_len': 7.0}




  0%|          | 24/7658 [27:02<147:48:30, 69.70s/it]

For epoch 30: {Learning rate: [0.0003036786476289429]}


Train batch number 164: 100%|██████████| 164/164 [00:49<00:00,  3.33batches/s]
Test batch number 10: 100%|██████████| 10/10 [00:23<00:00,  2.34s/batches]



Metrics: {'train_loss': 10.84920192200963, 'test_loss': 15.045287132263184, 'bleu': 0.2928, 'gen_len': 8.0}




  0%|          | 25/7658 [28:17<151:02:08, 71.23s/it]

For epoch 31: {Learning rate: [0.00031415032513338925]}


Train batch number 147:  89%|████████▉ | 146/164 [00:44<00:05,  3.30batches/s]
  0%|          | 25/7658 [29:02<147:48:29, 69.71s/it]


KeyboardInterrupt: 