Custom Transformer Training
-------------------------------

In this notebook we will train the custom transformer on multiple GPUs if they are available. The GPUs are in a single machine. In [multiple](_custom_transformer_train_multiple.ipynb), we will use sagemaker to distribute the training of the model over multiple instances. 

We will pursue the following steps:

- Load the libraries
- Creating function to recuperate datasets (arguments: char_p, word_p, max_len, end_mark, corpus_1, corpus_2, data_directory)
- Training (The model is automatically saved)(arguments: config dictionary initialized before)
- Predictions

➡️ Import the libraries.

In [2]:
from wolof_translate import *

# specify a seed for everything
lt.seed_everything(0)

Global seed set to 0


0

➡️ Function to recuperate datasets

In [3]:
%%writefile wolof-translate/wolof_translate/utils/recuperate_datasets.py
from wolof_translate import *

def recuperate_datasets(char_p: float, word_p: float, max_len: int, end_mark: int, tokenizer: T5TokenizerFast,
                        corpus_1: str = 'french', corpus_2: str = 'wolof', 
                        train_file: str = 'data/extractions/new_data/train_set.csv', 
                        test_file: str = 'data/extractions/new_data/test_file.csv'):

  # Let us recuperate the end_mark adding option
  if end_mark == 1:
    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space)

    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space)
    
  else:
    
    if end_mark == 2:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!', replace = True)
    
    elif end_mark == 3:

      end_mark_fn = partial(add_end_mark)
    
    elif end_mark == 4:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!')
    
    else:  
        
        raise ValueError(f'No end mark number {end_mark}')

    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
  # Recuperate the train dataset
  train_dataset_aug = SentenceDataset(train_file,
                                        tokenizer,
                                        truncation = False,
                                        cp1_transformer = fr_augmentation_1,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2
                                        )

  # Recuperate the valid dataset
  valid_dataset = SentenceDataset(test_file,
                                        tokenizer,
                                        cp1_transformer = fr_augmentation_2,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2,
                                        truncation = False)
  
  # Return the datasets
  return train_dataset_aug, valid_dataset

Overwriting wolof-translate/wolof_translate/utils/recuperate_datasets.py


In [4]:
%run wolof-translate/wolof_translate/utils/recuperate_datasets.py

➡️ Training

In [5]:
# initialize the configurations
config = {
    'epochs': 100,
    'max_epoch': None,
    'log_step': 5,
    'metric_for_best_model': 'test_loss',
    'metric_objective': 'minimize',
    'corpus_1': 'french',
    'corpus_2': 'wolof',
    'train_file': 'data/extractions/new_data/train_set.csv',
    'test_file': 'data/extractions/new_data/valid_set.csv',
    'drop_out_rate': 0.09208880734618193,
    'd_model': 256,
    'n_head': 4,
    'dim_ff': 1830,
    'n_encoders': 3,
    'n_decoders': 3,
    'learning_rate': None,
    'weight_decay': 0.0,
    'char_p': 0.4241110694728468,
    'word_p': 0.013007053766023914,
    'end_mark': 3,
    'label_smoothing': 0.1,
    'max_len': 20,
    'random_state': 0,
    'boundaries': [2, 31, 59, 87, 115, 143, 171],
    'batch_sizes': [256, 128, 64, 32, 16, 8, 4, 2],
    'batch_size': None, 
    'warmup_init': True,
    'relative_step': True,
    'num_workers': 0,
    'pin_memory': False,
    # --------------------> Must be changed when continuing a training
    'model_dir': 'custom_transformer_v6_fw_best',
    'new_model_dir': 'custom_transformer_v6_fw_2',
    'continue': False, # --------------------------> Must be changed when continuing training
    'logging_dir': 'data/logs/custom_transformer_fw_2',
    'save_best': True,
    'tokenizer_path': 'wolof-translate/wolof_translate/tokenizers/t5_tokenizers/tokenizer_v5.model',
    'data_directory': 'data/extractions/new_data/',
    'data_file': 'corpora_v6.csv',
    'version': 6,
    # in the case of a distributed training
    'backend': None,
    'hosts': [],
    'current_host': None,
    'num_gpus': 5,
    'logger': None,
    'return_trainer': True,
    'include_split': True,
}

In [6]:
%%writefile wolof-translate/wolof_translate/utils/training.py
from wolof_translate import *
import warnings

def train(config: dict):
    
    # ---------------------------------------
    # add distribution if necessary (https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-python-sdk/pytorch_mnist/mnist.py)
    
    logger = config['logger']
    
    is_distributed = len(config['hosts']) > 1 and config['backend'] is not None
    
    use_cuda = config['num_gpus'] > 0
    
    config.update({"num_workers": 1, "pin_memory": True} if use_cuda else {})

    if not logger is None:
        
        logger.debug("Distributed training - {}".format(is_distributed))
        
        logger.debug("Number of gpus available - {}".format(config['num_gpus']))
        
    if is_distributed:
        # Initialize the distributed environment.
        world_size = len(config['hosts'])
        
        os.environ["WORLD_SIZE"] = str(world_size)
        
        host_rank = config['hosts'].index(config['current_host'])
        
        os.environ["RANK"] = str(host_rank)
        
        dist.init_process_group(backend=config['backend'], rank=host_rank, world_size=world_size)
        
        if not logger is None: logger.info(
            "Initialized the distributed environment: '{}' backend on {} nodes. ".format(
                config['backend'], dist.get_world_size()
            )
            + "Current host rank is {}. Number of gpus: {}".format(dist.get_rank(), config['num_gpus'])
        )
    # ---------------------------------------
    
    # split the data
    if config['include_split']: split_data(config['random_state'], config['data_directory'], config['data_file'])

    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # recuperate train and test set
    train_dataset, test_dataset = recuperate_datasets(config['char_p'],
                                                        config['word_p'], config['max_len'],
                                                        config['end_mark'], tokenizer, config['corpus_1'],
                                                        config['corpus_2'],
                                                        config['train_file'], config['test_file'])
    
    # initialize the evaluation object
    evaluation = TranslationEvaluation(tokenizer, train_dataset.decode)

    # let us initialize the trainer
    trainer = ModelRunner(model = Transformer, version=config['version'], seed = 0, evaluation = evaluation, optimizer = Adafactor)

    # initialize the encoder and the decoder layers
    encoder_layer = nn.TransformerEncoderLayer(config['d_model'],
                                                config['n_head'],
                                                config['dim_ff'],
                                                config['drop_out_rate'], batch_first = True)

    decoder_layer = nn.TransformerDecoderLayer(config['d_model'],
                                                config['n_head'],
                                                config['dim_ff'],
                                                config['drop_out_rate'], batch_first = True)

    # let us initialize the encoder and the decoder
    encoder = nn.TransformerEncoder(encoder_layer, config['n_encoders'])

    decoder = nn.TransformerDecoder(decoder_layer, config['n_decoders'])

    #-------------------------------------
    # in the case when the linear learning rate scheduler with warmup is used
    
    # let us calculate the appropriate warmup steps (let us take a max epoch of 100)
    # length = len(train_dataset)

    # n_steps = length // config['batch_size']

    # num_steps = config['max_epoch'] * n_steps

    # warmup_steps = (config['max_epoch'] * n_steps) * config['warmup_ratio']

    # Initialize the scheduler parameters
    # scheduler_args = {'num_warmup_steps': warmup_steps, 'num_training_steps': num_steps}
    #-------------------------------------

    # Initialize the transformer parameters
    model_args = {
        'vocab_size': len(tokenizer),
        'encoder': encoder,
        'decoder': decoder,
        'class_criterion': nn.CrossEntropyLoss(label_smoothing = config['label_smoothing']),
        'max_len': config['max_len']
    }

    # Initialize the optimizer parameters
    optimizer_args = {
        'lr': config['learning_rate'],
        'weight_decay': config['weight_decay'],
        # 'betas': (0.9, 0.98),
        'warmup_init': config['warmup_init'],
        'relative_step': config['relative_step']
    }

    # ----------------------------
    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    train_sampler = SequenceLengthBatchSampler(train_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    # ------------------------------
    # initialize a bucket sampler with fixed batch size in the case of single machine
    # with parallelization on multiple gpus
    # train_sampler = BucketSampler(train_dataset, config['batch_size'])

    # test_sampler = BucketSampler(test_dataset, config['batch_size'])
    
    # ------------------------------

    # Initialize the loaders parameters
    train_loader_args = {'batch_sampler': train_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    # Add the datasets and hyperparameters to trainer
    trainer.compile(train_dataset, test_dataset, tokenizer, train_loader_args,
                    test_loader_args, optimizer_kwargs = optimizer_args,
                    model_kwargs = model_args,
                    # lr_scheduler=get_linear_schedule_with_warmup,
                    # lr_scheduler_kwargs=scheduler_args,
                    predict_with_generate = True,
                    is_distributed=is_distributed,
                    logging_dir=config['logging_dir'],
                    dist=dist
                    )

    # load the model
    trainer.load(config['model_dir'], load_best = not config['continue'])
    
    # Train the model
    trainer.train(config['epochs'] - trainer.current_epoch, auto_save = True, log_step = config['log_step'], saving_directory=config['new_model_dir'], save_best = config['save_best'],
                  metric_for_best_model = config['metric_for_best_model'], metric_objective = config['metric_objective'])
    
    if config['return_trainer']:
        
        return trainer
    
    return None


Overwriting wolof-translate/wolof_translate/utils/training.py


Below train and save if we want.

In [7]:
from wolof_translate.utils.training import train

In [None]:
# with warnings.catch_warnings():
    # warnings.simplefilter("ignore")
trainer = train(config)

# save if necessary

  0%|          | 0/95 [00:00<?, ?it/s]

For epoch 6: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:10<00:00,  3.90batches/s]
  1%|          | 1/95 [00:10<16:27, 10.50s/it]



For epoch 7: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.11batches/s]
  2%|▏         | 2/95 [00:20<15:48, 10.20s/it]



For epoch 8: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.23batches/s]
  3%|▎         | 3/95 [00:30<15:17,  9.97s/it]



For epoch 9: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.15batches/s]
  4%|▍         | 4/95 [00:40<15:04,  9.94s/it]



For epoch 10: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.10batches/s]
  output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)
  return torch._native_multi_head_attention(


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
Test batch number 10: 100%|██████████| 9/9 [00:10<00:00,  1.22s/batches]



Metrics: {'train_loss': 7.678882628483806, 'test_loss': 7.75705405548736, 'accuracy': 0.07233671328671328, 'bleu': 0.1418608391608392, 'gen_len': 34.04895104895105}




  5%|▌         | 5/95 [01:01<21:07, 14.08s/it]

For epoch 11: 




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.21batches/s]
  6%|▋         | 6/95 [01:11<18:42, 12.61s/it]



For epoch 12: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:10<00:00,  3.82batches/s]
  7%|▋         | 7/95 [01:21<17:35, 12.00s/it]



For epoch 13: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.14batches/s]
  8%|▊         | 8/95 [01:31<16:26, 11.33s/it]



For epoch 14: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.12batches/s]
  9%|▉         | 9/95 [01:41<15:37, 10.90s/it]



For epoch 15: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.27batches/s]
  return torch._native_multi_head_attention(


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
Test batch number 10: 100%|██████████| 9/9 [00:10<00:00,  1.19s/batches]



Metrics: {'train_loss': 6.723200789486209, 'test_loss': 7.204742138202373, 'accuracy': 0.07606468531468531, 'bleu': 0.15678181818181822, 'gen_len': 34.04895104895105}




 11%|█         | 10/95 [02:02<19:47, 13.97s/it]

For epoch 16: 




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:10<00:00,  3.96batches/s]
 12%|█▏        | 11/95 [02:13<18:00, 12.86s/it]



For epoch 17: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.28batches/s]
 13%|█▎        | 12/95 [02:22<16:24, 11.86s/it]



For epoch 18: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.18batches/s]
 14%|█▎        | 13/95 [02:32<15:22, 11.25s/it]



For epoch 19: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.28batches/s]
 15%|█▍        | 14/95 [02:42<14:30, 10.74s/it]



For epoch 20: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.31batches/s]
  return torch._native_multi_head_attention(


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
Test batch number 10: 100%|██████████| 9/9 [00:10<00:00,  1.21s/batches]



Metrics: {'train_loss': 6.123343567791912, 'test_loss': 6.983398509192301, 'accuracy': 0.10837412587412587, 'bleu': 0.14396888111888112, 'gen_len': 34.04895104895105}




 16%|█▌        | 15/95 [03:02<18:22, 13.78s/it]

For epoch 21: 




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:10<00:00,  3.89batches/s]
 17%|█▋        | 16/95 [03:13<16:51, 12.80s/it]



For epoch 22: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.10batches/s]
 18%|█▊        | 17/95 [03:23<15:32, 11.96s/it]



For epoch 23: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.32batches/s]
 19%|█▉        | 18/95 [03:32<14:23, 11.22s/it]



For epoch 24: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.18batches/s]
 20%|██        | 19/95 [03:42<13:40, 10.80s/it]



For epoch 25: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:10<00:00,  3.73batches/s]
  return torch._native_multi_head_attention(


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
Test batch number 10: 100%|██████████| 9/9 [00:10<00:00,  1.20s/batches]



Metrics: {'train_loss': 5.769862362845373, 'test_loss': 6.949723965638167, 'accuracy': 0.08115314685314687, 'bleu': 0.18209160839160834, 'gen_len': 34.04895104895105}




 21%|██        | 20/95 [04:05<17:49, 14.26s/it]

For epoch 26: 




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.18batches/s]
 22%|██▏       | 21/95 [04:14<15:56, 12.92s/it]



For epoch 27: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:09<00:00,  4.21batches/s]
 23%|██▎       | 22/95 [04:24<14:34, 11.97s/it]



For epoch 28: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:10<00:00,  4.10batches/s]
 24%|██▍       | 23/95 [04:34<13:39, 11.38s/it]



For epoch 29: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:10<00:00,  3.74batches/s]
 25%|██▌       | 24/95 [04:45<13:19, 11.26s/it]



For epoch 30: 


Train batch number 2:   0%|          | 0/41 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 42: 100%|██████████| 41/41 [00:10<00:00,  4.03batches/s]
  return torch._native_multi_head_attention(


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
Test batch number 10: 100%|██████████| 9/9 [00:10<00:00,  1.21s/batches]



Metrics: {'train_loss': 5.499035483528341, 'test_loss': 7.266024682905291, 'accuracy': 0.05013006993006993, 'bleu': 0.14280244755244756, 'gen_len': 34.04895104895105}




 26%|██▋       | 25/95 [05:06<16:39, 14.27s/it]

For epoch 31: 




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 40:  93%|█████████▎| 38/41 [00:09<00:00,  7.78batches/s]

➡️ Predictions


In [15]:
if not trainer is None:
    
    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # recuperate the test dataset
    # initialize the transformation sequence
    end_mark_fn = partial(add_end_mark)
    augmentation = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)


    # let us get the test set
    test_dataset = SentenceDataset(f"{config['data_directory']}test_set.csv",
                                            tokenizer = tokenizer,
                                            cp1_transformer = augmentation,
                                            cp2_transformer = augmentation,
                                            corpus_1=config['corpus_1'],
                                            corpus_2=config['corpus_2'],
                                            truncation = False)

    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                            'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    metrics, prediction = trainer.evaluate(test_dataset, test_loader_args)


  return torch._native_multi_head_attention(


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
  return torch._native_multi_head_attention(
Evaluation batch number 8: 100%|██████████| 7/7 [00:18<00:00,  2.67s/batches]


In [16]:
metrics

{'test_loss': 8.648195539202009,
 'accuracy': 0.012357142857142858,
 'bleu': 0.12092857142857141,
 'gen_len': 82.14285714285714}

In [17]:
prediction

Unnamed: 0,original_sentences,translations,predictions
0,Es-tu en paix?,Jaam nga am?,Naka Yan ŋga ŋga ŋga ŋga ŋga ŋga ŋgaest
1,Il a vu tes autres amis!,Gis na sa xarit yeneen yi!,Gis Gis ngi ngia am am am am am
2,Donne deux fois rien que là!,Joxeel foofu doŋŋ ñari yoon!,Ci nataal bi waay dem dem dem moo moo moo
3,Pourquoi?,Lu waral?,A Yan naa yiy yiyestestest
4,Cela simplement!,Loolu doŋŋ!,Ci Seet bépp waay dem dem moo moo moo moo
...,...,...,...
281,"Ce n'est que longtemps après, quand l'égoïsme ...","Teg nañ ciy ati-at ma door a jëli ni jigéen, n...",Ci Baay ko ñu ñu ñu ñu ñu ñu ñu ñu ñu ñu ñu ñu...
282,"J'ai ressenti de l'étonnement, et même de l'in...","Li wóor te wér moo di ne bi loolu lépp weesoo,...",A A ko ñu ñi ñi ñi ñi ñi ñi ñi ñi ñi ñi ñi ñi ...
283,À quel point les arbres aux troncs rectilignes...,"Dàtti garab yaa ngi lunk, sànneeku jëm ca kow,...",Ciyyy waay ñi ñi ñi ñi ñi ñi ñi ñi ñi ñi ñi ñi...
284,Je peux ressentir l'émotion qu'il éprouve à tr...,"Li koy yëngal noonu, xam naa ko. Lan moo ko dà...",Ay ko ko waay ñi ñi ñi ñi ñi ñi ñi ñi ñi ñi ñi...
