Custom Transformer Training
-------------------------------

In this notebook we will train the custom transformer on multiple GPUs if they are available. The GPUs are in a single machine. In [multiple](_custom_transformer_train_multiple.ipynb), we will use sagemaker to distribute the training of the model over multiple instances. 

We will pursue the following steps:

- Load the libraries
- Creating function to recuperate datasets (arguments: char_p, word_p, max_len, end_mark, corpus_1, corpus_2, data_directory)
- Training (The model is automatically saved)(arguments: config dictionary initialized before)
- Predictions

-------------------------------------------

#### French-Wolof v5

➡️ Import the libraries.

In [1]:
from wolof_translate import *

# specify a seed for everything
lt.seed_everything(0)

Global seed set to 0


0

➡️ Function to recuperate datasets

In [2]:
%%writefile wolof-translate/wolof_translate/utils/recuperate_datasets.py
from wolof_translate import *

def recuperate_datasets(char_p: float, word_p: float, max_len: int, end_mark: int, tokenizer: T5TokenizerFast,
                        corpus_1: str = 'french', corpus_2: str = 'wolof', 
                        train_file: str = 'data/extractions/new_data/train_set.csv', 
                        test_file: str = 'data/extractions/new_data/test_file.csv'):

  # Let us recuperate the end_mark adding option
  if end_mark == 1:
    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space)

    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space)
    
  else:
    
    if end_mark == 2:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!', replace = True)
    
    elif end_mark == 3:

      end_mark_fn = partial(add_end_mark)
    
    elif end_mark == 4:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!')
    
    else:  
        
        raise ValueError(f'No end mark number {end_mark}')

    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
  # Recuperate the train dataset
  train_dataset_aug = SentenceDataset(train_file,
                                        tokenizer,
                                        truncation = False,
                                        cp1_transformer = fr_augmentation_1,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2
                                        )

  # Recuperate the valid dataset
  valid_dataset = SentenceDataset(test_file,
                                        tokenizer,
                                        cp1_transformer = fr_augmentation_2,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2,
                                        truncation = False)
  
  # Return the datasets
  return train_dataset_aug, valid_dataset

Overwriting wolof-translate/wolof_translate/utils/recuperate_datasets.py


In [3]:
%run wolof-translate/wolof_translate/utils/recuperate_datasets.py

➡️ Training

In [12]:
# initialize the configurations
config = {
    'epochs': 21,
    'max_epoch': None,
    'log_step': 1,
    'metric_for_best_model': 'test_loss',
    'metric_objective': 'minimize',
    'corpus_1': 'french',
    'corpus_2': 'wolof',
    'train_file': 'data/extractions/new_data/train_set.csv',
    'test_file': 'data/extractions/new_data/valid_set.csv',
    'drop_out_rate': 0.291121690756753,
    'd_model': 512,
    'n_head': 8,
    'dim_ff': 2024,
    'n_encoders': 6,
    'n_decoders': 6,
    'learning_rate': 1e-3,
    'weight_decay': 0.0,
    'char_p': 0.8986208054599546,
    'word_p': 0.7876712525708085,
    'end_mark': 3,
    'label_smoothing': 0.1,
    'max_len': 20,
    'random_state': 0,
    'boundaries': [2, 23, 43, 64, 84, 104],
    'batch_sizes': [256, 128, 64, 32, 16, 8, 4],
    'batch_size': None, 
    'warmup_init': False,
    'relative_step': False,
    'num_workers': 0,
    'pin_memory': False,
    # --------------------> Must be changed when continuing a training
    'model_dir': 't5_small_v5_fw',
    'new_model_dir': 't5_small_v5_fw',
    'continue': False, # --------------------------> Must be changed when continuing training
    'logging_dir': 'data/logs/t5_small_fw',
    'save_best': True,
    'tokenizer_path': 'wolof-translate/wolof_translate/tokenizers/t5_tokenizers/tokenizer_v4.model',
    'data_directory': 'data/extractions/new_data/',
    'data_file': 'ad_sentences.csv',
    'version': 5,
    # in the case of a distributed training
    'backend': None,
    'hosts': [],
    'current_host': None,
    'num_gpus': 5,
    'logger': None,
    'return_trainer': True,
    'include_split': True,
}

In [13]:
%%writefile wolof-translate/wolof_translate/utils/hg_training.py
from wolof_translate import *
import warnings

def train(config: dict):
    
    # ---------------------------------------
    # add distribution if necessary (https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-python-sdk/pytorch_mnist/mnist.py)
    
    logger = config['logger']
    
    is_distributed = len(config['hosts']) > 1 and config['backend'] is not None
    
    use_cuda = config['num_gpus'] > 0
    
    config.update({"num_workers": 1, "pin_memory": True} if use_cuda else {})

    if not logger is None:
        
        logger.debug("Distributed training - {}".format(is_distributed))
        
        logger.debug("Number of gpus available - {}".format(config['num_gpus']))
        
    if is_distributed:
        # Initialize the distributed environment.
        world_size = len(config['hosts'])
        
        os.environ["WORLD_SIZE"] = str(world_size)
        
        host_rank = config['hosts'].index(config['current_host'])
        
        os.environ["RANK"] = str(host_rank)
        
        dist.init_process_group(backend=config['backend'], rank=host_rank, world_size=world_size)
        
        if not logger is None: logger.info(
            "Initialized the distributed environment: '{}' backend on {} nodes. ".format(
                config['backend'], dist.get_world_size()
            )
            + "Current host rank is {}. Number of gpus: {}".format(dist.get_rank(), config['num_gpus'])
        )
    # ---------------------------------------
    
    # split the data
    if config['include_split']: split_data(config['random_state'], config['data_directory'], config['data_file'])

    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # Initialize the model name
    model_name = 't5-small'

    # import the model with its pre-trained weights
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    # resize the token embeddings
    model.resize_token_embeddings(len(tokenizer))
    
    # recuperate train and test set
    train_dataset, test_dataset = recuperate_datasets(config['char_p'],
                                                        config['word_p'], config['max_len'],
                                                        config['end_mark'], tokenizer, config['corpus_1'],
                                                        config['corpus_2'],
                                                        config['train_file'], config['test_file'])
    
    # initialize the evaluation object
    evaluation = TranslationEvaluation(tokenizer, train_dataset.decode)

    # let us initialize the trainer
    trainer = ModelRunner(model = model, version=config['version'], seed = 0, evaluation = evaluation, optimizer = Adafactor)

    #-------------------------------------
    # in the case when the linear learning rate scheduler with warmup is used
    
    # let us calculate the appropriate warmup steps (let us take a max epoch of 100)
    # length = len(train_dataset)

    # n_steps = length // config['batch_size']

    # num_steps = config['max_epoch'] * n_steps

    # warmup_steps = (config['max_epoch'] * n_steps) * config['warmup_ratio']

    # Initialize the scheduler parameters
    # scheduler_args = {'num_warmup_steps': warmup_steps, 'num_training_steps': num_steps}
    #-------------------------------------

    # Initialize the optimizer parameters
    optimizer_args = {
        'lr': config['learning_rate'],
        'weight_decay': config['weight_decay'],
        # 'betas': (0.9, 0.98),
        'warmup_init': config['warmup_init'],
        'relative_step': config['relative_step']
    }

    # ----------------------------
    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    train_sampler = SequenceLengthBatchSampler(train_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    # ------------------------------
    # initialize a bucket sampler with fixed batch size in the case of single machine
    # with parallelization on multiple gpus
    # train_sampler = BucketSampler(train_dataset, config['batch_size'])

    # test_sampler = BucketSampler(test_dataset, config['batch_size'])
    
    # ------------------------------

    # Initialize the loaders parameters
    train_loader_args = {'batch_sampler': train_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    # Add the datasets and hyperparameters to trainer
    trainer.compile(train_dataset, test_dataset, tokenizer, train_loader_args,
                    test_loader_args, optimizer_kwargs = optimizer_args,
                    # lr_scheduler=get_linear_schedule_with_warmup,
                    # lr_scheduler_kwargs=scheduler_args,
                    predict_with_generate = True,
                    hugging_face = True,
                    is_distributed=is_distributed,
                    logging_dir=config['logging_dir'],
                    dist=dist
                    )

    # load the model
    trainer.load(config['model_dir'], load_best = not config['continue'])
    
    # Train the model
    trainer.train(config['epochs'] - trainer.current_epoch, auto_save = True, log_step = config['log_step'], saving_directory=config['new_model_dir'], save_best = config['save_best'],
                  metric_for_best_model = config['metric_for_best_model'], metric_objective = config['metric_objective'])
    
    if config['return_trainer']:
        
        return trainer
    
    return None


Overwriting wolof-translate/wolof_translate/utils/hg_training.py


Below train and save if we want.

In [14]:
from wolof_translate.utils.hg_training import train

In [37]:
# with warnings.catch_warnings():
    # warnings.simplefilter("ignore")
trainer = train(config)

# save if necessary

  0%|          | 0/25 [00:00<?, ?it/s]

For epoch 6: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.41batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.23batches/s]



Metrics: {'train_loss': 2.5114653375916345, 'test_loss': 2.7761967109911367, 'bleu': 2.266399494949495, 'gen_len': 13.217162121212121}




  4%|▍         | 1/25 [00:10<04:22, 10.92s/it]

For epoch 7: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.40batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.98batches/s]



Metrics: {'train_loss': 2.42468051816903, 'test_loss': 2.6943533950381813, 'bleu': 2.4339045454545456, 'gen_len': 12.560631313131314}




  8%|▊         | 2/25 [00:22<04:17, 11.18s/it]

For epoch 8: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.49batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.98batches/s]



Metrics: {'train_loss': 2.3517192236461786, 'test_loss': 2.634918566906091, 'bleu': 2.3541358585858587, 'gen_len': 13.671723737373739}




 12%|█▏        | 3/25 [00:33<04:06, 11.22s/it]

For epoch 9: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.39batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.99batches/s]



Metrics: {'train_loss': 2.2777167031615697, 'test_loss': 2.590723254463889, 'bleu': 2.4163686868686867, 'gen_len': 14.858569696969699}




 16%|█▌        | 4/25 [00:44<03:56, 11.26s/it]

For epoch 10: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.29batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.13batches/s]



Metrics: {'train_loss': 2.223224872882271, 'test_loss': 2.6014229601079766, 'bleu': 2.3500434343434353, 'gen_len': 11.525243434343436}




 20%|██        | 5/25 [00:54<03:35, 10.80s/it]

For epoch 11: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:06<00:00,  7.09batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.07batches/s]



Metrics: {'train_loss': 2.1764694935607425, 'test_loss': 2.5320193478555386, 'bleu': 2.618009595959596, 'gen_len': 12.217173737373738}




 24%|██▍       | 6/25 [01:06<03:29, 11.01s/it]

For epoch 12: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.34batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.32batches/s]



Metrics: {'train_loss': 2.1303964158841753, 'test_loss': 2.5434020721551147, 'bleu': 2.744057070707071, 'gen_len': 11.03030404040404}




 28%|██▊       | 7/25 [01:15<03:09, 10.55s/it]

For epoch 13: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.40batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.02batches/s]



Metrics: {'train_loss': 2.0804740817089926, 'test_loss': 2.5099077537806354, 'bleu': 2.563536868686869, 'gen_len': 12.616167676767677}




 32%|███▏      | 8/25 [01:27<03:03, 10.78s/it]

For epoch 14: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.33batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.17batches/s]



Metrics: {'train_loss': 2.0455848709229487, 'test_loss': 2.52968150919134, 'bleu': 3.1110070707070716, 'gen_len': 11.015133333333333}




 36%|███▌      | 9/25 [01:36<02:47, 10.47s/it]

For epoch 15: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.37batches/s]
Test batch number 3:  14%|█▍        | 1/7 [00:00<00:01,  4.58batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.09batches/s]



Metrics: {'train_loss': 1.9926729948742343, 'test_loss': 2.5232395668222445, 'bleu': 2.864400505050505, 'gen_len': 11.368671212121212}




 40%|████      | 10/25 [01:46<02:34, 10.30s/it]

For epoch 16: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.33batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.00batches/s]



Metrics: {'train_loss': 1.9518446629745607, 'test_loss': 2.5120888189835986, 'bleu': 2.8665191919191924, 'gen_len': 11.176735858585861}




 44%|████▍     | 11/25 [01:56<02:23, 10.23s/it]

For epoch 17: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.32batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.07batches/s]



Metrics: {'train_loss': 1.9194935183974475, 'test_loss': 2.5144295403451626, 'bleu': 3.2974808080808082, 'gen_len': 11.974762626262626}




 48%|████▊     | 12/25 [02:06<02:11, 10.13s/it]

For epoch 18: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:06<00:00,  7.16batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.09batches/s]



Metrics: {'train_loss': 1.8835548197144882, 'test_loss': 2.5088922808868714, 'bleu': 2.9407616161616166, 'gen_len': 11.671747474747475}




 52%|█████▏    | 13/25 [02:18<02:05, 10.50s/it]

For epoch 19: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.24batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.14batches/s]



Metrics: {'train_loss': 1.842434627055217, 'test_loss': 2.5319321829863273, 'bleu': 2.9120015151515157, 'gen_len': 10.03538484848485}




 56%|█████▌    | 14/25 [02:28<01:53, 10.31s/it]

For epoch 20: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.34batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.05batches/s]



Metrics: {'train_loss': 1.8105094993061694, 'test_loss': 2.497430601505318, 'bleu': 2.875379292929293, 'gen_len': 11.111095454545456}




 60%|██████    | 15/25 [02:39<01:45, 10.59s/it]

For epoch 21: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.26batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.26batches/s]



Metrics: {'train_loss': 1.7762581537525384, 'test_loss': 2.4893684290876292, 'bleu': 3.0885227272727276, 'gen_len': 11.181794444444446}




 64%|██████▍   | 16/25 [02:50<01:36, 10.72s/it]

For epoch 22: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.25batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:02<00:00,  2.47batches/s]



Metrics: {'train_loss': 1.7403693729228116, 'test_loss': 2.5324128372500643, 'bleu': 2.810114141414142, 'gen_len': 9.777791919191921}




 68%|██████▊   | 17/25 [02:59<01:22, 10.36s/it]

For epoch 23: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.26batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.31batches/s]



Metrics: {'train_loss': 1.6994967128113057, 'test_loss': 2.506803387343282, 'bleu': 2.749485858585859, 'gen_len': 10.61617676767677}




 72%|███████▏  | 18/25 [03:09<01:11, 10.17s/it]

For epoch 24: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.27batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.06batches/s]



Metrics: {'train_loss': 1.6744661027309942, 'test_loss': 2.52493722992714, 'bleu': 2.8192727272727276, 'gen_len': 12.17172676767677}




 76%|███████▌  | 19/25 [03:19<01:00, 10.13s/it]

For epoch 25: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.19batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.08batches/s]



Metrics: {'train_loss': 1.6336687108069412, 'test_loss': 2.5186826291710447, 'bleu': 3.126379797979798, 'gen_len': 11.545429292929294}




 80%|████████  | 20/25 [03:29<00:50, 10.12s/it]

For epoch 26: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.23batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.04batches/s]



Metrics: {'train_loss': 1.595658942463276, 'test_loss': 2.5293197246512986, 'bleu': 3.3226782828282833, 'gen_len': 11.797979292929293}




 84%|████████▍ | 21/25 [03:39<00:40, 10.10s/it]

For epoch 27: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.20batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.08batches/s]



Metrics: {'train_loss': 1.5806173942723958, 'test_loss': 2.5458529356754185, 'bleu': 3.031581818181818, 'gen_len': 12.040430303030304}




 88%|████████▊ | 22/25 [03:49<00:30, 10.07s/it]

For epoch 28: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.21batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.96batches/s]



Metrics: {'train_loss': 1.5406483437802352, 'test_loss': 2.5303414036529235, 'bleu': 3.12880101010101, 'gen_len': 12.808074747474748}




 92%|█████████▏| 23/25 [03:59<00:20, 10.11s/it]

For epoch 29: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:05<00:00,  7.20batches/s]
Test batch number 3:  14%|█▍        | 1/7 [00:00<00:01,  4.60batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.09batches/s]



Metrics: {'train_loss': 1.5093537220692503, 'test_loss': 2.5414568535005206, 'bleu': 3.5049414141414146, 'gen_len': 12.07069797979798}




 96%|█████████▌| 24/25 [04:09<00:10, 10.09s/it]

For epoch 30: 


Train batch number 2:   0%|          | 0/43 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 44: 100%|██████████| 43/43 [00:06<00:00,  7.15batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.11batches/s]



Metrics: {'train_loss': 1.4772157685655651, 'test_loss': 2.5400167571173777, 'bleu': 3.1790439393939396, 'gen_len': 12.126261111111113}




100%|██████████| 25/25 [04:20<00:00, 10.40s/it]


In [15]:
# with warnings.catch_warnings():
    # warnings.simplefilter("ignore")
trainer = train(config)

# save if necessary

0it [00:00, ?it/s]


➡️ Predictions


In [16]:
if not trainer is None:
    
    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # recuperate the test dataset
    # initialize the transformation sequence
    end_mark_fn = partial(add_end_mark)
    augmentation = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)


    # let us get the test set
    test_dataset = SentenceDataset(f"{config['data_directory']}test_set.csv",
                                            tokenizer = tokenizer,
                                            cp1_transformer = augmentation,
                                            cp2_transformer = augmentation,
                                            corpus_1=config['corpus_1'],
                                            corpus_2=config['corpus_2'],
                                            truncation = False)

    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                            'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    metrics, prediction = trainer.evaluate(test_dataset, test_loader_args)


Evaluation batch number 2:   0%|          | 0/6 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Evaluation batch number 7: 100%|██████████| 6/6 [00:03<00:00,  1.64batches/s]


In [17]:
metrics

{'test_loss': 2.249727627243659,
 'bleu': 2.7937212121212123,
 'gen_len': 7.303034343434343}

In [18]:
prediction

Unnamed: 0,original_sentences,translations,predictions
0,C'est des femmes.,Jigéen lanu.,Nit la.
1,Cet homme qui avait voulu.,Góor gii bëggóon.,Góor gii ŋga dem.
2,Par ici?,Ci fii?,Ci wax?
3,Qu'il entre!,Na dugg ci biir su bëggée!,Koo dem!
4,À l'intérieur si tu ne veux pas!,Ci biir soo bëggul!,Ci biir!
...,...,...,...
193,"Ceux-là, cependant, sont des cases. Celle qui ...","Waaw lii nag ay néegi ñax la, néegi ñax bi ci ...","Lii ab néeg la, néeg bi dañ kooale, néeg bi d..."
194,On voit sur la photo beaucoup de personnes sor...,Ñu gis ci nataal bi ay nit ñu bari ñu génn ci ...,Nataal bii de gis naa ci benn bool bu weex ak...
195,"Ceux-là, aussi, sont des gendarmes. Ils siègen...",Ñii moom tamit ay takk-der nañ. Ñi ñi ngi bàyy...,"Lii ab néeg la, néeg bi dañ kooale, néeg bi d..."
196,"Ceci, cependant, on a l'habitude de faire les ...",Lii nag dañ ciy faral di def ndugg maanaam jig...,Waaw nataal bii de ay bunt yu dóomu-taal moo ...


----------------------------------

#### Wolof-French v5

➡️ Import the libraries.

In [1]:
from wolof_translate import *

# specify a seed for everything
lt.seed_everything(0)

Global seed set to 0


0

➡️ Function to recuperate datasets

In [2]:
%%writefile wolof-translate/wolof_translate/utils/recuperate_datasets.py
from wolof_translate import *

def recuperate_datasets(char_p: float, word_p: float, max_len: int, end_mark: int, tokenizer: T5TokenizerFast,
                        corpus_1: str = 'french', corpus_2: str = 'wolof', 
                        train_file: str = 'data/extractions/new_data/train_set.csv', 
                        test_file: str = 'data/extractions/new_data/test_file.csv'):

  # Let us recuperate the end_mark adding option
  if end_mark == 1:
    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space)

    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space)
    
  else:
    
    if end_mark == 2:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!', replace = True)
    
    elif end_mark == 3:

      end_mark_fn = partial(add_end_mark)
    
    elif end_mark == 4:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!')
    
    else:  
        
        raise ValueError(f'No end mark number {end_mark}')

    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
  # Recuperate the train dataset
  train_dataset_aug = SentenceDataset(train_file,
                                        tokenizer,
                                        truncation = False,
                                        cp1_transformer = fr_augmentation_1,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2
                                        )

  # Recuperate the valid dataset
  valid_dataset = SentenceDataset(test_file,
                                        tokenizer,
                                        cp1_transformer = fr_augmentation_2,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2,
                                        truncation = False)
  
  # Return the datasets
  return train_dataset_aug, valid_dataset

Overwriting wolof-translate/wolof_translate/utils/recuperate_datasets.py


In [3]:
%run wolof-translate/wolof_translate/utils/recuperate_datasets.py

➡️ Training

In [12]:
# initialize the configurations
config = {
    'epochs': 29,
    'max_epoch': None,
    'log_step': 1,
    'metric_for_best_model': 'test_loss',
    'metric_objective': 'minimize',
    'corpus_1': 'wolof',
    'corpus_2': 'french',
    'train_file': 'data/extractions/new_data/train_set.csv',
    'test_file': 'data/extractions/new_data/valid_set.csv',
    'drop_out_rate': 0.291121690756753,
    'd_model': 512,
    'n_head': 8,
    'dim_ff': 2024,
    'n_encoders': 6,
    'n_decoders': 6,
    'learning_rate': 1e-3,
    'weight_decay': 0.0,
    'char_p': 0.5275538662009825,
    'word_p': 0.8981250882159111,
    'end_mark': 3,
    'label_smoothing': 0.1,
    'max_len': 20,
    'random_state': 0,
    'boundaries': [2, 23, 43, 64, 84, 104],
    'batch_sizes': [256, 128, 64, 32, 16, 8, 4],
    'batch_size': None, 
    'warmup_init': False,
    'relative_step': False,
    'num_workers': 0,
    'pin_memory': False,
    # --------------------> Must be changed when continuing a training
    'model_dir': 't5_small_v5_wf',
    'new_model_dir': 't5_small_v5_wf',
    'continue': False, # --------------------------> Must be changed when continuing training
    'logging_dir': 'data/logs/t5_small_wf',
    'save_best': True,
    'tokenizer_path': 'wolof-translate/wolof_translate/tokenizers/t5_tokenizers/tokenizer_v4.model',
    'data_directory': 'data/extractions/new_data/',
    'data_file': 'ad_sentences.csv',
    'version': 5,
    # in the case of a distributed training
    'backend': None,
    'hosts': [],
    'current_host': None,
    'num_gpus': 5,
    'logger': None,
    'return_trainer': True,
    'include_split': True,
}

In [13]:
%%writefile wolof-translate/wolof_translate/utils/hg_training.py
from wolof_translate import *
import warnings

def train(config: dict):
    
    # ---------------------------------------
    # add distribution if necessary (https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-python-sdk/pytorch_mnist/mnist.py)
    
    logger = config['logger']
    
    is_distributed = len(config['hosts']) > 1 and config['backend'] is not None
    
    use_cuda = config['num_gpus'] > 0
    
    config.update({"num_workers": 1, "pin_memory": True} if use_cuda else {})

    if not logger is None:
        
        logger.debug("Distributed training - {}".format(is_distributed))
        
        logger.debug("Number of gpus available - {}".format(config['num_gpus']))
        
    if is_distributed:
        # Initialize the distributed environment.
        world_size = len(config['hosts'])
        
        os.environ["WORLD_SIZE"] = str(world_size)
        
        host_rank = config['hosts'].index(config['current_host'])
        
        os.environ["RANK"] = str(host_rank)
        
        dist.init_process_group(backend=config['backend'], rank=host_rank, world_size=world_size)
        
        if not logger is None: logger.info(
            "Initialized the distributed environment: '{}' backend on {} nodes. ".format(
                config['backend'], dist.get_world_size()
            )
            + "Current host rank is {}. Number of gpus: {}".format(dist.get_rank(), config['num_gpus'])
        )
    # ---------------------------------------
    
    # split the data
    if config['include_split']: split_data(config['random_state'], config['data_directory'], config['data_file'])

    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # Initialize the model name
    model_name = 't5-small'

    # import the model with its pre-trained weights
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    # resize the token embeddings
    model.resize_token_embeddings(len(tokenizer))
    
    # recuperate train and test set
    train_dataset, test_dataset = recuperate_datasets(config['char_p'],
                                                        config['word_p'], config['max_len'],
                                                        config['end_mark'], tokenizer, config['corpus_1'],
                                                        config['corpus_2'],
                                                        config['train_file'], config['test_file'])
    
    # initialize the evaluation object
    evaluation = TranslationEvaluation(tokenizer, train_dataset.decode)

    # let us initialize the trainer
    trainer = ModelRunner(model = model, version=config['version'], seed = 0, evaluation = evaluation, optimizer = Adafactor)

    #-------------------------------------
    # in the case when the linear learning rate scheduler with warmup is used
    
    # let us calculate the appropriate warmup steps (let us take a max epoch of 100)
    # length = len(train_dataset)

    # n_steps = length // config['batch_size']

    # num_steps = config['max_epoch'] * n_steps

    # warmup_steps = (config['max_epoch'] * n_steps) * config['warmup_ratio']

    # Initialize the scheduler parameters
    # scheduler_args = {'num_warmup_steps': warmup_steps, 'num_training_steps': num_steps}
    #-------------------------------------

    # Initialize the optimizer parameters
    optimizer_args = {
        'lr': config['learning_rate'],
        'weight_decay': config['weight_decay'],
        # 'betas': (0.9, 0.98),
        'warmup_init': config['warmup_init'],
        'relative_step': config['relative_step']
    }

    # ----------------------------
    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    train_sampler = SequenceLengthBatchSampler(train_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    # ------------------------------
    # initialize a bucket sampler with fixed batch size in the case of single machine
    # with parallelization on multiple gpus
    # train_sampler = BucketSampler(train_dataset, config['batch_size'])

    # test_sampler = BucketSampler(test_dataset, config['batch_size'])
    
    # ------------------------------

    # Initialize the loaders parameters
    train_loader_args = {'batch_sampler': train_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    # Add the datasets and hyperparameters to trainer
    trainer.compile(train_dataset, test_dataset, tokenizer, train_loader_args,
                    test_loader_args, optimizer_kwargs = optimizer_args,
                    # lr_scheduler=get_linear_schedule_with_warmup,
                    # lr_scheduler_kwargs=scheduler_args,
                    predict_with_generate = True,
                    hugging_face = True,
                    is_distributed=is_distributed,
                    logging_dir=config['logging_dir'],
                    dist=dist
                    )

    # load the model
    trainer.load(config['model_dir'], load_best = not config['continue'])
    
    # Train the model
    trainer.train(config['epochs'] - trainer.current_epoch, auto_save = True, log_step = config['log_step'], saving_directory=config['new_model_dir'], save_best = config['save_best'],
                  metric_for_best_model = config['metric_for_best_model'], metric_objective = config['metric_objective'])
    
    if config['return_trainer']:
        
        return trainer
    
    return None


Overwriting wolof-translate/wolof_translate/utils/hg_training.py


Below train and save if we want.

In [14]:
from wolof_translate.utils.hg_training import train

In [32]:
# with warnings.catch_warnings():
    # warnings.simplefilter("ignore")
trainer = train(config)

# save if necessary

  0%|          | 0/25 [00:00<?, ?it/s]

For epoch 6: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.64batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.75batches/s]



Metrics: {'train_loss': 3.025106288450267, 'test_loss': 3.0578086183528708, 'bleu': 1.330838383838384, 'gen_len': 19.964669191919192}




  4%|▍         | 1/25 [00:09<03:43,  9.31s/it]

For epoch 7: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.60batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.96batches/s]



Metrics: {'train_loss': 2.888327236344436, 'test_loss': 2.932071378736785, 'bleu': 1.3087666666666669, 'gen_len': 16.136386363636365}




  8%|▊         | 2/25 [00:19<03:49,  9.96s/it]

For epoch 8: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.56batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.95batches/s]



Metrics: {'train_loss': 2.7692004121759877, 'test_loss': 2.849424434430672, 'bleu': 1.4679595959595961, 'gen_len': 16.43939494949495}




 12%|█▏        | 3/25 [00:30<03:44, 10.19s/it]

For epoch 9: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.53batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.77batches/s]



Metrics: {'train_loss': 2.663334673526241, 'test_loss': 2.788634987792584, 'bleu': 1.5415909090909092, 'gen_len': 16.055548484848487}




 16%|█▌        | 4/25 [00:41<03:39, 10.45s/it]

For epoch 10: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.53batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:04<00:00,  1.74batches/s]



Metrics: {'train_loss': 2.5818608457117724, 'test_loss': 2.7095980692391444, 'bleu': 1.8785469696969699, 'gen_len': 13.808072222222222}




 20%|██        | 5/25 [00:51<03:32, 10.62s/it]

For epoch 11: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.60batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:04<00:00,  1.73batches/s]



Metrics: {'train_loss': 2.4917865036259266, 'test_loss': 2.6703582159196495, 'bleu': 2.21769898989899, 'gen_len': 15.121238383838381}




 24%|██▍       | 6/25 [01:02<03:23, 10.71s/it]

For epoch 12: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.59batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.82batches/s]



Metrics: {'train_loss': 2.4179942812970907, 'test_loss': 2.6327337351712314, 'bleu': 2.2543474747474748, 'gen_len': 16.212137373737374}




 28%|██▊       | 7/25 [01:13<03:12, 10.70s/it]

For epoch 13: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.58batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:02<00:00,  2.44batches/s]



Metrics: {'train_loss': 2.362779219643667, 'test_loss': 2.6032126094355728, 'bleu': 2.085411616161616, 'gen_len': 12.797957575757577}




 32%|███▏      | 8/25 [01:23<02:56, 10.40s/it]

For epoch 14: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.43batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.75batches/s]



Metrics: {'train_loss': 2.292300087677235, 'test_loss': 2.561257794649914, 'bleu': 2.3605954545454546, 'gen_len': 15.70708686868687}




 36%|███▌      | 9/25 [01:34<02:49, 10.58s/it]

For epoch 15: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.51batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.86batches/s]



Metrics: {'train_loss': 2.2410037018391695, 'test_loss': 2.5626361875823047, 'bleu': 2.6147292929292933, 'gen_len': 15.65151666666667}




 40%|████      | 10/25 [01:43<02:33, 10.23s/it]

For epoch 16: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.47batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.07batches/s]



Metrics: {'train_loss': 2.1965482313372076, 'test_loss': 2.527354256071226, 'bleu': 2.484096969696969, 'gen_len': 13.2525}




 44%|████▍     | 11/25 [01:54<02:23, 10.26s/it]

For epoch 17: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.45batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.93batches/s]



Metrics: {'train_loss': 2.1407281559226945, 'test_loss': 2.497628905556419, 'bleu': 2.9381095959595958, 'gen_len': 14.272710606060606}




 48%|████▊     | 12/25 [02:04<02:14, 10.36s/it]

For epoch 18: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.53batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.76batches/s]



Metrics: {'train_loss': 2.0784834201844107, 'test_loss': 2.497836967911384, 'bleu': 2.899290404040405, 'gen_len': 15.631311111111112}




 52%|█████▏    | 13/25 [02:14<02:01, 10.14s/it]

For epoch 19: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.49batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.89batches/s]



Metrics: {'train_loss': 2.05341728807325, 'test_loss': 2.481565406828216, 'bleu': 2.6675671717171716, 'gen_len': 16.57071616161616}




 56%|█████▌    | 14/25 [02:24<01:53, 10.28s/it]

For epoch 20: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.41batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:04<00:00,  1.68batches/s]



Metrics: {'train_loss': 2.002027263300544, 'test_loss': 2.461949396615077, 'bleu': 3.311483333333334, 'gen_len': 15.63637171717172}




 60%|██████    | 15/25 [02:36<01:45, 10.55s/it]

For epoch 21: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.41batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.78batches/s]



Metrics: {'train_loss': 1.9573933808280268, 'test_loss': 2.440742301218437, 'bleu': 3.6045080808080816, 'gen_len': 16.005034343434343}




 64%|██████▍   | 16/25 [02:47<01:36, 10.74s/it]

For epoch 22: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.43batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.82batches/s]



Metrics: {'train_loss': 1.9155979083354977, 'test_loss': 2.457561742175709, 'bleu': 4.930958080808081, 'gen_len': 16.479829292929296}




 68%|██████▊   | 17/25 [02:56<01:23, 10.40s/it]

For epoch 23: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:05<00:00,  6.38batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.87batches/s]



Metrics: {'train_loss': 1.8723649379952205, 'test_loss': 2.454600731531779, 'bleu': 4.282234343434344, 'gen_len': 15.005060101010102}




 72%|███████▏  | 18/25 [03:06<01:11, 10.16s/it]

For epoch 24: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.52batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.75batches/s]



Metrics: {'train_loss': 1.8316938165318433, 'test_loss': 2.439377421080464, 'bleu': 4.814183333333334, 'gen_len': 16.166674242424243}




 76%|███████▌  | 19/25 [03:17<01:02, 10.38s/it]

For epoch 25: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.48batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:04<00:00,  1.73batches/s]



Metrics: {'train_loss': 1.7985615124814047, 'test_loss': 2.456443261618566, 'bleu': 4.883823232323234, 'gen_len': 16.010118686868687}




 80%|████████  | 20/25 [03:27<00:51, 10.22s/it]

For epoch 26: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.44batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.79batches/s]



Metrics: {'train_loss': 1.7628219490334776, 'test_loss': 2.4467217958334717, 'bleu': 4.541628787878789, 'gen_len': 16.626272222222223}




 84%|████████▍ | 21/25 [03:36<00:40, 10.04s/it]

For epoch 27: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.44batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.14batches/s]



Metrics: {'train_loss': 1.7351601044638563, 'test_loss': 2.4418391073592987, 'bleu': 4.809811616161616, 'gen_len': 13.74244191919192}




 88%|████████▊ | 22/25 [03:45<00:29,  9.72s/it]

For epoch 28: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.43batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.84batches/s]



Metrics: {'train_loss': 1.6943547316249008, 'test_loss': 2.4684393875526665, 'bleu': 4.866994444444445, 'gen_len': 15.510115151515153}




 92%|█████████▏| 23/25 [03:55<00:19,  9.65s/it]

For epoch 29: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.45batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.76batches/s]



Metrics: {'train_loss': 1.6575911842065094, 'test_loss': 2.433619892958439, 'bleu': 5.274884848484849, 'gen_len': 15.570673232323234}




 96%|█████████▌| 24/25 [04:06<00:10, 10.04s/it]

For epoch 30: 


Train batch number 2:   0%|          | 0/32 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 33: 100%|██████████| 32/32 [00:04<00:00,  6.44batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.98batches/s]



Metrics: {'train_loss': 1.6241250370666247, 'test_loss': 2.4374969884602713, 'bleu': 4.865085353535354, 'gen_len': 15.782837878787882}




100%|██████████| 25/25 [04:15<00:00, 10.22s/it]


In [36]:
# with warnings.catch_warnings():
    # warnings.simplefilter("ignore")
trainer = train(config)

# save if necessary

  0%|          | 0/20 [00:00<?, ?it/s]

For epoch 31: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:04<00:00,  6.74batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.99batches/s]



Metrics: {'train_loss': 1.6031698549495028, 'test_loss': 2.48018310286782, 'bleu': 5.082866161616162, 'gen_len': 14.929275757575759}




  5%|▌         | 1/20 [00:09<02:54,  9.20s/it]

For epoch 32: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:04<00:00,  6.66batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.83batches/s]



Metrics: {'train_loss': 1.56718063592006, 'test_loss': 2.4559369737451733, 'bleu': 5.706334848484849, 'gen_len': 15.72220505050505}




 10%|█         | 2/20 [00:18<02:50,  9.47s/it]

For epoch 33: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:04<00:00,  6.62batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.79batches/s]



Metrics: {'train_loss': 1.5363509753774343, 'test_loss': 2.462271097934608, 'bleu': 4.999496464646464, 'gen_len': 16.78284292929293}




 15%|█▌        | 3/20 [00:28<02:42,  9.58s/it]

For epoch 34: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:04<00:00,  6.68batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.76batches/s]



Metrics: {'train_loss': 1.4993085373171824, 'test_loss': 2.4647206417237872, 'bleu': 5.2097772727272735, 'gen_len': 16.76260101010101}




 20%|██        | 4/20 [00:38<02:33,  9.62s/it]

For epoch 35: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:04<00:00,  6.65batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.18batches/s]



Metrics: {'train_loss': 1.4714344369505565, 'test_loss': 2.52563930039454, 'bleu': 5.206471717171718, 'gen_len': 13.84849494949495}




 25%|██▌       | 5/20 [00:47<02:20,  9.37s/it]

For epoch 36: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:04<00:00,  6.63batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.97batches/s]



Metrics: {'train_loss': 1.4317455306828286, 'test_loss': 2.4958050130593654, 'bleu': 4.841671717171717, 'gen_len': 14.994919191919193}




 30%|███       | 6/20 [00:56<02:11,  9.39s/it]

For epoch 37: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:04<00:00,  6.71batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:04<00:00,  1.71batches/s]



Metrics: {'train_loss': 1.4115574834544322, 'test_loss': 2.4901456832885747, 'bleu': 5.709369696969697, 'gen_len': 16.217164646464646}




 35%|███▌      | 7/20 [01:06<02:04,  9.55s/it]

For epoch 38: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:05<00:00,  6.58batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.84batches/s]



Metrics: {'train_loss': 1.3803954382134265, 'test_loss': 2.5174373015008786, 'bleu': 5.082258080808081, 'gen_len': 14.899022222222225}




 40%|████      | 8/20 [01:16<01:55,  9.59s/it]

For epoch 39: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:04<00:00,  6.64batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.33batches/s]



Metrics: {'train_loss': 1.3583325295603628, 'test_loss': 2.531702978442414, 'bleu': 5.318264646464646, 'gen_len': 12.757601515151515}




 45%|████▌     | 9/20 [01:24<01:42,  9.35s/it]

For epoch 40: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:05<00:00,  6.52batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.91batches/s]



Metrics: {'train_loss': 1.3195869691721784, 'test_loss': 2.5422805367094097, 'bleu': 5.3666823232323235, 'gen_len': 15.166634848484849}




 50%|█████     | 10/20 [01:34<01:34,  9.43s/it]

For epoch 41: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:04<00:00,  6.67batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.86batches/s]



Metrics: {'train_loss': 1.297868958158934, 'test_loss': 2.5708903038140503, 'bleu': 4.87779494949495, 'gen_len': 15.833323232323233}




 55%|█████▌    | 11/20 [01:44<01:25,  9.45s/it]

For epoch 42: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:04<00:00,  6.62batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.88batches/s]



Metrics: {'train_loss': 1.2691234390449861, 'test_loss': 2.5951932126825508, 'bleu': 5.3977151515151505, 'gen_len': 15.232341414141414}




 60%|██████    | 12/20 [01:53<01:15,  9.48s/it]

For epoch 43: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:05<00:00,  6.54batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.87batches/s]



Metrics: {'train_loss': 1.2421140144853033, 'test_loss': 2.590024545939282, 'bleu': 6.711281818181818, 'gen_len': 13.772703535353537}




 65%|██████▌   | 13/20 [02:03<01:06,  9.52s/it]

For epoch 44: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:04<00:00,  6.61batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.78batches/s]



Metrics: {'train_loss': 1.2233287424018762, 'test_loss': 2.5700962663900975, 'bleu': 5.34880303030303, 'gen_len': 16.36870303030303}




 70%|███████   | 14/20 [02:12<00:57,  9.59s/it]

For epoch 45: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:05<00:00,  6.53batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.81batches/s]



Metrics: {'train_loss': 1.2071754128061318, 'test_loss': 2.6190930038991604, 'bleu': 5.325007070707071, 'gen_len': 15.06563181818182}




 75%|███████▌  | 15/20 [02:22<00:48,  9.62s/it]

For epoch 46: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:05<00:00,  6.56batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.79batches/s]



Metrics: {'train_loss': 1.167711973529613, 'test_loss': 2.5440620123737996, 'bleu': 6.071860606060607, 'gen_len': 15.292952525252526}




 80%|████████  | 16/20 [02:32<00:38,  9.66s/it]

For epoch 47: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:05<00:00,  6.55batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.83batches/s]



Metrics: {'train_loss': 1.151745243541128, 'test_loss': 2.610427420548719, 'bleu': 6.87779696969697, 'gen_len': 14.500006060606061}




 85%|████████▌ | 17/20 [02:42<00:28,  9.65s/it]

For epoch 48: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:05<00:00,  6.51batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.76batches/s]



Metrics: {'train_loss': 1.1192170883152472, 'test_loss': 2.5948631040977714, 'bleu': 5.628589393939393, 'gen_len': 16.36867777777778}




 90%|█████████ | 18/20 [02:51<00:19,  9.70s/it]

For epoch 49: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:05<00:00,  6.54batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  2.09batches/s]



Metrics: {'train_loss': 1.0961309396564627, 'test_loss': 2.650956584949686, 'bleu': 7.354377777777779, 'gen_len': 14.343432323232324}




 95%|█████████▌| 19/20 [03:01<00:09,  9.55s/it]

For epoch 50: 


Train batch number 2:   0%|          | 0/33 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 34: 100%|██████████| 33/33 [00:05<00:00,  6.53batches/s]
Test batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 8: 100%|██████████| 7/7 [00:03<00:00,  1.94batches/s]



Metrics: {'train_loss': 1.0732805997670491, 'test_loss': 2.6258159791580358, 'bleu': 6.835085353535354, 'gen_len': 14.914118686868687}




100%|██████████| 20/20 [03:10<00:00,  9.53s/it]


In [15]:
# with warnings.catch_warnings():
    # warnings.simplefilter("ignore")
trainer = train(config)

# save if necessary

0it [00:00, ?it/s]


➡️ Predictions


In [16]:
if not trainer is None:
    
    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # recuperate the test dataset
    # initialize the transformation sequence
    end_mark_fn = partial(add_end_mark)
    augmentation = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)


    # let us get the test set
    test_dataset = SentenceDataset(f"{config['data_directory']}test_set.csv",
                                            tokenizer = tokenizer,
                                            cp1_transformer = augmentation,
                                            cp2_transformer = augmentation,
                                            corpus_1=config['corpus_1'],
                                            corpus_2=config['corpus_2'],
                                            truncation = False)

    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                            'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    metrics, prediction = trainer.evaluate(test_dataset, test_loader_args)


Evaluation batch number 2:   0%|          | 0/6 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Evaluation batch number 7: 100%|██████████| 6/6 [00:03<00:00,  1.73batches/s]


In [17]:
metrics

{'test_loss': 2.1652301561952845,
 'bleu': 4.696123737373738,
 'gen_len': 10.767653535353533}

In [18]:
prediction

Unnamed: 0,original_sentences,translations,predictions
0,Góor gi dem na ci biir.,L'homme est allé à l'intérieur.,Il est parti.
1,Ana kooku.,Où est celui-là?,Où est le maître.
2,Gis naa xale ba.,J'ai vu l'enfant.,J'ai vu.
3,Deŋkël lëf li kenn ki!,Confie la chose à l'un!,"Il est là-bas, là-bas,"
4,Sama aw xarit!,Un ami à moi!,C'est celui-là!
...,...,...,...
193,"Waaw lii nag ay néegi ñax la, néegi ñax bi ci ...","Ceux-là, cependant, sont des cases. Celle qui ...","Je suis en gris, d'autres d'autres d'autres d..."
194,Ñu gis ci nataal bi ay nit ñu bari ñu génn ci ...,On voit sur la photo beaucoup de personnes sor...,Ceux-ci sont de vieilles pourr que tu allais ...
195,Ñii moom tamit ay takk-der nañ. Ñi ñi ngi bàyy...,"Ceux-là, aussi, sont des gendarmes. Ils siègen...",Ceci est une photo sur laquelle je vois un ch...
196,Lii nag dañ ciy faral di def ndugg maanaam jig...,"Ceci, cependant, on a l'habitude de faire les ...","Ceci est une théière, similaire au poisson sé..."


-------------------------------------------

---------------------------

#### French-Wolof v6

➡️ Import the libraries.

In [31]:
from wolof_translate import *

# specify a seed for everything
lt.seed_everything(0)

Global seed set to 0


0

➡️ Function to recuperate datasets

In [32]:
%%writefile wolof-translate/wolof_translate/utils/recuperate_datasets.py
from wolof_translate import *

def recuperate_datasets(char_p: float, word_p: float, max_len: int, end_mark: int, tokenizer: T5TokenizerFast,
                        corpus_1: str = 'french', corpus_2: str = 'wolof', 
                        train_file: str = 'data/extractions/new_data/train_set.csv', 
                        test_file: str = 'data/extractions/new_data/test_file.csv'):

  # Let us recuperate the end_mark adding option
  if end_mark == 1:
    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space)

    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space)
    
  else:
    
    if end_mark == 2:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!', replace = True)
    
    elif end_mark == 3:

      end_mark_fn = partial(add_end_mark)
    
    elif end_mark == 4:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!')
    
    else:  
        
        raise ValueError(f'No end mark number {end_mark}')

    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
  # Recuperate the train dataset
  train_dataset_aug = SentenceDataset(train_file,
                                        tokenizer,
                                        truncation = False,
                                        cp1_transformer = fr_augmentation_1,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2
                                        )

  # Recuperate the valid dataset
  valid_dataset = SentenceDataset(test_file,
                                        tokenizer,
                                        cp1_transformer = fr_augmentation_2,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2,
                                        truncation = False)
  
  # Return the datasets
  return train_dataset_aug, valid_dataset

Overwriting wolof-translate/wolof_translate/utils/recuperate_datasets.py


In [33]:
%run wolof-translate/wolof_translate/utils/recuperate_datasets.py

➡️ Training

In [47]:
# initialize the configurations
config = {
    'epochs': 15,
    'max_epoch': None,
    'log_step': 1,
    'metric_for_best_model': 'test_loss',
    'metric_objective': 'minimize',
    'corpus_1': 'french',
    'corpus_2': 'wolof',
    'train_file': 'data/extractions/new_data/train_set.csv',
    'test_file': 'data/extractions/new_data/valid_set.csv',
    'drop_out_rate': 0.02121451891074674,
    'd_model': 512,
    'n_head': 8,
    'dim_ff': 2024,
    'n_encoders': 6,
    'n_decoders': 6,
    'learning_rate': 0.0012924460038848235,
    'weight_decay': 0.02121451891074674,
    'char_p': 0.4488567119340453,
    'word_p': 0.8710480007380237,
    'end_mark': 3,
    'label_smoothing': 0.1,
    'max_len': 20,
    'random_state': 0,
    'boundaries': [2, 31, 59, 87, 115, 143, 171],
    'batch_sizes': [256, 128, 64, 32, 16, 8, 4, 2],
    'batch_size': None, 
    'warmup_init': False,
    'relative_step': False,
    'num_workers': 0,
    'pin_memory': False,
    # --------------------> Must be changed when continuing a training
    'model_dir': 't5_small_v6_fw',
    'new_model_dir': 't5_small_v6_fw',
    'continue': True, # --------------------------> Must be changed when continuing training
    'logging_dir': 'data/logs/t5_small_fw',
    'save_best': True,
    'tokenizer_path': 'wolof-translate/wolof_translate/tokenizers/t5_tokenizers/tokenizer_v5.model',
    'data_directory': 'data/extractions/new_data/',
    'data_file': 'corpora_v6.csv',
    'version': 6,
    # in the case of a distributed training
    'backend': None,
    'hosts': [],
    'current_host': None,
    'num_gpus': 5,
    'logger': None,
    'return_trainer': True,
    'include_split': True,
}

In [48]:
%%writefile wolof-translate/wolof_translate/utils/hg_training.py
from wolof_translate import *
import warnings

def train(config: dict):
    
    # ---------------------------------------
    # add distribution if necessary (https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-python-sdk/pytorch_mnist/mnist.py)
    
    logger = config['logger']
    
    is_distributed = len(config['hosts']) > 1 and config['backend'] is not None
    
    use_cuda = config['num_gpus'] > 0
    
    config.update({"num_workers": 1, "pin_memory": True} if use_cuda else {})

    if not logger is None:
        
        logger.debug("Distributed training - {}".format(is_distributed))
        
        logger.debug("Number of gpus available - {}".format(config['num_gpus']))
        
    if is_distributed:
        # Initialize the distributed environment.
        world_size = len(config['hosts'])
        
        os.environ["WORLD_SIZE"] = str(world_size)
        
        host_rank = config['hosts'].index(config['current_host'])
        
        os.environ["RANK"] = str(host_rank)
        
        dist.init_process_group(backend=config['backend'], rank=host_rank, world_size=world_size)
        
        if not logger is None: logger.info(
            "Initialized the distributed environment: '{}' backend on {} nodes. ".format(
                config['backend'], dist.get_world_size()
            )
            + "Current host rank is {}. Number of gpus: {}".format(dist.get_rank(), config['num_gpus'])
        )
    # ---------------------------------------
    
    # split the data
    if config['include_split']: split_data(config['random_state'], config['data_directory'], config['data_file'])

    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # Initialize the model name
    model_name = 't5-small'

    # import the model with its pre-trained weights
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    # resize the token embeddings
    model.resize_token_embeddings(len(tokenizer))
    
    # recuperate train and test set
    train_dataset, test_dataset = recuperate_datasets(config['char_p'],
                                                        config['word_p'], config['max_len'],
                                                        config['end_mark'], tokenizer, config['corpus_1'],
                                                        config['corpus_2'],
                                                        config['train_file'], config['test_file'])
    
    # initialize the evaluation object
    evaluation = TranslationEvaluation(tokenizer, train_dataset.decode)

    # let us initialize the trainer
    trainer = ModelRunner(model = model, version=config['version'], seed = 0, evaluation = evaluation, optimizer = Adafactor)

    #-------------------------------------
    # in the case when the linear learning rate scheduler with warmup is used
    
    # let us calculate the appropriate warmup steps (let us take a max epoch of 100)
    # length = len(train_dataset)

    # n_steps = length // config['batch_size']

    # num_steps = config['max_epoch'] * n_steps

    # warmup_steps = (config['max_epoch'] * n_steps) * config['warmup_ratio']

    # Initialize the scheduler parameters
    # scheduler_args = {'num_warmup_steps': warmup_steps, 'num_training_steps': num_steps}
    #-------------------------------------

    # Initialize the optimizer parameters
    optimizer_args = {
        'lr': config['learning_rate'],
        'weight_decay': config['weight_decay'],
        # 'betas': (0.9, 0.98),
        'warmup_init': config['warmup_init'],
        'relative_step': config['relative_step']
    }

    # ----------------------------
    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    train_sampler = SequenceLengthBatchSampler(train_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    # ------------------------------
    # initialize a bucket sampler with fixed batch size in the case of single machine
    # with parallelization on multiple gpus
    # train_sampler = BucketSampler(train_dataset, config['batch_size'])

    # test_sampler = BucketSampler(test_dataset, config['batch_size'])
    
    # ------------------------------

    # Initialize the loaders parameters
    train_loader_args = {'batch_sampler': train_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    # Add the datasets and hyperparameters to trainer
    trainer.compile(train_dataset, test_dataset, tokenizer, train_loader_args,
                    test_loader_args, optimizer_kwargs = optimizer_args,
                    # lr_scheduler=get_linear_schedule_with_warmup,
                    # lr_scheduler_kwargs=scheduler_args,
                    predict_with_generate = True,
                    hugging_face = True,
                    is_distributed=is_distributed,
                    logging_dir=config['logging_dir'],
                    dist=dist
                    )

    # load the model
    trainer.load(config['model_dir'], load_best = not config['continue'])
    
    # Train the model
    trainer.train(config['epochs'] - trainer.current_epoch, auto_save = True, log_step = config['log_step'], saving_directory=config['new_model_dir'], save_best = config['save_best'],
                  metric_for_best_model = config['metric_for_best_model'], metric_objective = config['metric_objective'])
    
    if config['return_trainer']:
        
        return trainer
    
    return None


Overwriting wolof-translate/wolof_translate/utils/hg_training.py


Below train and save if we want.

In [49]:
from wolof_translate.utils.hg_training import train

In [37]:
# with warnings.catch_warnings():
    # warnings.simplefilter("ignore")
trainer = train(config)

# save if necessary

  0%|          | 0/25 [00:00<?, ?it/s]

For epoch 6: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.55batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.44batches/s]



Metrics: {'train_loss': 2.4375888028413804, 'test_loss': 2.966145953932008, 'bleu': 2.3349269230769227, 'gen_len': 21.185333566433567}




  4%|▍         | 1/25 [00:24<09:49, 24.57s/it]

For epoch 7: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.62batches/s]
Test batch number 3:  11%|█         | 1/9 [00:00<00:01,  4.63batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.29batches/s]



Metrics: {'train_loss': 2.3719612082175723, 'test_loss': 2.9198804418523827, 'bleu': 2.384601748251748, 'gen_len': 21.020954545454543}




  8%|▊         | 2/25 [00:49<09:33, 24.94s/it]

For epoch 8: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.60batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.29batches/s]



Metrics: {'train_loss': 2.322190919712635, 'test_loss': 2.896589669314298, 'bleu': 2.3052779720279717, 'gen_len': 19.241266433566434}




 12%|█▏        | 3/25 [01:14<09:11, 25.07s/it]

For epoch 9: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.59batches/s]
Test batch number 3:  11%|█         | 1/9 [00:00<00:01,  4.77batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.35batches/s]



Metrics: {'train_loss': 2.2724601538565983, 'test_loss': 2.8756876675399035, 'bleu': 2.2909538461538457, 'gen_len': 19.94757972027972}




 16%|█▌        | 4/25 [01:39<08:45, 25.01s/it]

For epoch 10: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.60batches/s]
Test batch number 3:  11%|█         | 1/9 [00:00<00:01,  4.79batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.22batches/s]



Metrics: {'train_loss': 2.228499875821344, 'test_loss': 2.8574944959653843, 'bleu': 2.504084615384616, 'gen_len': 20.21676433566434}




 20%|██        | 5/25 [02:05<08:24, 25.23s/it]

For epoch 11: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.51batches/s]
Test batch number 2:  11%|█         | 1/9 [00:00<00:01,  4.45batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.40batches/s]



Metrics: {'train_loss': 2.1911549471224245, 'test_loss': 2.8746307279680154, 'bleu': 2.0619433566433565, 'gen_len': 16.118908391608393}




 24%|██▍       | 6/25 [02:29<07:48, 24.66s/it]

For epoch 12: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.51batches/s]
Test batch number 3:  11%|█         | 1/9 [00:00<00:01,  4.70batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.37batches/s]



Metrics: {'train_loss': 2.150642161083681, 'test_loss': 2.8429726854070916, 'bleu': 2.8710527972027973, 'gen_len': 17.94405454545455}




 28%|██▊       | 7/25 [02:54<07:25, 24.78s/it]

For epoch 13: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.47batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.23batches/s]



Metrics: {'train_loss': 2.1094355544618604, 'test_loss': 2.833220173428942, 'bleu': 2.8371926573426576, 'gen_len': 19.74824615384615}




 32%|███▏      | 8/25 [03:19<07:07, 25.12s/it]

For epoch 14: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.50batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.24batches/s]



Metrics: {'train_loss': 2.0748607206407157, 'test_loss': 2.8436978813651557, 'bleu': 3.220885664335665, 'gen_len': 18.332169930069927}




 36%|███▌      | 9/25 [03:44<06:38, 24.89s/it]

For epoch 15: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.47batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:08<00:00,  1.09batches/s]



Metrics: {'train_loss': 2.0419498727370615, 'test_loss': 2.8237611914014478, 'bleu': 2.646240909090908, 'gen_len': 20.881122377622376}




 40%|████      | 10/25 [04:11<06:22, 25.49s/it]

For epoch 16: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.48batches/s]
Test batch number 3:  11%|█         | 1/9 [00:00<00:01,  4.49batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.34batches/s]



Metrics: {'train_loss': 2.008480124250478, 'test_loss': 2.8501608755205066, 'bleu': 2.9293461538461547, 'gen_len': 18.723774825174825}




 44%|████▍     | 11/25 [04:34<05:49, 24.98s/it]

For epoch 17: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.45batches/s]
Test batch number 2:  11%|█         | 1/9 [00:00<00:01,  4.46batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.25batches/s]



Metrics: {'train_loss': 1.9747594703896154, 'test_loss': 2.825546859861254, 'bleu': 3.1001982517482514, 'gen_len': 18.353144755244756}




 48%|████▊     | 12/25 [04:59<05:22, 24.79s/it]

For epoch 18: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.48batches/s]
Test batch number 3:  11%|█         | 1/9 [00:00<00:01,  4.53batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.17batches/s]



Metrics: {'train_loss': 1.9440243814415847, 'test_loss': 2.86058876731179, 'bleu': 2.7641552447552447, 'gen_len': 17.059434965034967}




 52%|█████▏    | 13/25 [05:24<04:57, 24.82s/it]

For epoch 19: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.43batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.38batches/s]



Metrics: {'train_loss': 1.9121805690082347, 'test_loss': 2.853731763946427, 'bleu': 2.945455594405594, 'gen_len': 17.19578951048951}




 56%|█████▌    | 14/25 [05:47<04:29, 24.49s/it]

For epoch 20: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.58batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.18batches/s]



Metrics: {'train_loss': 1.8839237426220727, 'test_loss': 2.82736990168378, 'bleu': 2.5408059440559434, 'gen_len': 19.545450349650356}




 60%|██████    | 15/25 [06:12<04:04, 24.49s/it]

For epoch 21: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.60batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.20batches/s]



Metrics: {'train_loss': 1.8511747155170752, 'test_loss': 2.8715496763482795, 'bleu': 2.97046993006993, 'gen_len': 18.86013286713287}




 64%|██████▍   | 16/25 [06:36<03:40, 24.45s/it]

For epoch 22: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.62batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.34batches/s]



Metrics: {'train_loss': 1.8166869640298207, 'test_loss': 2.8685822203442766, 'bleu': 3.062287062937062, 'gen_len': 17.62591118881119}




 68%|██████▊   | 17/25 [07:00<03:13, 24.15s/it]

For epoch 23: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:15<00:00,  6.63batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.33batches/s]



Metrics: {'train_loss': 1.78812318959505, 'test_loss': 2.864329903275817, 'bleu': 3.1684657342657347, 'gen_len': 16.423091608391605}




 72%|███████▏  | 18/25 [07:23<02:47, 23.94s/it]

For epoch 24: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:16<00:00,  6.62batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.35batches/s]



Metrics: {'train_loss': 1.7567113254286078, 'test_loss': 2.889874550012442, 'bleu': 4.056932867132867, 'gen_len': 18.95105664335664}




 76%|███████▌  | 19/25 [07:47<02:22, 23.79s/it]

For epoch 25: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:15<00:00,  6.66batches/s]
Test batch number 3:  11%|█         | 1/9 [00:00<00:01,  4.60batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.19batches/s]



Metrics: {'train_loss': 1.7311453506396117, 'test_loss': 2.900331260441066, 'bleu': 2.816500699300699, 'gen_len': 18.590922377622377}




 80%|████████  | 20/25 [08:11<01:59, 23.90s/it]

For epoch 26: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:15<00:00,  6.70batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.28batches/s]



Metrics: {'train_loss': 1.7051444899504187, 'test_loss': 2.9014232308714543, 'bleu': 3.8829940559440552, 'gen_len': 17.115365734265733}




 84%|████████▍ | 21/25 [08:34<01:35, 23.81s/it]

For epoch 27: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:15<00:00,  6.80batches/s]
Test batch number 3:  11%|█         | 1/9 [00:00<00:01,  4.56batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.21batches/s]



Metrics: {'train_loss': 1.6735967323281744, 'test_loss': 2.9207512398699786, 'bleu': 3.4692776223776223, 'gen_len': 19.27973776223776}




 88%|████████▊ | 22/25 [08:58<01:11, 23.78s/it]

For epoch 28: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:15<00:00,  6.88batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.16batches/s]



Metrics: {'train_loss': 1.651091969216501, 'test_loss': 2.894568519992428, 'bleu': 2.565493356643356, 'gen_len': 19.828672727272732}




 92%|█████████▏| 23/25 [09:22<00:47, 23.80s/it]

For epoch 29: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:15<00:00,  6.92batches/s]
Test batch number 2:  11%|█         | 1/9 [00:00<00:01,  4.50batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.16batches/s]



Metrics: {'train_loss': 1.6179291179569009, 'test_loss': 2.9350514862087227, 'bleu': 2.698149300699301, 'gen_len': 20.583897202797203}




 96%|█████████▌| 24/25 [09:46<00:23, 23.81s/it]

For epoch 30: 


Train batch number 2:   0%|          | 0/106 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 107: 100%|██████████| 106/106 [00:15<00:00,  6.90batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.29batches/s]



Metrics: {'train_loss': 1.5896231090079034, 'test_loss': 2.94225197238522, 'bleu': 4.267848601398602, 'gen_len': 18.486016083916084}




100%|██████████| 25/25 [10:09<00:00, 24.37s/it]


In [50]:
# with warnings.catch_warnings():
    # warnings.simplefilter("ignore")
trainer = train(config)

# save if necessary

0it [00:00, ?it/s]


➡️ Predictions


In [51]:
if not trainer is None:
    
    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # recuperate the test dataset
    # initialize the transformation sequence
    end_mark_fn = partial(add_end_mark)
    augmentation = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)


    # let us get the test set
    test_dataset = SentenceDataset(f"{config['data_directory']}test_set.csv",
                                            tokenizer = tokenizer,
                                            cp1_transformer = augmentation,
                                            cp2_transformer = augmentation,
                                            corpus_1=config['corpus_1'],
                                            corpus_2=config['corpus_2'],
                                            truncation = False)

    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                            'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    metrics, prediction = trainer.evaluate(test_dataset, test_loader_args)


Evaluation batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Evaluation batch number 8: 100%|██████████| 7/7 [00:04<00:00,  1.53batches/s]


In [52]:
metrics

{'test_loss': 2.927400617332725,
 'bleu': 4.409472027972028,
 'gen_len': 14.54896958041958}

In [53]:
prediction

Unnamed: 0,original_sentences,translations,predictions
0,Toutes les portes étaient ouvertes.,Bunt yi yépp a tëjju woon.,Gis ŋga nit ki.
1,"Ici alors, si tu refuses!","Cii fii kon, soo bëggul!",Kooku dem!
2,Celui qui est en haut!,Kenn ki ci kaw!,Mi ŋgi fi!
3,C'est votre ami!,Seen xarit la!,Nit ku baax la!
4,Devant toi.,Cha kanam.,De moomu.
...,...,...,...
281,"Ce n'est que longtemps après, quand l'égoïsme ...","Teg nañ ciy ati-at ma door a jëli ni jigéen, n...","Xare bi jeex, ñu tekk ci ñaar-fukki at, ma àn..."
282,"J'ai ressenti de l'étonnement, et même de l'in...","Li wóor te wér moo di ne bi loolu lépp weesoo,...","Jigéen ñu bare, seen der jóge woon ca seen xe..."
283,À quel point les arbres aux troncs rectilignes...,"Dàtti garab yaa ngi lunk, sànneeku jëm ca kow,...","Ci dénd yi muy nettaliy xol di : « Moo, góor ..."
284,Je peux ressentir l'émotion qu'il éprouve à tr...,"Li koy yëngal noonu, xam naa ko. Lan moo ko dà...","Li muy nettaliy toog ci kër gi, di naka noonu..."


----------------------------------

#### Wolof-French v6

➡️ Import the libraries.

In [38]:
from wolof_translate import *

# specify a seed for everything
lt.seed_everything(0)

Global seed set to 0


0

➡️ Function to recuperate datasets

In [39]:
%%writefile wolof-translate/wolof_translate/utils/recuperate_datasets.py
from wolof_translate import *

def recuperate_datasets(char_p: float, word_p: float, max_len: int, end_mark: int, tokenizer: T5TokenizerFast,
                        corpus_1: str = 'french', corpus_2: str = 'wolof', 
                        train_file: str = 'data/extractions/new_data/train_set.csv', 
                        test_file: str = 'data/extractions/new_data/test_file.csv'):

  # Let us recuperate the end_mark adding option
  if end_mark == 1:
    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space)

    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space)
    
  else:
    
    if end_mark == 2:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!', replace = True)
    
    elif end_mark == 3:

      end_mark_fn = partial(add_end_mark)
    
    elif end_mark == 4:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!')
    
    else:  
        
        raise ValueError(f'No end mark number {end_mark}')

    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
  # Recuperate the train dataset
  train_dataset_aug = SentenceDataset(train_file,
                                        tokenizer,
                                        truncation = False,
                                        cp1_transformer = fr_augmentation_1,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2
                                        )

  # Recuperate the valid dataset
  valid_dataset = SentenceDataset(test_file,
                                        tokenizer,
                                        cp1_transformer = fr_augmentation_2,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2,
                                        truncation = False)
  
  # Return the datasets
  return train_dataset_aug, valid_dataset

Overwriting wolof-translate/wolof_translate/utils/recuperate_datasets.py


In [40]:
%run wolof-translate/wolof_translate/utils/recuperate_datasets.py

➡️ Training

In [54]:
# initialize the configurations
config = {
    'epochs': 8,
    'max_epoch': None,
    'log_step': 1,
    'metric_for_best_model': 'test_loss',
    'metric_objective': 'minimize',
    'corpus_1': 'wolof',
    'corpus_2': 'french',
    'train_file': 'data/extractions/new_data/train_set.csv',
    'test_file': 'data/extractions/new_data/valid_set.csv',
    'drop_out_rate': 0.291121690756753,
    'd_model': 512,
    'n_head': 8,
    'dim_ff': 2024,
    'n_encoders': 6,
    'n_decoders': 6,
    'learning_rate': 0.004976535748221598,
    'weight_decay': 0.01725680581796274,
    'char_p': 0.6630185468549884,
    'word_p': 0.819968675819829,
    'end_mark': 3,
    'label_smoothing': 0.1,
    'max_len': 20,
    'random_state': 0,
    'boundaries': [2, 31, 59, 87, 115, 143, 171],
    'batch_sizes': [256, 128, 64, 32, 16, 8, 4, 2],
    'batch_size': None, 
    'warmup_init': False,
    'relative_step': False,
    'num_workers': 0,
    'pin_memory': False,
    # --------------------> Must be changed when continuing a training
    'model_dir': 't5_small_v6_wf',
    'new_model_dir': 't5_small_v6_wf',
    'continue': True, # --------------------------> Must be changed when continuing training
    'logging_dir': 'data/logs/t5_small_wf',
    'save_best': True,
    'tokenizer_path': 'wolof-translate/wolof_translate/tokenizers/t5_tokenizers/tokenizer_v5.model',
    'data_directory': 'data/extractions/new_data/',
    'data_file': 'corpora_v6.csv',
    'version': 6,
    # in the case of a distributed training
    'backend': None,
    'hosts': [],
    'current_host': None,
    'num_gpus': 5,
    'logger': None,
    'return_trainer': True,
    'include_split': True,
}

In [55]:
%%writefile wolof-translate/wolof_translate/utils/hg_training.py
from wolof_translate import *
import warnings

def train(config: dict):
    
    # ---------------------------------------
    # add distribution if necessary (https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-python-sdk/pytorch_mnist/mnist.py)
    
    logger = config['logger']
    
    is_distributed = len(config['hosts']) > 1 and config['backend'] is not None
    
    use_cuda = config['num_gpus'] > 0
    
    config.update({"num_workers": 1, "pin_memory": True} if use_cuda else {})

    if not logger is None:
        
        logger.debug("Distributed training - {}".format(is_distributed))
        
        logger.debug("Number of gpus available - {}".format(config['num_gpus']))
        
    if is_distributed:
        # Initialize the distributed environment.
        world_size = len(config['hosts'])
        
        os.environ["WORLD_SIZE"] = str(world_size)
        
        host_rank = config['hosts'].index(config['current_host'])
        
        os.environ["RANK"] = str(host_rank)
        
        dist.init_process_group(backend=config['backend'], rank=host_rank, world_size=world_size)
        
        if not logger is None: logger.info(
            "Initialized the distributed environment: '{}' backend on {} nodes. ".format(
                config['backend'], dist.get_world_size()
            )
            + "Current host rank is {}. Number of gpus: {}".format(dist.get_rank(), config['num_gpus'])
        )
    # ---------------------------------------
    
    # split the data
    if config['include_split']: split_data(config['random_state'], config['data_directory'], config['data_file'])

    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # Initialize the model name
    model_name = 't5-small'

    # import the model with its pre-trained weights
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    # resize the token embeddings
    model.resize_token_embeddings(len(tokenizer))
    
    # recuperate train and test set
    train_dataset, test_dataset = recuperate_datasets(config['char_p'],
                                                        config['word_p'], config['max_len'],
                                                        config['end_mark'], tokenizer, config['corpus_1'],
                                                        config['corpus_2'],
                                                        config['train_file'], config['test_file'])
    
    # initialize the evaluation object
    evaluation = TranslationEvaluation(tokenizer, train_dataset.decode)

    # let us initialize the trainer
    trainer = ModelRunner(model = model, version=config['version'], seed = 0, evaluation = evaluation, optimizer = Adafactor)

    #-------------------------------------
    # in the case when the linear learning rate scheduler with warmup is used
    
    # let us calculate the appropriate warmup steps (let us take a max epoch of 100)
    # length = len(train_dataset)

    # n_steps = length // config['batch_size']

    # num_steps = config['max_epoch'] * n_steps

    # warmup_steps = (config['max_epoch'] * n_steps) * config['warmup_ratio']

    # Initialize the scheduler parameters
    # scheduler_args = {'num_warmup_steps': warmup_steps, 'num_training_steps': num_steps}
    #-------------------------------------

    # Initialize the optimizer parameters
    optimizer_args = {
        'lr': config['learning_rate'],
        'weight_decay': config['weight_decay'],
        # 'betas': (0.9, 0.98),
        'warmup_init': config['warmup_init'],
        'relative_step': config['relative_step']
    }

    # ----------------------------
    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    train_sampler = SequenceLengthBatchSampler(train_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    # ------------------------------
    # initialize a bucket sampler with fixed batch size in the case of single machine
    # with parallelization on multiple gpus
    # train_sampler = BucketSampler(train_dataset, config['batch_size'])

    # test_sampler = BucketSampler(test_dataset, config['batch_size'])
    
    # ------------------------------

    # Initialize the loaders parameters
    train_loader_args = {'batch_sampler': train_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    # Add the datasets and hyperparameters to trainer
    trainer.compile(train_dataset, test_dataset, tokenizer, train_loader_args,
                    test_loader_args, optimizer_kwargs = optimizer_args,
                    # lr_scheduler=get_linear_schedule_with_warmup,
                    # lr_scheduler_kwargs=scheduler_args,
                    predict_with_generate = True,
                    hugging_face = True,
                    is_distributed=is_distributed,
                    logging_dir=config['logging_dir'],
                    dist=dist
                    )

    # load the model
    trainer.load(config['model_dir'], load_best = not config['continue'])
    
    # Train the model
    trainer.train(config['epochs'] - trainer.current_epoch, auto_save = True, log_step = config['log_step'], saving_directory=config['new_model_dir'], save_best = config['save_best'],
                  metric_for_best_model = config['metric_for_best_model'], metric_objective = config['metric_objective'])
    
    if config['return_trainer']:
        
        return trainer
    
    return None


Overwriting wolof-translate/wolof_translate/utils/hg_training.py


Below train and save if we want.

In [56]:
from wolof_translate.utils.hg_training import train

In [44]:
# with warnings.catch_warnings():
    # warnings.simplefilter("ignore")
trainer = train(config)

# save if necessary

  0%|          | 0/25 [00:00<?, ?it/s]

For epoch 6: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.44batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.33batches/s]



Metrics: {'train_loss': 2.2012223856491864, 'test_loss': 2.97088413972121, 'bleu': 1.3288132867132867, 'gen_len': 32.2832027972028}




  4%|▍         | 1/25 [00:20<08:14, 20.62s/it]

For epoch 7: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.44batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.36batches/s]



Metrics: {'train_loss': 2.0587681743011346, 'test_loss': 2.920340236250337, 'bleu': 1.5949433566433566, 'gen_len': 31.53494195804196}




  8%|▊         | 2/25 [00:42<08:17, 21.65s/it]

For epoch 8: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.48batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.41batches/s]



Metrics: {'train_loss': 1.9235682864349695, 'test_loss': 2.907906102133798, 'bleu': 1.9670594405594404, 'gen_len': 30.622374825174827}




 12%|█▏        | 3/25 [01:04<07:57, 21.72s/it]

For epoch 9: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.40batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.32batches/s]



Metrics: {'train_loss': 1.8176907369232496, 'test_loss': 2.957092210129424, 'bleu': 1.7479314685314682, 'gen_len': 31.54194195804196}




 16%|█▌        | 4/25 [01:25<07:30, 21.44s/it]

For epoch 10: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.43batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.27batches/s]



Metrics: {'train_loss': 1.7026850958812894, 'test_loss': 3.0525858285543808, 'bleu': 1.4636772727272727, 'gen_len': 30.437095804195803}




 20%|██        | 5/25 [01:46<07:06, 21.34s/it]

For epoch 11: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.42batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.35batches/s]



Metrics: {'train_loss': 1.6265092072499394, 'test_loss': 3.011158271269365, 'bleu': 2.3412748251748248, 'gen_len': 30.237743356643353}




 24%|██▍       | 6/25 [02:07<06:42, 21.17s/it]

For epoch 12: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.41batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.24batches/s]



Metrics: {'train_loss': 1.5177782061035368, 'test_loss': 3.071008787288532, 'bleu': 3.338223426573427, 'gen_len': 30.769244055944053}




 28%|██▊       | 7/25 [02:29<06:22, 21.25s/it]

For epoch 13: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.37batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.29batches/s]



Metrics: {'train_loss': 1.4270186270617693, 'test_loss': 3.084889536970979, 'bleu': 2.645886013986014, 'gen_len': 29.01398391608392}




 32%|███▏      | 8/25 [02:50<06:00, 21.22s/it]

For epoch 14: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.41batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.33batches/s]



Metrics: {'train_loss': 1.3568135406520627, 'test_loss': 3.039921688866782, 'bleu': 2.096099300699301, 'gen_len': 27.216797902097902}




 36%|███▌      | 9/25 [03:11<05:37, 21.12s/it]

For epoch 15: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.50batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.28batches/s]



Metrics: {'train_loss': 1.2849820408372312, 'test_loss': 3.1051486388786693, 'bleu': 3.5916318181818183, 'gen_len': 29.580393006993006}




 40%|████      | 10/25 [03:32<05:16, 21.07s/it]

For epoch 16: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.45batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.36batches/s]



Metrics: {'train_loss': 1.2035034914815803, 'test_loss': 3.0593154080264218, 'bleu': 3.5271048951048947, 'gen_len': 27.00347762237762}




 44%|████▍     | 11/25 [03:52<04:53, 20.95s/it]

For epoch 17: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.45batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.32batches/s]



Metrics: {'train_loss': 1.1423003066430517, 'test_loss': 3.2354436237495254, 'bleu': 3.003247552447552, 'gen_len': 29.86014405594406}




 48%|████▊     | 12/25 [04:13<04:32, 20.94s/it]

For epoch 18: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.46batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.26batches/s]



Metrics: {'train_loss': 1.0782466187525848, 'test_loss': 3.068224074957254, 'bleu': 3.4960227272727273, 'gen_len': 30.300693706293707}




 52%|█████▏    | 13/25 [04:35<04:12, 21.03s/it]

For epoch 19: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.51batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.34batches/s]



Metrics: {'train_loss': 1.0126017112078027, 'test_loss': 3.157427514349664, 'bleu': 3.7345384615384614, 'gen_len': 30.650358041958047}




 56%|█████▌    | 14/25 [04:55<03:50, 20.94s/it]

For epoch 20: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.49batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.28batches/s]



Metrics: {'train_loss': 0.9505923225250786, 'test_loss': 3.2311591268419386, 'bleu': 4.545248951048951, 'gen_len': 29.867105594405597}




 60%|██████    | 15/25 [05:16<03:29, 20.99s/it]

For epoch 21: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.44batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.28batches/s]



Metrics: {'train_loss': 0.9221569982401709, 'test_loss': 3.342424379362093, 'bleu': 4.577914685314686, 'gen_len': 29.97205734265734}




 64%|██████▍   | 16/25 [05:38<03:09, 21.03s/it]

For epoch 22: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.49batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.24batches/s]



Metrics: {'train_loss': 0.8686813425580567, 'test_loss': 3.3138423116057067, 'bleu': 4.567988811188811, 'gen_len': 29.0175013986014}




 68%|██████▊   | 17/25 [05:59<02:48, 21.08s/it]

For epoch 23: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.44batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.25batches/s]



Metrics: {'train_loss': 0.805647796151038, 'test_loss': 3.304755932801253, 'bleu': 5.165895104895106, 'gen_len': 30.206317482517484}




 72%|███████▏  | 18/25 [06:20<02:28, 21.14s/it]

For epoch 24: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.44batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.30batches/s]



Metrics: {'train_loss': 0.7463729696834536, 'test_loss': 3.153714696844141, 'bleu': 5.976082167832168, 'gen_len': 25.657329370629373}




 76%|███████▌  | 19/25 [06:41<02:06, 21.10s/it]

For epoch 25: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.43batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.37batches/s]



Metrics: {'train_loss': 0.7106834085496115, 'test_loss': 3.3087262707156735, 'bleu': 4.279994055944056, 'gen_len': 28.41257832167832}




 80%|████████  | 20/25 [07:02<01:44, 20.96s/it]

For epoch 26: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.21batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.26batches/s]



Metrics: {'train_loss': 0.6612085418610337, 'test_loss': 3.3331449115192973, 'bleu': 4.954048951048952, 'gen_len': 28.080423776223782}




 84%|████████▍ | 21/25 [07:23<01:24, 21.18s/it]

For epoch 27: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.35batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.25batches/s]



Metrics: {'train_loss': 0.626731078167722, 'test_loss': 3.367448740072183, 'bleu': 6.475375174825176, 'gen_len': 28.678298601398602}




 88%|████████▊ | 22/25 [07:45<01:03, 21.27s/it]

For epoch 28: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.36batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.25batches/s]



Metrics: {'train_loss': 0.589140331246786, 'test_loss': 3.37711390581998, 'bleu': 6.344759090909091, 'gen_len': 30.244778321678325}




 92%|█████████▏| 23/25 [08:06<00:42, 21.32s/it]

For epoch 29: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.34batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:06<00:00,  1.30batches/s]



Metrics: {'train_loss': 0.5504833005073876, 'test_loss': 3.431763330539623, 'bleu': 7.882058041958043, 'gen_len': 27.56643636363637}




 96%|█████████▌| 24/25 [08:27<00:21, 21.29s/it]

For epoch 30: 


Train batch number 2:   0%|          | 0/86 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 87: 100%|██████████| 86/86 [00:13<00:00,  6.29batches/s]
Test batch number 2:   0%|          | 0/9 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 10: 100%|██████████| 9/9 [00:07<00:00,  1.15batches/s]



Metrics: {'train_loss': 0.5199076705752737, 'test_loss': 3.3950424027609665, 'bleu': 6.214842657342657, 'gen_len': 27.10490769230769}




100%|██████████| 25/25 [08:50<00:00, 21.21s/it]


In [57]:
# with warnings.catch_warnings():
    # warnings.simplefilter("ignore")
trainer = train(config)

# save if necessary

0it [00:00, ?it/s]


➡️ Predictions


In [58]:
if not trainer is None:
    
    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # recuperate the test dataset
    # initialize the transformation sequence
    end_mark_fn = partial(add_end_mark)
    augmentation = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)


    # let us get the test set
    test_dataset = SentenceDataset(f"{config['data_directory']}test_set.csv",
                                            tokenizer = tokenizer,
                                            cp1_transformer = augmentation,
                                            cp2_transformer = augmentation,
                                            corpus_1=config['corpus_1'],
                                            corpus_2=config['corpus_2'],
                                            truncation = False)

    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                            'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    metrics, prediction = trainer.evaluate(test_dataset, test_loader_args)


Evaluation batch number 2:   0%|          | 0/7 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Evaluation batch number 8: 100%|██████████| 7/7 [00:04<00:00,  1.46batches/s]


In [59]:
metrics

{'test_loss': 3.406999619690688,
 'bleu': 5.349946153846154,
 'gen_len': 24.363641608391607}

In [60]:
prediction

Unnamed: 0,original_sentences,translations,predictions
0,Nit la ku baax.,C'est un homme gentil.,C'est un homme de un homme gentil.
1,Góor gi dana dem.,L'homme ira.,L'homme qui allait partir.
2,Looy ñaan?,Tu pries pour quoi?,Qu'as-tu acheté?
3,Maa di dem.,C'est moi qui pars.,C'est moi qui vais partir de partir.
4,Ci biir.,À l'intérieur.,Chez celui qu'il a indiqué.
...,...,...,...
281,"Teg nañ ciy ati-at ma door a jëli ni jigéen, n...","Ce n'est que longtemps après, quand l'égoïsme ...",Je me souviens de la violence en Afrique. Non...
282,"Li wóor te wér moo di ne bi loolu lépp weesoo,...","J'ai ressenti de l'étonnement, et même de l'in...",Je me souviens de la violence en Afrique. Non...
283,"Dàtti garab yaa ngi lunk, sànneeku jëm ca kow,...",À quel point les arbres aux troncs rectilignes...,Je me souviens de la violence en Afrique. Non...
284,"Li koy yëngal noonu, xam naa ko. Lan moo ko dà...",Je peux ressentir l'émotion qu'il éprouve à tr...,"Jusqu'à présent, on se moquait des planteurs ..."


---------------------------

--------------

#### French-Wolof v7

➡️ Import the libraries.

In [29]:
from wolof_translate import *

# specify a seed for everything
lt.seed_everything(0)

Global seed set to 0


0

➡️ Function to recuperate datasets

In [30]:
%%writefile wolof-translate/wolof_translate/utils/recuperate_datasets.py
from wolof_translate import *

def recuperate_datasets(char_p: float, word_p: float, max_len: int, end_mark: int, tokenizer: T5TokenizerFast,
                        corpus_1: str = 'french', corpus_2: str = 'wolof', 
                        train_file: str = 'data/extractions/new_data/train_set.csv', 
                        test_file: str = 'data/extractions/new_data/test_file.csv'):

  # Let us recuperate the end_mark adding option
  if end_mark == 1:
    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space)

    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space)
    
  else:
    
    if end_mark == 2:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!', replace = True)
    
    elif end_mark == 3:

      end_mark_fn = partial(add_end_mark)
    
    elif end_mark == 4:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!')
    
    else:  
        
        raise ValueError(f'No end mark number {end_mark}')

    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
  # Recuperate the train dataset
  train_dataset_aug = SentenceDataset(train_file,
                                        tokenizer,
                                        truncation = False,
                                        cp1_transformer = fr_augmentation_1,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2
                                        )

  # Recuperate the valid dataset
  valid_dataset = SentenceDataset(test_file,
                                        tokenizer,
                                        cp1_transformer = fr_augmentation_2,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2,
                                        truncation = False)
  
  # Return the datasets
  return train_dataset_aug, valid_dataset

Overwriting wolof-translate/wolof_translate/utils/recuperate_datasets.py


In [31]:
%run wolof-translate/wolof_translate/utils/recuperate_datasets.py

➡️ Training

In [36]:
# initialize the configurations
config = {
    'epochs': 7,
    'max_epoch': None,
    'log_step': 1,
    'metric_for_best_model': 'test_loss',
    'metric_objective': 'minimize',
    'corpus_1': 'french',
    'corpus_2': 'wolof',
    'train_file': 'data/extractions/new_data/train_set.csv',
    'test_file': 'data/extractions/new_data/valid_set.csv',
    'drop_out_rate': 0.02121451891074674,
    'd_model': 512,
    'n_head': 8,
    'dim_ff': 2024,
    'n_encoders': 6,
    'n_decoders': 6,
    'learning_rate': 0.0061194459986928336,
    'weight_decay': 0.004113564518012536,
    'char_p': 0.14939770357468674,
    'word_p': 0.05670290376911641,
    'end_mark': 3,
    'label_smoothing': 0.1,
    'max_len': 20,
    'random_state': 0,
    'boundaries': [2, 30, 57, 84, 112, 139, 166],
    'batch_sizes': [256, 128, 64, 32, 16, 8, 4, 2],
    'batch_size': None, 
    'warmup_init': False,
    'relative_step': False,
    'num_workers': 0,
    'pin_memory': False,
    # --------------------> Must be changed when continuing a training
    'model_dir': 't5_small_v7_fw',
    'new_model_dir': 't5_small_v7_fw',
    'continue': True, # --------------------------> Must be changed when continuing training
    'logging_dir': 'data/logs/t5_small_fw',
    'save_best': True,
    'tokenizer_path': 'wolof-translate/wolof_translate/tokenizers/t5_tokenizers/tokenizer_v6.model',
    'data_directory': 'data/extractions/new_data/',
    'data_file': 'corpora_v7.csv',
    'version': 7,
    # in the case of a distributed training
    'backend': None,
    'hosts': [],
    'current_host': None,
    'num_gpus': 5,
    'logger': None,
    'return_trainer': True,
    'include_split': True,
}

In [37]:
%%writefile wolof-translate/wolof_translate/utils/hg_training.py
from wolof_translate import *
import warnings

def train(config: dict):
    
    # ---------------------------------------
    # add distribution if necessary (https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-python-sdk/pytorch_mnist/mnist.py)
    
    logger = config['logger']
    
    is_distributed = len(config['hosts']) > 1 and config['backend'] is not None
    
    use_cuda = config['num_gpus'] > 0
    
    config.update({"num_workers": 1, "pin_memory": True} if use_cuda else {})

    if not logger is None:
        
        logger.debug("Distributed training - {}".format(is_distributed))
        
        logger.debug("Number of gpus available - {}".format(config['num_gpus']))
        
    if is_distributed:
        # Initialize the distributed environment.
        world_size = len(config['hosts'])
        
        os.environ["WORLD_SIZE"] = str(world_size)
        
        host_rank = config['hosts'].index(config['current_host'])
        
        os.environ["RANK"] = str(host_rank)
        
        dist.init_process_group(backend=config['backend'], rank=host_rank, world_size=world_size)
        
        if not logger is None: logger.info(
            "Initialized the distributed environment: '{}' backend on {} nodes. ".format(
                config['backend'], dist.get_world_size()
            )
            + "Current host rank is {}. Number of gpus: {}".format(dist.get_rank(), config['num_gpus'])
        )
    # ---------------------------------------
    
    # split the data
    if config['include_split']: split_data(config['random_state'], config['data_directory'], config['data_file'])

    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # Initialize the model name
    model_name = 't5-small'

    # import the model with its pre-trained weights
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    # resize the token embeddings
    model.resize_token_embeddings(len(tokenizer))
    
    # recuperate train and test set
    train_dataset, test_dataset = recuperate_datasets(config['char_p'],
                                                        config['word_p'], config['max_len'],
                                                        config['end_mark'], tokenizer, config['corpus_1'],
                                                        config['corpus_2'],
                                                        config['train_file'], config['test_file'])
    
    # initialize the evaluation object
    evaluation = TranslationEvaluation(tokenizer, train_dataset.decode)

    # let us initialize the trainer
    trainer = ModelRunner(model = model, version=config['version'], seed = 0, evaluation = evaluation, optimizer = Adafactor)

    #-------------------------------------
    # in the case when the linear learning rate scheduler with warmup is used
    
    # let us calculate the appropriate warmup steps (let us take a max epoch of 100)
    # length = len(train_dataset)

    # n_steps = length // config['batch_size']

    # num_steps = config['max_epoch'] * n_steps

    # warmup_steps = (config['max_epoch'] * n_steps) * config['warmup_ratio']

    # Initialize the scheduler parameters
    # scheduler_args = {'num_warmup_steps': warmup_steps, 'num_training_steps': num_steps}
    #-------------------------------------

    # Initialize the optimizer parameters
    optimizer_args = {
        'lr': config['learning_rate'],
        'weight_decay': config['weight_decay'],
        # 'betas': (0.9, 0.98),
        'warmup_init': config['warmup_init'],
        'relative_step': config['relative_step']
    }

    # ----------------------------
    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    train_sampler = SequenceLengthBatchSampler(train_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    # ------------------------------
    # initialize a bucket sampler with fixed batch size in the case of single machine
    # with parallelization on multiple gpus
    # train_sampler = BucketSampler(train_dataset, config['batch_size'])

    # test_sampler = BucketSampler(test_dataset, config['batch_size'])
    
    # ------------------------------

    # Initialize the loaders parameters
    train_loader_args = {'batch_sampler': train_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    # Add the datasets and hyperparameters to trainer
    trainer.compile(train_dataset, test_dataset, tokenizer, train_loader_args,
                    test_loader_args, optimizer_kwargs = optimizer_args,
                    # lr_scheduler=get_linear_schedule_with_warmup,
                    # lr_scheduler_kwargs=scheduler_args,
                    predict_with_generate = True,
                    hugging_face = True,
                    is_distributed=is_distributed,
                    logging_dir=config['logging_dir'],
                    dist=dist
                    )

    # load the model
    trainer.load(config['model_dir'], load_best = not config['continue'])
    
    # Train the model
    trainer.train(config['epochs'] - trainer.current_epoch, auto_save = True, log_step = config['log_step'], saving_directory=config['new_model_dir'], save_best = config['save_best'],
                  metric_for_best_model = config['metric_for_best_model'], metric_objective = config['metric_objective'])
    
    if config['return_trainer']:
        
        return trainer
    
    return None


Overwriting wolof-translate/wolof_translate/utils/hg_training.py


Below train and save if we want.

In [38]:
from wolof_translate.utils.hg_training import train

In [14]:
# with warnings.catch_warnings():
    # warnings.simplefilter("ignore")
trainer = train(config)

# save if necessary

  0%|          | 0/25 [00:00<?, ?it/s]

For epoch 6: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:13<00:00,  4.80batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:09<00:00,  1.21batches/s]



Metrics: {'train_loss': 2.6273809150817358, 'test_loss': 3.429296156483838, 'bleu': 1.5858124786324788, 'gen_len': 19.186319658119654}




  4%|▍         | 1/25 [00:23<09:20, 23.35s/it]

For epoch 7: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:13<00:00,  4.76batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:05<00:00,  1.86batches/s]



Metrics: {'train_loss': 2.4541173095474713, 'test_loss': 3.404978168520153, 'bleu': 1.8367466666666665, 'gen_len': 18.454717606837605}




  8%|▊         | 2/25 [00:51<09:59, 26.06s/it]

For epoch 8: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:13<00:00,  4.77batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.63batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.63batches/s]



Metrics: {'train_loss': 2.2870384227831813, 'test_loss': 3.443212355915298, 'bleu': 1.8307552136752137, 'gen_len': 19.482052136752134}




 12%|█▏        | 3/25 [01:12<08:47, 23.97s/it]

For epoch 9: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:13<00:00,  4.75batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.99batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.66batches/s]



Metrics: {'train_loss': 2.1352149050545752, 'test_loss': 3.438325615418263, 'bleu': 2.647915384615385, 'gen_len': 18.136759145299145}




 16%|█▌        | 4/25 [01:35<08:09, 23.33s/it]

For epoch 10: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:13<00:00,  4.79batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.87batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.63batches/s]



Metrics: {'train_loss': 1.986386069497356, 'test_loss': 3.5020873004554685, 'bleu': 2.401228205128205, 'gen_len': 20.008519145299143}




 20%|██        | 5/25 [01:58<07:49, 23.50s/it]

For epoch 11: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:13<00:00,  4.76batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.78batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.62batches/s]



Metrics: {'train_loss': 1.8512997479776316, 'test_loss': 3.569902686583691, 'bleu': 2.6901911111111105, 'gen_len': 19.938464273504273}




 24%|██▍       | 6/25 [02:20<07:14, 22.89s/it]

For epoch 12: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:13<00:00,  4.79batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.87batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.77batches/s]



Metrics: {'train_loss': 1.6897891996582928, 'test_loss': 3.564492826380281, 'bleu': 3.600856581196582, 'gen_len': 19.615387350427348}




 28%|██▊       | 7/25 [02:41<06:40, 22.26s/it]

For epoch 13: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:13<00:00,  4.77batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.93batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.72batches/s]



Metrics: {'train_loss': 1.5803535098291233, 'test_loss': 3.5194729707179926, 'bleu': 3.8159319658119655, 'gen_len': 18.323057435897432}




 32%|███▏      | 8/25 [03:02<06:12, 21.91s/it]

For epoch 14: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:13<00:00,  4.78batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.77batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.65batches/s]



Metrics: {'train_loss': 1.4440055558724063, 'test_loss': 3.507848788530399, 'bleu': 4.121050256410257, 'gen_len': 18.683748888888886}




 36%|███▌      | 9/25 [03:24<05:49, 21.85s/it]

For epoch 15: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:14<00:00,  4.65batches/s]
Test batch number 2:   9%|▉         | 1/11 [00:00<00:02,  4.58batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.81batches/s]



Metrics: {'train_loss': 1.3341362899260145, 'test_loss': 3.62996331891443, 'bleu': 4.754367692307693, 'gen_len': 18.868364786324786}




 40%|████      | 10/25 [03:45<05:25, 21.71s/it]

For epoch 16: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:13<00:00,  4.73batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.71batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.64batches/s]



Metrics: {'train_loss': 1.217524775293543, 'test_loss': 3.761451134314904, 'bleu': 4.313910769230768, 'gen_len': 19.651287521367518}




 44%|████▍     | 11/25 [04:07<05:03, 21.65s/it]

For epoch 17: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:13<00:00,  4.75batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.69batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.74batches/s]



Metrics: {'train_loss': 1.116343072042749, 'test_loss': 3.5930790754464947, 'bleu': 5.23972888888889, 'gen_len': 18.610275897435898}




 48%|████▊     | 12/25 [04:31<04:50, 22.32s/it]

For epoch 18: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:13<00:00,  4.73batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.71batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.79batches/s]



Metrics: {'train_loss': 1.0311605998779383, 'test_loss': 3.714331290253207, 'bleu': 4.78909076923077, 'gen_len': 19.056409743589743}




 52%|█████▏    | 13/25 [04:52<04:23, 21.93s/it]

For epoch 19: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:13<00:00,  4.73batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.72batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:05<00:00,  2.13batches/s]



Metrics: {'train_loss': 0.9403659137550634, 'test_loss': 3.700549722329164, 'bleu': 6.10097641025641, 'gen_len': 17.04446376068376}




 56%|█████▌    | 14/25 [05:12<03:55, 21.37s/it]

For epoch 20: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:14<00:00,  4.70batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.88batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.78batches/s]



Metrics: {'train_loss': 0.8617357562644241, 'test_loss': 3.7089439701830225, 'bleu': 6.111125982905984, 'gen_len': 18.157270769230767}




 60%|██████    | 15/25 [05:33<03:32, 21.30s/it]

For epoch 21: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:14<00:00,  4.67batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.65batches/s]



Metrics: {'train_loss': 0.7813965496348204, 'test_loss': 3.7829569767682982, 'bleu': 5.685519829059829, 'gen_len': 18.36067333333333}




 64%|██████▍   | 16/25 [05:55<03:12, 21.43s/it]

For epoch 22: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:14<00:00,  4.63batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.81batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.80batches/s]



Metrics: {'train_loss': 0.7253713563683709, 'test_loss': 3.7712708505809815, 'bleu': 6.514730427350428, 'gen_len': 17.788026495726495}




 68%|██████▊   | 17/25 [06:17<02:53, 21.63s/it]

For epoch 23: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:14<00:00,  4.59batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.75batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.70batches/s]



Metrics: {'train_loss': 0.6614922964706429, 'test_loss': 3.760771334884514, 'bleu': 7.574549059829059, 'gen_len': 17.596589059829057}




 72%|███████▏  | 18/25 [06:39<02:31, 21.66s/it]

For epoch 24: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:14<00:00,  4.58batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.75batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.67batches/s]



Metrics: {'train_loss': 0.6032347261150415, 'test_loss': 3.9356272681146605, 'bleu': 7.386919999999999, 'gen_len': 18.449588547008545}




 76%|███████▌  | 19/25 [07:01<02:10, 21.78s/it]

For epoch 25: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:14<00:00,  4.56batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:05<00:00,  2.06batches/s]



Metrics: {'train_loss': 0.5677011773528785, 'test_loss': 4.0033436685545825, 'bleu': 7.683991111111111, 'gen_len': 17.141903247863244}




 80%|████████  | 20/25 [07:21<01:47, 21.47s/it]

For epoch 26: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:14<00:00,  4.54batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.69batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.82batches/s]



Metrics: {'train_loss': 0.5069852984574893, 'test_loss': 3.9822296272995126, 'bleu': 7.061915042735043, 'gen_len': 17.222232136752133}




 84%|████████▍ | 21/25 [07:43<01:25, 21.45s/it]

For epoch 27: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:14<00:00,  4.53batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.66batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.65batches/s]



Metrics: {'train_loss': 0.4642480768636187, 'test_loss': 4.044623067643908, 'bleu': 8.169699316239315, 'gen_len': 18.22220632478632}




 88%|████████▊ | 22/25 [08:05<01:05, 21.67s/it]

For epoch 28: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:14<00:00,  4.52batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.67batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.69batches/s]



Metrics: {'train_loss': 0.44251986547597116, 'test_loss': 4.129232294946655, 'bleu': 7.961665470085469, 'gen_len': 17.35555213675213}




 92%|█████████▏| 23/25 [08:27<00:43, 21.79s/it]

For epoch 29: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:14<00:00,  4.49batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.66batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:06<00:00,  1.79batches/s]



Metrics: {'train_loss': 0.41743088178926147, 'test_loss': 4.249863563439785, 'bleu': 7.452027692307692, 'gen_len': 17.839301025641024}




 96%|█████████▌| 24/25 [08:49<00:21, 21.77s/it]

For epoch 30: 


Train batch number 2:   0%|          | 0/66 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 67: 100%|██████████| 66/66 [00:14<00:00,  4.45batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.66batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:05<00:00,  1.87batches/s]



Metrics: {'train_loss': 0.3819201127741519, 'test_loss': 4.304435429206262, 'bleu': 7.999988205128205, 'gen_len': 17.58118068376068}




100%|██████████| 25/25 [09:10<00:00, 22.04s/it]


In [39]:
# with warnings.catch_warnings():
    # warnings.simplefilter("ignore")
trainer = train(config)

# save if necessary

0it [00:00, ?it/s]


➡️ Predictions


In [40]:
if not trainer is None:
    
    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # recuperate the test dataset
    # initialize the transformation sequence
    end_mark_fn = partial(add_end_mark)
    augmentation = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)


    # let us get the test set
    test_dataset = SentenceDataset(f"{config['data_directory']}test_set.csv",
                                            tokenizer = tokenizer,
                                            cp1_transformer = augmentation,
                                            cp2_transformer = augmentation,
                                            corpus_1=config['corpus_1'],
                                            corpus_2=config['corpus_2'],
                                            truncation = False)

    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                            'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    metrics, prediction = trainer.evaluate(test_dataset, test_loader_args)


Evaluation batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Evaluation batch number 12: 100%|██████████| 11/11 [00:09<00:00,  1.18batches/s]


In [41]:
metrics

{'test_loss': 4.216073186988504,
 'bleu': 6.631126153846155,
 'gen_len': 18.235899999999994}

In [42]:
prediction

Unnamed: 0,original_sentences,translations,predictions
0,Toi tu connais personne.,Yaw xamuloo kenn.,Yaw doonkoon ŋga suñu njiit.
1,Donne un à chacun!,Mayal benn ñépp!,Mayal ñépp benn!
2,Je parle de ce petit.,Xale buu laa wax.,Maa ngiy féexlu ci suba.
3,Viens m'accompagner au marché.,Kaay gunge ma marse.,Kaay tànnal ma marse ba.
4,Où est la récompense?,Ana njëgu guró gi?,Ana boroom kër gi?
...,...,...,...
580,Je peux ressentir l'émotion qu'il éprouve à tr...,"Li koy yëngal noonu, xam naa ko. Lan moo ko dà...",Coppite gi may njort doon na àgg ba ci waxina...
581,"C'était avant la guerre, avant la solitude et ...","Walla ma ni boog loolu la doon gént, laata xar...","Su dee, su ma juumulee, baayam « District Off..."
582,"Je ne m'en souviens pas, mais j'ai dû crier, h...",Xam naa damaa waroon a yuuxu keroog ndax sama ...,"Sama xel dellu ci Afrig, am kàddu gu riir ci ..."
583,Ou bien les récits de grands Blancs qui voyage...,Walla boog mu mas maa may ci jaloorey ponkali ...,"Du lu war a jaaxal kenn. Su dee loolu la, maa..."


----------------------------------

#### Wolof-French v7

➡️ Import the libraries.

In [22]:
from wolof_translate import *

# specify a seed for everything
lt.seed_everything(0)

Global seed set to 0


0

➡️ Function to recuperate datasets

In [23]:
%%writefile wolof-translate/wolof_translate/utils/recuperate_datasets.py
from wolof_translate import *

def recuperate_datasets(char_p: float, word_p: float, max_len: int, end_mark: int, tokenizer: T5TokenizerFast,
                        corpus_1: str = 'french', corpus_2: str = 'wolof', 
                        train_file: str = 'data/extractions/new_data/train_set.csv', 
                        test_file: str = 'data/extractions/new_data/test_file.csv'):

  # Let us recuperate the end_mark adding option
  if end_mark == 1:
    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space)

    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space)
    
  else:
    
    if end_mark == 2:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!', replace = True)
    
    elif end_mark == 3:

      end_mark_fn = partial(add_end_mark)
    
    elif end_mark == 4:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!')
    
    else:  
        
        raise ValueError(f'No end mark number {end_mark}')

    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
  # Recuperate the train dataset
  train_dataset_aug = SentenceDataset(train_file,
                                        tokenizer,
                                        truncation = False,
                                        cp1_transformer = fr_augmentation_1,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2
                                        )

  # Recuperate the valid dataset
  valid_dataset = SentenceDataset(test_file,
                                        tokenizer,
                                        cp1_transformer = fr_augmentation_2,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2,
                                        truncation = False)
  
  # Return the datasets
  return train_dataset_aug, valid_dataset

Overwriting wolof-translate/wolof_translate/utils/recuperate_datasets.py


In [24]:
%run wolof-translate/wolof_translate/utils/recuperate_datasets.py

➡️ Training

In [43]:
# initialize the configurations
config = {
    'epochs': 20,
    'max_epoch': None,
    'log_step': 1,
    'metric_for_best_model': 'test_loss',
    'metric_objective': 'minimize',
    'corpus_1': 'wolof',
    'corpus_2': 'french',
    'train_file': 'data/extractions/new_data/train_set.csv',
    'test_file': 'data/extractions/new_data/valid_set.csv',
    'drop_out_rate': 0.291121690756753,
    'd_model': 512,
    'n_head': 8,
    'dim_ff': 2024,
    'n_encoders': 6,
    'n_decoders': 6,
    'learning_rate': 0.002880919287770637,
    'weight_decay': 0.0003390262718277915,
    'char_p': 0.018203086625721746,
    'word_p': 0.02404347487358639,
    'end_mark': 3,
    'label_smoothing': 0.1,
    'max_len': 20,
    'random_state': 0,
    'boundaries': [2, 30, 57, 84, 112, 139, 166],
    'batch_sizes': [256, 128, 64, 32, 16, 8, 4, 2],
    'batch_size': None, 
    'warmup_init': False,
    'relative_step': False,
    'num_workers': 0,
    'pin_memory': False,
    # --------------------> Must be changed when continuing a training
    'model_dir': 't5_small_v7_wf',
    'new_model_dir': 't5_small_v7_wf',
    'continue': True, # --------------------------> Must be changed when continuing training
    'logging_dir': 'data/logs/t5_small_wf',
    'save_best': True,
    'tokenizer_path': 'wolof-translate/wolof_translate/tokenizers/t5_tokenizers/tokenizer_v6.model',
    'data_directory': 'data/extractions/new_data/',
    'data_file': 'corpora_v7.csv',
    'version': 7,
    # in the case of a distributed training
    'backend': None,
    'hosts': [],
    'current_host': None,
    'num_gpus': 5,
    'logger': None,
    'return_trainer': True,
    'include_split': True,
}

In [44]:
%%writefile wolof-translate/wolof_translate/utils/hg_training.py
from wolof_translate import *
import warnings

def train(config: dict):
    
    # ---------------------------------------
    # add distribution if necessary (https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-python-sdk/pytorch_mnist/mnist.py)
    
    logger = config['logger']
    
    is_distributed = len(config['hosts']) > 1 and config['backend'] is not None
    
    use_cuda = config['num_gpus'] > 0
    
    config.update({"num_workers": 1, "pin_memory": True} if use_cuda else {})

    if not logger is None:
        
        logger.debug("Distributed training - {}".format(is_distributed))
        
        logger.debug("Number of gpus available - {}".format(config['num_gpus']))
        
    if is_distributed:
        # Initialize the distributed environment.
        world_size = len(config['hosts'])
        
        os.environ["WORLD_SIZE"] = str(world_size)
        
        host_rank = config['hosts'].index(config['current_host'])
        
        os.environ["RANK"] = str(host_rank)
        
        dist.init_process_group(backend=config['backend'], rank=host_rank, world_size=world_size)
        
        if not logger is None: logger.info(
            "Initialized the distributed environment: '{}' backend on {} nodes. ".format(
                config['backend'], dist.get_world_size()
            )
            + "Current host rank is {}. Number of gpus: {}".format(dist.get_rank(), config['num_gpus'])
        )
    # ---------------------------------------
    
    # split the data
    if config['include_split']: split_data(config['random_state'], config['data_directory'], config['data_file'])

    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # Initialize the model name
    model_name = 't5-small'

    # import the model with its pre-trained weights
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    # resize the token embeddings
    model.resize_token_embeddings(len(tokenizer))
    
    # recuperate train and test set
    train_dataset, test_dataset = recuperate_datasets(config['char_p'],
                                                        config['word_p'], config['max_len'],
                                                        config['end_mark'], tokenizer, config['corpus_1'],
                                                        config['corpus_2'],
                                                        config['train_file'], config['test_file'])
    
    # initialize the evaluation object
    evaluation = TranslationEvaluation(tokenizer, train_dataset.decode)

    # let us initialize the trainer
    trainer = ModelRunner(model = model, version=config['version'], seed = 0, evaluation = evaluation, optimizer = Adafactor)

    #-------------------------------------
    # in the case when the linear learning rate scheduler with warmup is used
    
    # let us calculate the appropriate warmup steps (let us take a max epoch of 100)
    # length = len(train_dataset)

    # n_steps = length // config['batch_size']

    # num_steps = config['max_epoch'] * n_steps

    # warmup_steps = (config['max_epoch'] * n_steps) * config['warmup_ratio']

    # Initialize the scheduler parameters
    # scheduler_args = {'num_warmup_steps': warmup_steps, 'num_training_steps': num_steps}
    #-------------------------------------

    # Initialize the optimizer parameters
    optimizer_args = {
        'lr': config['learning_rate'],
        'weight_decay': config['weight_decay'],
        # 'betas': (0.9, 0.98),
        'warmup_init': config['warmup_init'],
        'relative_step': config['relative_step']
    }

    # ----------------------------
    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    train_sampler = SequenceLengthBatchSampler(train_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    # ------------------------------
    # initialize a bucket sampler with fixed batch size in the case of single machine
    # with parallelization on multiple gpus
    # train_sampler = BucketSampler(train_dataset, config['batch_size'])

    # test_sampler = BucketSampler(test_dataset, config['batch_size'])
    
    # ------------------------------

    # Initialize the loaders parameters
    train_loader_args = {'batch_sampler': train_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    # Add the datasets and hyperparameters to trainer
    trainer.compile(train_dataset, test_dataset, tokenizer, train_loader_args,
                    test_loader_args, optimizer_kwargs = optimizer_args,
                    # lr_scheduler=get_linear_schedule_with_warmup,
                    # lr_scheduler_kwargs=scheduler_args,
                    predict_with_generate = True,
                    hugging_face = True,
                    is_distributed=is_distributed,
                    logging_dir=config['logging_dir'],
                    dist=dist
                    )

    # load the model
    trainer.load(config['model_dir'], load_best = not config['continue'])
    
    # Train the model
    trainer.train(config['epochs'] - trainer.current_epoch, auto_save = True, log_step = config['log_step'], saving_directory=config['new_model_dir'], save_best = config['save_best'],
                  metric_for_best_model = config['metric_for_best_model'], metric_objective = config['metric_objective'])
    
    if config['return_trainer']:
        
        return trainer
    
    return None


Overwriting wolof-translate/wolof_translate/utils/hg_training.py


Below train and save if we want.

In [45]:
from wolof_translate.utils.hg_training import train

In [28]:
# with warnings.catch_warnings():
    # warnings.simplefilter("ignore")
trainer = train(config)

# save if necessary

  0%|          | 0/25 [00:00<?, ?it/s]

For epoch 6: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:23<00:00,  2.52batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:08<00:00,  1.26batches/s]



Metrics: {'train_loss': 3.5544145637541567, 'test_loss': 3.557906624802157, 'bleu': 0.9559466666666665, 'gen_len': 20.58971863247863}




  4%|▍         | 1/25 [00:33<13:13, 33.08s/it]

For epoch 7: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:24<00:00,  2.46batches/s]
Test batch number 3:   9%|▉         | 1/11 [00:00<00:02,  4.59batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:08<00:00,  1.36batches/s]



Metrics: {'train_loss': 3.3875999480766357, 'test_loss': 3.498508999897883, 'bleu': 1.3941232478632481, 'gen_len': 20.854708034188032}




  8%|▊         | 2/25 [01:07<13:05, 34.15s/it]

For epoch 8: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:23<00:00,  2.60batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:08<00:00,  1.23batches/s]



Metrics: {'train_loss': 3.2478912552775627, 'test_loss': 3.4401248059721072, 'bleu': 1.641368205128205, 'gen_len': 21.083752478632476}




 12%|█▏        | 3/25 [01:42<12:32, 34.22s/it]

For epoch 9: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:25<00:00,  2.38batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:08<00:00,  1.29batches/s]



Metrics: {'train_loss': 3.1177383677076347, 'test_loss': 3.4354852224007635, 'bleu': 1.7895782905982909, 'gen_len': 20.536745128205126}




 16%|█▌        | 4/25 [02:18<12:13, 34.94s/it]

For epoch 10: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:20<00:00,  2.92batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:08<00:00,  1.31batches/s]



Metrics: {'train_loss': 2.991359824090049, 'test_loss': 3.4468863858117, 'bleu': 2.036101538461539, 'gen_len': 21.02394803418803}




 20%|██        | 5/25 [02:48<11:01, 33.09s/it]

For epoch 11: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:24<00:00,  2.46batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:09<00:00,  1.18batches/s]



Metrics: {'train_loss': 2.8759532886452917, 'test_loss': 3.37109613663111, 'bleu': 2.1354916239316233, 'gen_len': 21.10767846153846}




 24%|██▍       | 6/25 [03:24<10:48, 34.12s/it]

For epoch 12: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:21<00:00,  2.75batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:08<00:00,  1.33batches/s]



Metrics: {'train_loss': 2.7573855085899135, 'test_loss': 3.365126796054026, 'bleu': 2.4606398290598297, 'gen_len': 20.71794444444444}




 28%|██▊       | 7/25 [03:56<10:03, 33.53s/it]

For epoch 13: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:22<00:00,  2.65batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:07<00:00,  1.42batches/s]



Metrics: {'train_loss': 2.6500792455428486, 'test_loss': 3.4173506288446927, 'bleu': 2.6750264957264958, 'gen_len': 20.74701675213675}




 32%|███▏      | 8/25 [04:27<09:18, 32.84s/it]

For epoch 14: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:21<00:00,  2.75batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:08<00:00,  1.36batches/s]



Metrics: {'train_loss': 2.5380443433368174, 'test_loss': 3.3410615073310006, 'bleu': 2.7775425641025646, 'gen_len': 20.747032307692308}




 36%|███▌      | 9/25 [05:00<08:41, 32.61s/it]

For epoch 15: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:24<00:00,  2.44batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:08<00:00,  1.27batches/s]



Metrics: {'train_loss': 2.4323231978254314, 'test_loss': 3.363137841836, 'bleu': 3.428517948717949, 'gen_len': 20.83416239316239}




 40%|████      | 10/25 [05:34<08:16, 33.08s/it]

For epoch 16: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:22<00:00,  2.67batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:09<00:00,  1.21batches/s]



Metrics: {'train_loss': 2.333548980788149, 'test_loss': 3.3440274169302397, 'bleu': 3.2440717948717945, 'gen_len': 21.005132820512816}




 44%|████▍     | 11/25 [06:06<07:40, 32.91s/it]

For epoch 17: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:24<00:00,  2.43batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:08<00:00,  1.23batches/s]



Metrics: {'train_loss': 2.2358776534388882, 'test_loss': 3.3254459833487484, 'bleu': 4.005057264957265, 'gen_len': 20.62908102564102}




 48%|████▊     | 12/25 [06:42<07:19, 33.81s/it]

For epoch 18: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:23<00:00,  2.58batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:09<00:00,  1.18batches/s]



Metrics: {'train_loss': 2.122140735738343, 'test_loss': 3.324604402444301, 'bleu': 4.092709230769231, 'gen_len': 20.7367347008547}




 52%|█████▏    | 13/25 [07:17<06:49, 34.13s/it]

For epoch 19: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:22<00:00,  2.69batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:09<00:00,  1.18batches/s]



Metrics: {'train_loss': 2.0318410820475084, 'test_loss': 3.3232013095138417, 'bleu': 4.237663247863248, 'gen_len': 20.85469521367521}




 56%|█████▌    | 14/25 [07:51<06:14, 34.05s/it]

For epoch 20: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:26<00:00,  2.26batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:08<00:00,  1.37batches/s]



Metrics: {'train_loss': 1.9317204730443407, 'test_loss': 3.3213615441933655, 'bleu': 4.656213846153847, 'gen_len': 20.470100854700853}




 60%|██████    | 15/25 [08:28<05:48, 34.87s/it]

For epoch 21: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:22<00:00,  2.64batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:08<00:00,  1.33batches/s]



Metrics: {'train_loss': 1.8333162675755874, 'test_loss': 3.3651895767603173, 'bleu': 4.917832820512821, 'gen_len': 20.454705299145296}




 64%|██████▍   | 16/25 [09:00<05:06, 34.06s/it]

For epoch 22: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 61: 100%|██████████| 60/60 [00:27<00:00,  2.22batches/s]
Test batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Test batch number 12: 100%|██████████| 11/11 [00:08<00:00,  1.31batches/s]



Metrics: {'train_loss': 1.7523903528386144, 'test_loss': 3.3957172605726456, 'bleu': 5.2231699145299135, 'gen_len': 20.16238649572649}




 68%|██████▊   | 17/25 [09:36<04:38, 34.76s/it]

For epoch 23: 


Train batch number 2:   0%|          | 0/60 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Train batch number 30:  47%|████▋     | 28/60 [00:11<00:12,  2.54batches/s]
 68%|██████▊   | 17/25 [09:47<04:36, 34.57s/it]

KeyboardInterrupt



In [46]:
# with warnings.catch_warnings():
    # warnings.simplefilter("ignore")
trainer = train(config)

# save if necessary

0it [00:00, ?it/s]


➡️ Predictions


In [47]:
if not trainer is None:
    
    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # recuperate the test dataset
    # initialize the transformation sequence
    end_mark_fn = partial(add_end_mark)
    augmentation = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)


    # let us get the test set
    test_dataset = SentenceDataset(f"{config['data_directory']}test_set.csv",
                                            tokenizer = tokenizer,
                                            cp1_transformer = augmentation,
                                            cp2_transformer = augmentation,
                                            corpus_1=config['corpus_1'],
                                            corpus_2=config['corpus_2'],
                                            truncation = False)

    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                            'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    metrics, prediction = trainer.evaluate(test_dataset, test_loader_args)


Evaluation batch number 2:   0%|          | 0/11 [00:00<?, ?batches/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Evaluation batch number 12: 100%|██████████| 11/11 [00:10<00:00,  1.07batches/s]


In [48]:
metrics

{'test_loss': 3.3293098201099625,
 'bleu': 4.558357435897436,
 'gen_len': 21.695728376068374}

In [53]:
pd.options.display.max_rows = 1000
prediction.head(1000)

Unnamed: 0,original_sentences,translations,predictions
0,Jéndal gejju Kaasamaas.,Achète du poisson séché de Casamance.,Achète une autre chose de l
1,Bët yu réy la am.,Il a de gros yeux.,C'est cette femme qu
2,Asamaan saa ngiy dënnu.,Il tonne.,On y a des choses qui
3,Mi xar mépp.,Ce mouton en entier.,Le mouton n'est pas celui
4,Kër gu jëkk gi la.,C'est la première maison.,C'est la maison de la
5,Ndax kan dem?,Afin que parte?,Afin que qui parte?
6,Jamma rek.,Paix seulement.,Je n'ai pas été à
7,Ana ŋga?,Où es-tu?,Où est l'homme?
8,Naan?,De quelle manière?,Pai?
9,Yaw xamuloo kenn.,Toi tu connais personne.,Tois ne sont pas encore arrivée
