Custom Transformer Training
-------------------------------

In this notebook we will train the custom transformer on multiple GPUs if they are available. The GPUs are in a single machine. In [multiple](_custom_transformer_train_multiple.ipynb), we will use sagemaker to distribute the training of the model over multiple instances. 

We will pursue the following steps:

- Load the libraries
- Creating function to recuperate datasets (arguments: char_p, word_p, max_len, end_mark, corpus_1, corpus_2, data_directory)
- Training (The model is automatically saved)(arguments: config dictionary initialized before)
- Predictions

➡️ Import the libraries.

In [1]:
from wolof_translate import *

# specify a seed for everything
lt.seed_everything(0)

  from .autonotebook import tqdm as notebook_tqdm
Global seed set to 0


0

➡️ Function to recuperate datasets

In [2]:
%%writefile wolof-translate/wolof_translate/utils/recuperate_datasets.py
from wolof_translate import *

def recuperate_datasets(char_p: float, word_p: float, max_len: int, end_mark: int, tokenizer: T5TokenizerFast,
                        corpus_1: str = 'french', corpus_2: str = 'wolof', 
                        train_file: str = 'data/extractions/new_data/train_set.csv', 
                        test_file: str = 'data/extractions/new_data/test_file.csv'):

  # Let us recuperate the end_mark adding option
  if end_mark == 1:
    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space)

    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space)
    
  else:
    
    if end_mark == 2:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!', replace = True)
    
    elif end_mark == 3:

      end_mark_fn = partial(add_end_mark)
    
    elif end_mark == 4:

      end_mark_fn = partial(add_end_mark, end_mark_to_remove = '!')
    
    else:  
        
        raise ValueError(f'No end mark number {end_mark}')

    # Create augmentation to add on French sentences
    fr_augmentation_1 = TransformerSequences(nac.KeyboardAug(aug_char_p=char_p, aug_word_p=word_p,
                                                             aug_word_max = max_len),
                                          remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
    fr_augmentation_2 = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)
    
  # Recuperate the train dataset
  train_dataset_aug = SentenceDataset(train_file,
                                        tokenizer,
                                        truncation = False,
                                        cp1_transformer = fr_augmentation_1,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2
                                        )

  # Recuperate the valid dataset
  valid_dataset = SentenceDataset(test_file,
                                        tokenizer,
                                        cp1_transformer = fr_augmentation_2,
                                        cp2_transformer = fr_augmentation_2,
                                        corpus_1=corpus_1,
                                        corpus_2=corpus_2,
                                        truncation = False)
  
  # Return the datasets
  return train_dataset_aug, valid_dataset

Overwriting wolof-translate/wolof_translate/utils/recuperate_datasets.py


In [3]:
%run wolof-translate/wolof_translate/utils/recuperate_datasets.py

➡️ Training

In [4]:
# initialize the configurations
config = {
    'epochs': 15,
    'max_epoch': None,
    'log_step': 5,
    'metric_for_best_model': 'bleu',
    'metric_objective': 'maximize',
    'corpus_1': 'french',
    'corpus_2': 'wolof',
    'train_file': 'data/extractions/new_data/train_set.csv',
    'test_file': 'data/extractions/new_data/valid_set.csv',
    'drop_out_rate': 0.291121690756753,
    'd_model': 512,
    'n_head': 8,
    'dim_ff': 2024,
    'n_encoders': 6,
    'n_decoders': 6,
    'learning_rate': None,
    'weight_decay': 0.0,
    'char_p': 0.082269346292589,
    'word_p': 0.005292549318241768,
    'end_mark': 3,
    'label_smoothing': 0.1,
    'max_len': 20,
    'random_state': 0,
    'boundaries': [2, 31, 59, 87, 115, 143, 171],
    'batch_sizes': [256, 128, 64, 32, 16, 8, 4, 2],
    'batch_size': None, 
    'warmup_init': True,
    'relative_step': True,
    'num_workers': 0,
    'pin_memory': False,
    # --------------------> Must be changed when continuing a training
    'model_dir': 'data/checkpoints/custom_transformer_v6_fw_best',
    'new_model_dir': 'data/checkpoints/custom_transformer_v6_fw',
    'continue': False, # --------------------------> Must be changed when continuing training
    'logging_dir': 'data/logs/custom_transformer_fw',
    'save_best': True,
    'tokenizer_path': 'wolof-translate/wolof_translate/tokenizers/t5_tokenizers/tokenizer_v5.model',
    'data_directory': 'data/extractions/new_data/',
    'data_file': 'corpora_v6.csv',
    'version': 6,
    # in the case of a distributed training
    'backend': None,
    'hosts': [],
    'current_host': None,
    'num_gpus': 5,
    'logger': None,
    'return_trainer': True,
    'include_split': True,
}

In [5]:
# %%writefile wolof-translate/wolof_translate/utils/training.py
from wolof_translate import *
import warnings

def train(config: dict):
    
    # ---------------------------------------
    # add distribution if necessary (https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-python-sdk/pytorch_mnist/mnist.py)
    
    logger = config['logger']
    
    is_distributed = len(config['hosts']) > 1 and config['backend'] is not None
    
    use_cuda = config['num_gpus'] > 0
    
    config.update({"num_workers": 1, "pin_memory": True} if use_cuda else {})

    if not logger is None:
        
        logger.debug("Distributed training - {}".format(is_distributed))
        
        logger.debug("Number of gpus available - {}".format(config['num_gpus']))
        
    if is_distributed:
        # Initialize the distributed environment.
        world_size = len(config['hosts'])
        
        os.environ["WORLD_SIZE"] = str(world_size)
        
        host_rank = config['hosts'].index(config['current_host'])
        
        os.environ["RANK"] = str(host_rank)
        
        dist.init_process_group(backend=config['backend'], rank=host_rank, world_size=world_size)
        
        if not logger is None: logger.info(
            "Initialized the distributed environment: '{}' backend on {} nodes. ".format(
                config['backend'], dist.get_world_size()
            )
            + "Current host rank is {}. Number of gpus: {}".format(dist.get_rank(), config['num_gpus'])
        )
    # ---------------------------------------
    
    # split the data
    if config['include_split']: split_data(config['random_state'], config['data_directory'], config['data_file'])

    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # recuperate train and test set
    train_dataset, test_dataset = recuperate_datasets(config['char_p'],
                                                        config['word_p'], config['max_len'],
                                                        config['end_mark'], tokenizer, config['corpus_1'],
                                                        config['corpus_2'],
                                                        config['train_file'], config['test_file'])
    
    # initialize the evaluation object
    evaluation = TranslationEvaluation(tokenizer, train_dataset.decode)

    # let us initialize the trainer
    trainer = ModelRunner(model = Transformer, version=config['version'], seed = 0, evaluation = evaluation, optimizer = Adafactor)

    # initialize the encoder and the decoder layers
    encoder_layer = nn.TransformerEncoderLayer(config['d_model'],
                                                config['n_head'],
                                                config['dim_ff'],
                                                config['drop_out_rate'], batch_first = True)

    decoder_layer = nn.TransformerDecoderLayer(config['d_model'],
                                                config['n_head'],
                                                config['dim_ff'],
                                                config['drop_out_rate'], batch_first = True)

    # let us initialize the encoder and the decoder
    encoder = nn.TransformerEncoder(encoder_layer, config['n_encoders'])

    decoder = nn.TransformerDecoder(decoder_layer, config['n_decoders'])

    #-------------------------------------
    # in the case when the linear learning rate scheduler with warmup is used
    
    # let us calculate the appropriate warmup steps (let us take a max epoch of 100)
    # length = len(train_dataset)

    # n_steps = length // config['batch_size']

    # num_steps = config['max_epoch'] * n_steps

    # warmup_steps = (config['max_epoch'] * n_steps) * config['warmup_ratio']

    # Initialize the scheduler parameters
    # scheduler_args = {'num_warmup_steps': warmup_steps, 'num_training_steps': num_steps}
    #-------------------------------------

    # Initialize the transformer parameters
    model_args = {
        'vocab_size': len(tokenizer),
        'encoder': encoder,
        'decoder': decoder,
        'class_criterion': nn.CrossEntropyLoss(label_smoothing = config['label_smoothing']),
        'max_len': config['max_len']
    }

    # Initialize the optimizer parameters
    optimizer_args = {
        'lr': config['learning_rate'],
        'weight_decay': config['weight_decay'],
        # 'betas': (0.9, 0.98),
        'warmup_init': config['warmup_init'],
        'relative_step': config['relative_step']
    }

    # ----------------------------
    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    train_sampler = SequenceLengthBatchSampler(train_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    # ------------------------------
    # initialize a bucket sampler with fixed batch size in the case of single machine
    # with parallelization on multiple gpus
    # train_sampler = BucketSampler(train_dataset, config['batch_size'])

    # test_sampler = BucketSampler(test_dataset, config['batch_size'])
    
    # ------------------------------

    # Initialize the loaders parameters
    train_loader_args = {'batch_sampler': train_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                        'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    # Add the datasets and hyperparameters to trainer
    trainer.compile(train_dataset, test_dataset, tokenizer, train_loader_args,
                    test_loader_args, optimizer_kwargs = optimizer_args,
                    model_kwargs = model_args,
                    # lr_scheduler=get_linear_schedule_with_warmup,
                    # lr_scheduler_kwargs=scheduler_args,
                    predict_with_generate = True,
                    is_distributed=is_distributed,
                    logging_dir=config['logging_dir'],
                    dist=dist
                    )

    # load the model
    trainer.load(config['model_dir'], load_best = not config['continue'])
    
    # Train the model
    trainer.train(config['epochs'] - trainer.current_epoch, auto_save = True, log_step = config['log_step'], saving_directory=config['new_model_dir'], save_best = config['save_best'],
                  metric_for_best_model = config['metric_for_best_model'], metric_objective = config['metric_objective'])
    
    if config['return_trainer']:
        
        return trainer
    
    return None


Below train and save if we want.

In [6]:
from wolof_translate.utils.training import train

In [7]:
# with warnings.catch_warnings():
    # warnings.simplefilter("ignore")
trainer = train(config)

# save if necessary

  0%|          | 0/10 [00:00<?, ?it/s]

For epoch 6: 


Train batch number 41: 100%|██████████| 40/40 [00:33<00:00,  1.20batches/s]
 10%|█         | 1/10 [00:33<05:01, 33.46s/it]



For epoch 7: 


Train batch number 41: 100%|██████████| 40/40 [00:32<00:00,  1.24batches/s]
 20%|██        | 2/10 [01:05<04:21, 32.75s/it]



For epoch 8: 


Train batch number 41: 100%|██████████| 40/40 [00:31<00:00,  1.26batches/s]
 30%|███       | 3/10 [01:37<03:45, 32.24s/it]



For epoch 9: 


Train batch number 41: 100%|██████████| 40/40 [00:31<00:00,  1.28batches/s]
 40%|████      | 4/10 [02:08<03:11, 31.85s/it]



For epoch 10: 


Train batch number 41: 100%|██████████| 40/40 [00:34<00:00,  1.14batches/s]
  output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)
  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


Test batch number 10: 100%|██████████| 9/9 [00:53<00:00,  5.92s/batches]



Metrics: {'train_loss': 33.21320210562812, 'test_loss': 7.546239852905273, 'accuracy': 0.06587777777777776, 'bleu': 0.16345555555555558, 'gen_len': 108.0}




 50%|█████     | 5/10 [03:41<04:28, 53.72s/it]

For epoch 11: 


Train batch number 41: 100%|██████████| 40/40 [00:32<00:00,  1.23batches/s]
 60%|██████    | 6/10 [04:13<03:06, 46.52s/it]



For epoch 12: 


Train batch number 41: 100%|██████████| 40/40 [00:31<00:00,  1.25batches/s]
 70%|███████   | 7/10 [04:45<02:05, 41.74s/it]



For epoch 13: 


Train batch number 41: 100%|██████████| 40/40 [00:35<00:00,  1.12batches/s]
 80%|████████  | 8/10 [05:21<01:19, 39.86s/it]



For epoch 14: 


Train batch number 41: 100%|██████████| 40/40 [00:34<00:00,  1.15batches/s]
 90%|█████████ | 9/10 [05:56<00:38, 38.25s/it]



For epoch 15: 


Train batch number 41: 100%|██████████| 40/40 [00:33<00:00,  1.20batches/s]
  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


Test batch number 10: 100%|██████████| 9/9 [00:50<00:00,  5.56s/batches]



Metrics: {'train_loss': 29.828085475497776, 'test_loss': 7.098637474907769, 'accuracy': 0.07026666666666666, 'bleu': 0.1617, 'gen_len': 108.0}




100%|██████████| 10/10 [07:21<00:00, 44.17s/it]


➡️ Predictions


In [10]:
if not trainer is None:
    
    # recuperate the tokenizer
    tokenizer = T5TokenizerFast(config['tokenizer_path'])
    
    # recuperate the test dataset
    # initialize the transformation sequence
    end_mark_fn = partial(add_end_mark)
    augmentation = TransformerSequences(remove_mark_space, delete_guillemet_space, add_mark_space, end_mark_fn)


    # let us get the test set
    test_dataset = SentenceDataset(f"{config['data_directory']}test_set.csv",
                                            tokenizer = tokenizer,
                                            cp1_transformer = augmentation,
                                            cp2_transformer = augmentation,
                                            corpus_1=config['corpus_1'],
                                            corpus_2=config['corpus_2'],
                                            truncation = False)

    # initialize the bucket samplers for distributed environment
    boundaries = config['boundaries']
    batch_sizes = config['batch_sizes']

    test_sampler = SequenceLengthBatchSampler(test_dataset,
                                                boundaries = boundaries,
                                                batch_sizes = batch_sizes)

    test_loader_args = {'batch_sampler': test_sampler, 'collate_fn': collate_fn,
                            'num_workers': config['num_workers'], 'pin_memory': config['pin_memory']}

    metrics, prediction = trainer.evaluate(test_dataset, test_loader_args)


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


  return torch._native_multi_head_attention(


embedding


Evaluation batch number 8: 100%|██████████| 7/7 [00:36<00:00,  5.23s/batches]


In [11]:
metrics

{'test_loss': 7.070769105638776,
 'accuracy': 0.08175714285714286,
 'bleu': 0.22375714285714285,
 'gen_len': 82.14285714285714}

In [12]:
prediction

Unnamed: 0,original_sentences,translations,predictions
0,Vous parlez de quelle dame?,Jile jigéen jan ŋgeen wax?,??????????
1,C'est notre ami!,Suñu sarit la!,!!!!!!!!!!
2,L'homme t'avait vu.,Góor gi gisóon na la.,..........
3,Celui qui est en haut!,Kenn ki ci kaw!,!!!!!!!!!!
4,À l'intérieur.,Ci biir.,..........
...,...,...,...
281,"Ce n'est que longtemps après, quand l'égoïsme ...","Teg nañ ciy ati-at ma door a jëli ni jigéen, n...",",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,..."
282,"J'ai ressenti de l'étonnement, et même de l'in...","Li wóor te wér moo di ne bi loolu lépp weesoo,...",",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,..."
283,À quel point les arbres aux troncs rectilignes...,"Dàtti garab yaa ngi lunk, sànneeku jëm ca kow,...",",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,..."
284,Je peux ressentir l'émotion qu'il éprouve à tr...,"Li koy yëngal noonu, xam naa ko. Lan moo ko dà...",",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,..."
