Custom Transformer Training
-------------------------------

In this notebook we will train the custom transformer on multiple GPUs if they are available. Each GPU is inside an aws instance. We will use the functions that we create at [single](_custom_transformer_train_single.ipynb) to distribute the training over multiple aws instances with the PyTorch's sagemaker framework. 

The following steps will be pursued to achieve the work:

- Parametrize the S3 bucket and recuperate the role
- Split the data from a local csv file and place each split inside the S3 bucket
- Place the tokenizer inside the S3 bucket
- Place the best model inside the S3 bucket
- Specify the arguments to pass to a python file used for compiling and training the model on multiple g4dn machines
- Configure the PyTorch's sagemaker framework with necessary parameters and call the fit method to begin the training.
- Download the checkpoints and the logs from S3 bucket

In [1]:
from wolof_translate import *

➡️ Parametrize the S3 bucket

In [2]:
# import sagemaker
import sagemaker

# initialize a session and a region
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name

# recuperate the default bucket and specify a prefix
bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/wf_translation"

# get the role
role = sagemaker.get_execution_role()

➡️ Split the data and add the splits into the bucket

In [4]:
# specify the data directory and the data file
data_directory = 'data/extractions/new_data/'
data_file = 'corpora_v6.csv'

# split the data
split_data(random_state=0, data_directory=data_directory, csv_file=data_file)

# upload the splits to the S3 bucket for the current session
train_split = sagemaker_session.upload_data(
    path=os.path.join(data_directory, 'train_set.csv'),
    bucket=bucket,
    key_prefix=prefix
)

valid_split = sagemaker_session.upload_data(
    path=os.path.join(data_directory, 'valid_set.csv'),
    bucket=bucket,
    key_prefix=prefix
)

test_split = sagemaker_session.upload_data(
    path=os.path.join(data_directory, 'test_set.csv'),
    bucket=bucket,
    key_prefix=prefix
)

# print the path where the splits are stored
print(f'Train is stored at: {train_split}\nTest is stored at: {test_split}\
    \nValid is stored at: {valid_split}')

# specify a dictionary containing the inputs
inputs = {
    'training': train_split,
    'testing': test_split,
    'validation': valid_split
}

Train is stored at: s3://sagemaker-us-east-1-634397825065/sagemaker/wf_translation/train_set.csv
Test is stored at: s3://sagemaker-us-east-1-634397825065/sagemaker/wf_translation/test_set.csv    
Valid is stored at: s3://sagemaker-us-east-1-634397825065/sagemaker/wf_translation/valid_set.csv


➡️ Place the tokenizer inside a bucket

In [5]:
# path of the tokenizer
tokenizer_path = 'wolof-translate/wolof_translate/tokenizers/t5_tokenizers/tokenizer_v5.model'

# place the tokenizer inside the S3 bucket
tokenizer = sagemaker_session.upload_data(
    path=tokenizer_path,
    bucket=bucket,
    key_prefix=prefix
)

# print the path where the tokenizer is stored
print(f'Tokenizer is stored at: {tokenizer}')

# add the tokenizer to the inputs dictionary
inputs['tokenizer'] = tokenizer

Tokenizer is stored at: s3://sagemaker-us-east-1-634397825065/sagemaker/wf_translation/tokenizer_v5.model


➡️ Place the best checkpoints' directory inside a bucket **

In [11]:
import boto3
import os

# path of the model
model_path = 'custom_transformer_v6_fw_best' # --------------------------> Must be changed when continuing training

s3_client = boto3.client('s3')

for root, _, files in os.walk(model_path):
    for file in files:
        local_path = os.path.join(root, file)
        s3_key = os.path.relpath(local_path, model_path)
        s3_object_key = os.path.join(prefix, s3_key)

        s3_client.upload_file(local_path, bucket, s3_object_key)


# add the S3 URI of the directory to the inputs dictionary
inputs['model'] = os.path.join(f's3://{bucket}/{prefix}/{os.path.basename(model_path)}')


➡️ Specify the arguments to pass to the framework as hyperparameter

In [12]:
# specify the output path
output_path = f's3://{bucket}/{prefix}/output'

# specify the instance type and the instance count
instance_type = 'ml.g4dn.2xlarge'
instance_count = 4

# specify the hyperparameters
hyperparameters = {
    'epochs': 1000,
    'log_step': 10,
    'metric_for_best_model': 'bleu',
    'metric_objective': 'maximize',
    'corpus_1': 'french',
    'corpus_2': 'wolof',
    'drop_out_rate': 0.291121690756753,
    'd_model': 512,
    'n_head': 8,
    'dim_ff': 2024,
    'n_encoders': 6,
    'n_decoders': 6,
    'learning_rate': None,
    'weight_decay': 0.0,
    'char_p': 0.082269346292589,
    'word_p': 0.005292549318241768,
    'end_mark': 3,
    'label_smoothing': 0.1,
    'max_len': 20,
    'random_state': 0,
    'boundaries': '2,31,59,87,115,143,171',
    'batch_sizes': '256,128,64,32,16,8,4,2',
    'batch_size': 256, 
    'warmup_init': True,
    'relative_step': True,
    'num_workers': 1,
    'pin_memory': True,
    'new_model_dir': 'custom_transformer_v6_fw', 
    'continue': False, # --------------------------> Must be changed when continuing training
    'logging_dir': 'custom_transformer_fw',
    'save_best': True,
    'version': 6,
    'backend': 'gloo'
}

    

➡️ Configuration and training

In [13]:
from sagemaker.pytorch import PyTorch

# specify the estimator
estimator = PyTorch(
    entry_point='train.py',
    role=role,
    py_version='py38',
    framework_version='1.11.0',
    instance_count=instance_count,
    instance_type=instance_type,
    output_path=output_path,
    hyperparameters=hyperparameters,
)

# fit the estimator
estimator.fit(inputs)
    

Using provided s3_resource


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2023-07-08-20-06-34-061


ClientError: An error occurred (AccessDeniedException) when calling the CreateTrainingJob operation: User: arn:aws:sts::634397825065:assumed-role/LabRole/SageMaker is not authorized to perform: sagemaker:CreateTrainingJob on resource: arn:aws:sagemaker:us-east-1:634397825065:training-job/pytorch-training-2023-07-08-20-06-34-061 with an explicit deny in an identity-based policy

➡️ Download logs and model from S3 bucket

In [None]:
import os
import boto3

s3_client = boto3.client('s3')

# Recuperate the current directory
current_dir = os.getcwd()

# Récupérer les emplacements de modèle et de sortie de données
model_dir = os.environ['SM_MODEL_DIR']
output_data_dir = os.environ['SM_OUTPUT_DATA_DIR']

# Télécharger le contenu de SM_MODEL_DIR
s3_client.download_file(model_dir, '', current_dir, recursive=True)

# Télécharger le contenu de SM_OUTPUT_DATA_DIR
s3_client.download_file(output_data_dir, '', current_dir, recursive=True)
