# Part 3 - Training (aka *fine-tuning*) a Transformer model

In this part we will finally train our very own Transformers model. We saw that the zero-shot model didn't produce great results, and that's probably because the model was trained on summarising news articles, not academic papers. 

These lines of code are typical setup for Sagemaker, we require them for training jobs: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html

In [1]:
import sagemaker

sess = sagemaker.Session()
role = 'arn:aws:iam::647481066755:role/ArnabK'
bucket = sess.default_bucket()

print(f"IAM role arn used for running training: {role}")
print(f"S3 bucket used for storing artifacts: {sess.default_bucket()}")

sagemaker.config INFO - Not applying SDK defaults from location: C:\ProgramData\sagemaker\sagemaker\config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: C:\Users\ASUS\AppData\Local\sagemaker\sagemaker\config.yaml
IAM role arn used for running training: arn:aws:iam::647481066755:role/ArnabK
S3 bucket used for storing artifacts: sagemaker-ap-south-1-647481066755


We are in the great position that we don't have to write our own training script. Instead we will use a script from the transformers library in Github: https://github.com/huggingface/transformers/blob/v4.6.1/examples/pytorch/summarization/run_summarization.py

In [2]:
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}

These are the parameters for training, and this is one of the most important levers we can leverage once we are in the experimentation phase. Changing these parameters can influence the model performance and there will be a component of trial & error to find the best model. Also check out https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html for automated hyperparameter tuning. 

In [3]:
# hyperparameters, which are passed into the training job
hyperparameters={'per_device_train_batch_size': 4,
                 'per_device_eval_batch_size': 4,
                 'model_name_or_path': 'sshleifer/distilbart-cnn-12-6',
                 'train_file': '/opt/ml/input/data/train.csv',
                 'validation_file': '/opt/ml/input/data/val.csv',
                 'do_train': True,
                 'do_eval': True,
                 'do_predict': False,
                 'predict_with_generate': True,
                 'output_dir': 'opt/ml/model',
                 'num_train_epochs': 3,
                 'learning_rate': 5e-5,
                 'seed': 7,
                 'fp16': True,
                 'val_max_target_length': 20,
                 'text_column': 'text',
                 'summary_column': 'summary',
                 }

# configuration for running training on smdistributed Data Parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

In Amazon SageMaker, training data is typically stored in Amazon S3 buckets, and SageMaker automatically downloads the data to the specified directory (/opt/ml/input/data/) within the container before starting the training job. So, when you specify 'train_file': '/opt/ml/input/data/train.csv', you're telling SageMaker to look for the training data file named train.csv in that specific directory.

In [4]:
from sagemaker.huggingface import HuggingFace

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point='run_summarization.py', #fine-tuning script used in training job
    source_dir='./examples/pytorch/summarization', # directory where fine-tuning script is stored
    git_config=git_config,
    instance_type='ml.p3.16xlarge', # instances type used for the training job
    instance_count=2, # the number of instances used for training
    transformers_version='4.6', # the transformers version used in the training job
    pytorch_version='1.7', # the pytorch_version version used in the training job
    py_version='py36',
    role=role,
    hyperparameters=hyperparameters, # the hyperparameter used for running the training job
    distribution=distribution,
)

In [5]:
training_input_path ="s3://sagemaker-ap-south-1-647481066755/summarization/data/train.csv"

In [8]:
import boto3

In [13]:
s3 = boto3.client('s3')

bucket_name = 'sagemaker-ap-south-1-647481066755'
object_key = 'summarization/data/train.csv'

#code to verify the existence of a file in an S3 bucket
try:
    s3.head_object(Bucket=bucket_name, Key=object_key)
    print("Object exists")
except Exception as e:
    print("Object does not exist:", e)
    
#code to verify S3 access using the IAM role associated with your SageMaker instance
try:
    # Attempt to list objects in the bucket
    response = s3.list_objects_v2(Bucket=bucket_name)
    print("Successfully listed objects in the bucket.")
except Exception as e:
    print("Failed to list objects in the bucket:", e)

Successfully listed objects in the bucket.
Object exists


This will kick off the training job which should take around 1 hour. There is also the option to use distributed training with more instances, see here:https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html. Running this training with 2 distributed instances should take ~40 minutes.

In [6]:
huggingface_estimator.fit({'train': training_input_path})
# channel name is datasets

FileNotFoundError: [WinError 2] The system cannot find the file specified

This method typically takes a dictionary where the keys represent the names of the input channels and the values represent the S3 locations of the training data.

If you're encountering the error "[WinError 2] The system cannot find the file specified," 
it suggests that there might be an issue with the training_input_path variable, which likely represents the file path to your training data.