# Sagemaker-Huggingface SDK Training Job for Clean Summary Model
In this notebook I finetune the initial version of the clean summary model. Whose job will be to summarize why messages marked as innapropriate by the rule adherance classifier were marked so. The caveat to this model is that the summary must itself be appropriate. To accomplish this I used code interpreter to label the toxic comments of the wikipedia toxic comments dataset with clean summaries. The model being fine tuned is T5 which is a seq2seq transformer architecture commonly used for summary.

---------
For context about the training environment: The sagemaker-huggingface SDK runs a training job specified in the train.py script. The model and hyperparams are specified in the huggingface estimator object. The prepared data is uploaded to an s3 bucket and the URI of each dataset is passed to estimator in the .fit() call. The base T5 model is also upload to s3 in this way to be passed into the training job.

https://huggingface.co/docs/sagemaker/train Here is the documentation for aws-huggingface training

## Setup Sagemaker Environment

In [1]:
import sagemaker
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

## Huggingface Estimator
The huggingface estimator carries out the training script train.py optimized by sagemaker. These arguments will be sent to the training script train.py where the trainer object will be instantiated with them.

In [2]:
from sagemaker.huggingface import HuggingFace


# hyperparameters which are passed to the training job
hyperparameters = {
    'epochs': 1,
    'per_device_train_batch_size': 16,
    'per_device_eval_batch_size': 16, 
    'model_name_or_path': 't5-small'
}

# create the Estimator
huggingface_estimator = HuggingFace(
        entry_point='train.py',
        source_dir='/home/ec2-user/SageMaker/clean_summary_model_training', #path to your training script
        instance_type='ml.g4dn.xlarge',
        instance_count=1,
        role=role,
        transformers_version='4.28.1',
        pytorch_version='2.0.0',
        py_version='py310',
        hyperparameters = hyperparameters,
        vpc_config={  #for using custom datasets
            'Subnets': ["replaced"],    #get these from VPC dash
            'SecurityGroupIds': ['replaced'] #delete sensitive info before uploading to github
        }
)

## Upload datasets to s3 for estimator to access

I was having issues with the deep learning container losing internet access so I opted to upload datasets to s3 and pass their URIs into the estimator. This means that the train.py script will retrieve them from the s3 bucket before running trainer.train() 

These next few cells are all related to uploading to s3. If the data is already uploaded these can be disregarded.

The below code creates a new s3 bucket to store our datasets. 

In [None]:
import boto3

s3 = boto3.client('s3', region_name='us-east-2') #us-east-2 is specific to my specs. May need to be changed if used
bucket_name = 'hambart-training'

# Create a new bucket in the us-east-2 region
s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': 'us-east-2'})

# Check the location of an existing bucket
response = s3.get_bucket_location(Bucket=bucket_name)
bucket_location = response['LocationConstraint']
print('Bucket location:', bucket_location)

The datasets have already been tokenized and were saved as datasets.Dataset objects in the RAC_dataset_tokenization.ipynb notebook.

Below we upload the files for each dataset into seperate directories created in the s3 bucket.

In [12]:
import os
import boto3

s3 = boto3.resource('s3')
bucket_name = 'hambart-training'

base_path = '/home/ec2-user/SageMaker/clean_summary_model_training/'
dataset_paths = [os.path.join(base_path, 'summary_train_dataset/'),
                 os.path.join(base_path, 'summary_test_dataset/')]

def upload_directory_to_s3(directory_path, s3_bucket, s3_key_prefix=''):
    '''
    uploads dataset folders to new dir in s3 bucket 
    for estimator to access. dir names are that which
    they are pulled from.
    '''
    directory_name = os.path.basename(directory_path.rstrip('/'))  # Ensure no trailing slash
    for root, dirs, files in os.walk(directory_path):
        for file in files:
            file_path = os.path.join(root, file)
            s3_key = os.path.join(s3_key_prefix, directory_name, os.path.relpath(file_path, directory_path))
            print(f"Uploading {file_path} to s3://{s3_bucket.name}/{s3_key}")
            s3_bucket.Object(s3_key).upload_file(Filename=file_path)

bucket = s3.Bucket(bucket_name)

for folder_path in dataset_paths:
    upload_directory_to_s3(folder_path, bucket)  


Below we create and print the s3 buck uris to each of the dataset dirs. These will be passed to the estimator.

In [3]:
bucket_name = 'hambart-training'

train_dataset_uri = 's3://{}/{}'.format(bucket_name, 'summary_train_dataset')
test_dataset_uri = 's3://{}/{}'.format(bucket_name, 'summary_test_dataset')


print(train_dataset_uri, '\n', test_dataset_uri, sep='')

s3://hambart-training/summary_train_dataset
s3://hambart-training/summary_test_dataset


## Upload base model to s3
We also need to upload the distilbert base uncased model to s3 and pass it into the training job. 

Below we download it and save it to this notebooks s3 bucket.

In [1]:
#!pip install transformers

In [None]:
from transformers import T5ForConditionalGeneration, T5Config

# Load the configuration
config = T5Config.from_pretrained("t5-small")  

# Load the T5 model for conditional generation (summarization)
model = T5ForConditionalGeneration.from_pretrained("t5-small", config=config)

# Save the model in order to upload it to s3
model.save_pretrained("./t5-small-summarization-model")

Below we upload the model to the bucket

In [None]:
# The name of the directory to be created in the bucket
directory_name = "t5-small-summarization-model"

# Upload the entire directory
for root, dirs, files in os.walk("./t5-small-summarization-model"):
    for filename in files:
        local_path = os.path.join(root, filename)
        s3_path = os.path.join(directory_name, os.path.relpath(local_path, "./t5-small-summarization-model"))
        
        bucket.upload_file(local_path, s3_path)

## Fit Data

Calling the .fit() method on the hugging face estimator will carry out the training specified in train.py

In [6]:
huggingface_estimator.fit({'train': train_dataset_uri, 'test': test_dataset_uri})

Using provided s3_resource


INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-08-22-15-40-20-582


2023-08-22 15:46:59 Starting - Starting the training job...
2023-08-22 15:47:14 Starting - Preparing the instances for training......
2023-08-22 15:48:13 Downloading - Downloading input data...
2023-08-22 15:48:33 Training - Downloading the training image...................................................
2023-08-22 15:57:10 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-08-22 15:57:21,710 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-08-22 15:57:21,729 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-08-22 15:57:21,738 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-08-22 15:57:21,746 sagemaker_pytorch_container.training INFO     Invoking user training script.[0

[34mCollecting nltk[0m
[34mDownloading nltk-3.8.1-py3-none-any.whl (1.5 MB)[0m
[34m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 38.8 MB/s eta 0:00:00[0m
[34mCollecting rouge_score[0m
[34mDownloading rouge_score-0.1.2.tar.gz (17 kB)[0m
[34mPreparing metadata (setup.py): started[0m
[34mPreparing metadata (setup.py): finished with status 'done'[0m
[34mBuilding wheels for collected packages: rouge_score[0m
[34mBuilding wheel for rouge_score (setup.py): started[0m
[34mBuilding wheel for rouge_score (setup.py): finished with status 'done'[0m
[34mCreated wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=1a9c03c05e92175e7290f03f4abff3f6a4f6013ae0c6ea42ea4efce632d99f99[0m
[34mStored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4[0m
[34mSuccessfully built rouge_score[0m
[34mInstalling collected packages: nltk, rouge_score[0m
[34mSuccessfully installed nltk-3.8.1 rouge_score-0

[34m26%|██▌       | 89/344 [00:41<01:58,  2.14it/s][0m
[34m26%|██▌       | 90/344 [00:41<01:58,  2.14it/s][0m
[34m26%|██▋       | 91/344 [00:42<01:58,  2.14it/s][0m
[34m27%|██▋       | 92/344 [00:42<01:57,  2.14it/s][0m
[34m27%|██▋       | 93/344 [00:43<01:57,  2.14it/s][0m
[34m27%|██▋       | 94/344 [00:43<01:56,  2.14it/s][0m
[34m28%|██▊       | 95/344 [00:44<01:56,  2.14it/s][0m
[34m28%|██▊       | 96/344 [00:44<01:55,  2.15it/s][0m
[34m28%|██▊       | 97/344 [00:45<01:55,  2.14it/s][0m
[34m28%|██▊       | 98/344 [00:45<01:54,  2.14it/s][0m
[34m29%|██▉       | 99/344 [00:45<01:54,  2.15it/s][0m
[34m29%|██▉       | 100/344 [00:46<01:53,  2.14it/s][0m
[34m29%|██▉       | 101/344 [00:46<01:53,  2.14it/s][0m
[34m30%|██▉       | 102/344 [00:47<01:52,  2.14it/s][0m
[34m30%|██▉       | 103/344 [00:47<01:52,  2.15it/s][0m
[34m30%|███       | 104/344 [00:48<01:51,  2.15it/s][0m
[34m31%|███       | 105/344 [00:48<01:51,  2.14it/s][0m
[34m31%|███       | 106/

[34m69%|██████▉   | 237/344 [01:51<00:51,  2.09it/s][0m
[34m69%|██████▉   | 238/344 [01:51<00:50,  2.09it/s][0m
[34m69%|██████▉   | 239/344 [01:52<00:50,  2.09it/s][0m
[34m70%|██████▉   | 240/344 [01:52<00:49,  2.09it/s][0m
[34m70%|███████   | 241/344 [01:53<00:49,  2.09it/s][0m
[34m70%|███████   | 242/344 [01:53<00:48,  2.08it/s][0m
[34m71%|███████   | 243/344 [01:54<00:48,  2.09it/s][0m
[34m71%|███████   | 244/344 [01:54<00:47,  2.09it/s][0m
[34m71%|███████   | 245/344 [01:54<00:47,  2.08it/s][0m
[34m72%|███████▏  | 246/344 [01:55<00:46,  2.09it/s][0m
[34m72%|███████▏  | 247/344 [01:55<00:46,  2.08it/s][0m
[34m72%|███████▏  | 248/344 [01:56<00:46,  2.08it/s][0m
[34m72%|███████▏  | 249/344 [01:56<00:45,  2.09it/s][0m
[34m73%|███████▎  | 250/344 [01:57<00:45,  2.09it/s][0m
[34m73%|███████▎  | 251/344 [01:57<00:44,  2.09it/s][0m
[34m73%|███████▎  | 252/344 [01:58<00:44,  2.09it/s][0m
[34m74%|███████▎  | 253/344 [01:58<00:43,  2.08it/s][0m
[34m74%|█████