## Model training on AWS Sagemaker

**Author:** Shaun Khoo  
**Date:** 15 Oct 2021  
**Context:** Training on the local computer is taking too much time, would be much faster if we could train our models on AWS Sagemaker instead  
**Objective:** Develop code that will help us train our model directly on AWS Sagemaker   

**Note:** Referencing [this notebook](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/pytorch_lstm_word_language_model/pytorch_rnn.ipynb)

#### A) Importing the required libraries

Note that your AWS credentials need to be set up on the AWS CLI first before this can work seamlessly

In [1]:
import sagemaker
from sagemaker.pytorch import PyTorch
import boto3

The code below returns your IAM user (as an Amazon Resource Number or `Arn`). Make sure the code runs below - this ensures you are logged in correctly

In [2]:
sts = boto3.client('sts')
sts.get_caller_identity()

{'UserId': 'AIDAYUZMQUYGUNJ2VXD2E',
 'Account': '594409465357',
 'Arn': 'arn:aws:iam::594409465357:user/shaunkhoo',
 'ResponseMetadata': {'RequestId': '36cedfbf-9749-4965-957b-b4c1c7d89981',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '36cedfbf-9749-4965-957b-b4c1c7d89981',
   'content-type': 'text/xml',
   'content-length': '406',
   'date': 'Sat, 08 Jan 2022 00:45:52 GMT'},
  'RetryAttempts': 0}}

Changing the working directory to the top-level folder

In [3]:
import os
os.chdir('..')

#### B) Setting up Sagemaker and S3

Initialising the Sagemaker session object

In [4]:
sagemaker_session = sagemaker.Session()

Obtaining the default bucket for our Sagemaker session

In [5]:
bucket = sagemaker_session.default_bucket()
print(f"Bucket Name: {bucket}")

Bucket Name: sagemaker-us-east-1-594409465357


Set the prefix for where you want to store your data / model files in the S3 bucket

In [6]:
prefix = 'Sagemaker/ssoc-autocoder'

Run the code below to retrieve Sagemaker's execution role. Note that the role we have set up is called `mom-aws`

In [8]:
try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName = 'mom-aws')['Role']['Arn']
print(f"Execution Role: {role}")

Couldn't call 'get_role' to get Role ARN from role name shaunkhoo to get Role path.


Execution Role: arn:aws:iam::594409465357:role/mom-aws


Upload the raw data to the S3 folder

In [13]:
inputs = sagemaker_session.upload_data(path = "Data/Train/pre-training-sample1000.txt", 
                                       bucket = bucket, 
                                       key_prefix = prefix)

In [14]:
print(f"Inputs stored in: {inputs}")

Inputs stored in: s3://sagemaker-us-east-1-594409465357/Sagemaker/ssoc-autocoder/pre-training-sample1000.txt


#### C) Language modelling (or pretraining) on Sagemaker

Running masked language modelling to finetune the DistilBERT model on MCF data to improve downstream classification performance

Define the hyperparameters that need to be passed onto the masked language modelling script

In [6]:
mlm_parameters = {
    'model_name_or_path': 'mcf-pretrained-5epoch',#'distilbert-base-uncased',
    'train_file': "pre-training-full.txt",
    'line_by_line': True,
    'do_train': True,
    'do_eval': True,
    'evaluation_strategy': 'epoch',
    'logging_steps': 500,
    'save_strategy': 'epoch',
    'overwrite_output_dir': True,
    'output_dir': '20211228_test'
}

Create the estimator object and run it on the full pretraining text file

In [10]:
mlm_estimator = PyTorch(
    entry_point = "run_mlm_aws.py",
    role = role,
    framework_version = "1.8.1",
    instance_count = 1,
    instance_type = "ml.g4dn.xlarge",
    source_dir = "ssoc_autocoder",
    max_run = 432000,
    py_version = "py3",
    env = {'SAGEMAKER_REQUIREMENTS': 'C:\\Users\\shaun\\PycharmProjects\\ssoc-autocoder\\ssoc_autocoder\\requirements.txt'},
    hyperparameters = mlm_parameters
)

In [11]:
mlm_estimator.fit({"training": 's3://sagemaker-us-east-1-594409465357/Sagemaker/ssoc-autocoder'})

2022-01-08 00:50:45 Starting - Starting the training job...
2022-01-08 00:50:47 Starting - Launching requested ML instancesProfilerReport-1641603042: InProgress
...
2022-01-08 00:51:48 Starting - Preparing the instances for training......
2022-01-08 00:53:00 Downloading - Downloading input data......
2022-01-08 00:54:00 Training - Downloading the training image..................
2022-01-08 00:57:42 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2022-01-08 00:57:35,440 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2022-01-08 00:57:35,461 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2022-01-08 00:57:35,467 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2022-01-08 00:57:35,826 sagemaker-traini

ClientError: An error occurred (InvalidSignatureException) when calling the DescribeLogStreams operation: Signature expired: 20220109T110957Z is now earlier than 20220109T124605Z (20220109T125105Z - 5 min.)

#### D) Finetuning classification on Sagemaker

In [56]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point = "train_aws.py",
    role = role,
    framework_version = "1.8.1",
    instance_count = 1,
    instance_type = "ml.g4dn.xlarge",
    source_dir = "ssoc_autocoder",
    py_version = "py3",
    env = env,
    #use_spot_instances = True,
    #max_run = ,
    #max_wait = 600,
    hyperparameters = {"epochs": 4, "tied": True},
)

In [53]:
estimator.fit({"training": inputs})

2021-10-11 07:40:29 Starting - Starting the training job...
2021-10-11 07:40:55 Starting - Launching requested ML instancesProfilerReport-1633938053: InProgress
......
2021-10-11 07:42:04 Starting - Preparing the instances for training.........
2021-10-11 07:43:40 Downloading - Downloading input data...
2021-10-11 07:44:16 Training - Downloading the training image......................[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-10-11 07:48:29,655 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-10-11 07:48:29,679 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-10-11 07:48:29,689 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-10-11 07:48:30,459 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[34m/opt/

KeyboardInterrupt: 

In [None]:
estimator.fit({"training": inputs})

2021-10-11 08:58:49 Starting - Starting the training job...
2021-10-11 08:59:13 Starting - Launching requested ML instancesProfilerReport-1633942752: InProgress
...
2021-10-11 08:59:53 Starting - Preparing the instances for training.........
2021-10-11 09:01:37 Downloading - Downloading input data
2021-10-11 09:01:37 Training - Downloading the training image.......................[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-10-11 09:05:49,253 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-10-11 09:05:49,273 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-10-11 09:05:50,703 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-10-11 09:05:51,289 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[34m/opt/conda