# Training Language Model with Amazon SageMaker

#### Goal: To fine-tune a pre-trained transformer for a binary text classification task using the `imdb` dataset.

#### Note: This notebook is a modified version of the original work from the [huggingface/notebooks](https://github.com/huggingface/notebooks) repository. The modification includes debugging, adding notes and additional features. 

1. [Development Environment and Permissions](#Development-Environment-and-Permissions)
    1. [Installation](#Installation)  
    2. [Development environment](#Development-environment)  
    3. [Permissions](#Permissions)
2. [Processing](#Preprocessing)   
    1. [Tokenization](#Tokenization)  
    2. [Uploading data to sagemaker_session_bucket](#Uploading-data-to-sagemaker_session_bucket)  
3. [Fine-tuning & starting Sagemaker Training Job](#Fine-tuning-\&-starting-Sagemaker-Training-Job)  
    1. [Creating an Estimator and start a training job](#Creating-an-Estimator-and-start-a-training-job)
    2. [Deploying the endpoint](#deploying-the-endpoint)
4. [Extras](#extras)
    1. [Estimator Parameters](#Estimator-Parameters)   
    2. [Attach to old training job to an estimator ](#Attach-to-old-training-job-to-an-estimator)  
    3. [Load fine-tuned model from S3](#Load-fine-tuned-model-from-S3)


# Development Environment and Permissions 

## Installation

In [None]:
!pip install "sagemaker>=2.140.0" "transformers==4.26.1" "datasets==4.0.0" --upgrade

## Development environment 

In [None]:
import sagemaker.huggingface

## Permissions

Identify a default S3 bucket for storing data and models and secures the necessary IAM execution role to grant permissions.

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

# Preprocessing

We are using the `datasets` library to download and preprocess the `imdb` dataset. After preprocessing, the dataset will be uploaded to our `sagemaker_session_bucket` to be used within our training job. 

## Tokenization 

Preprocess the `imdb` dataset by tokenizing the text with a `DistilBERT` tokenizer and then formatting the data into PyTorch tensors ready for model training.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

# tokenizer used in preprocessing
tokenizer_name = 'distilbert-base-uncased'

# dataset used
dataset_name = 'imdb'

# s3 key prefix for the data
s3_prefix = 'samples/datasets/imdb'

In [None]:
# load dataset
dataset = load_dataset(dataset_name)

# download tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

# tokenizer helper function
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

# load dataset
train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test'])
test_dataset = test_dataset.shuffle().select(range(10000)) # smaller the size for test dataset to 10k 


# tokenize dataset
train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

# set format for pytorch
train_dataset =  train_dataset.rename_column("label", "labels")
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset = test_dataset.rename_column("label", "labels")
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

## Uploading data to `sagemaker_session_bucket`

In [None]:
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
train_dataset.save_to_disk(training_input_path)

# save test_dataset to s3
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
test_dataset.save_to_disk(test_input_path)

# Fine-tuning & starting Sagemaker Training Job

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks.

When we create a SageMaker training job, SageMaker takes care of starting and managing all the required ec2 instances for us with the `huggingface` container, uploads the provided fine-tuning script `train.py` and downloads the data from our `sagemaker_session_bucket` into the container at `/opt/ml/input/data`. Then, it starts the training job.
```


## Creating an Estimator and start a training job

Check the fine-tuning script `train.py`

In [None]:
!pygmentize ./scripts/train.py

Define a SageMaker training job by creating a HuggingFace estimator.

In [None]:
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 1,
                 'train_batch_size': 32,
                 'model_name':'distilbert-base-uncased'
                 }

In [None]:
huggingface_estimator = HuggingFace(entry_point='train.py',
                            source_dir='./scripts',
                            instance_type='ml.p3.2xlarge',
                            instance_count=1,
                            role=role,
                            transformers_version='4.26',
                            pytorch_version='1.13',
                            py_version='py39',
                            hyperparameters = hyperparameters)

Starting the train job with our uploaded datasets as input.

In [None]:
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

The `train.py` script saves the model artifacts to a predefined local path `/opt/ml/model`.

SageMaker then automatically tars these artifacts and uploads the resulting `model.tar.gz` to the specified S3 output path.

## Deploying the endpoint

In [None]:
predictor = huggingface_estimator.deploy(1, "ml.g4dn.xlarge")

Then, we use the returned predictor object to call the endpoint.

In [None]:
sentiment_input= {"inputs":"I love using the new Inference DLC."}

predictor.predict(sentiment_input)

Finally, we delete the endpoint.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()

# Extras

## Estimator Parameters

In [None]:
# container image used for training job
print(f"container image used for training job: \n{huggingface_estimator.image_uri}\n")

# s3 uri where the trained model is located
print(f"s3 uri where the trained model is located: \n{huggingface_estimator.model_data}\n")

# latest training job name for this estimator
print(f"latest training job name for this estimator: \n{huggingface_estimator.latest_training_job.name}\n")

In [None]:
# access the logs of the training job
huggingface_estimator.sagemaker_session.logs_for_job(huggingface_estimator.latest_training_job.name)

## Attach to old training job to an estimator 

Attach an old training job to an estimator to continue training, get results etc..

In [None]:
from sagemaker.estimator import Estimator

# job which is going to be attached to the estimator
old_training_job_name=huggingface_estimator.latest_training_job.name

In [None]:
# attach old training job
huggingface_estimator_loaded = Estimator.attach(old_training_job_name)

# get model output s3 from training job
huggingface_estimator_loaded.model_data

## Load fine-tuned model from S3

In [None]:
from sagemaker.huggingface import HuggingFaceModel
import sagemaker

# Fill in the path to the model.tar.gz
s3_model_path = "s3://path/to/your/model.tar.gz" 

# Create a model object pointing to the model in S3
reloaded_model = HuggingFaceModel(
   model_data=s3_model_path,      
   role=role,                     
   transformers_version="4.26",   
   pytorch_version="1.13",
   py_version="py39"
)

# Deploy the endpoint
new_predictor = reloaded_model.deploy(1, "ml.g4dn.xlarge")

Then, we can use the returned predictor object to call the endpoint again.

In [None]:
sentiment_input= {"inputs":"I love using the new Inference DLC."}

new_predictor.predict(sentiment_input)

Delete the endpoint again.

In [None]:
new_predictor.delete_model()
new_predictor.delete_endpoint()