# Fine tuning BERT base model from HuggingFace on Amazon SageMaker

## Overview

This tutotial uses the Hugging Face transformers and datasets libraries with Amazon SageMaker to fine-tune a pre-trained BERT base model on binary text classification.

A [pre-trained model](https://huggingface.co/bert-base-uncased) is available in the transformers library from Hugging Face. You’ll be fine-tuning this pre-trained model using the [Amazon Plarity dataset](https://huggingface.co/datasets/amazon_polarity) which classify the content into either positive or negative feedback.

This Jupyter Notebook can run either on SageMaker notebook instance or SageMaker Studio Notebook.

You can set up your SageMaker Notebook instance by following the [Get Started with Amazon SageMaker Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-console.html) or SageMaker Studio Notebook by following the [Use Amazon SageMaker Studio Notebooks](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks.html)

The notebook was tested on a `ml.t3.medium` SageMaker Studio Notebook with `Data Science` image.

## Development Environment and Permissions

For SageMaker Studio, please make sure ipywidgets is installed and restart the kernel

In [None]:
%%capture
import IPython
import sys

!{sys.executable} -m pip install ipywidgets
IPython.Application.instance().kernel.do_shutdown(True)  # has to restart kernel so changes are used

This tutorial requires the following pip packages:
  + transformers
  + datasets

Please make sure SageMaker version is 2.146.0 or byond.

In [None]:
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install --upgrade --no-cache-dir sagemaker
!{sys.executable} -m pip install --upgrade --no-cache-dir torch==1.13.1
!{sys.executable} -m pip install --upgrade --no-cache-dir transformers==4.27.4 datasets

After these pip install commands in a notebook, make sure to restart your kernel.

In [None]:
import sagemaker
from sagemaker.pytorch import PyTorch
import torch
from transformers import AutoTokenizer
from datasets import load_dataset

### Create Sagemaker session
Next, create a SageMaker session and define an execution role. Default role should suffice.

In [None]:
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
sess_bucket = sess.default_bucket()
region = sess.boto_region_name

print(f'sagemaker role arn: {role}')
print(f'sagemaker bucket: {sess_bucket}')
print(f'sagemaker session region: {region}')

## Dataset preparation
For this example, we will use the `datasets` library to download and preprocess the `amazon_polarity` dataset from Hugging Face.

https://huggingface.co/datasets/amazon_polarity

In [None]:
# pre-trained model used
model_name = 'bert-base-uncased'

# dataset used
dataset_name = 'amazon_polarity'

# s3 key prefix
s3_prefix = 'bert-base-uncased-amazon-polarity'

In [None]:
# Prepare dataset
train_dataset, test_dataset = load_dataset(dataset_name, split=['train', 'test'])

# Reduced the dataset for testing purpose
train_dataset = train_dataset.shuffle().select(range(4000))
test_dataset = test_dataset.shuffle().select(range(156))

# Let's take a look one example from the training dataset.
index = 0
print(train_dataset[index])

In [None]:
# Tokenization
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Let's take a look one example to understand how original sentence is tokeninzed and encoded.
print('Tokenize:', tokenizer.tokenize(train_dataset['content'][index]))
print('Encode:', tokenizer.encode(train_dataset['content'][index]))

# Helper function to get the content to tokenize
def tokenize_function(examples):
    return tokenizer(examples['content'], padding='max_length', max_length=128, truncation=True)

# Tokenize
train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

# Set the format to PyTorch
train_dataset = train_dataset.rename_column("label", "labels")
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset = test_dataset.rename_column("label", "labels")
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

### Uploading dataset to S3 Bucket
After we processed the dataset we are going to use the new FileSystem integration to upload our dataset to S3.

In [None]:
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()  

# save dataset to s3
training_input_path = f's3://{sess_bucket}/{s3_prefix}/{dataset_name}/train'
train_dataset.save_to_disk(training_input_path, fs=s3)

test_input_path = f's3://{sess_bucket}/{s3_prefix}/{dataset_name}/test'
test_dataset.save_to_disk(test_input_path, fs=s3)

## Leveraging Neuron Persistent Cache
PyTorch Neuron performs just-in-time compilation of graphs during execution. At every step if the traced graph varies from the previous executions, it is compiled by the neuron compiler.
The compilation result can be saved in on-disk [Neuron Persistent Cache](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.html) to avoid compilations across training runs. In order to keep such cache across SageMaker traing jobs, we add the mechanism to restore the cache from S3 at the begging of the training job and then store back to S3 at the end of the training job.

In [None]:
import pathlib
import tarfile

# If neuron compile cache does not exist in S3, create dummy file.
if not sess.list_s3_files(sess_bucket, f'{s3_prefix}/neuron-compile-cache.tar.gz'):
    pathlib.Path('./dummy').touch()
    with tarfile.open('neuron-compile-cache.tar.gz', 'w:gz') as t:
        t.add('dummy')
    sess.upload_data('./neuron-compile-cache.tar.gz', sess_bucket, s3_prefix)

s3_prefix_path = f's3://{sess_bucket}/{s3_prefix}'

## Fine-tuning & starting Sagemaker Training Job
In order to create a sagemaker training job we use [PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#train-a-model-with-pytorch). The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. In a Estimator we define, which fine-tuning script should be used as `entry_point`, which `instance_type` should be used, which `hyperparameters` are passed in .....

Note) SageMaker support for EC2 Trn1 instance is currently available only for PyTorch Estimator. HuggingFace Estimator will be available in future release.

In [None]:
# hyperparameters, which are passed into the training job
hyperparameters={'epochs': 10,               # number of training epochs
                 'train_batch_size': 8,      # batch size for training
                 'eval_batch_size': 8,       # batch size for evaluation
                 'learning_rate': 5e-5,      # learning rate used during training
                 'model_name':model_name,    # pre-trained model
                }

In [None]:
# https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers
train_image_name="pytorch-training-neuronx"
image_tag="1.13.1-neuronx-py310-sdk2.12.0-ubuntu20.04"

In [None]:
pt_estimator = PyTorch(
    entry_point='train.py',
    source_dir='scripts',
    image_uri=f"763104351884.dkr.ecr.{region}.amazonaws.com/{train_image_name}:{image_tag}",
    role=role,
    instance_count=1,
    instance_type='ml.trn1.2xlarge',
    base_job_name=s3_prefix,
    hyperparameters=hyperparameters,

    distribution={
        'torch_distributed': {
            'enabled': True
        }
    }
)

In [None]:
!pygmentize ./scripts/train.py

In [None]:
# define a data input dictonary with our uploaded s3 uris
data = {
    'train': training_input_path,
    'test': test_input_path,
    's3_prefix': s3_prefix_path  # To restore Neuron Persistent Cache archive from S3
}

In [None]:
%%time
# starting the train job with our uploaded datasets as input
pt_estimator.fit(data, wait=True)

### Backup Neuron Persistent Cache into S3 for the following training jobs

In [None]:
TrainingJobName = pt_estimator.latest_training_job.describe()['TrainingJobName']

sess.download_data('./', sess_bucket, f'{TrainingJobName}/output/output.tar.gz')

import tarfile
with tarfile.open('output.tar.gz', 'r') as t:
    t.extractall('./output/')

with tarfile.open('neuron-compile-cache.tar.gz', 'w:gz') as t:
    t.add('./output/', arcname='./')

sess.upload_data('./neuron-compile-cache.tar.gz', sess_bucket, s3_prefix)

## Quick check the prediction result.
Let's quickly check to see if the fine-tuned model produces the expected prediction results.


In [None]:
sess.download_data('./', sess_bucket, f'{TrainingJobName}/output/model.tar.gz')
    
with tarfile.open('model.tar.gz', 'r') as t:
    t.extractall('./model/')

In [None]:
from transformers import pipeline

classifier = pipeline('text-classification', model = './model/')

print(classifier('This is the most amazing product I have ever seen. I love it.'))
print(classifier('big disappointment. I will not use it.'))
print(classifier('Wonderful product. I will let my friends know.'))

### Congratulation ! You can see the expected results !