# Amazon SageMaker x Hugging Face Transformers - Distributed Training and Checkpointing Demo
### Learn how to use SageMaker distributed training data parallelism and checkpoints in Transformer model fine-tuning.
#### Disclaimer: This is a demo showcasing SageMaker Distributed Training data parallelism and checkpointing, but not for direct production use.

[Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html) and the [Hugging Face DLCs](https://huggingface.co/docs/sagemaker/main) make it easy to train transformer models using pre-built Hugging Face framework container which supports [SageMaker distributed training](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html). 

Amazon SageMaker also offers support for [remote S3 Checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html) where data from a local path to Amazon S3 is saved. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path.

In this example, we are going to:

- preprocess a dataset in the notebook and upload it to Amazon S3
- configure checkpointing and distributed training in the `HuggingFace` estimator
- run training and resume training from checkpoints

_**NOTE: You can run this demo in Sagemaker Studio, your local machine, or Sagemaker Notebook Instances**_ When run in SageMaker Studio, choose kernel `Python 3 (PyTorch 1.10 Python 3.8 CPU Optimized)`

## **Development Environment and Permissions**

*Note: we only install the required libraries from Hugging Face and AWS. You also need PyTorch or Tensorflow, if you haven´t it installed*

In [None]:
!pip install --upgrade pip
!pip install "sagemaker>=2.77.0" "transformers==4.12.3" "datasets[s3]==1.18.3" s3fs --upgrade

In [None]:
!conda install -c conda-forge ipywidgets -y

## Permissions

*If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.*

In [None]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

## Preprocessing

We are using the `datasets` library to download and preprocess the `emotion` dataset. After preprocessing, the dataset will be uploaded to our `sagemaker_session_bucket` to be used within our training job. The [emotion](https://github.com/dair-ai/emotion_dataset) dataset consists of 16000 training examples, 2000 validation examples, and 2000 testing examples. A more detailed description of the dataset can be found in this [paper](https://aclanthology.org/D18-1404/).

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

# model_id used for training and preprocessing
model_id = 'distilbert-base-uncased'

# dataset used
dataset_name = 'emotion'

# s3 key prefix for the data
s3_prefix = 'samples/datasets/emotion'

# download tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# tokenizer helper function
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

# load dataset
train_dataset, test_dataset = load_dataset(dataset_name, split=['train', 'test'])

# tokenize dataset
train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

# set format for pytorch
train_dataset =  train_dataset.rename_column("label", "labels")
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset = test_dataset.rename_column("label", "labels")
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

After we processed the `datasets` we are going to use the new `FileSystem` [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload our dataset to S3.

In [None]:
import botocore
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()  

# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
train_dataset.save_to_disk(training_input_path, fs=s3)

# save test_dataset to s3
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
test_dataset.save_to_disk(test_input_path, fs=s3)

## Configure checkpointing and distributed training in the `HuggingFace` estimator

After we have uploaded dataset, we can configure our SageMaker Estimator parameters to have checkpointing and distributed training enabled. 

Checkpointing is also used in SageMaker [Managed Spot Training](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html). To configure spot training we need to define the `max_wait` and `max_run` in the `HuggingFace` estimator and set `use_spot_instances` to `True`. In this demo, we are not going to use spot training, thus set `use_spot_instances` to `False`, but feel free to play with it.

In spot training:
- `max_wait`: Duration in seconds until Amazon SageMaker will stop the managed spot training if not completed yet
- `max_run`: Max duration in seconds for training the training job

`max_wait` also needs to be greater than `max_run`, because `max_wait` is the duration for waiting/accessing spot instances (can take time when no spot capacity is free) + the expected duration of the training job. 

In [None]:
# enables spot training
use_spot_instances=False
# max time including spot start + training time
max_wait=7200
# expected training time
max_run=4000

To enable checkpointing we need to define `checkpoint_s3_uri` in the `HuggingFace` estimator. `checkpoint_s3_uri` is a S3 URI in which to save the checkpoints. By default Amazon SageMaker will save now any file, which is written to `/opt/ml/checkpoints` in the training job to `checkpoint_s3_uri`. 

*It is possible to adjust `/opt/ml/checkpoints` by overwriting `checkpoint_local_path` in the `HuggingFace` estimator*

In [None]:
# s3 uri where our checkpoints will be uploaded during training
base_job_name = "emotion-checkpointing-1st-job"

checkpoint_s3_uri = f's3://{sess.default_bucket()}/{base_job_name}/checkpoints'

Next we'll enable distributed training to use [SageMaker Distributed Data Parallel Library](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html). 

The SageMaker distributed data parallel library employs Message Passing Interface (MPI), a popular standard for managing communication between nodes in a high-performance cluster, and uses NVIDIA’s NCCL library for GPU-level communication. You can set custom MPI operations using the `custom_mpi_options parameter` in the `Estimator`. Any `mpirun` flags passed in this field are added to the `mpirun` command and executed by SageMaker for training. 

For example, you may define the `distribution` parameter of an `Estimator` using the following to use the NCCL_DEBUG variable to print the NCCL version at the start of the program:

`distribution = {'smdistributed':{'dataparallel':{'enabled': True, "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"}}}`

AWS has [pre-built framework container images](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) registered with Amazon ECR for SageMaker managed training and inference, and HuggingFace is one of these frameworks that have pre-built docker containers. 

Next step is to create our `HuggingFace` estimator, provide our `hyperparameters` and add our distributed training and checkpointing configurations. For training instances, choose from `ml.p3.16xlarge`, `ml.p3dn.24xlarge`, and `ml.p4d.24xlarge`

In [None]:
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
hyperparameters={
    'epochs': 2,                       # number of training epochs
  'train_batch_size': 32,            # batch size for training
  'eval_batch_size': 64,             # batch size for evaluation
  'learning_rate': 3e-5,             # learning rate used during training
  'model_id':model_id,               # pre-trained model id 
  'fp16': True,                      # Whether to use 16-bit (mixed) precision training
	'output_dir':'/opt/ml/checkpoints', # make sure files are saved to the checkpoint directory
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'train.py',        # fine-tuning script used in training jon
    source_dir           = './scripts',       # directory where fine-tuning script is stored
    instance_type        = 'ml.p3dn.24xlarge',   # instances type used for the training job
    instance_count       = 2,                 # the number of instances used for training
    base_job_name        = base_job_name,     # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    transformers_version = '4.12.3',          # the transformers version used in the training job
    pytorch_version      = '1.9.1',           # the pytorch_version version used in the training job
    py_version           = 'py38',            # the python version used in the training job
    hyperparameters      = hyperparameters,   # the hyperparameter used for running the training job
	distribution         = distribution,
    use_spot_instances   = use_spot_instances,# wether to use spot instances or not
    # max_wait             = max_wait,          # max time including spot start + training time
    # max_run              = max_run,           # max expected training time
	checkpoint_s3_uri    = checkpoint_s3_uri, # s3 uri where our checkpoints will be uploaded during training
)

When using remote S3 checkpointing you have to make sure that your `train.py` also supports checkpointing. `Transformers` and the `Trainer` offers utilities on how to do this. You only need to add the following snippet to your `Trainer` training script

```python
from transformers.trainer_utils import get_last_checkpoint

# check if checkpoint existing if so continue training
if get_last_checkpoint(args.output_dir) is not None:
    logger.info("***** continue training *****")
    last_checkpoint = get_last_checkpoint(args.output_dir)
    trainer.train(resume_from_checkpoint=last_checkpoint)
else:
    trainer.train()
```

## Run training

To start SageMaker training process, we simple call the `.fit` method of our estimator and provide our dataset. After training is finished, you'll see two checkpoints in `checkpoint_s3_uri` bucket, and final model output as a `model.tar.gz` file in 

In [None]:
# define train data object
data = {
	'train': training_input_path,
    'test': test_input_path
}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, logs="None")

After training is finished, you'll see two checkpoints in `checkpoint_s3_uri` bucket, and final model output as a `model.tar.gz` file in below s3 bucket

In [None]:
# s3 uri where the checkpoints are located
print(f"Checkpoint location: \n{checkpoint_s3_uri}\n")

# s3 uri where the trained model is located
print(f"s3 uri where the trained model is located: \n{huggingface_estimator.model_data}\n")

## Resume training from checkpoints 

To resume a training job from saved checkpoints, run a new estimator with the same checkpoint_s3_uri that you created in the Enable Checkpointing section. Once the training has resumed, the checkpoints from this S3 bucket are restored to checkpoint_local_path in each instance of the new training job.

In [None]:
new_base_job_name = "emotion-checkpointing-2nd-job"

In [None]:
# hyperparameters, which are passed into the training job
hyperparameters={
    'epochs': 3,                       # number of training epochs
  'train_batch_size': 32,            # batch size for training
  'eval_batch_size': 64,             # batch size for evaluation
  'learning_rate': 3e-5,             # learning rate used during training
  'model_id':model_id,               # pre-trained model id 
  'fp16': True,                      # Whether to use 16-bit (mixed) precision training
	'output_dir':'/opt/ml/checkpoints' # make sure files are saved to the checkpoint directory
}

In [None]:
huggingface_estimator_resume = HuggingFace(
    entry_point          = 'train.py',        # fine-tuning script used in training jon
    source_dir           = './scripts',       # directory where fine-tuning script is stored
    instance_type        = 'ml.p3dn.24xlarge',   # instances type used for the training job
    instance_count       = 2,                 # the number of instances used for training
    base_job_name        = new_base_job_name,     # the name of the training job
    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3
    transformers_version = '4.12.3',          # the transformers version used in the training job
    pytorch_version      = '1.9.1',           # the pytorch_version version used in the training job
    py_version           = 'py38',            # the python version used in the training job
    hyperparameters      = hyperparameters,   # the hyperparameter used for running the training job
	distribution         = distribution,
    use_spot_instances   = use_spot_instances,# wether to use spot instances or not
    # max_wait             = max_wait,          # max time including spot start + training time
    # max_run              = max_run,           # max expected training time
	checkpoint_s3_uri    = checkpoint_s3_uri, # s3 uri where previous checkpoints is located
)

Let's kick off the training again. Since we had two checkpoints saved in the first training job, SageMaker starts training from the third epoch. After training is finsihed, you'll be able to see 3 checkpoints in `checkpoint_s3_uri` bucket and a new `model.tar.gz` file in second training job's output bucket. 

In [None]:
# starting the train job with our uploaded datasets as input
huggingface_estimator_resume.fit(data)