# Accelerating Hugging Face Transformers with AWS Accelerators and Amazon SageMaker

This notebook will help you get started on how to train and deploy Hugging Face Transformers on Amazon SageMaker using AWS Accelerators, including [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/?nc1=h_ls) and [AWS Inferentia2](https://aws.amazon.com/machine-learning/inferentia/). As the field of deep learning continues to evolve, the need for efficient and cost-effective solutions to train and deploy increasingly complex transformers model has become more critical than ever. AWS purpose-built accelerators are designed to deliver high performance at the lowest cost for deep learning inference and training.

This notebook walks you through an end-to-end example on how to train a RoBERTa model with Hugging Face on AWS Trainium, and deploy it on AWS SageMaker using AWS Inferentia2 accelerators for inference. Benefit from faster time-to-train, up to 50% cost-to-train savings, and up to 4x higher throughput and 10x lower latency for inference compared to its first-generation.

You will learn how to:

1. Setup AWS environment
2. Load and prepare the dataset
3. Fine-tune RoBERTa using Hugging Face Transformers and Optimum Neuron on AWS Trainium
4. Deploy model to inferntia2 and run inference 


_**NOTE: You can run this demo in Sagemaker Studio, your local machine or Sagemaker Notebook Instances**_

## 1. Setup Development Environment


In [2]:
!pip install "transformers>=4.28.0" "datasets[s3]==2.9.0" sagemaker --upgrade --quiet

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find [here](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) more about it.



In [1]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")


Couldn't call 'get_role' to get Role ARN from role name philippschmid to get Role path.


sagemaker role arn: arn:aws:iam::558105141721:role/sagemaker_execution_role
sagemaker bucket: sagemaker-us-east-1-558105141721
sagemaker session region: us-east-1


## 2. Load and prepare the dataset

We are using the `datasets` library to download and preprocess the `emotion` dataset. After preprocessing, the dataset will be uploaded to our `sagemaker_session_bucket` to be used within our training job. The [emotion](https://github.com/dair-ai/emotion_dataset) dataset consists of 16000 training examples, 2000 validation examples, and 2000 testing examples.

```python
{
  'text': 'im feeling rather festive here in south florida', 
  'label': 1
}
```

To load the `emotion` dataset, we use the `load_dataset()` method from the 🤗 Datasets library.

In [2]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset("philschmid/emotion")

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Validation dataset size: {len(dataset['validation'])}")


No config specified, defaulting to: emotion/split
Found cached dataset emotion (/home/ubuntu/.cache/huggingface/datasets/philschmid___emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd)


  0%|          | 0/3 [00:00<?, ?it/s]

Train dataset size: 16000
Validation dataset size: 2000


To train our model, we need to convert our inputs (text) to token IDs. This is done by a 🤗 Transformers Tokenizer. If you are not sure what this means, check out **[chapter 6](https://huggingface.co/course/chapter6/1?fw=tf)** of the Hugging Face Course.

In [3]:
from transformers import AutoTokenizer

model_id="roberta-base"

# Load tokenizer of RoBERTa
tokenizer = AutoTokenizer.from_pretrained(model_id)


2023-06-02 07:40:26.787782: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-02 07:40:27.399962: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-06-02 07:40:27.622595: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-06-02 07:40:27.622608: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore 

AWS Trainium requires the inputs to be of a static shape. To optimize our training throughput want to understand how long our inputs are to efficiently pad them. 

In [4]:
from datasets import concatenate_datasets

# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = concatenate_datasets([dataset["train"], dataset["validation"]]).map(lambda x: tokenizer(x["text"], truncation=True), batched=True, remove_columns=["text"])
max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])
print(f"Max source length: {max_source_length}")

Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/philschmid___emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd/cache-64679c6b492027f9.arrow


Max source length: 88


Our max sequence length is 87. We are going to use 128 as our max sequence length for training and inference and pad all inputs to this length.


In [5]:
MAX_LENGTH = 128

# tokenizer helper function
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length',max_length=MAX_LENGTH, truncation=True,return_tensors="pt")

# tokenize dataset
tokenized_dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])
tokenized_dataset = tokenized_dataset.with_format("torch")
# rename label to labels to match the expected input
train_dataset =  tokenized_dataset['train'].rename_column("label", "labels")
validation_dataset =  tokenized_dataset['validation'].rename_column("label", "labels")


  0%|          | 0/16 [00:00<?, ?ba/s]

Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/philschmid___emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd/cache-a4f094e327cece66.arrow
Loading cached processed dataset at /home/ubuntu/.cache/huggingface/datasets/philschmid___emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd/cache-b3c48eb89cf1fc2f.arrow


lets print another sample

In [6]:
from random import randint

print(tokenized_dataset['train'][randint(0, len(dataset['train']))])

{'label': tensor(1), 'input_ids': tensor([    0,   118,  2198,  1346,    14,    51,   115,  3999,    33,    41,
         3031, 24672,    53,  1782,    24,    95, 10122,    15, 19750,     5,
          619,     9,     5,   157,   626,   278,     2,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,     1,     1,     1,     1,     1,     1,     1,
            1,     1,     1,  

After we processed the datasets we are going to use the new [FileSystem integration](https://huggingface.co/docs/datasets/filesystems) to upload our dataset to S3. We are using the `sess.default_bucket()`, adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

In [7]:
# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/startup-loft/train'
train_dataset.save_to_disk(training_input_path)

# save test_dataset to s3
test_input_path = f's3://{sess.default_bucket()}/startup-loft/test'
validation_dataset.save_to_disk(test_input_path)


Saving the dataset (0/1 shards):   0%|          | 0/16000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/2000 [00:00<?, ? examples/s]

# 3. Fine-tune RoBERTa using Hugging Face Transformers and Optimum Neuron on AWS Trainium

Normally we would use the [Trainer](https://huggingface.co/docs/transformers/v4.19.4/en/main_classes/trainer#transformers.Trainer) and [TrainingArguments](https://huggingface.co/docs/transformers/v4.19.4/en/main_classes/trainer#transformers.TrainingArguments) to fine-tune PyTorch-based transformer models. 

But together with AWS, we have developed a `TrainiumTrainer` to improve performance, robustness, and safety when training on Trainium instances. The `TrainiumTrainer` also comes with a [model cache](https://huggingface.co/docs/optimum-neuron/guides/cache_system), which allows us to use precompiled models and configuration from Hugging Face Hub to skip the compilation step, which would be needed at the beginning of training. This can reduce the training time by ~3x. 

The `TrainiumTrainer` is part of the `optimum-neuron` library and can be used as a 1-to-1 replacement for the `Trainer`. You only have to adjust the import in your training script. 

```diff
- from transformers import Trainer
+ from optimum.neuron import TrainiumTrainer as Trainer
```

We prepared a simple [train.py](./scripts/train.py) training script based on the ["Getting started with Pytorch 2.0 and Hugging Face Transformers”](https://www.philschmid.de/getting-started-pytorch-2-0-transformers#3-fine-tune--evaluate-bert-model-with-the-hugging-face-trainer) blog post with the `TrainiumTrainier`.

In order to train on Amazon SageMaker we need to create a `HuggingFace` Estimator. The Estimator defines, which fine-tuning script (`entry_point`), which `instance_type`, which `hyperparameters`, etc should be used.

Amazon SageMaker takes care of starting and managing all the required ec2 instances for us with the `huggingface` container, uploads the provided fine-tuning script `train.py` and downloads the data from our `sagemaker_session_bucket` into the container at `/opt/ml/input/data`. Then, it starts the training job by running. 


In [8]:
trainium_image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training-neuronx:1.13.0-transformers4.28.1-neuronx-py38-sdk2.9.1-ubuntu20.04-v1.0"

In [9]:
import time
from sagemaker.huggingface import HuggingFace

# define Training Job Name
job_name = f'huggingface-fsdp-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# hyperparameters, which are passed into the training job
hyperparameters={
    'model_id': model_id, # model id from huggingface.co/models
    'lr': 5e-5, # enable gradient checkpointing
    'bf16': True, # enable mixed precision training
    'per_device_train_batch_size': 16, # optimizer
    'epochs': 3, # number of epochs to train
}

# estimator
huggingface_estimator = HuggingFace(
    entry_point='train_with_export.py',
    source_dir='./scripts',
    instance_type="ml.trn1.2xlarge",
    instance_count=1,
    volume_size=200,
    role=role,
    job_name=job_name,
    image_uri=trainium_image_uri,
    py_version='py38',
    hyperparameters = hyperparameters,
    distribution={"torch_distributed": {"enabled": True}} # enable torchrun
)


In [10]:
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

Using provided s3_resource


INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-neuronx-2023-06-02-07-40-42-712


2023-06-02 07:40:44 Starting - Starting the training job...
2023-06-02 07:40:59 Starting - Preparing the instances for training.........
2023-06-02 07:42:32 Downloading - Downloading input data...
2023-06-02 07:42:52 Training - Downloading the training image..............................
2023-06-02 07:47:59 Training - Training image download completed. Training in progress..bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2023-06-02 07:48:33,753 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2023-06-02 07:48:33,755 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)
2023-06-02 07:48:34,797 sagemaker-training-toolkit INFO     Found 2 neurons on this instance
2023-06-02 07:48:34,808 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2023-06-02 07:48:34,810 sagemaker_pytorch_container.training INFO     Invoking Torc

The following diagram shows how a model is trained and deployed with Amazon SageMaker:
![assets](./assets/platform.png)

## 4. Deploy model to inferntia2 and run inference 

Now that we have trained our model, we want to deploy it to `inferentia2` on Amazon SageMaker so that we can use it for inference. When deploying models to `inferentia2`, we need to compile the model with the `neuron-sdk` or `optimum-neuron`. 

If you want to learn more about how to compile models for `inferentia2`, check out the [Optimum Neuron documentation](https://huggingface.co/docs/optimum-neuron/guides/export_model). 

As first we need to install `optimum-neuron` and the required neuron runtime 


In [None]:
# for amazon linux 2
# !sudo yum install aws-neuronx-runtime-lib-2.* -y
# for ubuntu 20.04
!sudo apt-get install aws-neuronx-runtime-lib=2.12.* -y 
!pip install optimum optimum-neuron==0.0.3 --upgrade
!pip install neuronx-cc==2.5.* torch-neuronx==1.13.0.1.6.1 --extra-index-url https://pip.repos.neuron.amazonaws.com --upgrade

Next, we need to load our trained model from S3 and compile it for `inferentia2`. We are using the `export()` function from `optimum-neuron` to compile our model.


In [None]:
from optimum.exporters.neuron import export
from optimum.exporters.neuron.model_configs import RobertaNeuronConfig
from pathlib import Path
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from sagemaker.s3 import S3Downloader
from tempfile import TemporaryDirectory
import shutil

with TemporaryDirectory() as tmp_dir:
    # S3Downloader.download(model_data, tmp_dir)
    S3Downloader.download(huggingface_estimator.model_data, tmp_dir)
    shutil.unpack_archive(f"{tmp_dir}/model.tar.gz", tmp_dir)

    model = AutoModelForSequenceClassification.from_pretrained(tmp_dir)
    tokenizer = AutoTokenizer.from_pretrained(tmp_dir)

    neuron_config = RobertaNeuronConfig(config=model.config,
                                           task="text-classification",
                                           batch_size=1, 
                                           sequence_length=128
                                           )
    output_path = Path(f"tmp")
    # Export to Neuron model
    export(
        model=model,
        config=neuron_config,
        output=output_path.joinpath("neuron_model.pt"),
        auto_cast="all",
        auto_cast_type="bf16",
    )
    model.config.save_pretrained(output_path)
    tokenizer.save_pretrained(output_path)

The [Hugging Face Inference Toolkit](https://github.com/aws/sagemaker-huggingface-inference-toolkit) supports zero-code deployments on top of the [pipeline feature](https://huggingface.co/transformers/main_classes/pipelines.html) from 🤗 Transformers. This allows users to deploy Hugging Face transformers without an inference script [[Example](https://github.com/huggingface/notebooks/blob/master/sagemaker/11_deploy_model_from_hf_hub/deploy_transformer_model_from_hf_hub.ipynb)].

Currently, this feature is not supported with AWS Inferentia2, which means we need to provide an `inference.py` script for running inference.

To use the inference script, we need to create an `inference.py` script. In our example, we are going to overwrite the `model_fn` to load our neuron model and the `predict_fn` to create a text-classification pipeline.

If you want to know more about the `inference.py` script check out this **[example](https://github.com/huggingface/notebooks/blob/master/sagemaker/17_custom_inference_script/sagemaker-notebook.ipynb)**. It explains amongst other things what `model_fn` and `predict_fn` are.



In [None]:
!mkdir code

In [None]:
%%writefile code/inference.py
import os
from transformers import AutoConfig, AutoTokenizer
import torch
import torch_neuronx

# saved weights name
TRACED_WEIGHTS_NAME = "neuron_model.pt"
TRACED_SEQUENCE_LEGNTH = 128
os.environ["NEURON_RT_NUM_CORES"] = "1"


def model_fn(model_dir):
    # load tokenizer and neuron model from model_dir
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = torch.jit.load(os.path.join(model_dir, TRACED_WEIGHTS_NAME))
    model_config = AutoConfig.from_pretrained(model_dir)

    return model, tokenizer, model_config


def predict_fn(data, model_tokenizer_model_config):
    # destruct model, tokenizer and model config
    model, tokenizer, model_config = model_tokenizer_model_config

    # create embeddings for inputs
    inputs = data.pop("inputs", data)
    embeddings = tokenizer(
        inputs,
        return_tensors="pt",
        max_length=TRACED_SEQUENCE_LEGNTH,
        padding="max_length",
        truncation=True,
    )
    # convert to tuple for neuron model
    neuron_inputs = tuple(embeddings.values())

    # run prediciton
    with torch.no_grad():
        predictions = model(*neuron_inputs)[0]
        scores = torch.nn.Softmax(dim=1)(predictions)

    # return dictonary, which will be json serializable
    return [{"label": model_config.id2label[item.argmax().item()], "score": item.max().item()} for item in scores]


Before we can deploy our neuron model to Amazon SageMaker we need to create a `model.tar.gz` archive with all our model artifacts and inference script. We can do this with the `tar` command.


In [None]:
# copy inference.py into the code/ directory of the model directory.
!cp -r code/ tmp/code/
# create a model.tar.gz archive with all the model artifacts and the inference.py script.
%cd tmp
!tar zcvf model.tar.gz *
%cd ..

Now we can upload our `model.tar.gz` to our session S3 bucket with sagemaker.

In [None]:
from sagemaker.s3 import S3Uploader
import os 
os.environ["AWS_DEFAULT_REGION"] = "us-east-2"
import sagemaker

sess = sagemaker.Session()
# create s3 uri
s3_model_path = f"s3://{sess.default_bucket()}/startup-loft/compiled-models"

# upload model.tar.gz
s3_model_uri = S3Uploader.upload(local_path="tmp/model.tar.gz",desired_s3_uri=s3_model_path)
print(f"model artifcats uploaded to {s3_model_uri}")

After we have uploaded our `model.tar.gz` to Amazon S3 can we create a custom `HuggingfaceModel`. This class will be used to create and deploy our real-time inference endpoint on Amazon SageMaker.

When we create the endpoint, SageMaker automatically provisions the specified inference instances and deploys our model to them. We can then send inference requests to the endpoint and receive predictions from our model. We can use the `deploy()` method from  our HuggingFace estimator, passing in our desired number of instances and instance type.

In [None]:
inferentia2_image_uri="763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-inference-neuronx:1.13.0-transformers4.28.1-neuronx-py38-sdk2.9.1-ubuntu20.04-v1.0"

In [None]:
from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=s3_model_uri,       # path to your model and script
   role=role,                    # iam role with permissions to create an Endpoint
   transformers_version="4.12",  # transformers version used
   image_uri=inferentia2_image_uri,
   py_version='py37',            # python version used
   model_server_workers=2,
)

# Let SageMaker know that we've already compiled the model via neuron-cc
huggingface_model._is_compiled_model = True

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,      # number of instances
    instance_type="ml.inf2.xlarge" # AWS Inferentia2 Instance
)

Inferentia2 is the second generation purpose built Machine Learning inference accelerator from AWS. The Inferentia2 device architecture is depicted below:

![assets](./assets/inferentia2.jpeg)

Then, we use the returned predictor object to call the endpoint.

In [None]:
sentiment_input = {"inputs":"I love using the new Inferentia2 instance on Amazon SageMaker."}

predictor.predict(sentiment_input)

In [None]:
from sagemaker.huggingface.model import HuggingFacePredictor

predictor = HuggingFacePredictor(endpoint_name="huggingface-pytorch-inference-neuronx-m-2023-05-10-14-09-36-983")

We managed to deploy our neuron compiled RoBERTa to AWS Inferentia2 on Amazon SageMaker. Now, let's test its performance. As a dummy load test, we will loop and send 5,000 requests to our endpoint and inspect the performance in cloudwatch.



In [11]:
print(f"https://console.aws.amazon.com/cloudwatch/home?region={sess.boto_region_name}#metricsV2:graph=~(metrics~(~(~'AWS*2fSageMaker~'ModelLatency~'EndpointName~'{predictor.endpoint_name}~'VariantName~'AllTraffic))~view~'timeSeries~stacked~false~region~'{sess.boto_region_name}~start~'-PT5M~end~'P0D~stat~'Average~period~30);query=~'*7bAWS*2fSageMaker*2cEndpointName*2cVariantName*7d*20{predictor.endpoint_name}")

https://console.aws.amazon.com/cloudwatch/home?region=us-east-2#metricsV2:graph=~(metrics~(~(~'AWS*2fSageMaker~'ModelLatency~'EndpointName~'huggingface-pytorch-inference-neuronx-m-2023-05-10-14-09-36-983~'VariantName~'AllTraffic))~view~'timeSeries~stacked~false~region~'us-east-2~start~'-PT5M~end~'P0D~stat~'Average~period~30);query=~'*7bAWS*2fSageMaker*2cEndpointName*2cVariantName*7d*20huggingface-pytorch-inference-neuronx-m-2023-05-10-14-09-36-983


In [9]:
from tqdm import tqdm

total_requests = 5_000  # 1m requests
for i in tqdm(range(total_requests)):
    predictor.predict(sentiment_input)


100%|██████████| 5000/5000 [06:59<00:00, 11.92it/s]


The average latency for our RoBERTA model is 1-2ms for a sequence length of 128.



clean up running endpoints

In [10]:
predictor.delete_model()
predictor.delete_endpoint()