# Run training on Amazon Sagemaker

The purpose of this notebook is a code example of how to train a text classifier for content moderation in AWS environment in a non ditributed manner for the first time, then showcase how to extend to a distributed fine-tuning. 

# Installation

As I am using Sagemaker notebooks, I am already using a kernel with Pytorch installed. So, all I have to do is to install sagemaker, transformers and datasets to download the dataset from Hugging Face Hub.

In [13]:
!pip install "sagemaker>=2.77.0" "transformers==4.12.3" "datasets[s3]==1.18.3" --upgrade

# Permissions

We need a sagemaker session. The session will handle the creation of a default bucket to store data, models, and logs.

As I am working with a sagemaker notebook, I can use the .get_execution_role() function to access the role. 

In [2]:
import sagemaker

sess = sagemaker.Session()
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

# Loading and preprocessing data

I chose the tweets_hate_speech_detection dataset from Hugging Face hub to work on, as it is very close to a content moderation task.
The labels for this dataset is either 0 (no hate speech) or 1 (hate speech), so it is not a multilabel text classification problem but the code stills the same for binary or multilabel classifiaction on AWS (except a little detail in the training script when dowloading the model with .from_pretrained : for a mulilabel class you should add the num_labels parameters and 2 dictionnaries mapping labels to ids and ids to labels).

I split the data into train and test splits.

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer

# load dataset
dataset = load_dataset("tweets_hate_speech_detection")
# split the data into train and test
dataset = dataset["train"].train_test_split(test_size=0.2)

Downloading:   0%|          | 0.00/1.45k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/881 [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset tweets_hate_speech_detection/default (download: 2.96 MiB, generated: 3.04 MiB, post-processed: Unknown size, total: 6.00 MiB) to /home/ec2-user/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c6b6f41e91ac9113e1c032c5ecf7a49b4e1e9dc8699ded3c2d8425c9217568b2...


0 examples [00:00, ? examples/s]

Dataset tweets_hate_speech_detection downloaded and prepared to /home/ec2-user/.cache/huggingface/datasets/tweets_hate_speech_detection/default/0.0.0/c6b6f41e91ac9113e1c032c5ecf7a49b4e1e9dc8699ded3c2d8425c9217568b2. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]



The preprocessing done here is just simple tokenization of the tweets.

We do trunction and padding to the max length.


In [4]:
# load tokenizer
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# create tokenization function
def tokenize(batch):
    return tokenizer(batch["tweet"], padding="max_length", truncation=True)

# tokenize train and test datasets
tokenized_dataset = dataset.map(tokenize, batched=True)

  0%|          | 0/26 [00:00<?, ?ba/s]

  0%|          | 0/7 [00:00<?, ?ba/s]

In [5]:
#rename label with labels as the model is expecting the keyword
tokenized_dataset =  tokenized_dataset.rename_column("label", "labels")

# set dataset format for PyTorch
tokenized_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# Upload data to S3

Next, we upload the tokenized train data and test data to 2 different folders in the S3 bucket previously created. 

In [6]:
import botocore
from datasets.filesystems import S3FileSystem

s3_prefix = 'samples/datasets/tweetsspeech'
s3 = S3FileSystem()

# save train_dataset to S3
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
tokenized_dataset["train"].save_to_disk(training_input_path,fs=s3)

# save test_dataset to S3
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
tokenized_dataset["test"].save_to_disk(test_input_path,fs=s3)

# Fine-tuning with Sagemaker

Now, we create the HuggingFace Estimator and we give it: 
* the fine-tuning script in the entry point parameter
* the instance type of Sagemaker (To my knowledge, currently, there are only GPU DLCs for Training Hugging Face Models. so you should choose GPU-based instances otherwise you get value error unsupported cpu. So you can choose either p3 or g4 instances. for my case I tried to use the least expensive instance in my region which is ml.p3.2xlarge)
* try to benefit from the spot instances to save costs using the parameter use_spot_instances
* finally we give the needed hyperparameters for our training script


In [11]:
from sagemaker.huggingface import HuggingFace

hyperparameters={
    "epochs": 1,                            # number of training epochs
    "train_batch_size": 32,                 # training batch size
    "eval_batch_size":64,                   #validation batch size
    "model_name":"distilbert-base-uncased",  # name of pretrained model
}

#Creating an estimator
huggingface_estimator = HuggingFace(
    entry_point="train.py",                 # fine-tuning script to use in training job
    source_dir="./",                 # directory where fine-tuning script is stored
    instance_type="ml.p3.2xlarge", # instance type
#     checkpoint_s3_uri=f's3://{sess.default_bucket()}/checkpoints',
#     use_spot_instances=True,
#     max_wait=3600,
#     max_run=1000,
    instance_count=1,                       # number of instances
    role=role,                              # IAM role used in training job to acccess AWS resources (S3)
    transformers_version="4.6",             # Transformers version
    pytorch_version="1.7",                  # PyTorch version
    py_version="py36",                      # Python version
    hyperparameters=hyperparameters         # hyperparameters to use in training job
)

We launch training with .fit function of the estimator object.

I hit a quota limit as I am using a Free tier AWS account that don't have access to GPU-based instances. So I had to ask to augment my quota, but I didn't get an answer and I should wait for the support team to get back to me.

But if you are using your professional company account, surely you have some ressources in your region and you won't have this problem.

In [12]:
huggingface_estimator.fit({"train": training_input_path, "test": test_input_path})

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-02-26-13-18-39-985


ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3.2xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.

# Question 2:
### Can you extend your example from 1) to a distributed setup?

Sagemaker provides two strategies for distributed training:
* data parallelism
* model parallelism

#### Data parallelism

If you want to do data parallelism, ignore the two previous cells and run the next one instead.
As we are using the Trainer API in our fineuning script, we only need to define the distribution parameter in the Hugging Face Estimator:

In [None]:
from sagemaker.huggingface import HuggingFace

hyperparameters={
    "epochs": 1,                            # number of training epochs
    "train_batch_size": 32,                 # training batch size
    "eval_batch_size":64,                   #validation batch size
    "model_name":"distilbert-base-uncased",  # name of pretrained model
}

# configuration for running training on smdistributed data parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

#Creating an estimator
huggingface_estimator = HuggingFace(
    entry_point="train.py",                 # fine-tuning script to use in training job
    source_dir="./",                 # directory where fine-tuning script is stored
    instance_type="ml.p3dn.24xlarge", # instance type
    instance_count=2,                       # number of instances
    role=role,                              # IAM role used in training job to acccess AWS resources (S3)
    transformers_version="4.6",             # Transformers version
    pytorch_version="1.7",                  # PyTorch version
    py_version="py36",                      # Python version
    hyperparameters=hyperparameters         # hyperparameters to use in training job
    distribution = distribution
)

huggingface_estimator.fit({"train": training_input_path, "test": test_input_path})

#### Model parallelism

To enable model-parallelism, a little change has to be done in the training script.
Extend the Trainer API to a the SageMakerTrainer to use the model parallelism library by using these imports in the training script :

```python
from transformers.sagemaker import SageMakerTrainingArguments as TrainingArguments
from transformers.sagemaker import SageMakerTrainer as Trainer
```

Now we only need to define the distribution parameter in the Hugging Face Estimator:

In [None]:
from sagemaker.huggingface import HuggingFace

hyperparameters={
    "epochs": 1,                            # number of training epochs
    "train_batch_size": 32,                 # training batch size
    "eval_batch_size":64,                   #validation batch size
    "model_name":"distilbert-base-uncased",  # name of pretrained model
}

# configuration for running training on smdistributed model parallel
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8
}

smp_options = {
    "enabled":True,
    "parameters": {
        "microbatches": 4,
        "placement_strategy": "spread",
        "pipeline": "interleaved",
        "optimize": "speed",
        "partitions": 4,
        "ddp": True,
    }
}

distribution={
    "smdistributed": {"modelparallel": smp_options},
    "mpi": mpi_options
}

#Creating an estimator
huggingface_estimator = HuggingFace(
    entry_point="train.py",                 # fine-tuning script to use in training job
    source_dir="./",                 # directory where fine-tuning script is stored
    instance_type="ml.p3dn.24xlarge", # instance type
    instance_count=2,                       # number of instances
    role=role,                              # IAM role used in training job to acccess AWS resources (S3)
    transformers_version="4.6",             # Transformers version
    pytorch_version="1.7",                  # PyTorch version
    py_version="py36",                      # Python version
    hyperparameters=hyperparameters         # hyperparameters to use in training job
    distribution = distribution
)

huggingface_estimator.fit({"train": training_input_path, "test": test_input_path})

# Deploying the endpoint

Deploy the fintuned model by calling deploy() on your estimator. then try to predict on some data.

In [None]:
predictor = huggingface_estimator.deploy(initial_instance_count=1,"ml.g4dn.xlarge")

In [None]:
sentiment_input = {"inputs": "put your sentence here to test the hate speech detection model"}

predictor.predict(sentiment_input)

Delete the endpoint and save costs :)

In [None]:
predictor.delete_endpoint()