# Sentiment Analysis App with Huggingface Distributed GPU Training and PyTorch Lightning

In this notebook, we will reimplement the work done in the SageMaker Project notebook, but with a language model obtained from Huggingface and using PyTorch Lightning to train the model. We are doing this because the model trained in the SageMaker Notebook used an LSTM, which is quite outdated for language models.

What you will learn in this notebook:

* How to do Distributed GPU Training with a Huggingface model in SageMaker
* Tokenizing your text dataset and storing it in s3
* Deploying and Testing your trained model
* How to use spot instances to train your model

References: https://github.com/huggingface/notebooks/blob/master/sagemaker/05_spot_instances/sagemaker-notebook.ipynb

## Development Environment and Permissions

### Installation

In [1]:
!pip install "sagemaker>=2.48.0" "transformers==4.6.1" "datasets[s3]==1.6.2" --upgrade

Collecting sagemaker>=2.48.0
  Downloading sagemaker-2.53.0.tar.gz (438 kB)
[K     |████████████████████████████████| 438 kB 26.8 MB/s eta 0:00:01
[?25hCollecting transformers==4.6.1
  Downloading transformers-4.6.1-py3-none-any.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 57.1 MB/s eta 0:00:01
[?25hCollecting datasets[s3]==1.6.2
  Downloading datasets-1.6.2-py3-none-any.whl (221 kB)
[K     |████████████████████████████████| 221 kB 69.5 MB/s eta 0:00:01
[?25hCollecting huggingface-hub==0.0.8
  Downloading huggingface_hub-0.0.8-py3-none-any.whl (34 kB)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 60.1 MB/s eta 0:00:01
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 69.1 MB/s eta 0:00:01
Collecting xxhash
  D

### Development environment

Upgrade ipywidgets for datasets library and restart kernel. Only needed when preprocessing is done in the notebook.

In [2]:
%%capture
import IPython
!conda install -c conda-forge ipywidgets -y
IPython.Application.instance().kernel.do_shutdown(True) # has to restart kernel so changes are used

In [1]:
import sagemaker
from sagemaker.huggingface import HuggingFace
sagemaker.__version__

'2.53.0'

### Permissions

In [2]:
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

sagemaker role arn: arn:aws:iam::034262493329:role/service-role/AmazonSageMaker-ExecutionRole-20210810T151351
sagemaker bucket: sagemaker-us-east-1-034262493329
sagemaker session region: us-east-1


## Tokenization

This section is actually unnecesary if we want to simply train the model, but I've added the steps here for future reference in case we want to use a dataset and task that is not easily provided by Huggingface like IMDB movie classification. In other words, if you only want to train IMDB (or a similarly HF provided dataset), you can simply skip to the training part. For more custom datasets, you will need to do the tokenization before training and may need to modify the training code a bit.

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer

# tokenizer used in preprocessing
tokenizer_name = 'distilbert-base-uncased'

# dataset used
dataset_name = 'imdb'

# s3 key prefix for the data
s3_prefix = 'samples/dataset/imdb'

In [5]:
# download tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

# tokenizer helper function
def tokenize(batch):
    return tokenizer(batch['text'], padding='max_length', truncation=True)

# load dataset
train_dataset, test_dataset = load_dataset('imdb', split=['train', 'test'])
test_dataset = test_dataset.shuffle().select(range(10000))

# tokenize dataset
train_dataset = train_dataset.map(tokenize, batched=True)
test_dataset = test_dataset.map(tokenize, batched=True)

# set format for PyTorch
train_dataset = train_dataset.rename_column('label', 'labels')
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset = train_dataset.rename_column('label', 'labels')
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

Reusing dataset imdb (/home/ec2-user/.cache/huggingface/datasets/imdb/plain_text/1.0.0/4ea52f2e58a08dbc12c2bd52d0d92b30b88c00230b4522801b3636782f625c5b)


HBox(children=(FloatProgress(value=0.0, max=25.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))




## Uploading data to sagemaker_session_bucket

After we processed the datasets (with the tokenizer), we are going to use the FileSystem integration to upload our dataset to S3.

In [6]:
import botocore
from datasets.filesystems import S3FileSystem

s3 = S3FileSystem()

# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train'
train_dataset.save_to_disk(training_input_path, fs=s3)

# save test_dataset to s3
test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test'
test_dataset.save_to_disk(test_input_path, fs=s3)

## Fine-tuning & Starting Training Job

At this point we can diverge in a few different directions. 

We can train the model with 1 GPU for we can do distributed GPU training: we'll be doing distributed GPU training. 

We can create our own custom `train.py` script to train our model or we can use the premade training scripts provided by huggingface: we'll be using the `run_glue.py` script since it will save us some time from creating our own script. 

Before we continue, we'll create here the script from the reference notebook in case we want to use the boilerplate `train.py` file created by Huggingface in the future.

In [7]:
%%writefile ./scripts/train.py

from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
from transformers.trainer_utils import get_last_checkpoint

from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from datasets import load_from_disk
import logging
import sys
import argparse
import os

# Set up logging
logger = logging.getLogger(__name__)

logging.basicConfig(
    level=logging.getLevelName("INFO"),
    handlers=[logging.StreamHandler(sys.stdout)],
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)

if __name__ == "__main__":

    logger.info(sys.argv)

    parser = argparse.ArgumentParser()

    # hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument("--epochs", type=int, default=3)
    parser.add_argument("--train-batch-size", type=int, default=32)
    parser.add_argument("--eval-batch-size", type=int, default=64)
    parser.add_argument("--warmup_steps", type=int, default=500)
    parser.add_argument("--model_name", type=str)
    parser.add_argument("--learning_rate", type=str, default=5e-5)
    parser.add_argument("--output_dir", type=str)

    # Data, model, and output directories
    parser.add_argument("--output-data-dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
    parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
    parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
    parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
    parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])

    args, _ = parser.parse_known_args()

    # load datasets
    train_dataset = load_from_disk(args.training_dir)
    test_dataset = load_from_disk(args.test_dir)

    logger.info(f" loaded train_dataset length is: {len(train_dataset)}")
    logger.info(f" loaded test_dataset length is: {len(test_dataset)}")

    # compute metrics function for binary classification
    def compute_metrics(pred):
        labels = pred.label_ids
        preds = pred.predictions.argmax(-1)
        precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary")
        acc = accuracy_score(labels, preds)
        return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

    # download model from model hub
    model = AutoModelForSequenceClassification.from_pretrained(args.model_name)

    # define training args
    training_args = TrainingArguments(
        output_dir=args.output_dir,
        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.train_batch_size,
        per_device_eval_batch_size=args.eval_batch_size,
        warmup_steps=args.warmup_steps,
        evaluation_strategy="epoch",
        logging_dir=f"{args.output_data_dir}/logs",
        learning_rate=float(args.learning_rate),
    )

    # create Trainer instance
    trainer = Trainer(
        model=model,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
    )

    # train model
    if get_last_checkpoint(args.output_dir) is not None:
        logger.info("***** continue training *****")
        trainer.train(resume_from_checkpoint=args.output_dir)
    else:
        trainer.train()
    # evaluate model
    eval_result = trainer.evaluate(eval_dataset=test_dataset)

    # writes eval result to file which can be accessed later in s3 ouput
    with open(os.path.join(args.output_data_dir, "eval_results.txt"), "w") as writer:
        print(f"***** Eval results *****")
        for key, value in sorted(eval_result.items()):
            writer.write(f"{key} = {value}\n")

    # Saves the model to s3
    trainer.save_model(args.model_dir)

Writing ./scripts/train.py


## Creating an Estimator and starting the training job

If you want to use the `train.py` file or your own custom training file, you need to do the following in the code below:

* Remove dataset_name from the hyperparameters
* Replace the `entry_point` of `run_glue.py` with `train.py`
* Replace the `source_dir` of `./examples/pytorch/text-classification` with `./scripts`


### Attach an old training job to an estimator

Before we start training, if you have an old trained model you would like to use to continue training, get results, deploy, etc., you can use the following code the grab a model by using the training job name:

In [None]:
from sagemaker.estimator import Estimator

# job which is going to be attached to the estimator
old_training_job_name = '' # should be something like huggingface-training-2021-02-04-16-47-39-189

In [None]:
# attach old training job
huggingface_estimator_loaded = Estimator.attach(old_training_job_name)

# get model output s3 from training job
huggingface_estimator_loaded.model_data

### Training the model

In [None]:
# gets role for executing training job
hyperparameters = {
    'model_name_or_path':'distilbert-base-uncased',
    'output_dir':'/opt/ml/checkpoints', # replace with '/opt/ml/checkpoints' when using train.py
    'dataset_name': 'imdb', # remove if using custom training script
    'do_train': True,
    'do_eval': True,
    'per_device_train_batch_size': 12,
    'num_train_epochs': 5,
    'max_seq_length': 128,
    'fp16': True,
    'pad_to_max_length': True,
}

# s3 uri where our checkpoints will be uploaded during training
job_name = 'using-spot'
checkpoint_s3_uri = f's3://{sess.default_bucket()}/{job_name}/checkpoints'

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}

# configuration for running training on smdistributed Data Parallel
# smdistributed = SageMaker Distributed
distribution = {'smdistributed': {'dataparallel':{'enabled': True}}}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
    entry_point='run_glue.py', # can be replaced with train.py for custom script
    source_dir='./examples/pytorch/text-classification', # can be replaced with local scripts directory ('./scripts')
    instance_type='ml.p3.8xlarge', # has 4 GPUs
    instance_count=2, # changed to 2 instances
    base_job_name=job_name,
    checkpoint_s3_uri=checkpoint_s3_uri,
    use_spot_instances=True,
    max_wait=3600, # This should be equal to or greater than max_run in seconds
    max_run=1000, # Expected max run in seconds (so that we don't end up using the instance for too long if there is an issue)
    role=role,
    git_config=git_config,
    transformers_version='4.6.1',
    pytorch_version='1.7.1',
    py_version='py36',
    hyperparameters = hyperparameters
)

# starting the train job
huggingface_estimator.fit() # put this inside fit if using train.py: {'train': training_input_path, 'test': test_input_path}

2021-08-12 20:11:54 Starting - Starting the training job...
2021-08-12 20:11:56 Starting - Launching requested ML instancesProfilerReport-1628799107: InProgress
...
2021-08-12 20:12:53 Starting - Insufficient capacity error from EC2 while launching instances, retrying!.............................................................................................................................................ProfilerReport-1628799107: Stopped
........................................................................................................................................................................................................................

In [None]:
!pip install transformers torch==1.6.0

In [None]:
import os
import tarfile
from sagemaker.s3 import S3Downloader

local_path = 'imdb_sentiment_distributed_transformer'

os.makedirs(local_path, exist_ok=True)

# download model S3
S3Downloader.download(
    s3_uri=huggingface_estimator.model_data, # s3 uri where the trained model is located
    local_path=local_path, # local path where *.tar.gz will be saved
)

# unzip model
tar = tarfile.open(f'{local_path}/model.tar.gz', 'r:gz')
tar.extractall(path=local_path)
tar.close()
os.remove(f'{local_path}/model.tar.gz')

In [None]:
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer

model=AutoModelForSequenceClassification.from_pretrained(local_path)
tokenizer=AutoTokenizer.from_pretrained(local_path)

clf = pipeline('text-classification', model=model, tokenizer=tokenizer)

In [None]:
review = 'The Dark Knight is an excellent film!'

In [None]:
clf(review)

## Deploying the endpoint

As usual, to deploy our model we simply need to call `.deploy` on our estimator object.

In [None]:
predictor = huggingface_estimator.deploy(1, "ml.g4dn.xlarge")

Let's test out our endpoint:

In [None]:
review = {"inputs": "The Dark Knight is an excellent film!"}

predictor.predict(review)

## Estimator Parameters

In [None]:
# container image used for training job
print(f"container image used for training job: \n{huggingface_estimator.image_uri}\n")

# s3 uri where the trained model is located
print(f"s3 uri where the trained model is located: \n{huggingface_estimator.model_data}\n")

# latest training job name for this estimator
print(f"latest training job name for this estimator: \n{huggingface_estimator.latest_training_job.name}\n")

In [None]:
# access the logs of the training job
huggingface_estimator.sagemaker_session.logs_for_job(huggingface_estimator.latest_training_job.name)

## Deleting the endpoint

In [None]:
predictor.delete_endpoint()