# Develop training and inference scripts for Script Mode

## Overview
In this notebook, we will learn how to develop training and inference scripts using HuggingFace framework. We will leverage SageMaker pre-build containers for HuggingFace (with PyTorch backend).

We chose to solve a typical NLP task - text classification. We will use `20 Newsgroups` dataset which assembles ~ 20,000 newsgroup documents across 20 different newsgroups (categories).

By the end of this notebook you will learn how to:
- prepare text corpus for training and inference using Amazon SageMaker;
- develop training script to run in pre-build HugginFace container;
- configure and schedule training job;
- develop inference code;
- configure and deploy real-time inference endpoint;
- test SageMaker endpoint.

**Estimated time to review and complete this sample**: ~30-40 mins.

### Notes on environment
Please note, that this notebook was tested on SageMaker Notebook instance with latest PyTorch dependencies installed (conda environment `conda_pytorch_latest_p38`). If you are using different environment, please make sure to install following Python dependencies via PIP or Conda installers:
- `scikit-learn`
- `sagemaker`

You can install all required packages for this example by running `pip` command below.

In [None]:
! pip install -r requirements.txt


### Selecting Model Architecture
In this example we train model which can categorize newsgroup article based on its content into one of categories.

There are number of model architecture which can address this task. Existing State-of-the-art (SOTA) models are usually based on Transformer architecture. Autoregressive models like BERT and its various derivatives are suitable for this task. We will use concept known as `Transfer learning` where pre-trained model on one task is used for a new task with minimal modifications. 

As a baseline model we will use model architecture known as `DistilBERT` which provides high accuracy on wide variety of tasks and is considerably smaller than other models (for instance, original BERT model). To adapt model for classification task, we would need to add a classification layer which will be trained during our training to recognize articles.

![title](static/finetuning.png)

`HuggingFace Transformers` simplifies model selection and modification for fine-tuning:
- provides rich model zoo with number pre-trained models and tokenizers.
- has simple model API to modify baseline model for finetuning for specific task.
- implements inference pipelines, combining data preprocessing and actual inference together.

### Selecting SageMaker Training Containers

Amazon SageMaker supports HuggingFace Transformer framework for inference and trainining. Hence, we won't need to develop any custom container. Instead we will use `Script Mode` feature to provide our custom training and inference scripts and execute them in pre-build containers. In this example we will develop intution how to develop these scripts.

## Preparing Dataset
First of, we need to acquire `20 Newsgroups` dataset. For this, we can use `sklearn` module utility. To shorten training cycle, let's choose 6 newsgroup categories (original dataset contains 20). The datasets will be loaded into memory.

In [None]:
from sklearn.datasets import fetch_20newsgroups

# We select 6 out of 20 diverse newsgroups
categories = [
    "comp.windows.x",
    "rec.autos",
    "sci.electronics",
    "misc.forsale",
    "talk.politics.misc",
    "alt.atheism"
]

train_dataset = fetch_20newsgroups(subset='train',
                                  categories=categories,
                                  shuffle=True,
                                  random_state=42
                                 )
test_dataset = fetch_20newsgroups(subset='test',
                                  categories=categories,
                                  shuffle=True,
                                  random_state=42
                                 )

n=6 # arbitrary sample index
print(f"Number of training samples: {len(train_dataset['data'])}")
print(f"Number of test samples: {len(test_dataset['data'])}")

print(f"\n=========== Sample article for category {train_dataset['target'][n]} ============== \n")
print(f"{train_dataset['data'][n]}")


Now, we need to save selected datasets into files and upload resulting files to Amazon S3 storage.
SageMaker will download them to training container at training time.

In [None]:
import csv

for file in ['train_dataset.csv', 'test_dataset.csv']:
    with open(file, 'w') as f:
        w = csv.DictWriter(f, ['data', 'category_id'])
        w.writeheader()
        for i in range(len(train_dataset["data"])):
            w.writerow({"data":train_dataset["data"][i], "category_id":train_dataset["target"][i]})

`sagemaker.Session()` object provides a set of utilizities to manage interaction with Sagemaker and AWS services in general. Let's use it to upload our data files in dedicated S3 bucket.

In [None]:
import sagemaker 

session = sagemaker.Session()
train_dataset_uri=session.upload_data("train_dataset.csv", key_prefix="newsgroups")
test_dataset_uri=session.upload_data("test_dataset.csv", key_prefix="newsgroups")

print(f"Datasets are available in following locations: {train_dataset_uri} and {test_dataset_uri}")

In [None]:
# removing local files
! rm train_dataset.csv
! rm test_dataset.csv

## Developing training script



To train our model on SageMaker, we need to provide a training script: `1_sources/train.py`. Feel free to review the script yourself. Here are several highlights of this script:
* At training time, SageMaker starts training by calling `user_training_script --arg1 value1 --arg2 value2 ...`. Here, `arg1..N` are hyperparameters provided by users as part of training job configuration.
- To correctly capture hyperparameters, training script need to be able to parse command line arguments. We use Python `argparse` library to do it (see lines #104-#112)
* Method `train()` method is resposible for running end-to-end training job. It includes following components:
    - Method `_get_tokenized_dataset()` loads and tokenizes dataset using pretrained DistilBERT tokenizer from HuggingFace library;
    - Method  `DistilBertForSequenceClassification.from_pretrained()` downloads DistilBERT model from HuggingFace Model Hub and loads it into memory. Please note that we update default config for classification task to adjust for our chosen number of categories (line #80);
    - Class `Trainer()` captures training configuration and starts training process (lines #86-#93);
    - Method `model.save_pretrained()` saves trained model (line #97)
    


SageMaker Training toolkit setups up several environmental variables which comes handy when writing your training script:
- `"SM_CHANNEL_TRAIN"` and `"SM_CHANNEL_TEST"` are locations where data files are download before training begins;
- `"SM_OUTPUT_DIR"` is a directory for any output artifacts, SageMaker will upload this directory to S3 whether training job succeeds or fails;
- `"SM_MODEL_DIR"`is a directory to store resulting model artifacts, SageMaker will also upload the model to S3.

In [None]:
! pygmentize -O linenos=1 1_sources/train.py

## Running training job

Once we have training script and dependencies ready, we can proceed and schedule training job via SageMaker Python SDK.

We start with import of HuggingFace Estimator object and getting IAM execution role for our training job.

In [None]:
from sagemaker.huggingface.estimator import HuggingFace
from sagemaker import get_execution_role

role=get_execution_role()

Next, we need to define our hyperparameters of our model and training process. These variables will be passed to our script at training time.

In [None]:
hyperparameters = {
    "epochs":1,
    # 2 params below may need to updated if non-GPU instances is used for training
    "per-device-train-batch-size":16, 
    "per-device-eval-batch-size":64,
    "warmup-steps":100,
    "logging-steps":100,
    "weight-decay":0.01    
}

We then define versions of Python and DL frameworks which we intend to use.

In [None]:
PYTHON_VERSION = "py38"
PYTORCH_VERSION = "1.10.2"
TRANSFORMER_VERSION = "4.17.0"

In [None]:
estimator = HuggingFace(
    py_version=PYTHON_VERSION,
    entry_point="train.py",
    source_dir="1_sources",
    pytorch_version=PYTORCH_VERSION,
    transformers_version=TRANSFORMER_VERSION,
    hyperparameters=hyperparameters,
    instance_type="ml.p2.xlarge",
    instance_count=1,
    role=role,
    debugger_hook_config=False,
    disable_profiler=True,    
)


estimator.fit({
    "train":train_dataset_uri,
    "test":test_dataset_uri
})

## Developing Inference Code

Now that we have trained model, let's deploy it as SageMaker Real-Time endpoint. Similar to training job, we will use SageMaker pre-build HuggingFace container and will only provide our inference script. The inference requests will be handled by [Multi-Model Server](https://github.com/awslabs/multi-model-server) which exposes model for inference via HTTP endpoint. 

When using pre-build inference containers, SageMaker automatically recognizes our inference script. According to SageMaker convention, inference script should implement following methods:
- `model_fn(model_dir)` (lines #16-#45) is executed at container start time to load model in the memory. This method takes model directory as an input argument. You can use `model_fn()` to initiatilize other components of your inference pipeline, such as tokenizer in our case. Note, that HuggingFace Transformers has a convenient Pipeline API which allows to combine data pre-processing (in our example it's text tokenization) and actual inference in a single object. Hence, instead of loaded model, we return inference pipeline (line #45).
- `transform_fn(inference_pipeline, data, content_type, accept_type)` is responsible for running actual inference (line #48). Since we are communicating with end-client via HTTP, we also need to deserialize inference request and serialize model response. In our sample example we expect JSON payload and return back JSON payload, however, any other formats based on the requirements (e.g. CSV, Protobuf).

Sometimes combining deserialization, inference, and serialization in a single method can be undesirable. For such cases, SageMaker supports more granular API:
- `input_fn(request_body, request_content_type)` performs deserialization.
- `predict_fn(deser_input, model)` performs model inference.
- `output_fn(prediction, response_content_type)` performs serialization of model predictions.

Note, that `transform_fn()` and `input_fn(); predict_fn(); output_fn()` are mutually exclusive.


In [None]:
! pygmentize -O linenos=1 1_sources/inference.py

## Deploying Inference Endpoint

Now we are ready to deploy and test our Newsgroup Classification endpoint. We can use method `estimator.create_model()` to prepare model deployment. SageMaker automatically identifies trained model artifacts from previous training job. Additionally, we need to supply reference to inference script. SageMaker will automatically package our script and all artifacts under `source_dir` and will upload it to inference instance along with our trained model.


In [None]:
from sagemaker.huggingface.estimator import HuggingFaceModel

model = estimator.create_model(role=role, 
                               entry_point="inference.py", 
                               source_dir="1_sources",
                              )

Next, we define parameters of our endpoint such as number and type of instances behind it. Remember, SageMaker supports horizontal scaling of your inference endpoints! `model.deploy()` method starts inference deployment (which usually takes several minutes) and returns `Predictor` object to run inference requests.

In [None]:
predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge"
)

Now that endpoint is deployed, let's test it out. Note that we don't expect stellar performance, since model is likely undertrained because we only trained for single epoch to shorten training cycle. However, we expect that model will get most predictions right. 


In [None]:
import random

for i in range(10):
    sample_id = random.randint(0, len(test_dataset['data']))
    prediction = predictor.predict([test_dataset['data'][sample_id]])
    print(f"Sample index: {sample_id}; predicted newsgroup: {prediction[0]['label']}; actual newsgroup: {test_dataset['target'][sample_id]}")



## Resource Cleanup

Execute the cell below to clean up all SageMaker resources and avoid any costs

In [None]:
predictor.delete_endpoint(delete_endpoint_config=True)
model.delete_model()


# Summary

In this notebook, you learned how to train and deploy custom HuggingFace model using **SageMaker Script mode**. Script mode provide a lot of flexibility for developers when it comes to development of training and inference scripts.

However, there are scenarios when you need to have more control over your runtime environments. SageMaker allows you to extend pre-built containers or Bring Your Own ("BOY") containers. In `2_Extending_Prebuilt_Training_Container.ipynb` example we will learn how to extend pre-built SageMaker container according to your specific requirements.