# Multi-label Text Classification using BERT

This notebook has been sourced from the following blogs by Kaushal Trivedi [1](https://medium.com/huggingface/introducing-fastbert-a-simple-deep-learning-library-for-bert-models-89ff763ad384) [2](https://medium.com/huggingface/multi-label-text-classification-using-bert-the-mighty-transformer-69714fa3fb3d) and the associated [GitHub repos](https://github.com/kaushaltrivedi/fast-bert).

Lets understand whats happening here - this is the way we are using SageMaker to fine tune Hugging Face BERT models

## Anatomy of a typical Amazon SageMaker container 

![SageMaker Architecture](../img/sagemaker-architecture.png)

### Principle components of the architecture

*Container* - we start off this lab by building our own container, and using SageMaker Service to train it and deploy the resultant model. I have commented it out because as of Nov 28 2019 the resultant container cannot train properly due to an unmet dependancy. As of this writing I am still debugging it. It takes around 22 mins of clock time to build this container and push it to ECR, from scratch on ml.p2.xlarge.

Once the container is ready we proceed with the lab. 

In [1]:
#!../container/build_and_push.sh
#We have prebuilt containers and made them available to be pulled in us-east-1 and us-west-2

### Lets talk about the container a bit - 

[This notebook is a bit lean](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/scikit_bring_your_own) - because most of our code resides in the container - this notebook is sort of an orchastrator - 

```bash
.
├── container
│   ├── bert
│   │   ├── download_pretrained_models.py
│   │   ├── nginx.conf
│   │   ├── predictor.py
│   │   ├── serve
│   │   ├── train
│   │   └── wsgi.py
│   ├── build_and_push.sh
│   └── Dockerfile_gpu
```

We have 3 important components in this container - 

1. __[nginx][nginx]__ is a light-weight layer that handles the incoming HTTP requests and manages the I/O in and out of the container efficiently.
2. __[gunicorn][gunicorn]__ is a WSGI pre-forking worker server that runs multiple copies of your application and load balances between them.
3. __[flask][flask]__ is a simple web framework used in the inference app that you write. It lets you respond to call on the `/ping` and `/invocations` endpoints without having to write much 
Lets talk about each file in turn - 

* __Dockerfile__: The _Dockerfile_ describes how the image is built and what it contains. It is a recipe for your container and gives you tremendous flexibility to construct almost any execution environment you can imagine. Here. we use the Dockerfile to describe a pretty standard python science stack and the simple scripts that we're going to add to it. See the [Dockerfile reference][dockerfile] for what's possible here.

* __build\_and\_push.sh__: The script to build the Docker image (using the Dockerfile above) and push it to the [Amazon EC2 Container Registry (ECR)][ecr] so that it can be deployed to SageMaker. Specify the name of the image as the argument to this script. The script will generate a full name for the repository in your account and your configured AWS region. If this ECR repository doesn't exist, the script will create it.


* __download_pretrained_models.py__
    Going to download the pre-trained BERT models from Hugging Face's repo
    
* __train__: The main program for training the model. When you build your own algorithm, you'll edit this to include your training code.
* __serve__: The wrapper that starts the inference server. In most cases, you can use this file as-is.
* __wsgi.py__: The start up shell for the individual server workers. This only needs to be changed if you changed where predictor.py is located or is named.
* __predictor.py__: The algorithm-specific inference server. This is the file that you modify with your own algorithm's code.
* __nginx.conf__: The configuration for the nginx master server that manages the multiple workers.
    
Finally, 

When SageMaker starts a container, it will invoke the container with an argument of either __train__ or __serve__. We have set this container up so that the argument in treated as the command that the container executes. When training, it will run the __train__ program included and, when serving, it will run the __serve__ program.

[dockerfile]: https://docs.docker.com/engine/reference/builder/ "The official Dockerfile reference guide"
[ecr]: https://aws.amazon.com/ecr/ "ECR Home Page"
[nginx]: http://nginx.org/
[gunicorn]: http://gunicorn.org/
[flask]: http://flask.pocoo.org/

[An excellent workshop dedicated - to this concept & source of this information](https://sagemaker-workshop.com/custom/code.html).

[An even better source](https://github.com/aws/sagemaker-containers/blob/master/README.rst)

### Dockerfile dissection 

The Dockerfile describes the image that you want to build. You can think of it as describing the complete operating system installation of the system that you want to run. A Docker container running is quite a bit lighter than a full operating system, however, because it takes advantage of Linux on the host machine for the basic operations.

This Dockerfile is special as we want to access the GPU while training, hence we start off with the Nvidia-Cuda base image

```python
FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
ARG ARCH=gpu
```

After setting up the necessary packages, we then proceed to download the pre-trained models into this container. 

```python
RUN python download_pretrained_models.py --location_dir ./pretrained_models/ --models bert-base-uncased roberta-base distilbert-base-uncased
```

Amazon SageMaker invokes the training code by running a version of the following command:


```bash
docker run <image> train
```

This means that your Docker image should have an executable file in it that is called train. You will modify this program to implement your training algorithm. This can be in any language that is capable of running inside of the Docker environment, but the most common language options for data scientists include Python, R, Scala, and Java. For our Scikit example, we use Python.

At runtime, Amazon SageMaker injects the training data from an Amazon S3 location into the container. The training program ideally should produce a model artifact. The artifact is written, inside of the container, then packaged into a compressed tar archive and pushed to an Amazon S3 location by Amazon SageMaker.

When Amazon SageMaker runs training, your train script is run just like any regular program. A number of files are laid out for your use, under the /opt/ml directory:

```bash
/opt/ml
├── input
│   ├── config
│   │   ├── hyperparameters.json
│   │   └── resourceConfig.json
│   └── data
│       └── <channel_name>
│           └── <input data>
├── model
│   └── <model files>
└── output
    └── failure
```

Lets take a look at the input path, hyperparameters, config for distributed training, data for training - 


* __/opt/ml/input/config__ contains information to control how your program runs. hyperparameters.json is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. 

* __resourceConfig.json__ is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn’t support distributed training, we’ll ignore it here.

* __/opt/ml/input/data/<channel_name>/__ (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it’s generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure.

* __/opt/ml/input/data/<channel_name>_<epoch_number>__ (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.


Now this is where the output is directed - 


* __/opt/ml/model__/ is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the DescribeTrainingJob result.

* __/opt/ml/output__ is a directory where the algorithm can write a file failure that describes why the job failed. The contents of this file will be returned in the FailureReason field of the DescribeTrainingJob result. For jobs that succeed, there is no reason to write this file as it will be ignored.

More info [here](https://github.com/aws/sagemaker-containers/blob/master/README.rst) & ]here](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html#your-algorithms-training-algo-running-container-dist-training)


```python
```
```python
```

### Quick note on Distributed Training Configuration

Though we are not using it here, but wanted to mention this - 

If you're performing distributed training with multiple containers, Amazon SageMaker makes information about all containers available in the /opt/ml/input/config/resourceconfig.json file.

To enable inter-container communication, this JSON file contains information for all containers. Amazon SageMaker makes this file available for both FILE and PIPE mode algorithms. The file provides the following information:

*     current_host—The name of the current container on the container network. For example, algo-1. Host values can change at any time. Don't write code with specific values for this variable.

*    hosts—The list of names of all containers on the container network, sorted lexicographically. For example, ["algo-1", "algo-2", "algo-3"] for a three-node cluster. Containers can use these names to address other containers on the container network. Host values can change at any time. Don't write code with specific values for these variables.

*    network_interface_name—The name of the network interface that is exposed to your container. For example, containers running the Message Passing Interface (MPI) can use this information to set the network interface name.

*    Do not use the information in /etc/hostname or /etc/hosts because it might be inaccurate.

*    Hostname information may not be immediately available to the algorithm container. We recommend adding a retry policy on hostname resolution operations as nodes become available in the cluster.

The following is an example file on node 1 in a three-node cluster:

```python
{
"current_host": "algo-1",
"hosts": ["algo-1","algo-2","algo-3"],
"network_interface_name":"eth1"
}
```

[A helpful blog about distributed training on SageMaker](https://aws.amazon.com/blogs/machine-learning/launching-tensorflow-distributed-training-easily-with-horovod-or-parameter-servers-in-amazon-sagemaker/) and [a workshop](https://sagemaker-workshop.com/builtin/parallelized.html)



### Serving a model, hosting a model in a container

Amazon SageMaker invokes hosting service by running a version of the following command

```bash
docker run <image> serve
```
This launches a RESTful API to serve HTTP requests for inference. Again, this can be done in any language or framework that works within the Docker environment.

In most Amazon SageMaker containers, serve is simply a wrapper that starts the inference server. Furthermore, Amazon SageMaker injects the model artifact produced in training into the container and unarchives it automatically.

Amazon SageMaker uses two URLs in the container:

*    __/ping__ will receive GET requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.
*    __/invocations__ is the endpoint that receives client inference POST requests. The format of the request and the response is up to the algorithm. If the client supplied ContentType and Accept headers, these will be passed in as well.

### About pushing this image to ECR

**Amazon SageMaker currently requires Docker images to reside in Amazon ECR**

We see these commands in __build_and_push.sh__ 

For SageMaker to run a container for training or hosting, it needs to be able to find the image hosted in the image repository, Amazon Elastic Container Registry (Amazon ECR). The three main steps to this process are building locally, tagging with the repository location, and pushing the image to the repository.

To build the local image, call the following command:

```bash
docker build <image name>
```
This takes instructions from the Dockerfile we discussed earlier to generate an image on your local instance. After the image is built, we need to let our local instance of Docker know where to store the image so that SageMaker can find it. We do this by tagging the image with the following command:


```bash
docker tag <image name> <repository name>
```
The repository name has the following structure:
```bash
<account number>.dkr.ecr.<region>.amazonaws.com/<image name>:<tag>
```
Without tagging the image with the repository name, Docker defaults to uploading to Docker Hub, and not Amazon ECR. Amazon SageMaker currently requires Docker images to reside in Amazon ECR. To push an image to ECR, and not the central Docker registry, you must tag it with the registry hostname.

Unlike Docker Hub, Amazon ECR images are private by default, which is a good practice with Amazon SageMaker. If you want to share your Amazon SageMaker images publicly, you can find more information in the Amazon ECR User Guide.

Finally, to upload the image to Amazon ECR, with the Region set in the repository name tag, call the following command:
```bash
docker push <repository name>
```
One final note on Amazon SageMaker Docker containers. We have already shown you that you have the option to build one Docker container serving both training and hosting, or you can build one for each. While building two Docker images can increase storage requirements and cost due to duplicated common libraries, you might get a benefit from building a significantly smaller inference container, allowing the hosting service to scale more quickly when reacting to traffic increases. This is especially common when you use GPUs for training, but your inference code is optimized for CPUs. You need to consider the tradeoffs when you decide if you want to build a single container or two.

### Now that we have uploaded the container to ECR, lets use it to fine-tune BERT

In [2]:
import sagemaker
from pathlib import Path
from sagemaker.predictor import json_serializer
import json
import numpy as np
import boto3

In [3]:
role = sagemaker.get_execution_role()
session = sagemaker.Session()

## Setup Path 

In [4]:
# Source of this data is - [Kaggle Competetion - Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)
# Please note - I am not providing the full data-set here, this data set is severly truncated, I would encourage
# If you want better results, please do signup for the competetion and use the real dataset. 

# location for train.csv, val.csv and labels.csv
DATA_PATH = Path("../sm-data/")   

# Location for storing training_config.json
CONFIG_PATH = DATA_PATH/'config'
CONFIG_PATH.mkdir(exist_ok=True)

suffix = str(np.random.uniform())[4:9]

# S3 bucket name
bucket = 'toxic-pytorch-sagemaker-' + suffix

# Prefix for S3 bucket for input and output
prefix = 'toxic_comments/input'
prefix_output = 'toxic_comments/output'

In [5]:
!aws s3 mb s3://{bucket}

make_bucket: toxic-pytorch-sagemaker-92090


## Hyperparameters & Training Config

In [6]:
hyperparameters = {
    "epochs": 10,
    "lr": 8e-5,
    "max_seq_length": 512,
    "train_batch_size": 16,
    "lr_schedule": "warmup_cosine",
    "warmup_steps": 1000,
    "optimizer_type": "adamw"
}

### Hyperparameters 

* **epochs** - One Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE.
* **lr** - Learning rate - step size to adjust the weights for minimizing the loss function. 
* **max_seq_length** - Maximum Sequence Length - Maximum number of tokens to input.  
* **train_batch_size** - The default is 32, but here its 16. For training we make a parallel reading and shuffle. 
* **lr_schedule** - The convergence rate and final performance of common deep learning models have significantly benefited from recently proposed heuristics such as learning rate schedules. Using too large learning rate may result in numerical instability especially at the very beginning of the training, where parameters are randomly initialized. The warmup strategy increases the learning rate from 0 to the initial learning rate linearly during the initial N epochs or m batches. After the learning rate warmup stage described earlier, we typically steadily decrease its value from the initial learning rate.  Compared to some widely used strategies including exponential decay and step decay, the cosine decay decreases the learning rate slowly at the beginning, and then becomes almost linear decreasing in the middle, and slows down again at the end. It potentially improves the training progress.
* **warmup_steps** - Number of warmup steps. 
* **optimizer_type** - Here we are choosing Adam, which chooses a diffrent learning rate for each parameter. Helps in speeding up training. AdamW is Adam with L2 regularization and Weight Decay. 

#### References - source of above statements

* [Epochs](https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9)
* [Learning Rate Scheduling](https://www.deeplearningwizard.com/deep_learning/boosting_models_pytorch/lr_scheduling/)
* [A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation](https://openreview.net/forum?id=r14EOsCqKX)
* [Bag of Tricks for Image Classification](https://www.dlology.com/blog/bag-of-tricks-for-image-classification-with-convolutional-neural-networks-in-keras/)
* [AdamW and Super-convergence is now the fastest way to train neural nets](https://www.fast.ai/2018/07/02/adam-weight-decay/)
* [Why AdamW matters](https://towardsdatascience.com/why-adamw-matters-736223f31b5d)

In [7]:
training_config = {
    "run_text": "toxic comments",
    "finetuned_model": None,
    "do_lower_case": "True",
    "train_file": "train.csv",
    "val_file": "val.csv",
    "label_file": "labels.csv",
    "text_col": "comment_text",
    "label_col": '["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]',
    "multi_label": "True",
    "grad_accumulation_steps": "1",
    "fp16_opt_level": "O1",
    "fp16": "True",
    "model_type": "roberta",
    "model_name": "roberta-base",
    "logging_steps": "300"
}

with open(CONFIG_PATH/'training_config.json', 'w') as f:
    json.dump(training_config, f)

### Training config 

Hyperparameters and this training config is consumed in **train** file inside of the container. Like so - 

```bash
hyperparam_path = os.path.join(
    prefix, "input/config/hyperparameters.json"
)  # opt/ml/input/config/hyperparameters.json
config_path = os.path.join(
    training_config_path, "training_config.json"
)  # opt/ml/input/data/training/config/training_config.json
```

* **run_text** - Used for tagging and debugging, logging.
* **finetuned_model** - Location of an already fine-tuned model. 
* **do_lower_case** - Because we are using an uncased model.
* **train_file** - name of the train dataset
* **val_file** - name of the validation dataset
* **label_file** - file where labels are stored.
* **text_col** - column where comment text are stored
* **label_col** - labels of columns
* **multi_label** - We want to do multi-label classification
* **grad_accumulation_steps** - Gradient Accumulation If you have small GPUs, you may need to use the gradient accumulation to make training stable. 
* **fp16_opt_level** - O1 (Conservative Mixed Precision): only some whitelist ops are done in FP16. By switching to 16-bit, we’ll be using half the memory and theoretically less computation at the expense of the available number range and precision. However, pure 16-bit training creates a lot of problems for us (imprecise weight updates, gradient underflow and overflow). Mixed precision training alleviate these problems.
* **fp16** - Enabled for the above
* **model_type** - RoBERTa. Introduced at Facebook, Robustly optimized BERT approach RoBERTa, is a retraining of BERT with improved training methodology, 1000% more data and compute power.
* **model_name** - RoBERTa: A Robustly Optimized BERT Pretraining Approach - 12-layer, 768-hidden, 12-heads, 125M parameters RoBERTa using the BERT-base architecture
* **logging_steps** - Control logging granularity


#### References - source of above statements

* [Multi-Task Deep Neural Networks for Natural Language Understanding](https://awesomeopensource.com/project/namisan/mt-dnn)
* [Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups](https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255)
* [Use NVIDIA Apex for Easy Mixed Precision Training in PyTorch](https://medium.com/the-artificial-impostor/use-nvidia-apex-for-easy-mixed-precision-training-in-pytorch-46841c6eed8c)
* [BERT, RoBERTa, DistilBERT, XLNet — which one to use?](https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8)
* [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://github.com/pytorch/fairseq/tree/master/examples/roberta)

## Upload Data

In [8]:
# This is a helper feature to upload data
# from your local machine to S3 bucket.

s3_input = session.upload_data(DATA_PATH, bucket=bucket , key_prefix=prefix)

session.upload_data(str(DATA_PATH/'val.csv'), bucket=bucket , key_prefix=prefix)

's3://toxic-pytorch-sagemaker-92090/toxic_comments/input/val.csv'

In [9]:
session.upload_data(str(DATA_PATH/'labels.csv'), bucket=bucket , key_prefix=prefix)

's3://toxic-pytorch-sagemaker-92090/toxic_comments/input/labels.csv'

In [10]:
session.upload_data(str(DATA_PATH/'train.csv'), bucket=bucket , key_prefix=prefix)

's3://toxic-pytorch-sagemaker-92090/toxic_comments/input/train.csv'

## Create an Estimator and start training

In [11]:
!aws configure get region

us-west-2


In [12]:
#account = session.boto_session.client('sts').get_caller_identity()['Account']
#region = session.boto_session.region_name

#image = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-bert:1.0-gpu-py36".format(account, region)

#Please use only the following images - 
#US West 2  - 111652037296.dkr.ecr.us-west-2.amazonaws.com/chazarey-sagemaker-fast-bert:1.0-gpu-py36
#US East 1 - 111652037296.dkr.ecr.us-east-1.amazonaws.com/chazarey-sagemaker-fast-bert-copied:1.0-gpu-py36

image = "111652037296.dkr.ecr.us-east-1.amazonaws.com/chazarey-sagemaker-fast-bert-copied:1.0-gpu-py36"

In [13]:
output_path = "s3://{}/{}".format(bucket, prefix_output)

In [14]:
estimator = sagemaker.estimator.Estimator(image, 
                                          role,
                                          train_instance_count=1, 
                                          train_instance_type='ml.p3.8xlarge', 
                                          output_path=output_path, 
                                          base_job_name='toxic-comments',
                                          hyperparameters=hyperparameters,
                                          sagemaker_session=session
                                         )

So now we bring everything together and ask SageMaker to train our container, finetune our BERT

* **image** - image_name (str) – The container image to use for training.
* **role** - role = sagemaker.get_execution_role() - we got this before, An AWS IAM role (either name or full ARN). The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.
* **train_instance_count** - train_instance_count (int) – Number of Amazon EC2 instances to use for training.
* **train_instance_type** - train_instance_type (str) – Type of EC2 instance to use for training, for example, ‘ml.c4.xlarge’.
* **output_path** - output_path (str) – S3 location for saving the training result (model artifacts and output files). If not specified, results are stored to a default bucket. If the bucket with the specific name does not exist, the estimator creates the bucket during the fit() method execution.
* **base_job_name** - base_job_name (str) – Prefix for training job name when the fit() method launches. If not specified, the estimator generates a default job name, based on the training image name and current timestamp.
* **hyperparameters** - hyperparameters (dict) – Dictionary containing the hyperparameters to initialize this estimator with. We set this up in the previous cells
* **sagemaker_session** - sagemaker_session (sagemaker.session.Session) – Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed. If not specified, the estimator creates one using the default AWS configuration chain.


In [15]:
estimator.fit(s3_input)

2019-12-05 00:47:56 Starting - Starting the training job...
2019-12-05 00:47:58 Starting - Launching requested ML instances......
2019-12-05 00:48:59 Starting - Preparing the instances for training...
2019-12-05 00:49:55 Downloading - Downloading input data...
2019-12-05 00:50:11 Training - Downloading the training image............
2019-12-05 00:52:25 Training - Training image download completed. Training in progress...[31mStarting the training.[0m
[31m/opt/ml/input/data/training/config/training_config.json[0m
[31m{'run_text': 'toxic comments', 'finetuned_model': None, 'do_lower_case': 'True', 'train_file': 'train.csv', 'val_file': 'val.csv', 'label_file': 'labels.csv', 'text_col': 'comment_text', 'label_col': '["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]', 'multi_label': 'True', 'grad_accumulation_steps': '1', 'fp16_opt_level': 'O1', 'fp16': 'True', 'model_type': 'roberta', 'model_name': 'roberta-base', 'logging_steps': '300'}[0m
[31m{'train_batch_

### Finally start training our model 

Train a model using the input training dataset.
The API calls the Amazon SageMaker CreateTrainingJob API to start model training. The API uses configuration you provided to create the estimator and the specified input training data to send the CreatingTrainingJob request to Amazon SageMaker.
This is a synchronous operation. After the model training successfully completes, you can call the deploy() method to host the model using the Amazon SageMaker hosting services.


## Deploy the model to hosting service

In [16]:
predictor = estimator.deploy(1, 
                             'ml.m5.large', 
                             endpoint_name='bert-toxic-comments', 
                             serializer=json_serializer)

--------------------------------------------------------------------------------------------------------------!

In [17]:
### Invoke the Endpoint
client = boto3.client('sagemaker-runtime')

sample_payload='{"text": "this is really really good thanks for recommending!!"}'

response = client.invoke_endpoint(
    EndpointName='bert-toxic-comments',
    Body=sample_payload,
    ContentType='application/json'
)
print('Our result for this payload is: {}'.format(response['Body'].read().decode('ascii')))

Our result for this payload is: [["toxic", 9.29754605749622e-05], ["insult", 4.071386865689419e-05], ["obscene", 3.189296694472432e-05], ["severe_toxic", 2.644667256390676e-05], ["identity_hate", 1.8441571228322573e-05], ["threat", 1.307918591919588e-05]]
