# Multi-label Text Classification using BERT

This notebook has been sourced from the following blogs by Kaushal Trivedi [1](https://medium.com/huggingface/introducing-fastbert-a-simple-deep-learning-library-for-bert-models-89ff763ad384) [2](https://medium.com/huggingface/multi-label-text-classification-using-bert-the-mighty-transformer-69714fa3fb3d) and the associated [GitHub repos](https://github.com/kaushaltrivedi/fast-bert).

Lets understand whats happening here - this is the way we are using SageMaker to fine tune Hugging Face BERT models 

![SageMaker Architecture](../img/sagemaker-architecture.png)

### Principle components of the architecture

*Container* - we start off this lab by building our own container, and using SageMaker Service to train it and deploy the resultant model. I have commented it out because as of Nov 28 2019 the resultant container cannot train properly due to an unmet dependancy. As of this writing I am still debugging it. It takes around 22 mins of clock time to build this container and push it to ECR, from scratch on ml.p2.xlarge.

Once the container is ready we proceed with the lab. 

In [None]:
#!../container/build_and_push.sh
#We have prebuilt containers and made them available to be pulled in us-east-1 and us-west-2

### Lets talk about the container a bit - 

This notebook is a bit lean - because most of our code resides in the container - this notebook is sort of an orchastrator - 

```bash
.
├── container
│   ├── bert
│   │   ├── download_pretrained_models.py
│   │   ├── nginx.conf
│   │   ├── predictor.py
│   │   ├── serve
│   │   ├── train
│   │   └── wsgi.py
│   ├── build_and_push.sh
│   └── Dockerfile_gpu
```

We have 3 important components in this container - 

1. __[nginx][nginx]__ is a light-weight layer that handles the incoming HTTP requests and manages the I/O in and out of the container efficiently.
2. __[gunicorn][gunicorn]__ is a WSGI pre-forking worker server that runs multiple copies of your application and load balances between them.
3. __[flask][flask]__ is a simple web framework used in the inference app that you write. It lets you respond to call on the `/ping` and `/invocations` endpoints without having to write much 
Lets talk about each file in turn - 

* __Dockerfile__: The _Dockerfile_ describes how the image is built and what it contains. It is a recipe for your container and gives you tremendous flexibility to construct almost any execution environment you can imagine. Here. we use the Dockerfile to describe a pretty standard python science stack and the simple scripts that we're going to add to it. See the [Dockerfile reference][dockerfile] for what's possible here.

* __build\_and\_push.sh__: The script to build the Docker image (using the Dockerfile above) and push it to the [Amazon EC2 Container Registry (ECR)][ecr] so that it can be deployed to SageMaker. Specify the name of the image as the argument to this script. The script will generate a full name for the repository in your account and your configured AWS region. If this ECR repository doesn't exist, the script will create it.


* __download_pretrained_models.py__
    Going to download the pre-trained BERT models from Hugging Face's repo
    
* __train__: The main program for training the model. When you build your own algorithm, you'll edit this to include your training code.
* __serve__: The wrapper that starts the inference server. In most cases, you can use this file as-is.
* __wsgi.py__: The start up shell for the individual server workers. This only needs to be changed if you changed where predictor.py is located or is named.
* __predictor.py__: The algorithm-specific inference server. This is the file that you modify with your own algorithm's code.
* __nginx.conf__: The configuration for the nginx master server that manages the multiple workers.
    
Finally, 

When SageMaker starts a container, it will invoke the container with an argument of either __train__ or __serve__. We have set this container up so that the argument in treated as the command that the container executes. When training, it will run the __train__ program included and, when serving, it will run the __serve__ program.

[dockerfile]: https://docs.docker.com/engine/reference/builder/ "The official Dockerfile reference guide"
[ecr]: https://aws.amazon.com/ecr/ "ECR Home Page"
[nginx]: http://nginx.org/
[gunicorn]: http://gunicorn.org/
[flask]: http://flask.pocoo.org/


In [None]:
import sagemaker
from pathlib import Path
from sagemaker.predictor import json_serializer
import json
import numpy as np
import boto3

In [None]:
role = sagemaker.get_execution_role()
session = sagemaker.Session()

## Setup Path 

In [None]:
# location for train.csv, val.csv and labels.csv
DATA_PATH = Path("../sm-data/")   

# Location for storing training_config.json
CONFIG_PATH = DATA_PATH/'config'
CONFIG_PATH.mkdir(exist_ok=True)

suffix = str(np.random.uniform())[4:9]

# S3 bucket name
bucket = 'toxic-pytorch-sagemaker-' + suffix

# Prefix for S3 bucket for input and output
prefix = 'toxic_comments/input'
prefix_output = 'toxic_comments/output'

In [None]:
!aws s3 mb s3://{bucket}

## Hyperparameters & Training Config

In [None]:
hyperparameters = {
    "epochs": 10,
    "lr": 8e-5,
    "max_seq_length": 512,
    "train_batch_size": 16,
    "lr_schedule": "warmup_cosine",
    "warmup_steps": 1000,
    "optimizer_type": "adamw"
}

In [None]:
training_config = {
    "run_text": "toxic comments",
    "finetuned_model": None,
    "do_lower_case": "True",
    "train_file": "train.csv",
    "val_file": "val.csv",
    "label_file": "labels.csv",
    "text_col": "comment_text",
    "label_col": '["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]',
    "multi_label": "True",
    "grad_accumulation_steps": "1",
    "fp16_opt_level": "O1",
    "fp16": "True",
    "model_type": "roberta",
    "model_name": "roberta-base",
    "logging_steps": "300"
}

with open(CONFIG_PATH/'training_config.json', 'w') as f:
    json.dump(training_config, f)

## Upload Data

In [None]:
# This is a helper feature to upload data
# from your local machine to S3 bucket.

s3_input = session.upload_data(DATA_PATH, bucket=bucket , key_prefix=prefix)

session.upload_data(str(DATA_PATH/'val.csv'), bucket=bucket , key_prefix=prefix)

In [None]:
session.upload_data(str(DATA_PATH/'labels.csv'), bucket=bucket , key_prefix=prefix)

In [None]:
session.upload_data(str(DATA_PATH/'train.csv'), bucket=bucket , key_prefix=prefix)

## Create an Estimator and start training

In [None]:
!aws configure get region

In [None]:
#account = session.boto_session.client('sts').get_caller_identity()['Account']
#region = session.boto_session.region_name

#image = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-bert:1.0-gpu-py36".format(account, region)

#Please use only the following images - 
#US East 1 - 111652037296.dkr.ecr.us-west-2.amazonaws.com/chazarey-sagemaker-fast-bert:1.0-gpu-py36
#US West 2 - 111652037296.dkr.ecr.us-east-1.amazonaws.com/chazarey-sagemaker-fast-bert-copied:1.0-gpu-py36

image = "111652037296.dkr.ecr.us-west-2.amazonaws.com/chazarey-sagemaker-fast-bert:1.0-gpu-py36"
#TODO Convert this to using SM Pytorch 

In [None]:
output_path = "s3://{}/{}".format(bucket, prefix_output)

In [None]:
estimator = sagemaker.estimator.Estimator(image, 
                                          role,
                                          train_instance_count=1, 
                                          train_instance_type='ml.p3.8xlarge', 
                                          output_path=output_path, 
                                          base_job_name='toxic-comments',
                                          hyperparameters=hyperparameters,
                                          sagemaker_session=session
                                         )

In [None]:
estimator.fit(s3_input)

## Deploy the model to hosting service

In [None]:
predictor = estimator.deploy(1, 
                             'ml.m5.large', 
                             endpoint_name='bert-toxic-comments', 
                             serializer=json_serializer)

In [None]:
### Invoke the Endpoint
client = boto3.client('sagemaker-runtime')

sample_payload='{"text": "this is really really good thanks for recommending!!"}'

response = client.invoke_endpoint(
    EndpointName='bert-toxic-comments',
    Body=sample_payload,
    ContentType='application/json'
)
print('Our result for this payload is: {}'.format(response['Body'].read().decode('ascii')))