# Multi-label Text Classification using BERT

This notebook has been sourced from the following blogs by Kaushal Trivedi [1](https://medium.com/huggingface/introducing-fastbert-a-simple-deep-learning-library-for-bert-models-89ff763ad384) [2](https://medium.com/huggingface/multi-label-text-classification-using-bert-the-mighty-transformer-69714fa3fb3d) and the associated [GitHub repos](https://github.com/kaushaltrivedi/fast-bert).

Lets understand whats happening here - this is the way we are using SageMaker to fine tune Hugging Face BERT models 

![SageMaker Architecture](../img/sagemaker-architecture.png)

### Principle components of the architecture

*Container* - we start off this lab by building our own container, and using SageMaker Service to train it and deploy the resultant model. I have commented it out because as of Nov 28 2019 the resultant container cannot train properly due to an unmet dependancy. As of this writing I am still debugging it. It takes around 22 mins of clock time to build this container and push it to ECR, from scratch on ml.p2.xlarge.

Once the container is ready we proceed with the lab. 

In [None]:
#!../container/build_and_push.sh
#We have prebuilt containers and made them available to be pulled in us-east-1 and us-west-2

In [None]:
import sagemaker
from pathlib import Path
from sagemaker.predictor import json_serializer
import json
import numpy as np
import boto3

In [None]:
role = sagemaker.get_execution_role()
session = sagemaker.Session()

## Setup Path 

In [None]:
# location for train.csv, val.csv and labels.csv
DATA_PATH = Path("../sm-data/")   

# Location for storing training_config.json
CONFIG_PATH = DATA_PATH/'config'
CONFIG_PATH.mkdir(exist_ok=True)

suffix = str(np.random.uniform())[4:9]

# S3 bucket name
bucket = 'toxic-pytorch-sagemaker-' + suffix

# Prefix for S3 bucket for input and output
prefix = 'toxic_comments/input'
prefix_output = 'toxic_comments/output'

In [None]:
!aws s3 mb s3://{bucket}

## Hyperparameters & Training Config

In [None]:
hyperparameters = {
    "epochs": 10,
    "lr": 8e-5,
    "max_seq_length": 512,
    "train_batch_size": 16,
    "lr_schedule": "warmup_cosine",
    "warmup_steps": 1000,
    "optimizer_type": "adamw"
}

In [None]:
training_config = {
    "run_text": "toxic comments",
    "finetuned_model": None,
    "do_lower_case": "True",
    "train_file": "train.csv",
    "val_file": "val.csv",
    "label_file": "labels.csv",
    "text_col": "comment_text",
    "label_col": '["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]',
    "multi_label": "True",
    "grad_accumulation_steps": "1",
    "fp16_opt_level": "O1",
    "fp16": "True",
    "model_type": "roberta",
    "model_name": "roberta-base",
    "logging_steps": "300"
}

with open(CONFIG_PATH/'training_config.json', 'w') as f:
    json.dump(training_config, f)

## Upload Data

In [None]:
# This is a helper feature to upload data
# from your local machine to S3 bucket.

s3_input = session.upload_data(DATA_PATH, bucket=bucket , key_prefix=prefix)

session.upload_data(str(DATA_PATH/'val.csv'), bucket=bucket , key_prefix=prefix)

In [None]:
session.upload_data(str(DATA_PATH/'labels.csv'), bucket=bucket , key_prefix=prefix)

In [None]:
session.upload_data(str(DATA_PATH/'train.csv'), bucket=bucket , key_prefix=prefix)

## Create an Estimator and start training

In [None]:
!aws configure get region

In [None]:
#account = session.boto_session.client('sts').get_caller_identity()['Account']
#region = session.boto_session.region_name

#image = "{}.dkr.ecr.{}.amazonaws.com/sagemaker-bert:1.0-gpu-py36".format(account, region)

#Please use only the following images - 
#US East 1 - 111652037296.dkr.ecr.us-west-2.amazonaws.com/chazarey-sagemaker-fast-bert:1.0-gpu-py36
#US West 2 - 111652037296.dkr.ecr.us-east-1.amazonaws.com/chazarey-sagemaker-fast-bert-copied:1.0-gpu-py36

image = "111652037296.dkr.ecr.us-west-2.amazonaws.com/chazarey-sagemaker-fast-bert:1.0-gpu-py36"
#TODO Convert this to using SM Pytorch 

In [None]:
output_path = "s3://{}/{}".format(bucket, prefix_output)

In [None]:
estimator = sagemaker.estimator.Estimator(image, 
                                          role,
                                          train_instance_count=1, 
                                          train_instance_type='ml.p3.8xlarge', 
                                          output_path=output_path, 
                                          base_job_name='toxic-comments',
                                          hyperparameters=hyperparameters,
                                          sagemaker_session=session
                                         )

In [None]:
estimator.fit(s3_input)

## Deploy the model to hosting service

In [None]:
predictor = estimator.deploy(1, 
                             'ml.m5.large', 
                             endpoint_name='bert-toxic-comments', 
                             serializer=json_serializer)

In [None]:
### Invoke the Endpoint
client = boto3.client('sagemaker-runtime')

sample_payload='{"text": "this is really really good thanks for recommending!!"}'

response = client.invoke_endpoint(
    EndpointName='bert-toxic-comments',
    Body=sample_payload,
    ContentType='application/json'
)
print('Our result for this payload is: {}'.format(response['Body'].read().decode('ascii')))