## Deploy Active Learning for Text Classification Algorithm Model Package from AWS Marketplace 

Active Learning for Text Classification  trains a text classification model using a small corpus of training data and provides the most appropriate samples from a huge corpus of unlabeled data to be annotated in order to improve the model accuracy significantly. Using Active Learning this algorithm helps in identifying the most effective data sample to be tagged first thus reducing the time and effort to build a usable Machine learning model.

Active Learning for Text Classification can be used for prioritizing the data labeling task and thereby drastically reduce the data tagging effort required to build a working Machine Learning model.

This solution can be used to iteratively sample the right data points & train the Machine Learning model to  build a  supervised machine learning algorithm. It helps in identifying which samples to label first based on the rules learned by Machine Learning model. It uses Active Learning methodologies to select the right sample data points from unlabeled data to build a better performing machine learning model much faster.

This sample notebook shows you how to deploy Active Learning for Text Classification Algorithm using Amazon SageMaker.

> **Note**: This is a reference notebook and it cannot run unless you make changes suggested in the notebook.

#### Pre-requisites:
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. To deploy this ML model successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to Active Learning for Text Classification. If so, skip step: [Subscribe to the model package](#1.-Subscribe-to-the-model-package)

#### Contents:
1. [Subscribe to the model package](#1.-Subscribe-to-the-model-package)
2. [Create an endpoint and perform real-time inference](#2.-Create-an-endpoint-and-perform-real-time-inference)
   1. [Create an endpoint](#A.-Create-an-endpoint)
   2. [Create input payload](#B.-Create-input-payload)
   3. [Perform real-time inference](#C.-Perform-real-time-inference)
   4. [Output Result](#D.-Output-Result)
   5. [Delete the endpoint](#E.-Delete-the-endpoint)
3. [Perform batch inference](#3.-Perform-batch-inference) 
4. [Clean-up](#4.-Clean-up)
    1. [Delete the model](#A.-Delete-the-model)
    2. [Unsubscribe to the listing (optional)](#B.-Unsubscribe-to-the-listing-(optional))
    

#### Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

### 1. Subscribe to the model package

To subscribe to the model package:
1. Open the algorithm listing page **Active Learning for Text Classification**
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

### 2. Usage Instruction

The deployed solution has these **2 steps**: Training the algorithm and Testing

<li>: The system trains on user provided text dataset.
<li>: The train dataset must contain 2 files - "train.csv" and "validation.csv" with 'utf-8' encoding.
<li>: The machine learning model is trained in the training step and once the model is generated, it can be used to sample data points from unlabelled data using active learning technique.
<li>: The testing API takes a csv file "unlabelled.csv" from which the data points are sampled for human annotation.
<li>: In the usage instruction notebook, the detailed steps are mentioned to before each cell.
<br>

**train.csv**
<li>: train.csv must contain 3 columns - <b>ID</b>, <b>Text</b> and <b>Category</b>. The <b>ID</b> is a unique identification number associated with the text, <b>Text</b> will have textual data that needs to be categorized and <b>Category</b> will be class with the associated text.
<li>: It is recommended to start with small amount of annotated data in train.csv and increase the training data size iteratively by sampling data points from unlabelled data.
<li>: After each iteration of model building and text sampling from unlabelled data, the model returned sampled data should be annotated and added to the train.csv. The same data points should be removed from unlabelled dataset.<br>

**validation.csv**
<li>: The format of validation.csv is similar to train.csv and must contain 3 columns - <b>ID</b>, <b>Text</b> and <b>Category</b>.
<li>: The validation.csv is used to check the accuracy of model.
<li>: The contents of the validation.csv should not be altered as it will demonstrate the change in accuracy after each iteration of sampling, annotation and training.<br>
    
**unlablled.csv**
<li>: The unlabelled.csv must contain 2 columns - <b>ID</b> and <b>Text</b>.
<li>: The unlablled.csv is the huge corpus of unlabelled data from which data points will be sampled for annotation. The sampling will be done using active learning technique instead of random sampling.
<li>: The active learning sampling helps in attaining better machine learning model accuracy in less as the amount of data points labelled is reduced drastically.
<br>
    
#### Output:
<li>:  Content types: `text/csv`.
<li>:  The output will be a csv file with sampled data points from unlablled data. The output csv file will contain ID and the associated text sampled for human annotation.
<li>: These sampled datapoints should be annotated, added to train.csv and removed from unlablled.csv before training the machine learning model again.
<li>: Now, the machine learning model will be trained on a bigger and better data set capturing variability present in the data and provide updated accuracy number on the validation dataset.
<br>


#### Invoking endpoint
##### AWS CLI Command
If you are using real time inferencing, please create the endpoint first and then use the  following command to invoke it:
``` bash 
aws sagemaker-runtime invoke-endpoint --endpoint-name "endpoint-name" --body fileb://$file_name --content-type text/csv --accept application/output.csv
```
Substitute the following parameters:
* `"endpoint-name"` - name of the inference endpoint where the model is deployed.
* `file_name` - Path of the directory where train.csv and validation.csv are placed
* `text/csv` - type of the given input file.
* `output.csv` - filename where the inference results are written to.

In [1]:
import base64 
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
from sagemaker import ModelPackage
import boto3
from IPython.display import Image
from PIL import Image
import numpy as np
import pandas as pd
import cv2
from numpy import asarray
import os

In [2]:
role = get_execution_role()

sagemaker_session = sage.Session()

bucket=sagemaker_session.default_bucket()
bucket

'sagemaker-us-east-2-786796469737'

In [3]:
# S3 prefixes
common_prefix = "active-learning"
training_input_prefix = common_prefix + "/training-input-data"
batch_inference_input_prefix = common_prefix + "/batch-inference-input-data"

In [4]:
sagemaker_session = sage.Session()

In [5]:
TRAINING_WORKDIR = "data/training"

TRAINING_DATA = TRAINING_WORKDIR + "/"

In [6]:
TRAINING_WORKDIR = "data/training"

# training input location
training_input = sagemaker_session.upload_data(TRAINING_WORKDIR, key_prefix=training_input_prefix)

In [7]:
training_input

's3://sagemaker-us-east-2-786796469737/active-learning/training-input-data'

### 3. Training 

In [8]:
import json
import time
from sagemaker.algorithm import AlgorithmEstimator

##### Algorithm ARN

In [9]:
algorithm_arn ='arn:aws:sagemaker:us-east-2:786796469737:algorithm/active-learning-text-classification-v04'

In [10]:
algo = AlgorithmEstimator(
    algorithm_arn=algorithm_arn,
    role=role,
    instance_count=1,
    instance_type='ml.c4.xlarge',
    base_job_name='active-learning-marketplace')

In [11]:
print ("Now run the training job using algorithm arn %s in region %s" % (algorithm_arn, sagemaker_session.boto_region_name))
algo.fit({'training': training_input})

Now run the training job using algorithm arn arn:aws:sagemaker:us-east-2:786796469737:algorithm/active-learning-text-classification-v04 in region us-east-2
2021-05-24 03:54:12 Starting - Starting the training job...
2021-05-24 03:54:35 Starting - Launching requested ML instancesProfilerReport-1621828452: InProgress
......
2021-05-24 03:55:35 Starting - Preparing the instances for training...
2021-05-24 03:56:12 Downloading - Downloading input data
2021-05-24 03:56:12 Training - Downloading the training image......
2021-05-24 03:57:10 Training - Training image download completed. Training in progress..[34mTraining Starts[0m
[34mAccuracy : 0.84[0m
[34m['text_classification_model.pkl', 'vectorizer.pk'][0m
[34m/opt/ml/model[0m
[34mAccuracy: train=84.000[0m

2021-05-24 03:57:36 Uploading - Uploading generated training model
2021-05-24 03:57:36 Completed - Training job completed
Training seconds: 86
Billable seconds: 86


### 4. Sample Data

In [12]:
import os
from zipfile import ZipFile
# test data location
TRANSFORM_WORKDIR = "data/transform"
filename = os.path.join(TRANSFORM_WORKDIR, "unlabelled.csv")


api_inputfile = "unlabelled.csv"
filepath = os.path.join(os.getcwd(), os.path.join(TRANSFORM_WORKDIR, api_inputfile))

In [13]:
unlabelled = pd.read_csv(filename)
print("Unlabelled Data Shape",unlabelled.shape,"\n")
unlabelled.head()

Unlabelled Data Shape (150, 3) 



Unnamed: 0.1,Unnamed: 0,ID,Text
0,0,1149,council tax rise reasonable welsh councils s...
1,1,1524,fa charges liverpool and millwall liverpool an...
2,2,2198,campbell returns to election team ex-downing s...
3,3,346,clijsters could play aussie open kim clijsters...
4,4,1370,anelka eyes man city departure striker nicol...


### 5. Perform batch inference

In this section, you will perform batch inference using multiple input payloads together. If you are not familiar with batch transform, and want to learn more, see these links:
1. [How it works](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-batch-transform.html)
2. [How to run a batch transform job](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html)

In [14]:
TRANSFORM_WORKDIR = "data/transform"
transform_input = sagemaker_session.upload_data(TRANSFORM_WORKDIR, key_prefix=batch_inference_input_prefix) + "/unlabelled.csv"
print("Transform input uploaded to " + transform_input)

Transform input uploaded to s3://sagemaker-us-east-2-786796469737/active-learning/batch-inference-input-data/unlabelled.csv


In [15]:
transformer = algo.transformer(1, 'ml.m4.xlarge')
transformer.transform(transform_input, content_type='text/csv')
transformer.wait()

print("Batch Transform output saved to " + transformer.output_path)

..........
.........................[34mStarting the inference server with 4 workers.[0m
[34m[2021-05-24 04:06:42 +0000] [12] [INFO] Starting gunicorn 20.1.0[0m
[34m[2021-05-24 04:06:42 +0000] [12] [INFO] Listening at: unix:/tmp/gunicorn.sock (12)[0m
[34m[2021-05-24 04:06:42 +0000] [12] [INFO] Using worker: gevent[0m
[34m[2021-05-24 04:06:42 +0000] [16] [INFO] Booting worker with pid: 16[0m
[34m[2021-05-24 04:06:42 +0000] [17] [INFO] Booting worker with pid: 17[0m
[34m[2021-05-24 04:06:42 +0000] [18] [INFO] Booting worker with pid: 18[0m
[34m[2021-05-24 04:06:42 +0000] [19] [INFO] Booting worker with pid: 19[0m
[35mStarting the inference server with 4 workers.[0m
[35m[2021-05-24 04:06:42 +0000] [12] [INFO] Starting gunicorn 20.1.0[0m
[35m[2021-05-24 04:06:42 +0000] [12] [INFO] Listening at: unix:/tmp/gunicorn.sock (12)[0m
[35m[2021-05-24 04:06:42 +0000] [12] [INFO] Using worker: gevent[0m
[35m[2021-05-24 04:06:42 +0000] [16] [INFO] Booting worker with pid: 16[

#### Inspect the Batch Transform Output in S3

In [16]:
from urllib.parse import urlparse

parsed_url = urlparse(transformer.output_path)
bucket_name = parsed_url.netloc
file_key = '{}/{}.out'.format(parsed_url.path[1:], api_inputfile)

s3_client = sagemaker_session.boto_session.client('s3')

response = s3_client.get_object(Bucket = sagemaker_session.default_bucket(), Key = file_key)

In [17]:
bucketFolder = transformer.output_path.rsplit('/')[3]

In [18]:
import boto3
s3_conn = boto3.client("s3")
bucket_name="sagemaker-us-east-2-786796469737"
with open('output.csv', 'wb') as f:
    s3_conn.download_fileobj(bucket_name, bucketFolder+'/' + api_inputfile +'.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


In [19]:
output = pd.read_csv('output.csv')

In [31]:
output.head(10)

Unnamed: 0.1,Unnamed: 0,ID,Text
0,0,1697,us airways staff agree to pay cut a union repr...
1,1,935,civil servants in strike ballot the uk s bigge...
2,2,2108,game warnings must be clearer violent video ...
3,3,896,jp morgan admits us slavery links thousands of...
4,4,1838,watchdog probes e-mail deletions the informati...
5,5,967,italy to get economic action plan italian prim...
6,6,1465,bmw cash to fuel mini production less than fou...
7,7,1605,kilroy launches veritas party ex-bbc chat sh...
8,8,1156,bid to cut court witness stress new targets to...
9,9,1217,uefa approves fake grass uefa says it will all...


### 6. Create an endpoint and perform real-time inference

If you want to understand how real-time inference with Amazon SageMaker works, see [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

In [21]:
model_name='active-learning'

content_type='text/csv'

real_time_inference_instance_type='ml.m5.xlarge'
batch_transform_inference_instance_type='ml.m5.large'

##### Algorithm ARN

In [22]:
algorithm_arn ='arn:aws:sagemaker:us-east-2:786796469737:algorithm/active-learning-text-classification-v04'

#### A. Create an endpoint

In [23]:
def predict_wrapper(endpoint, session):
    return sage.predictor.RealTimePredictor(endpoint, session,content_type)

#create a deployable model from the model package.

model = ModelPackage(role=role,
                    model_package_arn=algorithm_arn,
                    sagemaker_session=sagemaker_session,
                    predictor_cls=predict_wrapper)

In [24]:
#Deploy the model
predictor = algo.deploy(1, 'ml.m4.xlarge',endpoint_name=model_name)

..........
-------------!

Once endpoint has been created, you would be able to perform real-time inference.

#### C. Perform real-time inference

In [25]:
file_name = 'data/transform/unlabelled.csv'

In [26]:
output_file_name = 'output.csv'

In [27]:
!aws sagemaker-runtime invoke-endpoint \
    --endpoint-name 'active-learning' \
    --body fileb://$file_name \
    --content-type 'text/csv' \
    --region us-east-2 \
    output_file_name

{
    "ContentType": "text/csv; charset=utf-8",
    "InvokedProductionVariant": "AllTraffic"
}


In [28]:
output = pd.read_csv(output_file_name)

In [32]:
output.head(10)

Unnamed: 0.1,Unnamed: 0,ID,Text
0,0,1697,us airways staff agree to pay cut a union repr...
1,1,935,civil servants in strike ballot the uk s bigge...
2,2,2108,game warnings must be clearer violent video ...
3,3,896,jp morgan admits us slavery links thousands of...
4,4,1838,watchdog probes e-mail deletions the informati...
5,5,967,italy to get economic action plan italian prim...
6,6,1465,bmw cash to fuel mini production less than fou...
7,7,1605,kilroy launches veritas party ex-bbc chat sh...
8,8,1156,bid to cut court witness stress new targets to...
9,9,1217,uefa approves fake grass uefa says it will all...


### 7. Clean-up

#### A. Delete the model

In [30]:
predictor.delete_endpoint()

#### B. Unsubscribe to the listing (optional)

If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

