## Deploy Active Learning for Text Classification Algorithm from AWS Marketplace 

Active Learning for Text Classification trains a text classification model using a small corpus of training data and provides the most appropriate samples from a huge corpus of unlabeled data to be annotated in order to improve the model accuracy significantly. Using Active Learning this algorithm helps in identifying the most effective data sample to be tagged first thus reducing the time and effort to build a usable Machine learning model.

Active Learning for Text Classification can be used for prioritizing the data labeling task and thereby drastically reduce the data tagging effort required to build a working Machine Learning model.

This solution can be used to iteratively sample the right data points & train the Machine Learning model to  build a  supervised machine learning algorithm. It helps in identifying which samples to label first based on the rules learned by Machine Learning model. It uses Active Learning methodologies to select the right sample data points from unlabeled data to build a better performing machine learning model much faster.

This sample notebook shows you how to deploy Active Learning for Text Classification Algorithm using Amazon SageMaker.

> **Note**: This is a reference notebook and it cannot run unless you make changes suggested in the notebook.

#### Pre-requisites:
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. To deploy this ML model successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to Active Learning for Text Classification.

#### Contents:
1. [Subscribe to the Algorithm](#1.-Subscribe-to-the-Algorithm)
2. [Prepare dataset](#2.-Prepare-dataset)
    1. [Dataset format expected by the algorithm](#A.-Dataset-format-expected-by-the-algorithm)
    2. [Configure and visualize train,validation and test dataset](#B.-Configure-and-visualize-train,-validation-and-test-dataset)
    3. [Upload datasets to Amazon S3](#C.-Upload-datasets-to-Amazon-S3)
3. [Train a machine learning model](#3.-Train-a-machine-learning-model)
    1. [Set up environment](#A.-Set-up-environment)
    2. [Train a model](#B.-Train-a-model)
4. [Deploy model and verify results](#4.-Deploy-model-and-verify-results)
    1. [Deplay trained model](#A.-Deploy-trained-model)
    2. [Create input payload](#B.-Create-input-payload)
    3. [Perform real-time inference](#C.-Perform-real-time-inference)
    4. [Visualize output](#D.-Visualize-output)
    5. [Delete the endpoint](#E.-Delete-the-endpoint)
5. [Perform Batch inference](#5.-Perform-Batch-inference)
    1. [Inspect the Batch Transform Output in S3](#A.-Inspect-the-Batch-Transform-Output-in-S3)
6. [Clean-up](#6.-Clean-up)
    1. [Delete the model](#A.-Delete-the-model)
    2. [Unsubscribe to the listing (optional)](#B.-Unsubscribe-to-the-listing-(optional))
    

#### Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

### 1. Subscribe to the Algorithm

To subscribe to the Algorithm:
1. Open the algorithm listing page **Active Learning for Text Classification**
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the algorithm ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

In [1]:
algorithm_arn ='arn:aws:sagemaker:us-east-2:786796469737:algorithm/active-learning-text-classification-v04'

### 2. Prepare dataset

In [16]:
import base64
import json 
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
from sagemaker.algorithm import AlgorithmEstimator
from sagemaker import ModelPackage
from urllib.parse import urlparse
import boto3
from IPython.display import Image
from PIL import Image as ImageEdit
import urllib.request
import numpy as np
import pandas as pd

#### A. Dataset format expected by the algorithm

The deployed solution has these **2 steps**: Training the algorithm and Testing

<li>: The system trains on user provided text dataset.
<li>: The train dataset must contain 2 files - "train.csv" and "validation.csv" with 'utf-8' encoding.
<li>: The machine learning model is trained in the training step and once the model is generated, it can be used to sample data points from unlabelled data using active learning technique.
<li>: The testing API takes a csv file "unlabelled.csv" from which the data points are sampled for human annotation.
<br>

**train.csv**
<li>: train.csv must contain 3 columns - <b>ID</b>, <b>Text</b> and <b>Category</b>. The <b>ID</b> is a unique identification number associated with the text, <b>Text</b> will have textual data that needs to be categorized and <b>Category</b> will be class with the associated text.
<li>: It is recommended to start with small amount of annotated data in train.csv and increase the training data size iteratively by sampling data points from unlabelled data.
<li>: After each iteration of model building and text sampling from unlabelled data, the model returned sampled data should be annotated and added to the train.csv. The same data points should be removed from unlabelled dataset.<br>

**validation.csv**
<li>: The format of validation.csv is similar to train.csv and must contain 3 columns - <b>ID</b>, <b>Text</b> and <b>Category</b>.
<li>: The validation.csv is used to check the accuracy of model.
<li>: The contents of the validation.csv should not be altered as it will demonstrate the change in accuracy after each iteration of sampling, annotation and training.<br>
    
**unlablled.csv**
<li>: The unlabelled.csv must contain 2 columns - <b>ID</b> and <b>Text</b>.
<li>: The unlablled.csv is the huge corpus of unlabelled data from which data points will be sampled for annotation. The sampling will be done using active learning technique instead of random sampling.
<li>: The active learning sampling helps in attaining better machine learning model accuracy in less as the amount of data points labelled is reduced drastically.
<br>

#### B. Configure and visualize train, validation and test dataset

In [3]:
training_dataset='data/training/train.csv'

In [4]:
validation_dataset='data/training/validation.csv'

In [6]:
train_input_df = pd.read_csv(training_dataset)
train_input_df.head()

Unnamed: 0.1,Unnamed: 0,ID,Text,Category
0,0,2035,errors doomed first dome sale the initial att...,politics
1,1,440,lib dems to target stamp duty the liberal de...,politics
2,2,1374,uk economy facing major risks the uk manufac...,business
3,3,360,baa support ahead of court battle uk airport o...,politics
4,4,1598,survey confirms property slowdown government f...,business


In [7]:
validation_input_df = pd.read_csv(validation_dataset)
validation_input_df.head()

Unnamed: 0.1,Unnamed: 0,ID,Text,Category
0,0,900,williams stays on despite dispute matt william...,sport
1,1,1424,robben and cole earn chelsea win cheslea salva...,sport
2,2,2186,us bank loses customer details the bank of a...,business
3,3,1109,cairn shares up on new oil find shares in cair...,business
4,4,1579,golden rule boost for chancellor chancellor go...,business


In [8]:
test_dataset='data/transform/unlabelled.csv'

In [9]:
test_input_df = pd.read_csv(test_dataset)
test_input_df.head()

Unnamed: 0.1,Unnamed: 0,ID,Text
0,0,1149,council tax rise reasonable welsh councils s...
1,1,1524,fa charges liverpool and millwall liverpool an...
2,2,2198,campbell returns to election team ex-downing s...
3,3,346,clijsters could play aussie open kim clijsters...
4,4,1370,anelka eyes man city departure striker nicol...


#### C. Upload datasets to Amazon S3

In [30]:
sagemaker_session = sage.Session()
bucket=sagemaker_session.default_bucket()

In [11]:
# training input location
common_prefix = "active-learning"
training_input_prefix = common_prefix + "/training-input-data"
TRAINING_WORKDIR = "data/training"
training_input = sagemaker_session.upload_data(TRAINING_WORKDIR, key_prefix=training_input_prefix)

In [13]:
TRANSFORM_WORKDIR = "data/transform"
batch_inference_input_prefix = common_prefix + "/batch-inference-input-data"
transform_input = sagemaker_session.upload_data(TRANSFORM_WORKDIR, key_prefix=batch_inference_input_prefix) + "/unlabelled.csv"
print("Transform input uploaded to " + transform_input)

Transform input uploaded to s3://sagemaker-us-east-2-786796469737/active-learning/batch-inference-input-data/unlabelled.csv


### 3. Train a machine learning model

Now that dataset is available in an accessible Amazon S3 bucket, we are ready to train a machine learning model.

#### A. Set up environment

In [14]:
role = get_execution_role()

#### B. Train a model

In [17]:
algo = AlgorithmEstimator(
    algorithm_arn=algorithm_arn,
    role=role,
    instance_count=1,
    instance_type='ml.c4.xlarge',
    base_job_name='active-learning-marketplace')

In [18]:
print ("Now run the training job using algorithm arn %s in region %s" % (algorithm_arn, sagemaker_session.boto_region_name))
algo.fit({'training': training_input})

Now run the training job using algorithm arn arn:aws:sagemaker:us-east-2:786796469737:algorithm/active-learning-text-classification-v04 in region us-east-2
2021-06-24 03:17:05 Starting - Starting the training job...
2021-06-24 03:17:07 Starting - Launching requested ML instancesProfilerReport-1624504625: InProgress
...
2021-06-24 03:18:01 Starting - Preparing the instances for training.........
2021-06-24 03:19:29 Downloading - Downloading input data
2021-06-24 03:19:29 Training - Downloading the training image.....[34mTraining Starts[0m

2021-06-24 03:20:35 Uploading - Uploading generated training model
2021-06-24 03:20:35 Completed - Training job completed
[34mAccuracy : 0.84[0m
[34m['text_classification_model.pkl', 'vectorizer.pk'][0m
[34m/opt/ml/model[0m
[34mAccuracy: train=84.000[0m
Training seconds: 81
Billable seconds: 81


### 4. Deploy model and verify results
Now you can deploy the model for performing real-time inference.

In [19]:
model_name='active-learning'

content_type='text/csv'

real_time_inference_instance_type='ml.c4.xlarge'
batch_transform_inference_instance_type='ml.c4.large'

#### A. Deploy trained model

In [None]:
#Deploy the model
predictor = algo.deploy(1, 'ml.c4.xlarge',endpoint_name=model_name)

..........
------------!

Once endpoint is created, you can perform real-time inference.

#### B. Create input payload

In [21]:
file_name = 'data/transform/unlabelled.csv'
output_file_name = 'output.csv'

#### C. Perform real-time inference

In [22]:
!aws sagemaker-runtime invoke-endpoint \
    --endpoint-name 'active-learning' \
    --body fileb://$file_name \
    --content-type 'text/csv' \
    --region us-east-2 \
    output_file_name

{
    "ContentType": "text/csv; charset=utf-8",
    "InvokedProductionVariant": "AllTraffic"
}


#### D. Visualize output

<li>:  The output will be a csv file with sampled data points from unlablled data. The output csv file will contain ID and the associated text sampled for human annotation.
<li>: These sampled datapoints should be annotated, added to train.csv and removed from unlablled.csv before training the machine learning model again.
<li>: Now, the machine learning model will be trained on a bigger and better data set capturing variability present in the data and provide updated accuracy number on the validation dataset.
<br>

In [23]:
output = pd.read_csv(output_file_name)
output.head(10)

Unnamed: 0.1,Unnamed: 0,ID,Text
0,0,1697,us airways staff agree to pay cut a union repr...
1,1,935,civil servants in strike ballot the uk s bigge...
2,2,2108,game warnings must be clearer violent video ...
3,3,896,jp morgan admits us slavery links thousands of...
4,4,1838,watchdog probes e-mail deletions the informati...
5,5,967,italy to get economic action plan italian prim...
6,6,1465,bmw cash to fuel mini production less than fou...
7,7,1605,kilroy launches veritas party ex-bbc chat sh...
8,8,1156,bid to cut court witness stress new targets to...
9,9,1217,uefa approves fake grass uefa says it will all...


#### E. Delete the endpoint
Now that you have successfully performed a real-time inference, you do not need the endpoint any more. you can terminate the same to avoid being charged.

In [24]:
predictor.delete_endpoint(delete_endpoint_config=True)

### 5. Perform Batch inference
In this section, you will perform batch inference using multiple input payloads together. If you are not familiar with batch transform, and want to learn more, see these links:
1. [How it works](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-batch-transform.html)
2. [How to run a batch transform job](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html)

In [25]:
TRANSFORM_WORKDIR = "data/transform"
transform_input = sagemaker_session.upload_data(TRANSFORM_WORKDIR, key_prefix=batch_inference_input_prefix) + "/unlabelled.csv"
print("Transform input uploaded to " + transform_input)

Transform input uploaded to s3://sagemaker-us-east-2-786796469737/active-learning/batch-inference-input-data/unlabelled.csv


In [26]:
transformer = algo.transformer(1, 'ml.m4.xlarge')
transformer.transform(transform_input, content_type='text/csv')
transformer.wait()

print("Batch Transform output saved to " + transformer.output_path)

..........
.........................[34mStarting the inference server with 4 workers.[0m
[34m[2021-06-24 03:52:30 +0000] [12] [INFO] Starting gunicorn 20.1.0[0m
[34m[2021-06-24 03:52:30 +0000] [12] [INFO] Listening at: unix:/tmp/gunicorn.sock (12)[0m
[34m[2021-06-24 03:52:30 +0000] [12] [INFO] Using worker: gevent[0m
[34m[2021-06-24 03:52:30 +0000] [16] [INFO] Booting worker with pid: 16[0m
[34m[2021-06-24 03:52:30 +0000] [17] [INFO] Booting worker with pid: 17[0m
[34m[2021-06-24 03:52:30 +0000] [18] [INFO] Booting worker with pid: 18[0m
[34m[2021-06-24 03:52:30 +0000] [23] [INFO] Booting worker with pid: 23[0m
[34m169.254.255.130 - - [24/Jun/2021:03:52:38 +0000] "GET /ping HTTP/1.1" 200 1 "-" "Go-http-client/1.1"[0m
[34m169.254.255.130 - - [24/Jun/2021:03:52:38 +0000] "GET /execution-parameters HTTP/1.1" 404 2 "-" "Go-http-client/1.1"[0m
[35m169.254.255.130 - - [24/Jun/2021:03:52:38 +0000] "GET /ping HTTP/1.1" 200 1 "-" "Go-http-client/1.1"[0m
[35m169.254.255.13

In [27]:
#output is available on following path
transformer.output_path

's3://sagemaker-us-east-2-786796469737/active-learning-marketplace-2021-06-24-03-48-29-936'

#### A. Inspect the Batch Transform Output in S3

In [32]:
from urllib.parse import urlparse

parsed_url = urlparse(transformer.output_path)
bucket_name = parsed_url.netloc
file_key = '{}/{}.out'.format(parsed_url.path[1:], "unlabelled.csv")

s3_client = sagemaker_session.boto_session.client('s3')

response = s3_client.get_object(Bucket = sagemaker_session.default_bucket(), Key = file_key)

In [33]:
bucketFolder = transformer.output_path.rsplit('/')[3]

In [34]:
import boto3
s3_conn = boto3.client("s3")
bucket_name=bucket
with open('output.csv', 'wb') as f:
    s3_conn.download_fileobj(bucket_name, bucketFolder+'/' + "unlabelled.csv" +'.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


<li>:  The output will be a csv file with sampled data points from unlablled data. The output csv file will contain ID and the associated text sampled for human annotation.
<li>: These sampled datapoints should be annotated, added to train.csv and removed from unlablled.csv before training the machine learning model again.
<li>: Now, the machine learning model will be trained on a bigger and better data set capturing variability present in the data and provide updated accuracy number on the validation dataset.
<br>

In [35]:
output = pd.read_csv('output.csv')

In [36]:
output.head(10)

Unnamed: 0.1,Unnamed: 0,ID,Text
0,0,1697,us airways staff agree to pay cut a union repr...
1,1,935,civil servants in strike ballot the uk s bigge...
2,2,2108,game warnings must be clearer violent video ...
3,3,896,jp morgan admits us slavery links thousands of...
4,4,1838,watchdog probes e-mail deletions the informati...
5,5,967,italy to get economic action plan italian prim...
6,6,1465,bmw cash to fuel mini production less than fou...
7,7,1605,kilroy launches veritas party ex-bbc chat sh...
8,8,1156,bid to cut court witness stress new targets to...
9,9,1217,uefa approves fake grass uefa says it will all...


### 6. Clean-up

#### A. Delete the model

In [None]:
predictor.delete_model()

#### B. Unsubscribe to the listing (optional)
If you would like to unsubscribe to the algorithm, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.