## Train, tune, and deploy a custom ML model using bias detection & mitigation algorithm in text data from AWS Marketplace


The solution uses a Double - Hard DeBias Algorithm to remove targeted biases from the vector space representation of a text corpus.



This sample notebook shows you how to Train, tune, and deploy a custom ML model using bias detection & mitigation algorithm in text data from AWS Marketplace


> **Note**: This is a reference notebook and it cannot run unless you make changes suggested in the notebook.

#### Pre-requisites:
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to For Seller to update: Airline Crew Pairing Optimization. 

#### Contents:
1. [Subscribe to the algorithm](#1.-Subscribe-to-the-algorithm)
1. [Prepare dataset](#2.-Prepare-dataset)
	1. [Dataset format expected by the algorithm](#A.-Dataset-format-expected-by-the-algorithm)
	1. [Configure dataset](#B.-Configure-dataset)
	1. [Upload datasets to Amazon S3](#C.-Upload-datasets-to-Amazon-S3)
1. [Execute DeBias model](#3.-Execute-DeBias-Model)
	1. [Set up environment](#3.1-Set-up-environment)
	1. [Train Model](#3.2-Train-Model)
    1. [Inspect Output](#3.3-Inspect-the-Output-in-S3)
1. [Clean-up](#4.-Clean-up)
	1. [Unsubscribe to the listing (optional)](#Unsubscribe-to-the-listing-(optional))


#### Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

### 1. Subscribe to the algorithm

To subscribe to the algorithm:
1. Open the algorithm listing page Airline Crew Pairing Optimization
1. On the AWS Marketplace listing,  click on **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you agree with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn**. This is the algorithm ARN that you need to specify while training a custom ML model. Copy the ARN corresponding to your region and specify the same in the following cell.

In [10]:
algo_arn ='arn:aws:sagemaker:us-east-2:786796469737:algorithm/double-hard-debias-copy-07-07-copy-07-11'

### 2. Prepare dataset

In [11]:
import base64
import json 
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
from urllib.parse import urlparse
import io
import boto3
import urllib.request
import numpy as np
import tarfile
from zipfile import ZipFile
import pandas as pd
from pprint import pprint

#### A. Dataset format expected by the algorithm

The algorithm requires data in the format as described for best results:
* Input File name should be input_data.zip
* Within the zip file, inputs must be unstructured text corpus,two text files, each containing words pertaining to specific bias class (Eg: Male word file containing male specific key words), a json with bias specific keywords and a definitional pair json with keywords part of same group but opposite category
* The input data files must contain all columns specified in input data description; other columns will be ignored.
* For detailed instructions, please refer sample notebook and algorithm input details

#### B. Configure dataset

In [12]:
training_dataset='Input/input_data.zip'

#### C. Upload datasets to Amazon S3

In [13]:
sagemaker_session = sage.Session()
bucket=sagemaker_session.default_bucket()

In [14]:
# training input location
common_prefix = "double_hard_debias"
training_input_prefix = common_prefix + "/training-input-data"
TRAINING_WORKDIR = "Input"
training_input = sagemaker_session.upload_data(TRAINING_WORKDIR, key_prefix=training_input_prefix)
print("Training input uploaded to " + training_input)

Training input uploaded to s3://sagemaker-us-east-2-786796469737/double_hard_debias/training-input-data


## 3. Execute DeBias Model

Now that dataset is available in an accessible Amazon S3 bucket, we are ready to execute a DeBias model. 

### 3.1 Set up environment

In [15]:
role = get_execution_role()

In [16]:
output_location = 's3://{}/double_hard_debias/{}'.format(bucket, 'output')

### 3.2 Train model

For information on creating an `Estimator` object, see [documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)

In [17]:
training_instance_type='ml.m5.4xlarge'

In [18]:
#Create an estimator object for running a training job
estimator = sage.algorithm.AlgorithmEstimator(
    algorithm_arn=algo_arn,
    base_job_name="debias-training",
    role=role,
    train_instance_count=1,
    train_instance_type=training_instance_type,
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sagemaker_session,
    instance_count=1,
    instance_type=training_instance_type
)
#Run the training job.
estimator.fit({"training": training_input})

2022-07-11 13:11:30 Starting - Starting the training job...
2022-07-11 13:11:53 Starting - Preparing the instances for trainingProfilerReport-1657545089: InProgress
......
2022-07-11 13:12:59 Downloading - Downloading input data
2022-07-11 13:12:59 Training - Training image download completed. Training in progress...[34mStarting the training.[0m
[34mExtracting all the files now...[0m
[34mFiles extraction from zip is Done![0m
[34mAll files found in uploaded zip file.[0m
[34mText Pre-processing and Word Embeddings creating Algorithms initialized:[0m
[34mSentences_list done[0m
[34mcleaned_text_array done[0m
[34mword_tokenized_array is generated[0m
[34mStarted creating GloVe Word Embeddings![0m
[34mGloVe Model training done![0m
[34mGloVe Word Embeddings are created![0m
[34mStarted creating word2vec Word Embeddings![0m
[34mword2vec Model training done![0m
[34mGloVe Word Embeddings are created![0m
[34mDeBiasing Algorithm initialized:[0m
[34mStarted creating Glo

See this [blog-post](https://aws.amazon.com/blogs/machine-learning/easily-monitor-and-visualize-metrics-while-training-models-on-amazon-sagemaker/) for more information how to visualize metrics during the process. You can also open the training job from [Amazon SageMaker console](https://console.aws.amazon.com/sagemaker/home?#/jobs/) and monitor the metrics/logs in **Monitor** section.

In [19]:
#output is available on following path
estimator.output_path

's3://sagemaker-us-east-2-786796469737/double_hard_debias/output'

## Note: Inferencing is done within training pipeline. Real time inference endpoint/batch transform job is not required.

### 3.3 Inspect the Output in S3

In [20]:
from urllib.parse import urlparse

parsed_url = urlparse(estimator.output_path)
bucket_name = parsed_url.netloc
file_key = parsed_url.path[1:]+'/'+estimator.latest_training_job.job_name+'/output/'+"model.tar.gz"

s3_client = sagemaker_session.boto_session.client('s3')

response = s3_client.get_object(Bucket = sagemaker_session.default_bucket(), Key = file_key)

In [21]:
bucketFolder = estimator.output_path.rsplit('/')[3] +'/output/'+estimator.latest_training_job.job_name+'/output/'+"model.tar.gz"

In [22]:
import boto3
s3_conn = boto3.client("s3")
bucket_name=bucket
with open('output.tar.gz', 'wb') as f:
    s3_conn.download_fileobj(bucket_name, bucketFolder, f)
    print("Output file loaded from bucket")



Output file loaded from bucket


In [23]:
with tarfile.open('output.tar.gz') as file:
    file.extractall('./output')    

In [None]:
with ZipFile('./output/outputDeBiasedVectors.zip', "r") as output_zip:
    with io.TextIOWrapper(output_zip.open("GloVe_DeBiasedVectors.txt"), encoding="utf-8") as f:
        r = f.read()
    pprint(r)

In [None]:
with ZipFile('./output/outputDeBiasedVectors.zip', "r") as output_zip:
    with io.TextIOWrapper(output_zip.open("word2vec_DeBiasedVectors.txt"), encoding="utf-8") as f:
        r = f.read()
    pprint(r)

### 4. Clean-up

#### Unsubscribe to the listing (optional)

If you would like to unsubscribe to the algorithm, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

