## Train, tune, and deploy a custom ML model using Search Similar Company Descriptions Algorithm from AWS Marketplace 


Given a company name and the no. of companies to rank, this ML solution provides top companies with similar descriptions as output.

This sample notebook shows you how to train a custom ML model using Search Similar Company Descriptions Algorithm from AWS Marketplace.

> **Note**: This is a reference notebook and it cannot run unless you make changes suggested in the notebook.

#### Pre-requisites:
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to Vector Search for Company Description. 

#### Contents:
1. [Subscribe to the algorithm](#1.-Subscribe-to-the-algorithm)
1. [Prepare dataset](#2.-Prepare-dataset)
	1. [Dataset format expected by the algorithm](#A.-Dataset-format-expected-by-the-algorithm)
	1. [Configure the dataset](#B.-Configure-the-dataset)
	1. [Upload datasets to Amazon S3](#C.-Upload-datasets-to-Amazon-S3)
1. [Train a machine learning model](#3:-Train-a-machine-learning-model)
	1. [Set up environment](#3.1-Set-up-environment)
	1. [Train a model](#3.2-Train-a-model)
1. [Deploy model and verify results](#4:-Deploy-model-and-verify-results)
    1. [Deploy trained model](#A.-Deploy-trained-model)
    1. [Create input payload](#B.-Create-input-payload)
    1. [Perform real-time inference](#C.-Perform-real-time-inference)
    1. [Visualize output](#D.-Visualize-output)
    1. [Delete the endpoint](#E.-Delete-the-endpoint)
1. [Perform Batch inference](#5.-Perform-Batch-inference)
    1. [Run batch-transform job](#A.-Run-the-batch-transform-job)
    1. [Inspect the Output](#B.-Inspect-the-Batch-Transform-Output-in-S3)
1. [Clean-up](#6.-Clean-up)
	1. [Delete the model](#A.-Delete-the-model)
	1. [Unsubscribe to the listing (optional)](#B.-Unsubscribe-to-the-listing-(optional))


#### Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

### 1. Subscribe to the algorithm

To subscribe to the algorithm:
1. Open the algorithm listing page Search Similar Text Descriptions
1. On the AWS Marketplace listing,  click on **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you agree with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn**. This is the algorithm ARN that you need to specify while training a custom ML model. Copy the ARN corresponding to your region and specify the same in the following cell.

In [1]:
algo_arn ='arn:aws:sagemaker:us-east-2:786796469737:algorithm/vector-search-v2'

### 2. Prepare dataset

In [2]:
import base64
import json 
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
from urllib.parse import urlparse
import boto3
import urllib.request
import numpy as np
from zipfile import ZipFile
import pandas as pd

#### A. Dataset format expected by the algorithm

The algorithm requires data in the format as described for best results:
* Input File name should be input_data.zip
* The zip file should contain a CSV file named "input.csv" with mandatory information in columns.
* The input data files must contain all columns specified in input data description; other columns will be ignored.
* For detailed instructions, please refer sample notebook and algorithm input details

#### B. Configure the dataset

In [3]:
training_dataset='Training Input/input_data.zip'

#### C. Upload datasets to Amazon S3

In [4]:
sagemaker_session = sage.Session()
bucket=sagemaker_session.default_bucket()

In [5]:
# training input location
common_prefix = "vector-search"
training_input_prefix = common_prefix + "/training-input-data"
TRAINING_WORKDIR = "Training Input"
training_input = sagemaker_session.upload_data(TRAINING_WORKDIR, key_prefix=training_input_prefix)
print("Training input uploaded to " + training_input)

Training input uploaded to s3://sagemaker-us-east-2-786796469737/vector-search/training-input-data


## 3: Train a machine learning model

Now that dataset is available in an accessible Amazon S3 bucket, we are ready to train a machine learning model. 

### 3.1 Set up environment

In [6]:
role = get_execution_role()

In [7]:
output_location = 's3://{}/vector_search/{}'.format(bucket, 'output')

### 3.2 Train a model

For information on creating an `Estimator` object, see [documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)

Instance `m5.large` is sufficient for training dataset containing ~2000 company descriptions.  
Please select an appropriate instance type based on the training dataset size.  

In [None]:
instance_type='ml.m5.large'

In [8]:
#Create an estimator object for running a training job
estimator = sage.algorithm.AlgorithmEstimator(
    algorithm_arn=algo_arn,
    base_job_name="vector-search-training",
    role=role,
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sagemaker_session,
    instance_count=1,
    instance_type=instance_type
)
#Run the training job.
estimator.fit({"training": training_input})

2023-02-02 06:35:39 Starting - Starting the training job...
2023-02-02 06:36:03 Starting - Preparing the instances for trainingProfilerReport-1675319739: InProgress
......
2023-02-02 06:37:03 Downloading - Downloading input data...
2023-02-02 06:37:23 Training - Downloading the training image.........
2023-02-02 06:39:03 Training - Training image download completed. Training in progress..[34mStarting the training.[0m

2023-02-02 06:39:38 Uploading - Uploading generated training model[34m#015Batches:   0%|          | 0/1 [00:00<?, ?it/s]#015Batches: 100%|██████████| 1/1 [00:20<00:00, 20.80s/it]#015Batches: 100%|██████████| 1/1 [00:20<00:00, 20.80s/it][0m
[34mLength of the embeddings: 25[0m
[34m2023-02-02 06:39:30,554 [INFO]: Using 2 omp threads (processes), consider increasing --nb_cores if you have more[0m
[34m2023-02-02 06:39:30,555 [INFO]: Launching the whole pipeline 02/02/2023, 06:39:30[0m
[34m2023-02-02 06:39:30,555 [INFO]: Reading total number of vectors and dimension 

See this [blog-post](https://aws.amazon.com/blogs/machine-learning/easily-monitor-and-visualize-metrics-while-training-models-on-amazon-sagemaker/) for more information how to visualize metrics during the process. You can also open the training job from [Amazon SageMaker console](https://console.aws.amazon.com/sagemaker/home?#/jobs/) and monitor the metrics/logs in **Monitor** section.

### 4: Deploy model and verify results

Now you can deploy the model for performing real-time inference.

In [9]:
model_name='vector-search'

content_type='application/json'

real_time_inference_instance_type='ml.m5.large'
batch_transform_inference_instance_type='ml.m5.large'

#### A. Deploy trained model

In [10]:
from sagemaker.predictor import csv_serializer
predictor = estimator.deploy(1, real_time_inference_instance_type)

..........
-------!

Once endpoint is created, you can perform real-time inference.

#### B. Create input payload

The trained model accepts a json file containing the fields `company_name` and `k`.  
For detailed instructions, please refer sample input and model input details.

In [11]:
file_name = '"Model Input"/model_input.json'

#### C. Perform real-time inference

In [12]:
output_file_name = '"Model Output"/output.csv'

In [13]:
!aws sagemaker-runtime invoke-endpoint \
    --endpoint-name $predictor.endpoint_name \
    --body fileb://$file_name \
    --content-type $content_type \
    --region $sagemaker_session.boto_region_name \
    $output_file_name

{
    "ContentType": "text/csv; charset=utf-8",
    "InvokedProductionVariant": "AllTraffic"
}


#### D. Visualize output

In [14]:
output = pd.read_csv('Model Output/output.csv')
output

Unnamed: 0,company_name,company_description,distance_score,industry
0,"024 Pharma, Inc.",pharma inc provides healthcare products worldw...,0.642718,Beauty Care Products
1,"22nd Century Group, Inc.",nd century group inc plant biotechnology compa...,1.361139,Cigarettes
2,"20/20 Global, Inc.",rm investors inc supplies fruits vegetables no...,1.38911,Consumer Staples
3,"1PM Industries, Inc.",pm industries inc provides consulting services...,1.411064,Commercial and Professional Services
4,1st Prestige Wealth Management,st prestige wealth management provides wealth ...,1.431003,Asset Management and Custody Banks


#### E. Delete the endpoint

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. you can terminate the same to avoid being charged.

In [15]:
predictor.delete_endpoint(delete_endpoint_config=True)

### 5. Perform Batch inference

In this section, you will perform batch inference using multiple input payloads together.

#### A. Run the batch-transform job

In [16]:
#upload the batch-transform job input files to S3
transform_input_folder = "Model Input/model_input.json"
transform_input = sagemaker_session.upload_data(transform_input_folder, key_prefix=model_name) 
print("Transform input uploaded to " + transform_input)

Transform input uploaded to s3://sagemaker-us-east-2-786796469737/vector-search/model_input.json


In [17]:
#Run the batch-transform job
transformer = estimator.transformer(1, batch_transform_inference_instance_type)
transformer.transform(transform_input, content_type=content_type)
transformer.wait()

..........
................................
[34m * Serving Flask app 'serve'
 * Debug mode: off[0m
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:8080
 * Running on http://169.254.255.131:8080[0m
[34m#033[33mPress CTRL+C to quit#033[0m[0m
[35m * Serving Flask app 'serve'
 * Debug mode: off[0m
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:8080
 * Running on http://169.254.255.131:8080[0m
[35m#033[33mPress CTRL+C to quit#033[0m[0m
[34m169.254.255.130 - - [02/Feb/2023 06:50:47] "GET /ping HTTP/1.1" 200 -[0m
[34m169.254.255.130 - - [02/Feb/2023 06:50:47] "#033[33mGET /execution-parameters HTTP/1.1#033[0m" 404 -[0m
[35m169.254.255.130 - - [02/Feb/2023 06:50:47] "GET /ping HTTP/1.1" 200 -[0m
[35m169.254.255.130 - - [02/Feb/2023 06:50:47] "#033[33mGET /execution-parameters HTTP/1.1#033[0m" 404 -[0m
[34m#015Batches:   0%|          | 0/1 [00:00<?, ?it/s]#015Batches: 100%|██████████| 1/1 [00:00<00:00,  6.84it/s]#015Batches: 100%|███

In [18]:
#output is available on following path
transformer.output_path

's3://sagemaker-us-east-2-786796469737/vector-search-training-2023-02-02-06-45-28-427'

#### B. Inspect the Batch Transform Output in S3

In [20]:
import os
s3_conn = boto3.client("s3")
with open('results.csv', 'wb') as f:
    s3_conn.download_fileobj(bucket, os.path.basename(transformer.output_path)+'/model_input.json.out', f)
    print("Output file loaded from bucket")

Output file loaded from bucket


In [21]:
output = pd.read_csv('results.csv')
output

Unnamed: 0,company_name,company_description,distance_score,industry
0,"024 Pharma, Inc.",pharma inc provides healthcare products worldw...,0.642718,Beauty Care Products
1,"22nd Century Group, Inc.",nd century group inc plant biotechnology compa...,1.361139,Cigarettes
2,"20/20 Global, Inc.",rm investors inc supplies fruits vegetables no...,1.38911,Consumer Staples
3,"1PM Industries, Inc.",pm industries inc provides consulting services...,1.411064,Commercial and Professional Services
4,1st Prestige Wealth Management,st prestige wealth management provides wealth ...,1.431003,Asset Management and Custody Banks


### 6. Clean-up

#### A. Delete the model

In [22]:
transformer.delete_model()

#### B. Unsubscribe to the listing (optional)

If you would like to unsubscribe to the algorithm, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

