# Using NannyML Performance Estimation Algorithm from AWS Marketplace 

## Overview

NannyML can estimate the performance of a machine learning model running in production. You can read more about how it works [here](https://nannyml.readthedocs.io/en/stable/how_it_works/performance_estimation.html).


This sample notebook shows you how to use [Model Performance Estimation - NannyML](https://aws.amazon.com/marketplace/pp/prodview-uotyt66szg34o) from AWS Marketplace to estimate the performance of your deployed machine learnign models.

Performance Estimation can work for binary classification, multiclass classification and regression.

> **Note**: This is a reference notebook and it cannot run unless you make changes suggested in the notebook.

## Pre-requisites
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to [Model Performance Estimation - NannyML](https://aws.amazon.com/marketplace/pp/prodview-uotyt66szg34o). 

## Contents
1. [Subscribe to the algorithm](#1.-Subscribe-to-the-algorithm)
1. [Prepare dataset](#2.-Prepare-dataset)
	1. [Dataset format expected by the algorithm](#A.-Dataset-format-expected-by-the-algorithm)
	1. [Configure and visualize train and test dataset](#B.-Configure-and-visualize-train-and-test-dataset)
	1. [Upload datasets to Amazon S3](#C.-Upload-datasets-to-Amazon-S3)
1. [Train a machine learning model](#3:-Train-a-machine-learning-model)
	1. [Set up environment](#3.1-Set-up-environment)
	1. [Train a model](#3.2-Train-a-model)
1. [Deploy model](#4:-Deploy-model)
1. [Perform Batch inference](#6.-Perform-Batch-inference)
1. [Clean-up](#7.-Clean-up)
	1. [Delete the model](#A.-Delete-the-model)
	1. [Unsubscribe to the listing (optional)](#B.-Unsubscribe-to-the-listing-(optional))


## Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

## 1. Subscribe to the algorithm

To subscribe to the algorithm:
1. Open the algorithm listing page [Model Performance Estimation - NannyML](https://aws.amazon.com/marketplace/pp/prodview-uotyt66szg34o)
1. On the AWS Marketplace listing,  click on **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you agree with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn**. This is the algorithm ARN that you need to specify while training a custom ML model. Copy the ARN corresponding to your region and specify the same in the following cell.

<font color='red'>Directly copy your assigned ARN code below:<font>

In [1]:
# algo_arn = "<Customer to specify algorithm ARN corresponding to their AWS region>"
with open('arn.txt', 'r') as file:
    algo_arn = file.read().rstrip()
# algo_arn

## 2. Prepare dataset

In [2]:
import sagemaker as sage
from sagemaker import get_execution_role
import pandas as pd
import json

### A. Dataset format expected by the algorithm

To fully demonstrate the capabilities of NannyML's performance estimation we will provide code for all 3 supported machine learning problem types.

- For Binary Classification we are going to use [NannyML's synthetic car loan dataset](https://nannyml.readthedocs.io/en/stable/datasets/binary_car_loan.html).
- For Multiclass Classification we are going to use [NannyML's synthetic multiclass creadit card assignment dataset](https://nannyml.readthedocs.io/en/stable/datasets/multiclass.html).
- For Regression we are going to use [NannyML's synthetic car price dataset](https://nannyml.readthedocs.io/en/stable/datasets/regression.html).


You can find some information about dataset format in **Usage Information** section of [Model Performance Estimation - NannyML](https://aws.amazon.com/marketplace/pp/prodview-uotyt66szg34o).
<br>
More detailed information can be found in [NannyML's Data Requirements Documentation](https://nannyml.readthedocs.io/en/stable/tutorials/data_requirements.html).

<font color='red'>Edit code below as appropriate for the Machine Learning problem type you are interested in:<font>

In [3]:
# machine_learning_problem_type = "Binary Classification"
# machine_learning_problem_type = "Multiclass Classification"
machine_learning_problem_type = "Regression"

### B. Configure and visualize reference dataset

In [4]:
if machine_learning_problem_type == "Binary Classification":
    reference_dataset = "data/bc_reference.csv"
elif machine_learning_problem_type == "Multiclass Classification":
    reference_dataset = "data/mc_reference.csv"
elif machine_learning_problem_type == "Regression":
    reference_dataset = "data/reg_reference.csv"
else:
    raise ValueError("Unsupported Machine Learning Problem Type.")

# Show selected dataset
pd.read_csv(reference_dataset).head()

Unnamed: 0,car_age,km_driven,price_new,accident_count,door_count,fuel,transmission,y_true,y_pred,timestamp
0,15.0,144020.0,42810.0,4.0,3.0,diesel,automatic,569.0,1246.0,2017-01-24 08:00:00.000
1,12.0,57078.0,31835.0,3.0,3.0,electric,automatic,4277.0,4924.0,2017-01-24 08:00:33.600
2,2.0,76288.0,31851.0,3.0,5.0,diesel,automatic,7011.0,5744.0,2017-01-24 08:01:07.200
3,7.0,97593.0,29288.0,2.0,3.0,electric,manual,5576.0,6781.0,2017-01-24 08:01:40.800
4,13.0,9985.0,41350.0,1.0,5.0,diesel,automatic,6456.0,6822.0,2017-01-24 08:02:14.400


### C. Upload datasets to Amazon S3

In [5]:
sagemaker_session = sage.Session()
bucket = sagemaker_session.default_bucket()
# bucket

In [6]:
demo_prefix = "doc-notebook-demo"

In [7]:
reference_data = sagemaker_session.upload_data(
    reference_dataset, bucket=bucket, key_prefix=demo_prefix
)

## 3: Train a machine learning model

Now that dataset is available in an accessible Amazon S3 bucket, we are ready to train a machine learning model. 

### 3.1 Set up environment

In [8]:
role = get_execution_role()

In [9]:
output_location = f"s3://{bucket}/{demo_prefix}/output"
# output_location

### 3.2 Train a model

You can also find more information about dataset format in **Hyperparameters** section of [Model Performance Estimation - NannyML](https://aws.amazon.com/marketplace/pp/prodview-uotyt66szg34o).

For even more detailed information you read NannyML tutorials on performance estimation for:
- [Binary Classification](https://nannyml.readthedocs.io/en/stable/tutorials/performance_estimation/binary_performance_estimation.html)
- [Multiclass Classification](https://nannyml.readthedocs.io/en/stable/tutorials/performance_estimation/multiclass_performance_estimation.html)
- [Regression](https://nannyml.readthedocs.io/en/stable/tutorials/performance_estimation/regression_performance_estimation.html)

In [10]:
# Define hyperparameters
if machine_learning_problem_type == "Binary Classification":
    nannyml_parameters = {
        "y_pred_proba": "y_pred_proba",
        "y_pred": "y_pred",
        "y_true": "repaid",
        "timestamp_column_name": "timestamp",
        "metrics": ["roc_auc"],
        "chunk_size": 5000,
        "problem_type": "classification_binary",
    }
    # json.dumps needed due to sagemaker specifications
    sagemaker_hyperparameters = {
        "data_filename": reference_dataset.split("/")[-1],
        "data_type": "csv",
        "problem_type": "classification_binary",
        "parameters": json.dumps(nannyml_parameters),
    }
elif machine_learning_problem_type == "Multiclass Classification":
    nannyml_parameters = {
        "y_pred": "y_pred",
        "y_pred_proba": {
            "prepaid_card": "y_pred_proba_prepaid_card",
            "highstreet_card": "y_pred_proba_highstreet_card",
            "upmarket_card": "y_pred_proba_upmarket_card"
        },
        "y_true": "y_true",
        "timestamp_column_name": "timestamp",
        "metrics": ["roc_auc"],
        "chunk_size": 5000,
        "problem_type": "classification_multiclass",
    }
    # json.dumps needed due to sagemaker specifications
    sagemaker_hyperparameters = {
        "data_filename": "mc_reference.csv",
        "data_type": "csv",
        "problem_type": "classification_multiclass",
        "parameters": json.dumps(nannyml_parameters),
    }
elif machine_learning_problem_type == "Regression":
    nannyml_parameters = {
        "feature_column_names": [
            "car_age",
            "km_driven",
            "price_new",
            "accident_count",
            "door_count",
            "fuel",
            "transmission",
        ],
        "y_pred": "y_pred",
        "y_true": "y_true",
        "timestamp_column_name": "timestamp",
        "metrics": ["rmse"],
        "chunk_size": 6000,
        "tune_hyperparameters": False,
    }
    # json.dumps needed due to sagemaker specifications
    sagemaker_hyperparameters = {
        "data_filename": reference_dataset.split("/")[-1],
        "data_type": "csv",
        "problem_type": "regression",
        "parameters": json.dumps(nannyml_parameters),
    }
else:
    raise ValueError("Unsupported Machine Learning Problem Type.")

For information on creating an `Estimator` object, see [documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)

In [11]:
# Create an estimator object for running a training job
estimator = sage.algorithm.AlgorithmEstimator(
    algorithm_arn=algo_arn,
    base_job_name='nml-perf-est',
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sagemaker_session,
    hyperparameters=sagemaker_hyperparameters,
)

In [12]:
# Run the training job.
estimator.fit(
    {'training': reference_data}
)

INFO:sagemaker:Creating training-job with name: nml-perf-est-2023-08-21-05-24-46-306


2023-08-21 05:24:46 Starting - Starting the training job...
2023-08-21 05:25:02 Starting - Preparing the instances for training......
2023-08-21 05:26:06 Downloading - Downloading input data...
2023-08-21 05:26:31 Training - Downloading the training image...
2023-08-21 05:27:07 Uploading - Uploading generated training model[34mINFO:nannyml:Logger object created.[0m
[34mINFO:nannyml:Hyperparameters read.[0m
[34mINFO:nannyml:Estimator Instantiated.[0m
[34mINFO:nannyml:Loaded data.[0m
[34mINFO:nannyml.base:fitting DLE[tune_hyperparameters=False, metrics=['RMSE']][0m
[34mDEBUG:nannyml.performance_estimation.direct_loss_estimation.metrics:fitting RMSE[0m
[34mDEBUG:nannyml.performance_estimation.direct_loss_estimation.metrics:'tune_hyperparameters' set to 'False': skipping hyperparameter tuning[0m
[34mDEBUG:nannyml.performance_estimation.direct_loss_estimation.metrics:estimating RMSE[0m
[34mDEBUG:nannyml.performance_estimation.direct_loss_estimation.metrics:estimating RMSE[

See this [blog-post](https://aws.amazon.com/blogs/machine-learning/easily-monitor-and-visualize-metrics-while-training-models-on-amazon-sagemaker/) for more information how to visualize metrics during the process. You can also open the training job from [Amazon SageMaker console](https://console.aws.amazon.com/sagemaker/home?#/jobs/) and monitor the metrics/logs in **Monitor** section.

## 4: Deploy model

**NannyML's Performance Estimation is not designed for real time inference, therefore it is not recommended to use it in this way.**

For this reason we are not showcasing the real-time inference feature of Sagemaker Algorithm.

## 5. Perform Batch inference

In this section, you will perform batch inference using multiple input payloads together.

In [13]:
# upload the batch-transform job input files to S3

if machine_learning_problem_type == "Binary Classification":
    inference_dataset = "data/bc_analysis.csv"
elif machine_learning_problem_type == "Multiclass Classification":
    inference_dataset = "data/mc_analysis.csv"
elif machine_learning_problem_type == "Regression":
    inference_dataset = "data/reg_analysis.csv"
else:
    raise ValueError("Unsupported Machine Learning Problem Type.")

inference_data = sagemaker_session.upload_data(inference_dataset, bucket=bucket, key_prefix=demo_prefix)
# print("Transform input uploaded to " + inference_data)

In [14]:
# Run the batch-transform job
transformer = estimator.transformer(
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=output_location
)
transformer.transform(inference_data, content_type="text/csv")
transformer.wait()

INFO:sagemaker:Creating model package with name: nannyml-performance-estimation-1350cd6b-2023-08-21-05-27-59-475


..........

INFO:sagemaker:Creating model with name: nannyml-performance-estimation-1350cd6b-2023-08-21-05-28-50-005





INFO:sagemaker:Creating transform job with name: nml-perf-est-2023-08-21-05-28-50-522


...........................
[34m * Serving Flask app '/web_app_serve.py'
 * Debug mode: off[0m
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:8080
 * Running on http://127.0.0.1:8080[0m
[34m#033[33mPress CTRL+C to quit#033[0m[0m
[35m * Serving Flask app '/web_app_serve.py'
 * Debug mode: off[0m
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:8080
 * Running on http://127.0.0.1:8080[0m
[35m#033[33mPress CTRL+C to quit#033[0m[0m
[34m169.254.255.130 - - [21/Aug/2023 05:33:24] "GET /ping HTTP/1.1" 200 -[0m
[34m169.254.255.130 - - [21/Aug/2023 05:33:24] "#033[33mGET /execution-parameters HTTP/1.1#033[0m" 404 -[0m
[35m169.254.255.130 - - [21/Aug/2023 05:33:24] "GET /ping HTTP/1.1" 200 -[0m
[35m169.254.255.130 - - [21/Aug/2023 05:33:24] "#033[33mGET /execution-parameters HTTP/1.1#033[0m" 404 -[0m
[34mReceived POST invocation request.[0m
[34mEstimation invoked with 60000 rows[0m
[34mEstimation invoked with columns: ['car_age', '

**View Results of Performance Estimation**

In [15]:
results = pd.read_csv(transformer.output_path + "/" + inference_dataset.split("/")[-1] + ".out", header = [0,1])
results

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole


Unnamed: 0_level_0,chunk,chunk,chunk,chunk,chunk,chunk,chunk,rmse,rmse,rmse,rmse,rmse,rmse,rmse,rmse
Unnamed: 0_level_1,key,chunk_index,start_index,end_index,start_date,end_date,period,sampling_error,realized,value,upper_confidence_boundary,lower_confidence_boundary,upper_threshold,lower_threshold,alert
0,[0:5999],0,0,5999,2017-01-24 08:00:00,2017-01-26 15:59:26.400,reference,10.348009,1086.309762,1073.905515,1104.949542,1042.861488,1103.313741,1014.276167,False
1,[6000:11999],1,6000,11999,2017-01-26 16:00:00,2017-01-28 23:59:26.400,reference,10.348009,1060.221538,1056.620005,1087.664032,1025.575979,1103.313741,1014.276167,False
2,[12000:17999],2,12000,17999,2017-01-29 00:00:00,2017-01-31 07:59:26.400,reference,10.348009,1038.419338,1054.927053,1085.97108,1023.883026,1103.313741,1014.276167,False
3,[18000:23999],3,18000,23999,2017-01-31 08:00:00,2017-02-02 15:59:26.400,reference,10.348009,1038.398714,1054.427268,1085.471295,1023.383241,1103.313741,1014.276167,False
4,[24000:29999],4,24000,29999,2017-02-02 16:00:00,2017-02-04 23:59:26.400,reference,10.348009,1072.021221,1066.535506,1097.579533,1035.49148,1103.313741,1014.276167,False
5,[30000:35999],5,30000,35999,2017-02-05 00:00:00,2017-02-07 07:59:26.400,reference,10.348009,1074.967232,1064.803413,1095.84744,1033.759386,1103.313741,1014.276167,False
6,[36000:41999],6,36000,41999,2017-02-07 08:00:00,2017-02-09 15:59:26.400,reference,10.348009,1058.475997,1057.218829,1088.262856,1026.174803,1103.313741,1014.276167,False
7,[42000:47999],7,42000,47999,2017-02-09 16:00:00,2017-02-11 23:59:26.400,reference,10.348009,1050.695322,1055.10372,1086.147746,1024.059693,1103.313741,1014.276167,False
8,[48000:53999],8,48000,53999,2017-02-12 00:00:00,2017-02-14 07:59:26.400,reference,10.348009,1048.396774,1052.109723,1083.15375,1021.065696,1103.313741,1014.276167,False
9,[54000:59999],9,54000,59999,2017-02-14 08:00:00,2017-02-16 15:59:26.400,reference,10.348009,1060.04364,1053.167044,1084.211071,1022.123017,1103.313741,1014.276167,False


## 7. Clean-up

### A. Delete the model

In [16]:
transformer.delete_model()

INFO:sagemaker:Deleting model with name: nannyml-performance-estimation-1350cd6b-2023-08-21-05-28-50-005


### B. Unsubscribe to the listing (optional)

If you would like to unsubscribe to the algorithm, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

