# Train, tune, and deploy a custom ML model using <font color='red'>AIA(AI-Advisor) Univariate Contextual Anomaly Detection(U-CAD) </font> Algorithm from AWS Marketplace 

⚠ **Change the link for the `AIA-PAD-ALGO` the current link is our gitlab link**

<font color='red'> This product is able to detect contextual anomalies in time-series data that exhibit seasonal or cyclic patterns using unsupervised machine learning algorithms. It also offers auto hyperparameter optimization to provide users with greater convenience in finding proper hyperparameters for the algorithms and creating optimized anomaly detection models for their data. Additionally, the algorithm employs techniques to improve the accuracy of basic contextual anomaly detection algorithms. </font>

This sample notebook shows you how to train a U-CAD using <font color='red'> For Seller to update: [AIA-PAD-ALGO](http://mod.lge.com/hub/ai_contents_marketplace/aia-ml-marketplace)</font> from AWS Marketplace.

> **Note**: This is a reference notebook and it cannot run unless you make changes suggested in the notebook.

## Pre-requisites
1. **Note**: This notebook contains elements which render correctly in Jupyter interface. Open this notebook from an Amazon SageMaker Notebook Instance or Amazon SageMaker Studio.
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. Some hands-on experience using [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
1. To use this algorithm successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to <font color='red'> For Seller to update: [AIA-PAD-ALGO](http://mod.lge.com/hub/ai_contents_marketplace/aia-ml-marketplace)</font>. 

## Contents
1. [Subscribe to the algorithm](#1.-Subscribe-to-the-algorithm)  
2. [Prepare dataset](#2.-Prepare-dataset)  
    2.1 [Dataset format expected by the algorithm](#2.1-Dataset-format-expected-by-the-algorithm)  
	2.2 [Configure and visualize train and test dataset](#2.2-Configure-and-visualize-train-and-test-dataset)  
	2.3 [Upload datasets to Amazon S3](#2.3-Upload-datasets-to-Amazon-S3)  
3. [Train a machine learning model](#3:-Train-a-machine-learning-model)  
	3.1 [Set up environment](#3.1-Set-up-environment)  
	3.2 [Train a model](#3.2-Train-a-model)    
4. [Deploy model and verify results](#4.-Deploy-model-and-verify-results)  
    4.1 [Deploy trained model](#4.1-Deploy-trained-model)  
    4.2 [Visualize output](#4.2-Create-input-payload)   
5. [Perform inference](#5.-Perform-inference)  
    5.1 [Perform batch inference](#5.1-Perform-batch-inference)  
    5.2 [Visualize output](#5.2-Visualize-output)  
    5.3 [Perform real-time inference](#5.3-Perform-real-time-inference)    
    5.4 [Visualize output](#5.4-Visualize-output)  
    5.5 [Delete the endpoint](#5.3-Delete-the-endpoint)  
6. [Clean-up](#6.-Clean-up)  
	6.1 [Delete the model](#A.-Delete-the-model)  
    6.2 [Unsubscribe to the listing (optional)](#B.-Unsubscribe-to-the-listing-(optional))

## Usage instructions
You can run this notebook one cell at a time (By using Shift+Enter for running a cell).

## 1. Subscribe to the algorithm

To subscribe to the algorithm:
1. Open the algorithm listing page <font color='red'> For Seller to update: [AIA-PAD-ALGO](http://mod.lge.com/hub/ai_contents_marketplace/aia-ml-marketplace)</font>
1. On the AWS Marketplace listing,  click on **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you agree with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn**. This is the algorithm ARN that you need to specify while training a custom ML model. Copy the ARN corresponding to your region and specify the same in the following cell.

![product_arn_image](images/product_arn_image.png)

In [1]:
from getpass import getpass 

# SHAPE
# algo_arn = "<Customer to specify algorithm ARN corresponding to their AWS region follow the instruction above>"

########################################CHANGE####################################################
# SAMPLE
algo_arn='arn:aws:sagemaker:us-east-2:438613450817:algorithm/cad-image-v1-1-0'
##################################################################################################

# get your seesion information
#####################################################
aws_region = "us-east-2"  ##
aws_access_key = getpass(prompt="Access key: ")
aws_secret_key = getpass(prompt="Secret key: ")
#######################################################

## 2. Prepare dataset

In [2]:
import base64
import json
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
from sagemaker import ModelPackage
from urllib.parse import urlparse
import boto3
from IPython.display import Image
from PIL import Image as ImageEdit
import urllib.request
import numpy as np

### 2.1 Dataset format expected by the algorithm

This solution follows these **2 steps**:  `Training` and `Testing` the algorithm.

**Train**
- The algorithm trains on user provided dataset.
- Dataset must be in `csv` shape, under `./data/train/` folder, with 'utf-8' encoding.

**Test**
- After the Machine Learning model is trained, it can be used to make prediction using test dataset.
- The algorithm also tests on user provided dataset.
- Dataset must be in `csv` shape, under `./data/test/` folder, with 'utf-8' encoding.

**data format**  
To input data into the U-CAD, it is necessary for the data to have three columns.
- time column: time column must be in the format of '%Y-%m-%dT%H:%M:%S'. In case your data does not have a time column, you can create a fake time column and add it in your dataframe. 
- data column: data column should contain the data that you want to analyze.
- cycle column: cycle column distinguishes the pattern in your data. If your data is a collection of patterns with the same length, you can use the code below(def make_cycle_column). However, if the patterns in your data have different lengths, it will be necessary to create a cycle column and manually add it to your dataframe. It is important to note that the data within one pattern must be recorded in the same value.  

*tip*
- There's no need to worry if the patterns within your data have varying lengths. The U-CAD system can handle it! To ensure that the patterns are analyzed correctly, the system will perform an interpolation process to align the patterns and make them uniform in length.

### 2.2 Configure and visualize train and test dataset
The `train` and `test` dataset should look like this as below:

In [3]:
import pandas as pd # import padas to show how data looks like

In [4]:
# SHAPE
# training_dataset = "data/train/<FileName.ext>"

########################################CHANGE####################################################
# SAMPLE
training_dataset = "data/train/real_ucr_1sddb40_train.csv" # 300point cycle
##################################################################################################

In [5]:
# show sample of training dataset
train_df = pd.read_csv(training_dataset)
train_df.head()

Unnamed: 0,time_fake,data,cycle,project
0,2023-03-18T08:19:37,170.0,0.0,project
1,2023-03-18T08:19:47,171.0,0.0,project
2,2023-03-18T08:19:57,175.0,0.0,project
3,2023-03-18T08:20:07,171.0,0.0,project
4,2023-03-18T08:20:17,175.0,0.0,project


In [6]:
# SHAPE
# test_dataset = "data/test/<FileName.ext>"

########################################CHANGE####################################################
# SAMPLE
test_dataset = "data/test/real_ucr_1sddb40_inf.csv"
##################################################################################################

In [7]:
# show sample of test dataset
test_df = pd.read_csv(test_dataset)
test_df.head()

Unnamed: 0,time_fake,data,cycle,project
0,2023-03-18T08:19:37,128.0,0.0,project
1,2023-03-18T08:19:47,135.0,0.0,project
2,2023-03-18T08:19:57,138.0,0.0,project
3,2023-03-18T08:20:07,146.0,0.0,project
4,2023-03-18T08:20:17,146.0,0.0,project


If the data you're working with doesn't have a cycle column and you know the pattern length, proceed accordingly.  
In case the patterns in your data have varying lengths, you'll need to add the cycle column manually.

In [None]:
def make_cycle_column(df,pattern_length):
    return np.concatenate([np.ones(pattern_length)*i for i in range(int(len(df)//pattern_length)+1)])[:len(df)]

In [None]:
train_df['cycle'] = make_cycle_column(train_df,300)
test_df['cycle'] = make_cycle_column(test_df,300)

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
train_df.to_csv(training_dataset,index=False)
test_df.to_csv(test_dataset,index=False)

### 2.3 Upload datasets to Amazon S3

<font color='red'>Do not change bucket parameter value. Do not hardcode your S3 bucket name.</font>

In [8]:
import boto3
import sagemaker



boto_session = boto3.Session(region_name=aws_region, aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key)
sagemaker_session = sagemaker.Session(boto_session=boto_session) # get session info


bucket = sagemaker_session.default_bucket()
bucket

'sagemaker-us-east-2-438613450817'

In [9]:
# upload training data to s3 bucket
training_data = sagemaker_session.upload_data(training_dataset, bucket=bucket, key_prefix="TRAINING_INPUT_DIR")
print("Training input uploaded to : " + training_data)

Training input uploaded to : s3://sagemaker-us-east-2-438613450817/TRAINING_INPUT_DIR/real_ucr_1sddb40_train.csv


In [None]:
# # upload test data to s3 bucket
# test_data = sagemaker_session.upload_data(test_dataset, bucket=bucket, key_prefix="INFERENCE_INPUT_DIR")
# print("Inference input uploaded to : " + test_data)

## 3. Train a machine learning model

Now that dataset is available in an accessible Amazon S3 bucket, we are ready to train a machine learning model. 

### 3.1 Set up environment

In [10]:
## If you are running on a local server, enter the role name specified in IAM role.

sts = boto3.client('sts', region_name=aws_region, aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key)
caller_identity = sts.get_caller_identity()
account_id = caller_identity['Account']
role_name = input("Role name: ")
role = f'arn:aws:iam::{account_id}:role/{role_name}'



### If you are running in sagemaker jupyter notebook then uncomment the below. (The above is commented out.) 

#from sagemaker import get_execution_role
#role = get_execution_role(sagemaker_session=sagemaker_session)

print (f"Result: {role}")

Account ID: 438613450817


<font color='red'>For Seller to update: update algorithm sepcific unique prefix in following cell. </font>

In [11]:
# SHAPE
# output_location = "s3://{}/<For seller to Update:Update a unique prefix>/{}".format(bucket, "output")

########################################CHANGE####################################################
# SAMPLE
output_location = "s3://{}/ai-advisor-cad/{}".format(bucket, "output")
##################################################################################################

### 3.2 Train a model

You can also find more information about dataset format in **Hyperparameters** section of <font color='red'> For Seller to update:[Title_of_your_product](Provide link to your marketplace listing of your product).</font>

**Hyperparameters**  
you must specify the column names and choose which CAD functions to enable or disable in hyperparameters  
Here is a description of the arguments that work in CAD
1. **common_x_columns**
    - Please include the name of the data column to be analyzed in the input data.
2. **common_time_columns**
    - Please include the name of the time column in your input data.
3. **common_index_columns**
    - In CAD, you can create models for each group using the 'groupkey' function with the 'common_index_columns' parameter.
    - Groupkey column is the column of groups. The groupkey column specifies the groups, and the data within the same group should have the same value.
    - You can set the 'groupkey' up to 3 columns by using the 'common_index_columns' parameter with the following argument format   
    *Argument format*: 'column1,column2,column3'.
4. **common_pattern_column** 
    - Please include the name of the cycle column in your input data
5. **common_max_pattern_length** 
    - If the data has varying pattern lengths and cycle column lengths, and you know the cycle length, you can use the optional function in CAD to interpolate the data into a common maximum pattern length  
    *Argument format*: ex. '200'
6. **common_missing_x_adjust**
    - This function handles missing values in the data column.
    *Argument format*: 'none','dropna'(drop the missing values)
7. **inference_adaptive_threshold**
    - In CAD, we use the 'w' argument as a threshold, which ranges from 1 to 9 (float). By setting the value of 'w', the anomaly threshold is adjusted adaptively based on the input data.
    - Checking the 'w_table.csv' file after training allows you to choose the appropriate value for 'w'. If you leave it blank, CAD will automatically set the value for 'w' in the inference that was found during hyperparameter optimization (HPO) in training.  
    *Argument format*: ex. '2.5' 
8. **infernece_update_model**
    - If you want to update the model with new inference data, just set the value to 'true' and the model will be updated accordingly. On the other hand, if you don't want to modify the model after inference, just set the value to 'false'.  
    *Argument format*: 'true','false'

In [12]:
# SHAPE
# hyperparameters = {}

########################################CHANGE####################################################
# Define hyperparameters
hyperparameters = {'common_x_columns': 'data',
                   'common_time_column': 'time_fake',
                   'common_index_columns': '',
                   'common_pattern_column': 'cycle',
                   'common_max_pattern_length':'',
                   'common_missing_x_adjust':'none',
                   'inference_adaptive_threshold':'',
                   'inference_update_model':'true'}
##################################################################################################

<font color='red'>For Seller to update: Update appropriate values in estimator definition and ensure that fit call works as expected.</font>

For information on creating an `Estimator` object, see [documentation](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)

In [13]:
training_data

's3://sagemaker-us-east-2-438613450817/TRAINING_INPUT_DIR/real_ucr_1sddb40_train.csv'

In [14]:
########################################CHANGE####################################################
# Create an estimator object for running a training job
estimator = sagemaker.algorithm.AlgorithmEstimator(
    algorithm_arn=algo_arn,
    base_job_name="aia-contextual-anomaly-detection",
    role=role,
    instance_count=1,
    instance_type='ml.c5.xlarge',
    input_mode="File",
    output_path=output_location,
    sagemaker_session=sagemaker_session,
    hyperparameters={},
)
##################################################################################################

# Run the training job.
estimator.fit({'training': training_data})

2023-03-30 01:26:20 Starting - Starting the training job...
2023-03-30 01:26:44 Starting - Preparing the instances for trainingProfilerReport-1680139580: InProgress
...
2023-03-30 01:27:24 Downloading - Downloading input data...
2023-03-30 01:27:45 Training - Downloading the training image.........
2023-03-30 01:29:30 Training - Training image download completed. Training in progress..[34mjson_path:  /opt/ml/input/config/hyperparameters.json[0m
[34myaml_path:  /opt/program/framework/configure/cad.train.workflow.yaml[0m
[34mcurrent mode :  train[0m
[34m##############Train Hyperparameters overload complete##############[0m
[34mjson_path:  /opt/ml/input/config/hyperparameters.json[0m
[34myaml_path:  /opt/program/framework/configure/cad.inference.workflow.yaml[0m
[34mcurrent mode :  inference[0m
[34m##############Inference Hyperparameters overload complete##############[0m
[34mTrain, Inference hyperparams overriden![0m
[34maip yaml file is replaced with aip_train yaml fi

See this [blog-post](https://aws.amazon.com/blogs/machine-learning/easily-monitor-and-visualize-metrics-while-training-models-on-amazon-sagemaker/) for more information how to visualize metrics during the process. You can also open the training job from [Amazon SageMaker console](https://console.aws.amazon.com/sagemaker/home?#/jobs/) and monitor the metrics/logs in **Monitor** section.

## 4. Deploy model and verify results

Now you can deploy the model for performing real-time inference.

In [None]:
########################################CHANGE####################################################
model_name = "aia_cad_inference_model"
content_type='text/csv'

# set instance type
real_time_inference_instance_type = 'ml.c5.xlarge'
batch_transform_inference_instance_type = 'ml.c5.xlarge'
##################################################################################################

### 4.1 Deploy trained model

In [None]:
from sagemaker.predictor import csv_serializer

# deploy model
predictor = estimator.deploy(initial_instance_count=1, instance_type=real_time_inference_instance_type, serializer=csv_serializer)

### 4.2 Visualize output

In [None]:
########################################CHANGE####################################################
result = pd.read_csv("inference_result.csv", header=None)
##################################################################################################

# print result
print(result.head())

Once endpoint is created, you can perform real-time inference.

## 5. Perform inference

In this section, you will perform batch inference using multiple input payloads together.

### 5.1 Perform Batch inference

In [None]:
test_dataset

In [None]:
########################################CHANGE####################################################
# upload the batch-transform job input files to S3
transform_input_folder = test_dataset
##################################################################################################

transform_input = sagemaker_session.upload_data(transform_input_folder, key_prefix=model_name)
print("Transform input uploaded to : " + transform_input)

In [None]:
# Run the batch-transform job
transformer = estimator.transformer(instance_count=1, instance_type=batch_transform_inference_instance_type)
transformer.transform(transform_input, content_type=content_type)
transformer.wait()

In [None]:
# output is available on following path
transformer.output_path

### 5.2 Visualize output

In [None]:
########################################CHANGE####################################################
result = pd.read_csv("inference_result.csv", header=None)
##################################################################################################

# print result
print(result.head())

<Add code snippet that shows the payload contents>

### 5.3 Perform real-time inference

In [None]:
import pandas as pd
import io


runtime = boto3.client('sagemaker-runtime', region_name=aws_region, aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key)


response = runtime.invoke_endpoint(EndpointName=predictor.endpoint_name,ContentType=content_type,Body=open(file_name, 'rb').read(),Accept=content_type
)
result = response['Body'].read().decode('utf-8').replace('\x00', '')
result_df = pd.read_csv(io.StringIO(result))

In [None]:
result_df

In [None]:
result

In [None]:
########################################CHANGE####################################################
file_name = test_dataset
output_file_name = "inference_result.csv"
##################################################################################################

!aws sagemaker-runtime invoke-endpoint \
    --endpoint-name $predictor.endpoint \
    --body fileb://$file_name \
    --content-type $content_type \
    --region $sagemaker_session.boto_region_name \
    --profile marketplace \
    $output_file_name

### 5.4 Visualize output

In [None]:
########################################CHANGE####################################################
result = pd.read_csv("inference_result.csv", header=None)
##################################################################################################

# print result
print(result.head())

### 5.5 Delete the endpoint

Now that you have successfully performed a real-time inference, you do not need the endpoint any more. you can terminate the same to avoid being charged.

In [None]:
predictor.delete_endpoint(delete_endpoint_config=True)

## 6. Clean-up

### 6.1 Delete the model

In [None]:
predictor.delete_model()

### 6.2 Unsubscribe to the listing (optional)

If you would like to unsubscribe to the algorithm, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.

