## Model Monitor: Monitor Data Quality

For this example we will utilize an existing [XGBoost Regression Example](https://github.com/RamVegiraju/SageMaker-Deployment/blob/master/RealTime/Built-In/XGBoost/XGBoost-Abalone.ipynb), to create a real-time endpoint which we will enable data capture and model monitoring capabilities for. 

In [None]:
import boto3
import sagemaker
from sagemaker.estimator import Estimator

boto_session = boto3.session.Session()
region = boto_session.region_name

sagemaker_session = sagemaker.Session()
base_job_prefix = 'xgboost-example'
role = sagemaker.get_execution_role()

default_bucket = sagemaker_session.default_bucket()
s3_prefix = base_job_prefix

training_instance_type = 'ml.m5.xlarge'

## Download Data and Prepare Training Input in S3

In [None]:
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/uci_abalone/train_csv/abalone_dataset1_train.csv .

In [None]:
!aws s3 cp abalone_dataset1_train.csv s3://{default_bucket}/xgboost-regression/train.csv

We also need to download the Abalone dataset with headers, headers are required for Model Monitoring.

In [None]:
!aws s3 cp s3://sagemaker-sample-files/datasets/tabular/uci_abalone/abalone-with-headers.csv abalone-headers.csv

In [None]:
import pandas as pd
df = pd.read_csv("abalone-headers.csv")
df.head()

In [None]:
columns = list(df.columns)
columns

In [None]:
df = df[['Rings','Sex',
 'Length',
 'Diameter',
 'Height',
 'Whole weight',
 'Shucked weight',
 'Viscera weight',
 'Shell weight']]

In [None]:
df.head()

In [None]:
df.to_csv("mm-headers.csv", index = False)

In [None]:
!aws s3 cp mm-headers.csv s3://{default_bucket}/xgboost-regression/baseline/mm-headers.csv

In [None]:
from sagemaker.inputs import TrainingInput
training_path = f's3://{default_bucket}/xgboost-regression/train.csv'
headers_path = f's3://{default_bucket}/xgboost-regression/baseline/mm-headers.csv'
train_input = TrainingInput(training_path, content_type="text/csv")

## Retrieve XGBoost Image and Prepare Training Estimator W/ HyperParameters

In [None]:
model_path = f's3://{default_bucket}/{s3_prefix}/xgb_model'

image_uri = sagemaker.image_uris.retrieve(
    framework="xgboost",
    region=region,
    version="1.0-1",
    py_version="py3",
    instance_type=training_instance_type,
)

xgb_train = Estimator(
    image_uri=image_uri,
    instance_type=training_instance_type,
    instance_count=1,
    output_path=model_path,
    sagemaker_session=sagemaker_session,
    role=role
)

xgb_train.set_hyperparameters(
    objective="reg:linear",
    num_round=50,
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.7,
    silent=0,
)

## Model Training

In [None]:
xgb_train.fit({'train': train_input})

## Retrieve Model Data

In [None]:
model_artifacts = xgb_train.model_data
model_artifacts

## Create SM Client to Create Model, EP Config, EP

In [None]:
sm_client = boto3.client(service_name='sagemaker')

## Model Creation

In [None]:
from time import gmtime, strftime
model_name = 'xgboost-reg' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print('Model name: ' + model_name)

reference_container = {
    "Image": image_uri,
    "ModelDataUrl": model_artifacts
}

create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer= reference_container)

print("Model Arn: " + create_model_response['ModelArn'])

## Endpoint Config Creation

The main difference in our endpoint configuration is we enable data capture on a per inference basis for our real-time endpoint.

In [None]:
# Sampling percentage. Choose an integer value between 0 and 100
initial_sampling_percentage = 50                                                                                                                                                                                                                      

# The S3 URI containing the captured data
s3_capture_upload_path = f's3://{default_bucket}/{s3_prefix}/captured-data'

# Capture Input and Output of Inference
capture_modes = [ "Input",  "Output" ] 

In [None]:
endpoint_config_name = 'xgboost-config' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
instance_type='ml.c5d.2xlarge'
print('Endpoint config name: ' + endpoint_config_name)

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType': instance_type,
        'InitialInstanceCount': 1,
        'InitialVariantWeight': 1,
        'ModelName': model_name,
        'VariantName': 'AllTraffic',
        }],
    DataCaptureConfig= {
        'EnableCapture': True,
        'InitialSamplingPercentage' : initial_sampling_percentage,
        'DestinationS3Uri': s3_capture_upload_path,
        'CaptureOptions': [{"CaptureMode" : capture_mode} for capture_mode in capture_modes]
    })

print("Endpoint config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

## Endpoint Creation

In [None]:
%%time

import time

endpoint_name = 'xgboost-reg' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print('Endpoint name: ' + endpoint_name)

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print('Endpoint Arn: ' + create_endpoint_response['EndpointArn'])

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Endpoint Status: " + status)

print('Waiting for {} endpoint to be in service...'.format(endpoint_name))
waiter = sm_client.get_waiter('endpoint_in_service')
waiter.wait(EndpointName=endpoint_name)

## Sample Invocation

It may take a few minutes to see the results from Data Capture.

In [None]:
import boto3
smr = boto3.client('sagemaker-runtime')

In [None]:
%%time
for i in range(1000):
    resp = smr.invoke_endpoint(EndpointName=endpoint_name, Body=b'.345,0.224414,.131102,0.042329,.279923,-0.110329,-0.099358,0.0', 
                               ContentType='text/csv')

## Parse Data Capture Results

In [None]:
!pip install jsonlines

In [None]:
# Replace this with the S3 Path found in your Data Capture config in the Endpoint console
!aws s3 cp <replace with your S3 URL > results.jsonl

In [None]:
import jsonlines

with jsonlines.open('results.jsonl') as f:
    for line in f.iter():
        print(line)

## Create A Baseline Dataset

For this we will utilize the Abalone dataset with headers, it will create a Processing Job to execute 

In [None]:
baseline_results_uri = f's3://{default_bucket}/{s3_prefix}/baseline-results-xgboost'

In [None]:
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat

my_default_monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge',
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)

#we pass in the dataset with headers, note the target column is the first column for Model Monitor
my_default_monitor.suggest_baseline(
    baseline_dataset=headers_path,
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri=baseline_results_uri,
    wait=True
)