# Amazon SageMaker Batch Transform: Associate prediction results with their corresponding input records
_**Use SageMaker's XGBoost to train a binary classification model and for a list of tumors in batch file, predict if each is malignant**_

_**It also shows how to use the input output joining / filter feature in Batch transform in details**_

---



## Background
This purpose of this notebook is to train a model using SageMaker's XGBoost and UCI's breast cancer diagnostic data set to illustrate at how to run batch inferences and how to use the Batch Transform I/O join feature. UCI's breast cancer diagnostic data set is available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29. The data set is also available on Kaggle at https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. The purpose here is to use this data set to build a predictve model of whether a breast mass image indicates benign or malignant tumor.





### File Architecture
This file creates the following architecture for processing batch transforms. The architecture setups up a batch transformer, creates an event where data is passed to a lambda function that then processes the inputs & predictions.


<img src="http://storage.googleapis.com/arize-assets/tutorials/sagemaker/batch/batch_a.png?" width="500" height="600">




---

## Setup

Let's start by specifying:

* The SageMaker role arn used to give training and batch transform access to your data. The snippet below will use the same role used by your SageMaker notebook instance. Otherwise, specify the full ARN of a role with the SageMakerFullAccess policy attached.
* The S3 bucket that you want to use for training and storing model objects.

In [343]:
import os
import boto3
import sagemaker

role = sagemaker.get_execution_role()
sess = sagemaker.Session()

bucket=sess.default_bucket()
prefix = 'sagemaker/breast-cancer-prediction-xgboost' # place to upload training files within the bucket

In [344]:
bucket

'sagemaker-us-east-2-783112096916'

---
## Data preparation

Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data
        https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

Let's download the data and save it in the local folder with the name data.csv and take a look at it.

In [346]:
import pandas as pd
import numpy as np

#data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header = None)
data = pd.read_csv('https://storage.googleapis.com/arize-assets/tutorials/sagemaker/batch/datasets_180_408_data.csv')

print(data)
# specify columns extracted from wbdc.names
data.columns = ["id","diagnosis","radius_mean","texture_mean","perimeter_mean","area_mean","smoothness_mean",
                "compactness_mean","concavity_mean","concave points_mean","symmetry_mean","fractal_dimension_mean",
                "radius_se","texture_se","perimeter_se","area_se","smoothness_se","compactness_se","concavity_se",
                "concave points_se","symmetry_se","fractal_dimension_se","radius_worst","texture_worst",
                "perimeter_worst","area_worst","smoothness_worst","compactness_worst","concavity_worst",
                "concave points_worst","symmetry_worst","fractal_dimension_worst", "remove_col"]
data = data.drop(['remove_col'], axis=1)
# save the data
data.to_csv("data.csv", sep=',', index=False)

data.sample(8)

           id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0      842302         M        17.99         10.38          122.80     1001.0   
1      842517         M        20.57         17.77          132.90     1326.0   
2    84300903         M        19.69         21.25          130.00     1203.0   
3    84348301         M        11.42         20.38           77.58      386.1   
4    84358402         M        20.29         14.34          135.10     1297.0   
..        ...       ...          ...           ...             ...        ...   
564    926424         M        21.56         22.39          142.00     1479.0   
565    926682         M        20.13         28.25          131.20     1261.0   
566    926954         M        16.60         28.08          108.30      858.1   
567    927241         M        20.60         29.33          140.10     1265.0   
568     92751         B         7.76         24.54           47.92      181.0   

     smoothness_mean  compa

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
74,8610175,B,12.31,16.52,79.19,470.9,0.09172,0.06829,0.03372,0.02272,...,14.11,23.21,89.71,611.1,0.1176,0.1843,0.1703,0.0866,0.2618,0.07609
324,89511501,B,12.2,15.21,78.01,457.9,0.08673,0.06545,0.01994,0.01692,...,13.75,21.38,91.11,583.1,0.1256,0.1928,0.1167,0.05556,0.2661,0.07961
411,905520,B,11.04,16.83,70.92,373.2,0.1077,0.07804,0.03046,0.0248,...,12.41,26.44,79.93,471.4,0.1369,0.1482,0.1067,0.07431,0.2998,0.07881
386,902975,B,12.21,14.09,78.78,462.0,0.08108,0.07823,0.06839,0.02534,...,13.13,19.29,87.65,529.9,0.1026,0.2431,0.3076,0.0914,0.2677,0.08824
6,844359,M,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,...,22.88,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368
487,913505,M,19.44,18.82,128.1,1167.0,0.1089,0.1448,0.2256,0.1194,...,23.96,30.39,153.9,1740.0,0.1514,0.3725,0.5936,0.206,0.3266,0.09009
168,8712766,M,17.47,24.68,116.1,984.6,0.1049,0.1603,0.2159,0.1043,...,23.14,32.33,155.3,1660.0,0.1376,0.383,0.489,0.1721,0.216,0.093
128,866458,B,15.1,16.39,99.58,674.5,0.115,0.1807,0.1138,0.08534,...,16.11,18.33,105.9,762.6,0.1386,0.2883,0.196,0.1423,0.259,0.07779


#### Key observations:
* The data has 569 observations and 32 columns.
* The first field is the 'id' attribute that we will want to drop before batch inference and add to the final inference output next to the probability of malignancy.
* Second field, 'diagnosis', is an indicator of the actual diagnosis ('M' = Malignant; 'B' = Benign).
* There are 30 other numeric features that we will use for training and inferencing.

Let's replace the M/B diagnosis with a 1/0 boolean value.

In [347]:
data['diagnosis']=data['diagnosis'].apply(lambda x: ((x =="M"))+0)
data.sample(8)

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
19,8510426,0,13.54,14.36,87.46,566.3,0.09779,0.08129,0.06664,0.04781,...,15.11,19.26,99.7,711.2,0.144,0.1773,0.239,0.1288,0.2977,0.07259
301,892604,0,12.46,19.89,80.43,471.3,0.08451,0.1014,0.0683,0.03099,...,13.46,23.07,88.13,551.3,0.105,0.2158,0.1904,0.07625,0.2685,0.07764
108,86355,1,22.27,19.67,152.8,1509.0,0.1326,0.2768,0.4264,0.1823,...,28.4,28.01,206.8,2360.0,0.1701,0.6997,0.9608,0.291,0.4055,0.09789
89,861598,0,14.64,15.24,95.77,651.9,0.1132,0.1339,0.09966,0.07064,...,16.34,18.24,109.4,803.6,0.1277,0.3089,0.2604,0.1397,0.3151,0.08473
483,912558,0,13.7,17.64,87.76,571.1,0.0995,0.07957,0.04548,0.0316,...,14.96,23.53,95.78,686.5,0.1199,0.1346,0.1742,0.09077,0.2518,0.0696
152,8710441,0,9.731,15.34,63.78,300.2,0.1072,0.1599,0.4108,0.07857,...,11.02,19.49,71.04,380.5,0.1292,0.2772,0.8216,0.1571,0.3108,0.1259
475,911408,0,12.83,15.73,82.89,506.9,0.0904,0.08269,0.05835,0.03078,...,14.09,19.35,93.22,605.8,0.1326,0.261,0.3476,0.09783,0.3006,0.07802
41,855563,1,10.95,21.35,71.9,371.1,0.1227,0.1218,0.1044,0.05669,...,12.84,35.34,87.22,514.0,0.1909,0.2698,0.4023,0.1424,0.2964,0.09606


Let's split the data as follows: 80% for training, 10% for validation and let's set 10% aside for our batch inference job. In addition, let's drop the 'id' field on the training set and validation set as 'id' is not a training feature. For our batch set however, we keep the 'id' feature. We'll want to filter it out prior to running our inferences so that the input data features match the ones of training set and then ultimately, we'll want to join it with inference result. We are however dropping the diagnosis attribute for the batch set since this is what we'll try to predict.

In [348]:
data

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,1,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,842517,1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,84300903,1,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,84348301,1,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,84358402,1,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,1,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,926682,1,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,926954,1,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,927241,1,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


In [349]:
#data split in three sets, training, validation and batch inference
rand_split = np.random.rand(len(data))
train_list = rand_split < 0.8
val_list = (rand_split >= 0.8) & (rand_split < 0.9)
batch_list = rand_split >= 0.9

data_train = data[train_list].drop(['id'],axis=1)
data_val = data[val_list].drop(['id'],axis=1)
data_batch = data[batch_list].drop(['diagnosis'],axis=1)
data_batch_noID = data_batch.drop(['id'],axis=1)

Let's upload those data sets in S3

In [350]:
train_file = 'train_data.csv'
data_train.to_csv(train_file,index=False,header=False)
sess.upload_data(train_file, key_prefix='{}/train'.format(prefix))

validation_file = 'validation_data.csv'
data_val.to_csv(validation_file,index=False,header=False)
sess.upload_data(validation_file, key_prefix='{}/validation'.format(prefix))

batch_file = 'batch_data.csv'
data_batch.to_csv(batch_file,index=False,header=False)
sess.upload_data(batch_file, key_prefix='{}/batch'.format(prefix))

batch_file_noID = 'batch_data_noID.csv'
data_batch_noID.to_csv(batch_file_noID,index=False,header=False)
sess.upload_data(batch_file_noID, key_prefix='{}/batch'.format(prefix))

's3://sagemaker-us-east-2-783112096916/sagemaker/breast-cancer-prediction-xgboost/batch/batch_data_noID.csv'

In [351]:
#Arize addition
### This was added to track the feature columns
#XGBOOst does not allow a header in the CSV to keep the column names, in order to allow
#THE columns to travel with the data we save it to a file to be read back at the same time
features_file = 'features.csv'
data_batch.drop(columns=['id']).columns.to_series().to_csv(features_file,header=['features'], index=False)
features_file = sess.upload_data(features_file, key_prefix='{}/features'.format(prefix))
features_file

's3://sagemaker-us-east-2-783112096916/sagemaker/breast-cancer-prediction-xgboost/features/features.csv'

---

## Training job and model creation

The below cell uses the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk) to kick off the training job using both our training set and validation set. Not that the objective is set to 'binary:logistic' which trains a model to output a probability between 0 and 1 (here the probability of a tumor being malignant).

In [352]:
%%time
from time import gmtime, strftime
from sagemaker.amazon.amazon_estimator import get_image_uri


job_name = 'xgb-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
output_location = 's3://{}/{}/output/{}'.format(bucket, prefix, job_name)
image = get_image_uri(boto3.Session().region_name, 'xgboost')

sm_estimator = sagemaker.estimator.Estimator(image,
                                             role,
                                             train_instance_count=1,
                                             train_instance_type='ml.m5.4xlarge',
                                             train_volume_size=50,
                                             input_mode='File',
                                             output_path=output_location,
                                             sagemaker_session=sess)

sm_estimator.set_hyperparameters(objective="binary:logistic",
                                 max_depth=5,
                                 eta=0.2,
                                 gamma=4,
                                 min_child_weight=6,
                                 subsample=0.8,
                                 silent=0,
                                 num_round=100)

train_data = sagemaker.session.s3_input('s3://{}/{}/train'.format(bucket, prefix), distribution='FullyReplicated',
                                        content_type='text/csv', s3_data_type='S3Prefix')
validation_data = sagemaker.session.s3_input('s3://{}/{}/validation'.format(bucket, prefix), distribution='FullyReplicated',
                                             content_type='text/csv', s3_data_type='S3Prefix')
data_channels = {'train': train_data, 'validation': validation_data}


# Start training by calling the fit method in the estimator
sm_estimator.fit(inputs=data_channels, logs=True)

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
There is a more up to date SageMaker XGBoost image. To use the newer image, please set 'repo_version'='1.0-1'. For example:
	get_image_uri(region, 'xgboost', '1.0-1').
Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


2020-07-29 22:08:38 Starting - Starting the training job...
2020-07-29 22:08:39 Starting - Launching requested ML instances......
2020-07-29 22:09:44 Starting - Preparing the instances for training...
2020-07-29 22:10:24 Downloading - Downloading input data...
2020-07-29 22:11:05 Training - Training image download completed. Training in progress.
2020-07-29 22:11:05 Uploading - Uploading generated training model
2020-07-29 22:11:05 Completed - Training job completed
[34mArguments: train[0m
[34m[2020-07-29:22:10:53:INFO] Running standalone xgboost training.[0m
[34m[2020-07-29:22:10:53:INFO] File size need to be processed in the node: 0.13mb. Available memory size in the node: 55495.22mb[0m
[34m[2020-07-29:22:10:53:INFO] Determined delimiter of CSV input is ','[0m
[34m[22:10:53] S3DistributionType set as FullyReplicated[0m
[34m[22:10:53] 455x30 matrix with 13650 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2020-07-29:22:10:53:INF

---

## Batch Transform

In SageMaker Batch Transform, we introduced 3 new attributes - __input_filter__, __join_source__ and __output_filter__. Please refer to [this page](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform-data-processing.html) to learn more about how to use them.




In [369]:
#Setup Environment vaiables for Batch Transform
input_location = 's3://{}/{}/batch/{}'.format(bucket, prefix, batch_file_noID) # use input data without ID column
#Arize Batch Transform Enviornment Variables
model_name = 'sage_batch_test'
#Empty string model version is just date of batch
model_version = 'test_v_1.0'
batch_id =  "batch_10" #Should be set to different value every batch
#These are added so the event that picks up the finishing of the Transform has access to them
env_var = {'ArizeMonitor':'1','prefix':prefix, 'batch_file':batch_file_noID, 'bucket':bucket,
          'features_file':features_file, 'input_location':input_location,
          'model_name':model_name, 'model_version' : model_version,
          'batch_id':batch_id} #Batch ID is used to build prediction IDs and Match back to actuals

In [370]:
import json
import io
from urllib.parse import urlparse

def get_csv_output_from_s3(s3uri, file_name):
    parsed_url = urlparse(s3uri)
    bucket_name = parsed_url.netloc
    prefix = parsed_url.path[1:]
    s3 = boto3.resource('s3')
    obj = s3.Object(bucket_name, '{}/{}'.format(prefix, file_name))
    return obj.get()["Body"].read().decode('utf-8')

In [371]:
env_var = {'ArizeMonitor':'1','prefix':prefix, 'batch_file':batch_file_noID, 'bucket':bucket,
          'features_file':features_file, 'input_location':input_location,
          'model_name':model_name, 'model_version' : model_version,
          'batch_id':batch_id}

### Creating A Lambda GZIP
In order to create a lambda function you need to zip up all the libraries and place it on S3. The below code copies the arize required libraries and those neeeded for the lambda. This build is done on an Amazon server as it will put the right Numpy files / this ZIP won't work correctly if built on a MAC as Numpy on a MAC is different than Numpy for AWS.

Needs to be same Python Version 3.6 (as where the following packages are built)
Needs access to S3 Dir Resource
Needs timeout to be 10-15 seconds

##### Lambda File
The Lambda Function is in the file Lambda_function.py
You must set the API_KEY and SPACE_KEY in that file, which is pushed to the Lambda Server



In [372]:
!mkdir lambda_package

mkdir: cannot create directory ‘lambda_package’: File exists


In [373]:
!mkdir lambda_pkg
!pip install arize -t ./lambda_pkg/
!pip install boto3 -t ./lambda_pkg/
!pip install s3fs -t ./lambda_pkg/
!pip install datetime -t ./lambda_pkg/\
!pip install concurrent -t ./lambda_pkg/
!cp lambda_function.py ./lambda_pkg/lambda_function.py
%cd lambda_pkg
!zip -r ../myArizeDeploymentPackage.zip *
%cd ..

mkdir: cannot create directory ‘lambda_pkg’: File exists
Collecting arize
  Using cached arize-0.0.20-py2.py3-none-any.whl (16 kB)
Collecting numpy==1.18.4
  Using cached numpy-1.18.4-cp36-cp36m-manylinux1_x86_64.whl (20.2 MB)
Processing /home/ec2-user/.cache/pip/wheels/35/8d/af/a922cb18800b31fadac3523cadf6c1efdf233b788fe7a4da70/googleapis_common_protos-1.51.0-py3-none-any.whl
Collecting pandas==1.0.3
  Using cached pandas-1.0.3-cp36-cp36m-manylinux1_x86_64.whl (10.0 MB)
Collecting protobuf==3.11.3
  Using cached protobuf-3.11.3-cp36-cp36m-manylinux1_x86_64.whl (1.3 MB)
Processing /home/ec2-user/.cache/pip/wheels/15/03/c1/78bc17e91a6f740565af018749431a5c35e62ee93a32824344/requests_futures-1.0.0-py3-none-any.whl
Collecting python-dateutil>=2.6.1
  Using cached python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Collecting pytz>=2017.2
  Using cached pytz-2020.1-py2.py3-none-any.whl (510 kB)
Collecting six>=1.9
  Using cached six-1.15.0-py2.py3-none-any.whl (10 kB)
Collecting setuptools


In [374]:
!cp lambda_function.py ./lambda_pkg/lambda_function.py
%cd lambda_pkg
!zip -r ../myArizeDeploymentPackage.zip *
%cd ..

/home/ec2-user/SageMaker/lambda_pkg
updating: arize/ (stored 0%)
updating: arize/bounded_executor.py (deflated 60%)
updating: arize/public_pb2.py (deflated 88%)
updating: arize/__pycache__/ (stored 0%)
updating: arize/__pycache__/__init__.cpython-36.pyc (deflated 12%)
updating: arize/__pycache__/protocol_pb2.cpython-36.pyc (deflated 66%)
updating: arize/__pycache__/public_pb2.cpython-36.pyc (deflated 64%)
updating: arize/__pycache__/bounded_executor.cpython-36.pyc (deflated 43%)
updating: arize/__pycache__/validation_helper.cpython-36.pyc (deflated 57%)
updating: arize/__pycache__/input_transformer.cpython-36.pyc (deflated 50%)
updating: arize/__pycache__/api.cpython-36.pyc (deflated 55%)
updating: arize/input_transformer.py (deflated 71%)
updating: arize/api.py (deflated 78%)
updating: arize/__init__.py (stored 0%)
updating: arize/validation_helper.py (deflated 82%)
updating: arize/protocol_pb2.py (deflated 91%)
updating: arize-0.0.20.dist-info/ (stored 0%)
updating: arize-0.0.20.dist

In [375]:
batch_input_loc =sess.upload_data(path='myArizeDeploymentPackage.zip', key_prefix='{}/deploy'.format(prefix))

#### Copy the file path to the Lambda Server - S3 Location

In [376]:
print(batch_input_loc)

s3://sagemaker-us-east-2-783112096916/sagemaker/breast-cancer-prediction-xgboost/deploy/myArizeDeploymentPackage.zip


### Configuring A Lambda
The Lambda Function needs to have the following:

Lambda needs to be same Python Version as where you are building the Lambda Gzip. Given we are building it on this server we are going to print the version below. In this example it is Python 3.6


Create a Lambda Function

Python needs to be the same version running on this server where we created the GZIP

This example is Python 3.6 (set Python to version output below)

<img src="https://storage.googleapis.com/arize-assets/tutorials/sagemaker/batch/p2.png?" width="500" height="600">

In [377]:
#Print out Python Version of local server
import sys
sys.version

'3.6.10 |Anaconda, Inc.| (default, Mar 25 2020, 23:51:54) \n[GCC 7.3.0]'

##### Create a role to access S3
Lambda needs to read from S3 files. Create a role to read S3 either bucket specific or general S3 Readonly.

<img src="https://storage.googleapis.com/arize-assets/tutorials/sagemaker/batch/p3.png" width="500" height="600">

S3 Readonly Example role below

<img src="https://storage.googleapis.com/arize-assets/tutorials/sagemaker/batch/p4.png" width="500" height="600">

Function is created


<img src="https://storage.googleapis.com/arize-assets/tutorials/sagemaker/batch/p5.png" width="500" height="600">


Set file for execution to be Gzip lambda_function file we loaded to S3

<img src="https://storage.googleapis.com/arize-assets/tutorials/sagemaker/batch/p6.png" width="500" height="600">

<img src="https://storage.googleapis.com/arize-assets/tutorials/sagemaker/batch/p7.png" width="500" height="600">

In [378]:
#Insert this into S3 URL Link above
print(batch_input_loc)

s3://sagemaker-us-east-2-783112096916/sagemaker/breast-cancer-prediction-xgboost/deploy/myArizeDeploymentPackage.zip


#### Set function to run for longer than 3 seconds -> 10Min
The actual runtime is typically pretty short but if there is a lot of data it's good to set to 10min.

<img src="https://storage.googleapis.com/arize-assets/tutorials/sagemaker/batch/p8.png" width="500" height="600">

##### Batch Event Test
We are creating JSON that can be inserted as a test function into the Lambda Test Config section - once the lambda function is uploaded. It represents the data returned by the batch event once the batch process is done.

In [379]:
test_json = {
  "version": "0",
  "id": "844e2571-85d4-695f-b930-0153b71dcb42",
  "detail-type": "SageMaker Transform Job State Change",
  "source": "aws.sagemaker",
  "account": "123456789012",
  "time": "2018-10-06T12:26:13Z",
  "region": "us-east-1",
  "resources": [
    "arn:aws:sagemaker:us-east-1:123456789012:transform-job/myjob"
  ],
  "detail": {
    "TransformJobName": "4b52bd8f-e034-4345-818d-884bdd7c9724",
    "TransformJobArn": "arn:aws:sagemaker:us-east-1:123456789012:transform-job/myjob",
    "TransformJobStatus": "Completed",
    "FailureReason": "failed why 1",
    "ModelName": "i am a beautiful model",
    "MaxConcurrentTransforms": 5,
    "MaxPayloadInMB": 10,
    "BatchStrategy": "Strategizing...",
    "Environment": {
      "bucket": bucket,
      "input_location": input_location,
      "model_name": model_name,
      "model_version":model_version,
      "batch_id": batch_id,
      "prefix": prefix,
      "ArizeMonitor": "1",
      "features_file": features_file,
      "batch_file": batch_file_noID
    },
    "TransformInput": {
      "DataSource": {
        "S3DataSource": {
          "S3DataType": "s3DataType",
          "S3Uri": "s3Uri"
        }
      },
      "ContentType": "content type",
      "CompressionType": "compression type",
      "SplitType": "split type"
    },
    "TransformOutput": {
      "S3OutputPath": "s3://sagemaker-us-east-2-783112096916/xgboost-2020-07-28-05-16-52-914",
      "Accept": "accept",
      "AssembleWith": "assemblyType",
      "KmsKeyId": "kmsKeyId"
    },
    "TransformResources": {
      "InstanceType": "instanceType",
      "InstanceCount": 3
    },
    "CreationTime": "2018-10-06T12:26:13Z",
    "TransformStartTime": "2018-10-06T12:26:13Z",
    "TransformEndTime": "2018-10-06T12:26:13Z",
    "Tags": {}
  }
}

#### Copy Paste below into Lambda Test Config

The below json should be copied into the test config when creating the Lambda function.


In [382]:
import json
print(json.dumps(test_json))

{"version": "0", "id": "844e2571-85d4-695f-b930-0153b71dcb42", "detail-type": "SageMaker Transform Job State Change", "source": "aws.sagemaker", "account": "123456789012", "time": "2018-10-06T12:26:13Z", "region": "us-east-1", "resources": ["arn:aws:sagemaker:us-east-1:123456789012:transform-job/myjob"], "detail": {"TransformJobName": "4b52bd8f-e034-4345-818d-884bdd7c9724", "TransformJobArn": "arn:aws:sagemaker:us-east-1:123456789012:transform-job/myjob", "TransformJobStatus": "Completed", "FailureReason": "failed why 1", "ModelName": "i am a beautiful model", "MaxConcurrentTransforms": 5, "MaxPayloadInMB": 10, "BatchStrategy": "Strategizing...", "Environment": {"bucket": "sagemaker-us-east-2-783112096916", "input_location": "s3://sagemaker-us-east-2-783112096916/sagemaker/breast-cancer-prediction-xgboost/batch/batch_data_noID.csv", "model_name": "sage_batch_test", "model_version": "test_v_1.0", "batch_id": "batch_10", "prefix": "sagemaker/breast-cancer-prediction-xgboost", "ArizeMonit

Copy the above test JSON into the Lambda Test config. Click the JSON formating button to clean it up. This will allow you to test your Lambda without having the Transformer fire the event.

<img src="https://storage.googleapis.com/arize-assets/tutorials/sagemaker/batch/p10.png" width="500" height="600">

### Run Transformer
This runs the transformer process setup above, this transformer when done throws an event that the Lambda function watches for. The event calls the lambda function with the configuration similar to the above

####  Join the input and the prediction results
Now, let's associate the prediction results with their corresponding input records. We can also use the __input_filter__ to exclude the ID column easily and there's no need to have a separate file in S3.

* Set __input_filter__ to "$[1:]": indicates that we are excluding column 0 (the 'ID') before processing the inferences and keeping everything from column 1 to the last column (all the features or predictors)


* Set __join_source__ to "Input": indicates our desire to join the input data with the inference results

* Leave __output_filter__ to default ('$'), indicating that the joined input and inference results be will saved as output.

In [383]:
%%time


sm_transformer = sm_estimator.transformer(1, 'ml.m4.xlarge')
sm_transformer.accept = 'text/csv'
sm_transformer.assemble_with = 'Line'
sm_transformer.env = env_var
# start a transform job
sm_transformer.transform(input_location, split_type='Line',content_type='text/csv', input_filter='$[1:]', join_source='Input')
sm_transformer.wait()

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


......................[34mArguments: serve[0m
[34m[2020-07-29 22:27:34 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2020-07-29 22:27:34 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2020-07-29 22:27:34 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2020-07-29 22:27:34 +0000] [37] [INFO] Booting worker with pid: 37[0m
[34m[2020-07-29 22:27:34 +0000] [38] [INFO] Booting worker with pid: 38[0m
[34m[2020-07-29:22:27:34:INFO] Model loaded successfully for worker : 37[0m
[34m[2020-07-29 22:27:34 +0000] [39] [INFO] Booting worker with pid: 39[0m
[34m[2020-07-29:22:27:34:INFO] Model loaded successfully for worker : 38[0m
[34m[2020-07-29 22:27:34 +0000] [40] [INFO] Booting worker with pid: 40[0m
[34m[2020-07-29:22:27:34:INFO] Model loaded successfully for worker : 39[0m
[34m[2020-07-29:22:27:35:INFO] Model loaded successfully for worker : 40[0m
[34m[2020-07-29:22:27:42:INFO] Sniff delimiter as ','[0m
[34m[2020-07-29:22:27:42:INFO] Determined de

#### Actuals
The linking of actuals back to the batch job is shown in the example below. We use both the Batch_ID and number of prediction to create a prediction ID.

In [384]:
actuals_df = data[batch_list]['diagnosis'].reset_index()
actuals_df

Unnamed: 0,index,diagnosis
0,0,1
1,12,1
2,14,1
3,16,1
4,17,1
5,20,0
6,23,1
7,25,1
8,29,1
9,54,1


In [385]:
from arize.api import Client
from arize.utils.types import ModelTypes
#SPACE KEY - SUPPLIED BY ARIZE
space_key = 'SPACE KEY'
#API KEY - GENERATED IN ARIZE ACCOUNT OR SUPPLIED
api_key = 'API KEY'

arize_client = Client(space_key=space_key, api_key=api_key)

In [386]:
actuals_df = data[batch_list]['diagnosis'].reset_index()
ids = pd.DataFrame([str(x) + '_' + batch_id for x in actuals_df.index])
tfuture = arize_client.log_bulk_actuals(model_id=model_name, model_type=ModelTypes.CATEGORICAL, prediction_ids=ids, actual_labels=actuals_df)

In [387]:
import concurrent.futures as cf
for response in cf.as_completed(tfuture):
  res = response.result()
  print(f'future completed with response code {res.status_code}, {res.text}')

<Response [200]>

In [None]:
print('Finished Notebook')

### Overview
Arize is an end-to-end ML observability and model monitoring platform. The platform is designed to help ML engineers and data science practitioners surface and fix issues with ML models in production faster with:
- Automated ML monitoring and model monitoring
- Workflows to troubleshoot model performance
- Real-time visualizations for model performance monitoring, data quality monitoring, and drift monitoring
- Model prediction cohort analysis
- Pre-deployment model validation
- Integrated model explainability

### Website
Visit Us At: https://arize.com/model-monitoring/

### Additional Resources
- [What is ML observability?](https://arize.com/what-is-ml-observability/)
- [Playbook to model monitoring in production](https://arize.com/the-playbook-to-monitor-your-models-performance-in-production/)
- [Using statistical distance metrics for ML monitoring and observability](https://arize.com/using-statistical-distance-metrics-for-machine-learning-observability/)
- [ML infrastructure tools for data preparation](https://arize.com/ml-infrastructure-tools-for-data-preparation/)
- [ML infrastructure tools for model building](https://arize.com/ml-infrastructure-tools-for-model-building/)
- [ML infrastructure tools for production](https://arize.com/ml-infrastructure-tools-for-production-part-1/)
- [ML infrastructure tools for model deployment and model serving](https://arize.com/ml-infrastructure-tools-for-production-part-2-model-deployment-and-serving/)
- [ML infrastructure tools for ML monitoring and observability](https://arize.com/ml-infrastructure-tools-ml-observability/)

Visit the [Arize Blog](https://arize.com/blog) and [Resource Center](https://arize.com/resource-hub/) for more resources on ML observability and model monitoring.
