# Amazon SageMaker Workshop
## _** Batch Transform Deployment**_

---

In this part of the workshop we will deploy our model created in the previous lab in an batch endpoint for asynchronous inferences to Predict Mobile Customer Departure.

Batch transform uses the same mechanics as real-time hosting to generate predictions. However, unlike real-time hosted endpoints which have persistent hardware (instances stay running until you shut them down), batch transform clusters are torn down when the job completes.

---

## Contents

1. [Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html)
  * Set up a asynchronous endpoint to get predictions from your model
  
---

## Background

In the previous labs [Modeling](../../2-Modeling/modeling.ipynb) and [Evaluation](../../3-Evaluation/evaluation.ipynb) we trained multiple models with multiple SageMaker training jobs and evaluated them .

Let's import the libraries for this lab:


In [6]:
#Supress default INFO loggingd
import logging
logger = logging.getLogger()
logger.setLevel(logging.ERROR)

In [8]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-1.6.1-py3-none-manylinux2014_x86_64.whl (192.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m192.9/192.9 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: xgboost
Successfully installed xgboost-1.6.1
[0m

In [9]:
import os
import time
import json
import tarfile
from time import strftime, gmtime

import boto3
import pandas as pd
import numpy as np
import pickle
import xgboost

import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer
from sagemaker.s3 import S3Uploader, S3Downloader

from sklearn import metrics

In [10]:
sess = boto3.Session()
sm = sess.client('sagemaker')
role = sagemaker.get_execution_role()

In [11]:
%store -r bucket
%store -r prefix
%store -r region
%store -r docker_image_name
%store -r framework_version
%store -r s3uri_test

In [12]:
bucket, prefix, region, docker_image_name, framework_version, s3uri_test

('sagemaker-studio-us-east-1-924155096146',
 'xgboost-churn',
 'us-east-1',
 '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.3-1',
 '1.3-1',
 's3://sagemaker-studio-us-east-1-924155096146/xgboost-churn/data/test/test.csv')

---
### - if you _**skipped**_ the lab `2-Modeling/` follow instructions:

   - **run this:**

In [13]:
# # Uncomment if you have not done Lab 2-Modeling

#from config.solution_lab2 import get_estimator_from_lab2
#xgb = get_estimator_from_lab2(docker_image_name, framework_version)

---
### - if you _**have done**_ the lab `2-Modeling/` follow instructions:

   - **run this:**

In [14]:
# # Uncomment if you've done Lab 2-Modeling

%store -r training_job_name
xgb = sagemaker.estimator.Estimator.attach(training_job_name)


2022-08-17 14:10:04 Starting - Preparing the instances for training
2022-08-17 14:10:04 Downloading - Downloading input data
2022-08-17 14:10:04 Training - Training image download completed. Training in progress.
2022-08-17 14:10:04 Uploading - Uploading generated training model
2022-08-17 14:10:04 Completed - Training job completed


---
## Batch Prediction

Batch Transform manages all necessary compute resources, including launching instances to deploy endpoints and deleting them afterward.

#### Download Test Dataset and Model

In [15]:
S3Downloader.download(xgb.model_data, ".")
S3Downloader.download(s3uri_test, ".")

#### Visualizing Test Data

In [16]:
test_path = "test.csv"
df = pd.read_csv(test_path, header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,60,61,62,63,64,65,66,67,68,69
0,0,186,0,137.8,97,187.7,118,146.4,85,8.7,...,0,0,0,0,0,1,1,0,1,0
1,0,132,25,113.2,96,269.9,107,229.1,87,7.1,...,0,0,0,0,1,0,1,0,0,1
2,0,112,17,183.2,95,252.8,125,156.7,95,9.7,...,0,0,0,0,1,0,1,0,0,1
3,0,91,24,93.5,112,183.4,128,240.7,133,9.9,...,0,0,0,0,0,1,0,1,0,1
4,0,22,0,110.3,107,166.5,93,202.3,96,9.5,...,0,0,0,1,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
329,0,59,0,166.3,95,239.3,87,123.2,108,10.0,...,0,0,0,0,0,1,1,0,1,0
330,0,127,14,143.2,99,169.9,91,221.6,77,11.6,...,0,0,0,0,1,0,1,0,0,1
331,0,86,0,166.2,112,255.3,81,228.1,97,5.4,...,0,0,0,0,1,0,1,0,1,0
332,0,36,43,29.9,123,129.1,117,325.9,105,8.6,...,0,0,0,1,0,0,1,0,0,1


* batch_input The batch input dataset used for prediction(test dataset) cannot have target column and should be saved in S3 buckets
* batch_output We need to specify the path for the batch output

In [17]:
test_true_y = df.iloc[:,0] # get target column
test_true_y.to_frame()
test_data_batch = df.iloc[:, 1:] # delete the target column
test_data_batch.to_csv('test_batch.csv', header=False, index=False)
test_data_batch

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,60,61,62,63,64,65,66,67,68,69
0,186,0,137.8,97,187.7,118,146.4,85,8.7,6,...,0,0,0,0,0,1,1,0,1,0
1,132,25,113.2,96,269.9,107,229.1,87,7.1,7,...,0,0,0,0,1,0,1,0,0,1
2,112,17,183.2,95,252.8,125,156.7,95,9.7,3,...,0,0,0,0,1,0,1,0,0,1
3,91,24,93.5,112,183.4,128,240.7,133,9.9,3,...,0,0,0,0,0,1,0,1,0,1
4,22,0,110.3,107,166.5,93,202.3,96,9.5,5,...,0,0,0,1,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
329,59,0,166.3,95,239.3,87,123.2,108,10.0,3,...,0,0,0,0,0,1,1,0,1,0
330,127,14,143.2,99,169.9,91,221.6,77,11.6,1,...,0,0,0,0,1,0,1,0,0,1
331,86,0,166.2,112,255.3,81,228.1,97,5.4,7,...,0,0,0,0,1,0,1,0,1,0
332,36,43,29.9,123,129.1,117,325.9,105,8.6,6,...,0,0,0,1,0,0,1,0,0,1


#### Upload on S3

In [18]:
# upload to S3
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'batch/test_batch.csv')).upload_file('test_batch.csv')

In [19]:
s3_batch_input = 's3://{}/{}/batch/test_batch.csv'.format(bucket,prefix) # test data used for prediction
s3_batch_output = 's3://{}/{}/batch/batch-inference'.format(bucket, prefix) # specify the location of batch output

In [20]:
s3_batch_input, s3_batch_output

('s3://sagemaker-studio-us-east-1-924155096146/xgboost-churn/batch/test_batch.csv',
 's3://sagemaker-studio-us-east-1-924155096146/xgboost-churn/batch/batch-inference')

#### Import Pickle

In [21]:
model_path = "model.tar.gz"
with tarfile.open(model_path) as tar:
    tar.extractall(path=".")

print("Loading xgboost model.")
model = pickle.load(open("xgboost-model", "rb"))
model

Loading xgboost model.
  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.

  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.



<xgboost.core.Booster at 0x7f2ccd332520>

#### Testing model locally for randomly subset

In [22]:
print("Some random test data")
x = test_data_batch.sample(1)
print(x)


print("Performing predictions against test data.")

X_test = xgboost.DMatrix(x.values)
predictions_probs = model.predict(X_test)
predictions = predictions_probs.round()

print(predictions)

Some random test data
     1   2      3    4      5   6     7    8    9   10  ...  60  61  62  63  \
68  100   0  235.8  130  176.0  69  63.6  122  7.3   1  ...   0   0   0   0   

    64  65  66  67  68  69  
68   1   0   1   0   1   0  

[1 rows x 69 columns]
Performing predictions against test data.
[0.]


### Create Batch job and make batch predictions

As we saw in the **2-Modeling** lab, we added custom the inference logic in our script (with the *input_fn and predict_fn*). So just by selecting our previous estimator, we can deploy it and run batch inferences:

In [None]:
# creates a transformer object from the trained model
transformer = xgb.transformer(
                          instance_count=1,
                          instance_type='ml.m5.large',
                          output_path=s3_batch_output)

# calls that object's transform method to create a transform job
transformer.transform(data=s3_batch_input, data_type='S3Prefix', content_type='text/csv', split_type='Line')

transformer.wait()

..............................

### Track Results on Sagemaker Experiments
If you open *Experiments and trials* again, and select the "Unassigned trial components", you should see that your SageMaker Transform job executed successfully:

![batch_transform_result.png](./media/batch_transform_result.png)

#### Download Batch result from S3

In [None]:
batch_output = 's3://{}/{}/batch/batch-inference/test_batch.csv.out'.format(bucket,prefix)
S3Downloader.download(batch_output, ".")

In [None]:
batch_output = pd.read_csv('test_batch.csv.out', header=None)
pred_y = np.round(batch_output)
pred_y

## Evaluating Results

Following codes will evaluate job output data, to check accuracy of our Batch Transform model.

In [None]:
def get_score(y_true,y_pred):
    f1 = metrics.f1_score(y_true, y_pred)
    precision = metrics.precision_score(y_true, y_pred)
    recall = metrics.recall_score(y_true, y_pred)
    accuracy = metrics.accuracy_score(y_true, y_pred)
    tn, fp, fn, tp = metrics.confusion_matrix(y_true, y_pred).ravel()
    return precision, recall, f1, accuracy, tn, fp, fn, tp

In [None]:
#get scores
temp_precision, temp_recall, temp_f1, temp_accuracy, tn, fp, fn, tp = get_score(test_true_y, pred_y)
output = [temp_precision,temp_recall,temp_f1,temp_accuracy,tp, fp, tn, fn]
output = pd.Series(output, index=['precision', 'recall', 'f1', 'accuracy', 'tp', 'fp', 'tn', 'fn']) 
print(output[['accuracy', 'tp', 'fp', 'tn', 'fn']])

from sklearn.metrics import classification_report
print(classification_report(test_true_y, pred_y))