# Important notes* on update... 

* Deprecated `get_image_uri` which is now `image_uri.retrieve`

* Renaming `training_instance_count` to `instance_cout`

* Renaming `training_instance_type` to `instance_type`

* Deprecated `s3_input` which is now sagemaker.inputs.`TrainingInput`

* Renamed `csv_serializer` and `json_deserializer` to `sagemaker.serializers.CSVSerializer()` and `sagemaker.deserializers.JSONDeserializer()` respectively


### Importing Important Libraries

### Steps To Be Followed

1. Importing the necessary libraries
2. Creating S3 bucket
3. Mapping train And Test Data in S3
4. Mapping the path of the models in S3

In [1]:
import sagemaker
import boto3
from sagemaker.amazon.amazon_estimator import image_uris
from sagemaker.session import Session, TrainingInput
sagemaker.__version__

'2.59.3'

In [2]:
bucket_name = 'projektbankapplication' # Change the variable to the unique name of your bucket
my_region = boto3.session.Session().region_name # set the region of the instance
print(my_region)

us-east-1


In [3]:
s3 = boto3.resource('s3')
try:
    if  my_region == 'us-east-1':
        s3.create_bucket(Bucket=bucket_name)
    print('S3 bucket created successfully')
except Exception as e:
    print('S3 error: ',e)

S3 bucket created successfully


In [4]:
# Set an output path where the trained model will be saved 
prefix = 'xgboost-as-a-built-in-algo'
output_path = 's3://{}/{}/output'.format(bucket_name, prefix)
print(output_path)

s3://projektbankapplication/xgboost-as-a-built-in-algo/output


## Uploading File in S3 Bucket

**Downloading The Dataset And Storing in S3**

In [5]:
import pandas as pd
import urllib

try:
    urllib.request.urlretrieve("https://d1.awsstatic.com/tmt/build-train-deploy-machine-learning-model-sagemaker/bank_clean.27f01fbbdf43271788427f3682996ae29ceca05d.csv", "bank_clean.csv")
    print('Success: downloaded bank_clean.csv.')
except Exception as e:
    print('Data load error: ', e)

try:
    model_data = pd.read_csv('./bank_clean.csv', index_col = 0)
    print('Success: Data loaded into dataframe.')
except Exception as e:
    print("Data load error: ", e )

Success: downloaded bank_clean.csv.
Success: Data loaded into dataframe.


In [6]:
# Train Test Split according to aws

import numpy as np
train_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data))])
print(train_data.shape, test_data.shape)

(28831, 61) (12357, 61)


In [7]:
# Saving the Train and Test into Bucket (Storage). REM: prefix = 'xgboost-as-a-built-in-algo'
## We start with Train Data
import os
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], 
                                                axis=1)], 
                                                axis=1).to_csv('train.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
TrainingInput_train = TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

In [8]:
# Test Data into Bucket
import os
pd.concat([test_data['y_yes'], test_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('test.csv', index=False, header=False)
boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'test/test.csv')).upload_file('test.csv')
TrainingInput_test = TrainingInput(s3_data='s3://{}/{}/test'.format(bucket_name, prefix), content_type='csv')

# Building and Training models Xgboost- Inbuilt Algorithm

In [9]:
# this line automatically looks for the Xgboost image URI and builds an Xgboost container.
# Specify the repo_version depending on  your preferenece
container = image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest') # algorithm, region, version

In [10]:
# construct a SageMaker estimator that calls the xgboost_container
estimator = sagemaker.estimator.Estimator(image_uri=container,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1,
                                          instance_type='ml.m5.2xlarge',
                                          volume_size=5, #5 GB
                                          output_path=output_path,
                                          use_spot_instances=True,
                                          max_run=300,
                                          max_wait=600)

In [11]:
# initialize hyperparameters   # NOTE: you need to train for the parameter outside of aws console to save cost, since training might take more time
estimator.set_hyperparameters(max_depth = 5,
                              eta= 0.2,
                              gamma= 4,
                              min_child_weight= 6,
                              num_round= 100, 
                              subsample= 0.7,
                              objective= "binary:logistic")

In [12]:
estimator.fit({'train': TrainingInput_train, 'validation': TrainingInput_test})

2021-10-27 00:56:21 Starting - Starting the training job...
2021-10-27 00:56:44 Starting - Launching requested ML instancesProfilerReport-1635296181: InProgress
......
2021-10-27 00:57:44 Starting - Preparing the instances for training......
2021-10-27 00:58:46 Downloading - Downloading input data...
2021-10-27 00:59:19 Training - Training image download completed. Training in progress.
2021-10-27 00:59:19 Uploading - Uploading generated training model.[34mArguments: train[0m
[34m[2021-10-27:00:59:14:INFO] Running standalone xgboost training.[0m
[34m[2021-10-27:00:59:14:INFO] File size need to be processed in the node: 4.83mb. Available memory size in the node: 23785.35mb[0m
[34m[2021-10-27:00:59:14:INFO] Determined delimiter of CSV input is ','[0m
[34m[00:59:14] S3DistributionType set as FullyReplicated[0m
[34m[00:59:14] 28831x59 matrix with 1701029 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2021-10-27:00:59:14:INFO] Determi

## Deploy Machine Learning Model

In [13]:
xgb_predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

------------------!

## Prediction of the Test Data

In [23]:
from sagemaker.predictor import csv_serializer
test_data_array = test_data.drop(['y_no', 'y_yes'], axis=1).values       # Load the data into an array
#xgb_predictor.content_type = 'text/csv'     # set the data type for an inference
xgb_predictor.serializer = csv_serializer       # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8')        # predict !
predictions_array= np.fromstring(predictions[1:], sep=",")    # and turn the prediction into an array
print(predictions_array.shape)

The csv_serializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


(12357,)


In [24]:
predictions_array

array([0.05773537, 0.06024943, 0.04264852, ..., 0.03181691, 0.02607246,
       0.03320044])

In [28]:
cm = pd.crosstab(index=test_data['y_yes'], columns=np.round(predictions_array), rownames=['Observed'], colnames=['Predictions'])
tn = cm.iloc[0, 0] 
fn = cm.iloc[1, 0]
tp = cm.iloc[1, 1]
fp = cm.iloc[0, 1]
p = (tp+tn)/(tp+tn+fp+fn)*100
print("\n{0:<20}{1:<4.1f}%\n".format("Overall Classification Rate: ", p))
print("{0:<15}{1:<15}{2:>8}".format("Predicted", "No Purchase", "Purchase"))
print("Observed")
print("{0:<15}{1:<2.0f}% ({2:<}){3:>6.0f}% ({4:<})".format("No Purchase", tn/(tn+fn)*100, tn, fp/(tp+fp)*100, fp))
print("{0:<16}{1:<1.0f}% ({2:<}){3:>6.0f}% ({4:<}) \n".format("Purchase", fn/(tn+fn)*100, fn, tp/(tp+fp)*100, tp))


Overall Classification Rate: 89.5%

Predicted      No Purchase    Purchase
Observed
No Purchase    90% (10764)    37% (172)
Purchase        10% (1130)    63% (291) 

