# Building, Training, and Deploying a Machine Learning Model with Amazon SageMaker

<br>

**Table of Contents:**

- [Libraries](#Libraries)
- [Defining Environment Variables](#Defining-Environment-Variables)
- [Creating an S3 Bucket](#Creating-an-S3-Bucket)
- [Retrieving the Dataset](#Retrieving-the-Dataset)
- [Splitting into Training/Test Sets](#Splitting-into-Training/Test-Sets)
- [Training the Model](#Training-the-Model)
- [Deploying the Model](#Deploying-the-Model)
- [Evaluating Model Performance](#Evaluating-Model-Performance)
- [Cleaning up](#Cleaning-up)
    - [Deleting Endpoint](#Deleting-Endpoint)
    - [Deleting Training Artifacts and S3 Bucket](#Deleting-Training-Artifacts-and-S3-Bucket)

<br>

## Libraries

We start the notebook by importing the necessary libraries and tools. The AWS-specific tools are:

- [`Boto3`](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html): the official AWS Software Development Kit (SDK) for Python. Boto3 enables developers to integrate Python applications, libraries, or scripts with AWS services, such as Amazon S3 and Amazon EC2.

- [`sagemaker`](https://sagemaker.readthedocs.io/en/stable/): an open-source library for training and deploying machine learning models on Amazon SageMaker.

<br>

From Sagemaker, we will separately import the `get_execution_role()` function and the `CSVSerializer`. The former, as the name suggests, allows us to extract the execution role for the notebook instance, i.e. the IAM role that we created for our notebook instance. This role will later get passed to the tuning job. The latter, serialises data of various formats to a CSV-formatted string for the inference endpoint.

In [1]:
import boto3, sagemaker
from sagemaker import get_execution_role
from sagemaker.serializers import CSVSerializer

import re, os, urllib.request

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%config InlineBackend.figure_format='retina'

Moreover, we will install and import [`watermark`](https://github.com/rasbt/watermark), an IPython magic extension that enables us to print version numbers and hardware information. The cell magic command `%%capture` is placed at the beginning of the following cell to suppress its output.

In [2]:
%%capture
!pip install watermark

In [3]:
import watermark
%load_ext watermark

# See version of system, Python, and libraries
%watermark -n -v -m -iv

re         2.2.1
matplotlib 3.3.4
numpy      1.19.5
boto3      1.21.42
pandas     1.1.5
sagemaker  2.86.2
watermark  2.0.2
Fri Jun 10 2022 

CPython 3.6.13
IPython 7.16.1

compiler   : GCC 9.3.0
system     : Linux
release    : 4.14.252-131.483.amzn1.x86_64
machine    : x86_64
processor  : x86_64
CPU cores  : 2
interpreter: 64bit


<br>

## Defining Environment Variables

We continue by defining the environment variables we need to prepare the data, train the ML model, and deploy the ML model.

Firstly, we define the execution role for the notebook instance using the `get_execution_role()` function. This is the IAM role we created for this particular notebook instance; it will later be passed to the tunning job.

Then, we extract the name of the AWS region where our instance is hosted. This variable is necessary for building an XGBoost container (see next paragraph) and later for creating an S3 bucket.

Finally, we build an XGBoost container using `sagemaker`'s `image_uris.retrieve` function. SageMakers uses Docker containers for training and deploying machine learning algorithms. Containers allow developers and data scientists to package software into standardised units that run consistently on any platform that supports Docker.

In this project, we will use the XGBoost built-in algorithm to build an XGBoost training container. We can automatically spot the XGBoost built-in algorithm image URI using the SageMaker `image_uris.retrieve` API. After specifying the XGBoost image URI, you can use the XGBoost container to construct an estimator using the SageMaker `Estimator` API and initiate a training job.

In [4]:
role = get_execution_role()
prefix = 'sagemaker/DEMO-xgboost-dm'

my_region = boto3.session.Session().region_name

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
xgboost_container = sagemaker.image_uris.retrieve('xgboost', my_region, 'latest')

print(f'Success - the MySageMakerInstance is in the {my_region} region.')
print(f'You will use the {xgboost_container} container for your SageMaker endpoint.')

Success - the MySageMakerInstance is in the us-east-1 region.
You will use the 811284229777.dkr.ecr.us-east-1.amazonaws.com/xgboost:latest container for your SageMaker endpoint.


<br>

## Creating an S3 Bucket

In this section, we create an Amazon S3 bucket to store our data.
Simply put, an S3 bucket is a storage location to hold files. S3 files are referred to as objects.

Using the `boto3` library with Amazon S3 allows us to create, update, and delete S3 Buckets and objects from Python programs with ease. To establish a connection with S3, we need to use *resources*, i.e. an object-oriented interface to Amazon Web Services. For this purpose, we invoke the `resource()` method of a Session and pass in a service name, as in the following example:

```s3 = boto3.resource('s3')```

Once a connection with S3 is established, we can create a bucket. 

Note that every Amazon S3 Bucket must have a unique name across all AWS accounts and customers. Therefore, we need to use a name that hasn’t been taken yet, otherwise the program will yield an error.

In [5]:
bucket_name = 'sagemaker-tutorial-12345'  # <--- CHANGE THIS VARIABLE TO A UNIQUE NAME FOR YOUR BUCKET
s3 = boto3.resource('s3')

try:
    if my_region == 'us-east-1':
        s3.create_bucket(Bucket=bucket_name)
    else:
        s3.create_bucket(
            Bucket=bucket_name,
            CreateBucketConfiguration={'LocationConstraint': my_region})
    print('Success: S3 bucket created successfully!')
except Exception as e:
    print('S3 error: ', e)

Success: S3 bucket created successfully!




<br>

## Retrieving the Dataset

The next task is downloadinf the data to our SageMaker notebook instance and loading the data into a Pandas DataFrame.

In [6]:
dataset_url = 'https://d1.awsstatic.com/tmt/build-train-deploy-machine-learning-model-sagemaker/bank_clean.27f01fbbdf43271788427f3682996ae29ceca05d.csv'

try:
    urllib.request.urlretrieve(url=dataset_url, filename='bank_clean.csv')
    print('Success: Downloaded bank_clean.csv.')
except Exception as e:
    print('Data load error: ', e)

try:
    model_data = pd.read_csv('./bank_clean.csv', index_col=0)
    print('Success: Data loaded into dataframe.')
except Exception as e:
    print('Data load error: ', e)

Success: Downloaded bank_clean.csv.
Success: Data loaded into dataframe.


In [None]:
model_data.head()

<br>

## Splitting into Training/Test Sets



The dataset is already processed, so we can skip typical machine learning pre-processing practises (such as encoding categorical features). The only modification we need to perform is randomly splitting the dataset into a training and a test set. The training data (70% of all customers) is used during the model training. The test data (remaining 30% of customers) is used to evaluate the performance of the trained model and measure how well it generalises to unseen data.

In [7]:
train_data, test_data = np.split(model_data.sample(frac=1, random_state=1729),
                                 [int(0.7 * len(model_data))])

print(f'Train Set: {train_data.shape[0]} rows x {train_data.shape[1]} columns.')
print(f' Test Set: {test_data.shape[0]} rows x {test_data.shape[1]} columns.')

Train Set: 28831 rows x 61 columns.
 Test Set: 12357 rows x 61 columns.


<br>

## Training the Model

In [8]:
pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)

boto3.Session().resource('s3').Bucket(bucket_name).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')

s3_input_train = sagemaker.inputs.TrainingInput(s3_data='s3://{}/{}/train'.format(bucket_name, prefix), content_type='csv')

We initialize a session, which allows us to manage interactions with the Amazon SageMaker APIs and any other AWS services needed.

Then, we initialize an Estimator instance. We need to specify the following parameters:

- `image_uri`: The container image to use for training.
- `role`: The AWS IAM role we defined earlier.
- `instance_count`: Number of Amazon EC2 instances to use for training.
- `instance_type`: Type of EC2 instance to use for training.
- `output_path`: S3 location for saving the training result (model artifacts and output files).
- `sagemaker_session`: Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed.

After creating the Estimator, we also define the model’s hyperparameters.

In [9]:
sess = sagemaker.Session()
xgb = sagemaker.estimator.Estimator(image_uri=xgboost_container,
                                    role=role,
                                    instance_count=1,
                                    instance_type='ml.m4.xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket_name, prefix),
                                    sagemaker_session=sess)

xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        verbosity=2,
                        objective='binary:logistic',
                        num_round=100)

We are now ready to start the training job. The following command trains the model using gradient optimization on a `ml.m4.xlarge` instance. After a few minutes, we should see the training logs being generated in our Jupyter notebook.

In [10]:
xgb.fit({'train': s3_input_train})

2022-06-10 12:18:25 Starting - Starting the training job...ProfilerReport-1654863505: InProgress
...
2022-06-10 12:19:09 Starting - Preparing the instances for training.........
2022-06-10 12:20:50 Downloading - Downloading input data...
2022-06-10 12:21:21 Training - Downloading the training image......
2022-06-10 12:22:16 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[2022-06-10:12:22:19:INFO] Running standalone xgboost training.[0m
[34m[2022-06-10:12:22:19:INFO] Path /opt/ml/input/data/validation does not exist![0m
[34m[2022-06-10:12:22:19:INFO] File size need to be processed in the node: 3.38mb. Available memory size in the node: 8461.46mb[0m
[34m[2022-06-10:12:22:19:INFO] Determined delimiter of CSV input is ','[0m
[34m[12:22:19] S3DistributionType set as FullyReplicated[0m
[34m[12:22:20] 28831x59 matrix with 1701029 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[12:22:20] 


2022-06-10 12:22:50 Uploading - Uploading generated training model
2022-06-10 12:22:50 Completed - Training job completed
Training seconds: 127
Billable seconds: 127


<br>

## Deploying the Model

After fiting an XGBoost Estimator, we can host the newly created model in SageMaker. For this purpose, we can call `deploy` on an XGBoost estimator to create a SageMaker endpoint. The endpoint runs a SageMaker-provided XGBoost model server and hosts the model produced by your training script, which was run when we called `fit`.

This step may take a few minutes to complete.

In [11]:
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

---------!

The next step is making predictions. We first set up a CSV serialiser, which takes numpy arrays (like our testing data) and serialises them to CSV format. 

Finally, we call the `predict` method that returns the result of inference against our model (stored in the `predictions_array` array).

In [12]:
test_data_array = test_data.drop(labels=['y_no', 'y_yes'],
                                 axis=1).values  # load the data into an array

xgb_predictor.serializer = CSVSerializer()  # set the serializer type
predictions = xgb_predictor.predict(test_data_array).decode('utf-8')  # make predictions

predictions_array = np.fromstring(predictions[1:], sep=',')  # and turn the prediction into an array
print(f'Prediction array: {predictions_array.shape[0]} rows')

Prediction array: 12357 rows


The output array stores prediction in the form of probabilities, i.e. values ranging from 0 to 1 describing the probability that a new customer will subscribe to the product.

<br>

## Evaluating Model Performance

In this section, we evaluate the performance and accuracy of the trained machine learning model.

In [13]:
cm = pd.crosstab(index=test_data['y_yes'],
                 columns=np.round(predictions_array),
                 rownames=['Observed'],
                 colnames=['Predicted'])
tn = cm.iloc[0, 0]
fn = cm.iloc[1, 0]
tp = cm.iloc[1, 1]
fp = cm.iloc[0, 1]
p = (tp + tn) / (tp + tn + fp + fn) * 100

print('\n{0:<20}{1:<4.1f}%\n'.format('Overall Classification Rate: ', p))
print('{0:<15}{1:<15}{2:>8}'.format('Predicted', 'No Purchase', 'Purchase'))
print('Observed')
print('{0:<15}{1:<2.0f}% ({2:<}){3:>6.0f}% ({4:<})'.format(
    'No Purchase', tn / (tn + fn) * 100, tn, fp / (tp + fp) * 100, fp))
print('{0:<16}{1:<1.0f}% ({2:<}){3:>7.0f}% ({4:<}) \n'.format(
    'Purchase', fn / (tn + fn) * 100, fn, tp / (tp + fp) * 100, tp))


Overall Classification Rate: 89.5%

Predicted      No Purchase    Purchase
Observed
No Purchase    90% (10769)    37% (167)
Purchase        10% (1133)     63% (288) 



<br>

## Cleaning up

### Deleting Endpoint

In [14]:
xgb_predictor.delete_endpoint(delete_endpoint_config=True)

<br>

### Deleting Training Artifacts and S3 Bucket

In [15]:
bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
bucket_to_delete.objects.all().delete()



[{'ResponseMetadata': {'RequestId': 'ZTNRTJRJVCTWBG2Y',
   'HostId': 'jQYxyIcKv2s1hOf660/cWcLa4gyynUxmztyr/N0P/hf+pNyEZ+OPxtDUxbcJdIeeTt5YBn2XZ8s=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'jQYxyIcKv2s1hOf660/cWcLa4gyynUxmztyr/N0P/hf+pNyEZ+OPxtDUxbcJdIeeTt5YBn2XZ8s=',
    'x-amz-request-id': 'ZTNRTJRJVCTWBG2Y',
    'date': 'Fri, 10 Jun 2022 12:27:42 GMT',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3',
    'connection': 'close'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'sagemaker/DEMO-xgboost-dm/output/xgboost-2022-06-10-11-27-18-893/profiler-output/system/incremental/2022061011/1654860600.algo-1.json'},
   {'Key': 'sagemaker/DEMO-xgboost-dm/output/xgboost-2022-06-10-12-18-25-032/rule-output/ProfilerReport-1654863505/profiler-output/profiler-reports/GPUMemoryIncrease.json'},
   {'Key': 'sagemaker/DEMO-xgboost-dm/output/xgboost-2022-06-10-11-44-47-531/output/model.tar.gz'},
   {'Key': 'sagemaker/DEMO-xgboost-d

<br>

---