# MLE Capstone: SageMaker Model Training and Deployment

This notebook contains the code that is needed to train and deploy a binary classifier for the profit/loss part of the project.

In [1]:
# for the sake of development, use this magic command to solve slow suggestion
%config Completer.use_jedi = False

In [2]:
import io
import os
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 
from datetime import datetime
import math
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
%matplotlib inline
pd.set_option('max_columns', 150)

import boto3
import sagemaker
from sagemaker import get_execution_role

from load_data import load_data
from preprocess import preprocess

from sklearn.metrics import classification_report, f1_score, accuracy_score
from sklearn.metrics import recall_score, precision_score, confusion_matrix

In [3]:
print(sagemaker.__version__)

2.24.1


Store the SageMaker variables in the next cell.

In [4]:
# sagemaker session, role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# S3 bucket name
bucket = sagemaker_session.default_bucket()

### General Outline

Below is an outline that this notebook will follow.

1. Process / Prepare the data
2. Upload the processed data to the linked S3 bucket
3. Train a chosen model
5. Deploy the trained model
6. Use the deployed model for inference

### 1. Prepare the data for use in SageMaker

In [5]:
df = load_data()
print('Data shape (rows, cols): ', df.shape)
print('Total # films that made money: ', df['class'].sum())
print('Profitable percentage = {:.2f}%'.format(
    100*df['class'].sum()/df.shape[0]))

Data shape (rows, cols):  (5579, 658)
Total # films that made money:  3894
Profitable percentage = 69.80%


#### 1.1 Clean the data

In [6]:
df_clean = preprocess(df)

In [7]:
df_clean.head()

Unnamed: 0,revenue,runtime,num_prods,num_languages,num_writers,UNRATE,PCE,class,original_language_en,original_language_fr,original_language_hi,month_Apr,month_Aug,month_Dec,month_Feb,month_Jan,month_Jul,month_Jun,month_Mar,month_May,month_Nov,month_Oct,month_Sep,genres_Adventure,genres_Animation,genres_Children,genres_Comedy,genres_Fantasy,genres_Drama,genres_Romance,genres_Action,genres_Crime,genres_Thriller,genres_Mystery,genres_Sci-Fi,genres_Musical,genres_Horror,genres_War,genres_IMAX,prod_comp_names_Warner_Bros._Pictures,prod_comp_names_Universal_Pictures,prod_comp_names_Columbia_Pictures,prod_comp_names_Paramount,prod_comp_names_20th_Century_Fox,prod_comp_names_New_Line_Cinema,prod_comp_names_Walt_Disney_Pictures,prod_comp_names_Canal+,prod_comp_names_Metro-Goldwyn-Mayer,prod_comp_names_Touchstone_Pictures,prod_comp_names_Relativity_Media,prod_comp_names_Miramax,prod_comp_cntry_US,prod_comp_cntry_GB,prod_comp_cntry_FR,num_top_100_actors,established_director,log10_budget,log10_director_pop,log10_avg_writer_pop,log10_max_writer_pop,log10_avg_actor_pop,log10_max_actor_pop,log10_min_actor_pop,log10_cast_crew_sum_pop,log10_cast_crew_product_pop
0,221546000.0,81.0,1,1,5,5.5,5013.9,1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0.0,7.250232,0.525304,0.652182,0.822822,1.051268,1.42951,0.393048,1.280904,2.228754
1,156265000.0,104.0,4,2,3,5.6,5097.5,1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0.0,7.587154,0.471732,0.260389,0.629817,0.876776,1.069668,0.318272,1.090399,1.608897
2,48433220.0,127.0,1,1,2,5.6,5097.5,1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0.0,6.978361,0.72583,-0.175874,0.146438,0.524006,0.754578,0.199206,0.969789,1.073962
3,111454000.0,170.0,3,2,1,5.6,5097.5,1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,3,1.0,7.552392,0.962985,0.485863,0.962985,1.037811,1.211307,0.914079,1.36462,2.486659
4,31914590.0,127.0,7,2,5,5.6,5097.5,0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0.0,7.537669,0.031408,0.337858,0.350829,0.858918,1.054498,0.697142,1.020292,1.228185


#### 1.2 Split the data into a train and test set

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
# drops the two targets that we would be using
X = df.drop(['class', 'revenue'], axis=1)
y = df['class']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, train_size=0.8
)

In [11]:
# scaler = StandardScaler()
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)

#### 1.3 Upload the data to S3

First save the processed data locally. Take care for the label to proceed the features, as this is how SageMaker expects to receive the data.

In [12]:
data_dir = 'data'

In [13]:
# save the train data to the data/ directory
pd.concat([pd.DataFrame(y_train).reset_index().drop('index', axis=1)
           , pd.DataFrame(X_train)]
          , axis=1)\
    .to_csv(os.path.join(data_dir, 'train.csv'), index=False)

In [14]:
print(pd.read_csv('data/train.csv').shape)

(4463, 66)


Upload the training data to the SageMaker default S3 bucket so that we can provide access to it while training our model. The below cell will upload the entire contents of the data directory.

In [15]:
prefix = 'sagemaker/mle-capstone'

In [16]:
input_data = sagemaker_session.upload_data(
    path=data_dir
    , bucket=bucket
    , key_prefix=prefix
)

### 2. Modelling - custom SciKit-Learn model for a Decision Tree

A model in SageMaker comprises three objects:

* Model Artifacts,
* Training Code, and
* Inference Code,

each of which interact with one another.
<!-- We will try using SageMaker's LinearLearner, for which there are two main applications:

1. Regression tasks in which a linear line is fit to some data points -->



In [17]:
from sagemaker import sklearn
from sagemaker.sklearn.estimator import SKLearn

In [18]:
estimator = SKLearn(
    entry_point='train.py'
    , py_version='py3'
    , framework_version='0.23-1'
    , role=role
    , instance_count=1
    , instance_type='ml.m4.xlarge'
    , sagemaker_session=sagemaker_session
    # specify hyperparameters to use
    , hyperparameters={
        'class_weight': 'balanced'
        , 'criterion': 'entropy'
        , 'max_depth': 8
        , 'max_features': 'log2'
    }
)

In [19]:
estimator.fit({'train': input_data})

2021-02-13 22:04:43 Starting - Starting the training job...
2021-02-13 22:04:46 Starting - Launching requested ML instancesProfilerReport-1613253883: InProgress
......
2021-02-13 22:06:01 Starting - Preparing the instances for training......
2021-02-13 22:07:12 Downloading - Downloading input data......
2021-02-13 22:08:06 Training - Training image download completed. Training in progress..[34m2021-02-13 22:08:07,768 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2021-02-13 22:08:07,771 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-02-13 22:08:07,783 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2021-02-13 22:08:23,467 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-02-13 22:08:23,482 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2021-02-13 22:08:23,496 s

In [20]:
estimator.hyperparameters()

{'class_weight': '"balanced"',
 'criterion': '"entropy"',
 'max_depth': '8',
 'max_features': '"log2"',
 'sagemaker_submit_directory': '"s3://sagemaker-us-east-1-461533213257/sagemaker-scikit-learn-2021-02-13-22-04-43-561/source/sourcedir.tar.gz"',
 'sagemaker_program': '"train.py"',
 'sagemaker_container_log_level': '20',
 'sagemaker_job_name': '"sagemaker-scikit-learn-2021-02-13-22-04-43-561"',
 'sagemaker_region': '"us-east-1"'}

### 3. Inference - sending data to a deployed endpoint for inference

To deploy the model we only have to use the `deploy` method of the estimator object created above. The endpoint is a running cost until it is shut down. After using, it is important to be sure to clean up.

In [21]:
predictor = estimator.deploy(
    initial_instance_count=1
    , instance_type='ml.t2.medium'
)

-------------------!

In [22]:
X_test = scaler.fit_transform(X_test)

In [23]:
print(predictor.predict(data=X_test))
print(y_test.values)

[0 1 1 ... 1 1 0]
[1 0 1 ... 1 1 1]


In [24]:
preds = predictor.predict(data=X_test)

In [25]:
print(classification_report(y_test, preds))
print(confusion_matrix(y_test, preds))
print("F1-score: {:.2f}".format(f1_score(y_test, preds)))
print("Accuracy: {:.2f}".format(accuracy_score(y_test, preds)))
print("Precision: {:.2f}".format(precision_score(y_test, preds)))
print("Recall: {:.2f}".format(recall_score(y_test, preds)))
print()

              precision    recall  f1-score   support

           0       0.47      0.34      0.40       337
           1       0.75      0.83      0.79       779

    accuracy                           0.68      1116
   macro avg       0.61      0.59      0.59      1116
weighted avg       0.66      0.68      0.67      1116

[[116 221]
 [131 648]]
F1-score: 0.79
Accuracy: 0.68
Precision: 0.75
Recall: 0.83



### Tidy up resources by deleting the endpoint

In [26]:
predictor.delete_endpoint()