# Train a new model

In this second model, you will fix the gender imbalance in the dataset using SMOTE and train another model using XGBoost with hyperparameter tuning and Debugger. This model will also be saved to our registry and eventually approved for deployment.

In [1]:
!pip install imbalanced-learn==0.7.0



In [69]:
# import os
# import json
import boto3
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

import sagemaker
from sagemaker.inputs import TrainingInput
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.debugger import Rule, rule_configs
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

# from sagemaker import clarify
# from sagemaker.session import Session


In [3]:
# Set region, boto3 and SageMaker SDK variables¶

#You can change this to a region of your choice
region = sagemaker.Session().boto_region_name
print("Using AWS Region: {}".format(region))

boto3.setup_default_session(region_name=region)
boto_session = boto3.Session(region_name=region)

s3_client = boto3.client('s3', region_name=region)
sagemaker_boto_client = boto_session.client('sagemaker')

sagemaker_session = sagemaker.session.Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_boto_client)

sagemaker_role = sagemaker.get_execution_role()
account_id = boto3.client('sts').get_caller_identity()["Account"]

random_state = 42

Using AWS Region: us-east-1


In [4]:
# load stored variables
%store -r
%store

Stored variables and their in-db values:
clarify_data_uri                 -> 's3://sagemaker-us-east-1-367158743199/sagemaker-t
data_prefix                      -> 'sagemaker-tutorial/data'
default_bucket                   -> 'sagemaker-us-east-1-367158743199'
feature_group_name               -> 'FG-flow-sm-tutorial-31-16-16-17-9f41d66b'
hyperparameters                  -> {'max_depth': '5', 'eta': '0.2', 'gamma': '4', 'mi
local_processed_data             -> '../sm-tutorial/02_build_train/processed_data'
model_data                       -> 's3://sagemaker-us-east-1-367158743199/tf2-resnet-
mpg_name                         -> 'sagemaker-tutorial'
prefix                           -> 'sagemaker-tutorial'
s3_raw_data                      -> 's3://sagemaker-us-east-1-367158743199/sagemaker-t
s3_train_data                    -> 's3://sagemaker-us-east-1-367158743199/sagemaker-t
s3_validation_data               -> 's3://sagemaker-us-east-1-367158743199/sagemaker-t
test_data_uri               

In [31]:
# df = pd.read_csv(f'{local_processed_data}/df_processed.csv')
df = df.drop(columns=['Unnamed: 0'])

### SMOTE

One approach to addressing imbalanced datasets is to oversample the minority class. New examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.

In [34]:
sm = SMOTE(random_state=42)

df_resampled, _ = sm.fit_resample(df, df['SEX'])

In [37]:
# Check the gender balance
df_resampled['SEX'].value_counts()

2.0    5214
1.0    5214
Name: SEX, dtype: int64

### Saving data back to S3

In [40]:
X_train, X_test = train_test_split(df_resampled, test_size=0.2, random_state=random_state)

X_train, X_val = train_test_split(df_resampled, test_size=0.2, random_state=random_state)

In [49]:
# local_processed_data = './processed_data'

In [51]:
X_train.to_csv(f'{local_processed_data}/train_res.csv', header=False, index=False)

response = sagemaker_session.upload_data(f'{local_processed_data}/train_res.csv',
                                         bucket=default_bucket, 
                                         key_prefix=data_prefix)
train_res_data_uri = response
%store train_res_data_uri

Stored 'train_res_data_uri' (str)


In [52]:
X_val.to_csv(f'{local_processed_data}/validation_res.csv', header=False, index=False)

response = sagemaker_session.upload_data(f'{local_processed_data}/validation_res.csv',
                                         bucket=default_bucket, 
                                         key_prefix=data_prefix)
validation_res_data_uri = response
%store validation_res_data_uri

Stored 'validation_res_data_uri' (str)


In [53]:
X_test.to_csv(f'{local_processed_data}/test_res.csv', header=False, index=False)

response = sagemaker_session.upload_data(f'{local_processed_data}/test_res.csv',
                                         bucket=default_bucket, 
                                         key_prefix=data_prefix)
test_res_data_uri = response
%store test_res_data_uri

Stored 'test_res_data_uri' (str)


### Creating XGBoost model with Hyperparameter Tunining and Debugger

For SageMaker XGBoost training jobs, use the Debugger `CreateXgboostReport` rule to receive a comprehensive training report of the training progress and results. Following this guide, specify the CreateXgboostReport rule while constructing an XGBoost estimator. The `CreateXgboostReport` rule collects the following output tensors from your training job:

* hyperparameters – Saves at the first step.
* metrics – Saves loss and accuracy every 5 steps.
* feature_importance – Saves every 5 steps.
* predictions – Saves every 5 steps.
* labels – Saves every 5 steps.

In [75]:
train_instance_count = 1
train_instance_type = "ml.m4.xlarge"
content_type = "text/csv"
estimator_output_path = f's3://{default_bucket}/{prefix}/training_jobs'

rules=[
    Rule.sagemaker(rule_configs.create_xgboost_report())
]

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")


# construct a SageMaker estimator that calls the xgboost-container
xgb_estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container,
                                              hyperparameters=hyperparameters,
                                              role=sagemaker.get_execution_role(),
                                              instance_count=train_instance_count,
                                              instance_type=train_instance_type,
                                              volume_size=5,  # 5 GB
                                              output_path=estimator_output_path,
                                              rules=rules
                                             )

### Set up Hyperparameter Tuning


We will tune four hyperparameters in this examples:

* eta: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative.
* alpha: L1 regularization term on weights. Increasing this value makes models more conservative.
* min_child_weight: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is.
* max_depth: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfitted.

In [76]:
hyperparameter_ranges = {'max_depth': IntegerParameter(1, 10),
                         'eta': ContinuousParameter(0, 1),
                         'gamma': ContinuousParameter(0, 5),
                        'alpha': ContinuousParameter(0, 2)
                        }

Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. Since we are using built-in XGBoost algorithm here, it emits two predefined metrics: validation:auc and train:auc, and we elected to monitor validation:auc as you can see below. In this case, we only need to specify the metric name and do not need to provide regex. If you bring your own algorithm, your algorithm emits metrics by itself. In that case, you'll need to add a MetricDefinition object here to define the format of those metrics through regex, so that SageMaker knows how to extract those metrics from your CloudWatch logs.



In [77]:
objective_metric_name = 'validation:f1'

Now, we'll create a HyperparameterTuner object, to which we pass:

* The XGBoost estimator we created above
* Our hyperparameter ranges
* Objective metric name and definition
* Tuning resource configurations such as Number of training jobs to run in total and how many training jobs can be run in parallel.

In [83]:
tuner = HyperparameterTuner(xgb_estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=10,
                            max_parallel_jobs=4)

# You can increase the number of jobs, etc. I set them to 10, 4 for the demo purpose

## Launch Hyperparameter Tuning job

Now we can launch a hyperparameter tuning job by calling fit() function. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed.



In [84]:
# define the data type and paths to the training and validation datasets
train_input = TrainingInput(train_data_uri, content_type="text/csv")
validation_input = TrainingInput(validation_data_uri, content_type="text/csv")

# execute the XGBoost training job
tuner.fit({'train': train_input,
                   'validation': validation_input
                  }
                   )

training_smote_job_name = tuner.latest_training_job.job_name
%store training_smote_job_name

...................................................................................................................................................................................................................................!
Stored 'training_smote_job_name' (str)


### Download the Debugger XGBoost Training Report

In [89]:
report_output_path = f"{estimator_output_path}/{tuner.latest_tuning_job.name}"

In [90]:
report_output_path

's3://sagemaker-us-east-1-367158743199/sagemaker-tutorial/training_jobs/sagemaker-xgboost-210402-0144'

In [82]:
xgb_estimator.bes


<sagemaker.estimator.Estimator at 0x7f07bb11e290>