# Train a new model

In this second model, you will fix the gender imbalance in the dataset using SMOTE and train another model using XGBoost with hyperparameter tuning and Debugger. This model will also be saved to our registry and eventually approved for deployment.

In [1]:
!pip install imbalanced-learn==0.7.0

Collecting imbalanced-learn==0.7.0
  Using cached imbalanced_learn-0.7.0-py3-none-any.whl (167 kB)
Collecting scikit-learn>=0.23
  Using cached scikit_learn-0.24.1-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Installing collected packages: threadpoolctl, scikit-learn, imbalanced-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.22.1
    Uninstalling scikit-learn-0.22.1:
      Successfully uninstalled scikit-learn-0.22.1
Successfully installed imbalanced-learn-0.7.0 scikit-learn-0.24.1 threadpoolctl-2.1.0


In [2]:
import json
import boto3
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

import sagemaker
from sagemaker.inputs import TrainingInput
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.debugger import Rule, rule_configs
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner

In [3]:
# Set region, boto3 and SageMaker SDK variables¶

#You can change this to a region of your choice
region = sagemaker.Session().boto_region_name
print("Using AWS Region: {}".format(region))

boto3.setup_default_session(region_name=region)
boto_session = boto3.Session(region_name=region)

s3_client = boto3.client('s3', region_name=region)
sagemaker_boto_client = boto_session.client('sagemaker')

sagemaker_session = sagemaker.session.Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_boto_client)

sagemaker_role = sagemaker.get_execution_role()
account_id = boto3.client('sts').get_caller_identity()["Account"]

random_state = 42

Using AWS Region: us-east-1


In [4]:
# load stored variables
%store -r
%store

Stored variables and their in-db values:
data_prefix                      -> 'sagemaker-tutorial/data'
default_bucket                   -> 'sagemaker-us-east-1-367158743199'
header                           -> ['LABEL', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIA
hyperparameters                  -> {'max_depth': '5', 'eta': '0.2', 'gamma': '4', 'mi
local_data_dir                   -> '../data'
local_processed_path             -> '../data/df_processed.csv'
local_raw_path                   -> '../data/dataset.csv'
mpg_name                         -> 'sagemaker-tutorial'
prefix                           -> 'sagemaker-tutorial'
s3_raw_data                      -> 's3://sagemaker-us-east-1-367158743199/sagemaker-t
test_data_uri                    -> 's3://sagemaker-us-east-1-367158743199/sagemaker-t
train_data_uri                   -> 's3://sagemaker-us-east-1-367158743199/sagemaker-t
training_job_name                -> 'sagemaker-xgboost-2021-04-06-02-12-29-010'
validation_data_uri           

In [10]:
df = pd.read_csv(local_processed_path)
df = df.drop(columns=['Unnamed: 0'])

### SMOTE

One approach to addressing imbalanced datasets is to oversample the minority class. New examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.

In [11]:
sm = SMOTE(random_state=random_state)

df_resampled, _ = sm.fit_resample(df, df['SEX'])

In [12]:
# Check the gender balance
df_resampled['SEX'].value_counts()

2    18112
1    18112
Name: SEX, dtype: int64

### Saving data back to S3

In [13]:
X_train, X_test = train_test_split(df_resampled, test_size=0.2, random_state=random_state)
X_train, X_val = train_test_split(df_resampled, test_size=0.2, random_state=random_state)

In [None]:
# local_processed_data = './processed_data'

In [15]:
X_train.to_csv(f'{local_data_dir}/train_res.csv', header=False, index=False)

response = sagemaker_session.upload_data(f'{local_data_dir}/train_res.csv',
                                         bucket=default_bucket, 
                                         key_prefix=data_prefix)
train_res_data_uri = response
%store train_res_data_uri

Stored 'train_res_data_uri' (str)


In [16]:
X_val.to_csv(f'{local_data_dir}/validation_res.csv', header=False, index=False)

response = sagemaker_session.upload_data(f'{local_data_dir}/validation_res.csv',
                                         bucket=default_bucket, 
                                         key_prefix=data_prefix)
validation_res_data_uri = response
%store validation_res_data_uri

Stored 'validation_res_data_uri' (str)


In [17]:
X_test.to_csv(f'{local_data_dir}/test_res.csv', header=False, index=False)

response = sagemaker_session.upload_data(f'{local_data_dir}/test_res.csv',
                                         bucket=default_bucket, 
                                         key_prefix=data_prefix)
test_res_data_uri = response
%store test_res_data_uri

Stored 'test_res_data_uri' (str)


### Creating XGBoost model with Hyperparameter Tunining and Debugger

For SageMaker XGBoost training jobs, use the Debugger `CreateXgboostReport` rule to receive a comprehensive training report of the training progress and results. Following this guide, specify the CreateXgboostReport rule while constructing an XGBoost estimator. The `CreateXgboostReport` rule collects the following output tensors from your training job:

* hyperparameters – Saves at the first step.
* metrics – Saves loss and accuracy every 5 steps.
* feature_importance – Saves every 5 steps.
* predictions – Saves every 5 steps.
* labels – Saves every 5 steps.

In [18]:
train_instance_count = 1
train_instance_type = "ml.m4.xlarge"
content_type = "text/csv"
estimator_output_path = f's3://{default_bucket}/{prefix}/training_jobs'

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")


# construct a SageMaker estimator that calls the xgboost-container
xgb_estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container,
                                              hyperparameters=hyperparameters,
                                              role=sagemaker.get_execution_role(),
                                              instance_count=train_instance_count,
                                              instance_type=train_instance_type,
                                              volume_size=5,  # 5 GB
                                              output_path=estimator_output_path,
                                             )

### Set up Hyperparameter Tuning


We will tune four hyperparameters in this examples:

* eta: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative.
* alpha: L1 regularization term on weights. Increasing this value makes models more conservative.
* min_child_weight: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is.
* max_depth: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfitted.

In [19]:
hyperparameter_ranges = {'max_depth': IntegerParameter(1, 10),
                         'eta': ContinuousParameter(0, 1),
                         'gamma': ContinuousParameter(0, 5),
                        'alpha': ContinuousParameter(0, 2)
                        }

Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. Since we are using built-in XGBoost algorithm here, it emits two predefined metrics: validation:auc and train:auc, and we elected to monitor validation:auc as you can see below. In this case, we only need to specify the metric name and do not need to provide regex. If you bring your own algorithm, your algorithm emits metrics by itself. In that case, you'll need to add a MetricDefinition object here to define the format of those metrics through regex, so that SageMaker knows how to extract those metrics from your CloudWatch logs.



In [20]:
objective_metric_name = 'validation:f1'

Now, we'll create a HyperparameterTuner object, to which we pass:

* The XGBoost estimator we created above
* Our hyperparameter ranges
* Objective metric name and definition
* Tuning resource configurations such as Number of training jobs to run in total and how many training jobs can be run in parallel.

In [21]:
tuner = HyperparameterTuner(xgb_estimator,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=4,
                            max_parallel_jobs=2)

# You can increase the number of jobs, etc. I set them to 10, 4 for the demo purpose

## Launch Hyperparameter Tuning job

Now we can launch a hyperparameter tuning job by calling fit() function. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed.



In [22]:
# define the data type and paths to the training and validation datasets
train_input = TrainingInput(train_data_uri, content_type="text/csv")
validation_input = TrainingInput(validation_data_uri, content_type="text/csv")

# execute the XGBoost training job
tuner.fit({'train': train_input,
                   'validation': validation_input
                  }
                   )

........................................................................................................!


UsageError: Unknown variable 'training_smote_job_name'


In [27]:
training_smote_job_name = tuner.best_training_job()
%store training_smote_job_name

Stored 'training_smote_job_name' (str)


# Register Artifacts

### Create model from estimator

In [28]:
training_job_info = sagemaker_boto_client.describe_training_job(TrainingJobName=training_smote_job_name)

In [29]:
model_2_name = f'{prefix}-xgboost-smote'


model_matches = sagemaker_boto_client.list_models(NameContains=model_2_name)['Models']

if not model_matches:
    
    model_2 = sagemaker_session.create_model_from_job(
        name=model_2_name,
        training_job_name=training_job_info['TrainingJobName'],
        role=sagemaker_role,
        image_uri=training_job_info['AlgorithmSpecification']['TrainingImage'])
    %store model_2_name
    
else:
    
    print(f"Model {model_2_name} already exists.")

Model sagemaker-tutorial-xgboost-smote already exists.


In [30]:
model_2_name

'sagemaker-tutorial-xgboost-smote'

### Training data artifact

In [31]:
training_data_s3_uri = training_job_info['InputDataConfig'][0]['DataSource']['S3DataSource']['S3Uri']

matching_artifacts = list(sagemaker.lineage.artifact.Artifact.list(
    source_uri=training_data_s3_uri,
    sagemaker_session=sagemaker_session))

if matching_artifacts:
    training_data_artifact = matching_artifacts[0]
    print(f'Using existing artifact: {training_data_artifact.artifact_arn}')
else:
    training_data_artifact = sagemaker.lineage.artifact.Artifact.create(
        artifact_name='TrainingData',
        source_uri=training_data_s3_uri,
        artifact_type='Dataset',
        sagemaker_session=sagemaker_session)
    print(f'Create artifact {training_data_artifact.artifact_arn}: SUCCESSFUL')

Using existing artifact: arn:aws:sagemaker:us-east-1:367158743199:artifact/46e81ac5bb720131c95259d4f2325499


### Model artifact

In [32]:
trained_model_s3_uri = training_job_info['ModelArtifacts']['S3ModelArtifacts']

matching_artifacts = list(sagemaker.lineage.artifact.Artifact.list(
    source_uri=trained_model_s3_uri,
    sagemaker_session=sagemaker_session))

if matching_artifacts:
    model_artifact = matching_artifacts[0]
    print(f'Using existing artifact: {model_artifact.artifact_arn}')
else:
    model_artifact = sagemaker.lineage.artifact.Artifact.create(
        artifact_name='TrainedModel',
        source_uri=trained_model_s3_uri,
        artifact_type='Model',
        sagemaker_session=sagemaker_session)
    print(f'Create artifact {model_artifact.artifact_arn}: SUCCESSFUL')

Using existing artifact: arn:aws:sagemaker:us-east-1:367158743199:artifact/c627bbcebccfc80216660daf15fda7ba


### Set artifact associations¶


In [33]:
trial_component = sagemaker_boto_client.describe_trial_component(TrialComponentName=tuner.best_training_job()+'-aws-training-job')
trial_component_arn = trial_component['TrialComponentArn']

In [34]:
# Input artifacts
input_artifacts = [training_data_artifact]

for a in input_artifacts:
    try:
        sagemaker.lineage.association.Association.create(
            source_arn=a.artifact_arn,
            destination_arn=trial_component_arn,
            association_type='ContributedTo',
            sagemaker_session=sagemaker_session)
        print(f"Associate {trial_component_arn} and {a.artifact_arn}: SUCCEESFUL\n")
    except:
        print(f"Association already exists between {trial_component_arn} and {a.artifact_arn}.\n")


Association already exists between arn:aws:sagemaker:us-east-1:367158743199:experiment-trial-component/sagemaker-xgboost-210406-0236-004-a3848c52-aws-training-job and arn:aws:sagemaker:us-east-1:367158743199:artifact/46e81ac5bb720131c95259d4f2325499.



In [35]:
# Output artifacts

output_artifacts = [model_artifact]

for artifact_arn in output_artifacts:
    try:
        sagemaker.lineage.association.Association.create(
            source_arn=a.artifact_arn,
            destination_arn=trial_component_arn,
            association_type='Produced',
            sagemaker_session=sagemaker_session)
        print(f"Associate {trial_component_arn} and {a.artifact_arn}: SUCCEESFUL\n")
    except:
        print(f"Association already exists between {trial_component_arn} and {a.artifact_arn}.\n")

Association already exists between arn:aws:sagemaker:us-east-1:367158743199:experiment-trial-component/sagemaker-xgboost-210406-0236-004-a3848c52-aws-training-job and arn:aws:sagemaker:us-east-1:367158743199:artifact/46e81ac5bb720131c95259d4f2325499.



## Create Model Package for the Second Trained Model

In [36]:
model_metrics_report = {'classification_metrics': {}}
for metric in training_job_info['FinalMetricDataList']:
    stat = {metric['MetricName']: {'value': metric['Value']}}
    model_metrics_report['classification_metrics'].update(stat)
    
with open('training_metrics.json', 'w') as f:
    json.dump(model_metrics_report, f)
    
metrics_s3_key = f"{prefix}/training_jobs/{training_job_info['TrainingJobName']}/training_metrics.json"
s3_client.upload_file(Filename='training_metrics.json', Bucket=default_bucket, Key=metrics_s3_key)


In [37]:
model_metrics = {
    'ModelQuality': {
        'Statistics': {
            'ContentType': 'application/json',
            'S3Uri': f's3://{default_bucket}/{prefix}/{metrics_s3_key}'
        }
    }
}

In [38]:
inference_spec ={    
    "InferenceSpecification": {
        "Containers" : [{
            "Image": training_job_info['AlgorithmSpecification']['TrainingImage'],
            "ModelDataUrl": training_job_info['ModelArtifacts']['S3ModelArtifacts']
        }],
        "SupportedTransformInstanceTypes": ["ml.m4.xlarge"],
        "SupportedRealtimeInferenceInstanceTypes": ["ml.m4.xlarge"],
        "SupportedContentTypes": ['text/csv'],
        "SupportedResponseMIMETypes": ['text/csv']
    }
}



# {'ModelDataUrl': }

### Register second model package to Model Package Group

In [39]:

mp_input_dict = {
    'ModelPackageGroupName': mpg_name,
    'ModelPackageDescription': 'XGBoost classifier with SMOTE',
    'ModelApprovalStatus': 'PendingManualApproval',
    'ModelMetrics': model_metrics
}

mp_input_dict.update(inference_spec)
mp2_response = sagemaker_boto_client.create_model_package(**mp_input_dict)
mp2_arn = mp2_response['ModelPackageArn']
%store mp2_arn

Stored 'mp2_arn' (str)


In [40]:
# Check status of model package creation¶

mp_info = sagemaker_boto_client.describe_model_package(ModelPackageName=mp2_response['ModelPackageArn'])
mp_status = mp_info['ModelPackageStatus']

while mp_status not in ['Completed', 'Failed']:
    time.sleep(5)
    mp_info = sagemaker_boto_client.describe_model_package(ModelPackageName=mp2_response['ModelPackageArn'])
    mp_status = mp_info['ModelPackageStatus']
    print(f'model package status: {mp_status}')
print(f'model package status: {mp_status}')

model package status: Completed


### View both models in the registry

In [41]:
sagemaker_boto_client.list_model_packages(ModelPackageGroupName=mpg_name)['ModelPackageSummaryList']

[{'ModelPackageGroupName': 'sagemaker-tutorial',
  'ModelPackageVersion': 5,
  'ModelPackageArn': 'arn:aws:sagemaker:us-east-1:367158743199:model-package/sagemaker-tutorial/5',
  'ModelPackageDescription': 'XGBoost classifier with SMOTE',
  'CreationTime': datetime.datetime(2021, 4, 6, 2, 50, 10, 620000, tzinfo=tzlocal()),
  'ModelPackageStatus': 'Completed',
  'ModelApprovalStatus': 'PendingManualApproval'},
 {'ModelPackageGroupName': 'sagemaker-tutorial',
  'ModelPackageVersion': 4,
  'ModelPackageArn': 'arn:aws:sagemaker:us-east-1:367158743199:model-package/sagemaker-tutorial/4',
  'ModelPackageDescription': 'XGBoost classifier to detect insurance fraud.',
  'CreationTime': datetime.datetime(2021, 4, 6, 2, 32, 39, 455000, tzinfo=tzlocal()),
  'ModelPackageStatus': 'Completed',
  'ModelApprovalStatus': 'PendingManualApproval'},
 {'ModelPackageGroupName': 'sagemaker-tutorial',
  'ModelPackageVersion': 3,
  'ModelPackageArn': 'arn:aws:sagemaker:us-east-1:367158743199:model-package/sage