### Tracking and organizing training and tuning jobs with Amazon SageMaker Experiments

This notebook demonstrates using SageMaker Experiment capability to organize, track, compare, and evaluate your machine learning (ML) model training experiments.


### Overview

1. Set up
2. Create a SageMaker Experiment
3. Train XGBoost regression model as part of the Experiment
4. Visualize results from the Experiment.

### 1. Set up

In [18]:
#Install the sagemaker experiments SDK
!pip install sagemaker-experiments

You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


#### 1.1 Import libraries

In [19]:
import time

import boto3
import numpy as np
import pandas as pd
from IPython.display import set_matplotlib_formats
from matplotlib import pyplot as plt
import datetime

import sagemaker
from sagemaker import get_execution_role
from sagemaker.session import Session
from sagemaker.analytics import ExperimentAnalytics
from sagemaker.inputs import TrainingInput

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

region = 'us-west-2'

set_matplotlib_formats('retina')

In [20]:
sess = boto3.Session()
sm = sess.client('sagemaker')
role = get_execution_role()

#### 1.2 S3 paths to training and validation data and output paths

In [21]:
# define the data type and paths to the training and validation datasets
content_type = "csv"

#s3_bucket = 'bestpractices-bucket-sm'
#s3_prefix = 'prepared_parquet4'

#Set the s3_bucket to the correct bucket name created in your datascience environment
s3_bucket = 'datascience-environment-notebookinstance--06dc7a0224df'
s3_prefix = 'prepared'

train_input = TrainingInput("s3://{}/{}/{}/".format(s3_bucket, s3_prefix, 'train'), content_type=content_type, distribution='ShardedByS3Key')
validation_input = TrainingInput("s3://{}/{}/{}/".format(s3_bucket, s3_prefix, 'validation'), content_type=content_type, distribution='ShardedByS3Key')

Now lets track the parameters from the training step. 

In [22]:
with Tracker.create(display_name="Training", sagemaker_boto_client=sm) as tracker:
    tracker.log_parameters({"learning_rate": 1.0, "dropout": 0.5})
    
    # we can log the location of the training dataset
    tracker.log_input(name="weather-training-dataset", media_type="s3/uri", value="s3://{}/{}/{}/".format(s3_bucket, s3_prefix, 'train'))

### 2.  Set up the Experiment

Create an experiment to track all the model training iterations. Use Experiments to organize your data science work.

#### 2.1 Create an Experiment

In [23]:
weather_experiment = Experiment.create(
    experiment_name=f"weather-experiment-{int(time.time())}",
    description="Prediction of weather quality",
    sagemaker_boto_client=sm)
print(weather_experiment)

Experiment(sagemaker_boto_client=<botocore.client.SageMaker object at 0x7f8aea8c7320>,experiment_name='weather-experiment-1628392649',description='Prediction of weather quality',tags=None,experiment_arn='arn:aws:sagemaker:us-west-2:802439482869:experiment/weather-experiment-1628392649',response_metadata={'RequestId': '3ae94205-d193-460a-acab-4a89d547bc2e', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '3ae94205-d193-460a-acab-4a89d547bc2e', 'content-type': 'application/x-amz-json-1.1', 'content-length': '101', 'date': 'Sun, 08 Aug 2021 03:17:29 GMT'}, 'RetryAttempts': 0})


#### 2.2 Track Experiment

Now create a Trial for each training run to track the it's inputs, parameters, and metrics.

While training the XGBoost model on SageMaker, we will experiment with several values for the number of hidden channel in the model. We will create a Trial to track each training job run. We will also create a TrialComponent from the tracker we created before, and add to the Trial.

Note the execution of the following code takes a while.

In [24]:
##Keep track of the trails
max_depth_trial_name_map = {}
##Keep track of the training jobs launched to check if they are complete before analyzing the experiment results.
training_jobs =[]


### 3. Train XGBoost regression model as part of the Experiment

In [25]:
training_instance_type='ml.m5.12xlarge'
#Explore two different values for the max_depth hyerparameter for XGBoost model
for i, max_depth in enumerate([2, 5]):
    # create trial
    trial_name = f"xgboost-training-job-trial-{max_depth}-max-depth-{int(time.time())}"
    xgboost_trial = Trial.create(
        trial_name=trial_name, 
        experiment_name=weather_experiment.experiment_name,
        sagemaker_boto_client=sm,
    )
    max_depth_trial_name_map[max_depth] = trial_name
    
    # initialize hyperparameters
    hyperparameters = {
        "max_depth": max_depth,
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:squarederror",
        "num_round":"5"}

    #set an output path where the trained model will be saved
    output_prefix = 'weather-experiments'
    output_path = 's3://{}/{}/{}/output'.format(s3_bucket, output_prefix, 'xgboost')

    # This line automatically looks for the XGBoost image URI and builds an XGBoost container.
    # specify the repo_version depending on your preference.
    xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")

    # construct a SageMaker estimator that calls the xgboost-container
    estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type=training_instance_type,   
                                          volume_size=200, # 5 GB 
                                          output_path=output_path)

    xgboost_training_job_name = "xgboost-training-job-{}".format(int(time.time()))
    
    training_jobs.append(xgboost_training_job_name)
    
    # Now associate the estimator with the Experiment and Trial
    estimator.fit(
        inputs={'train': train_input}, 
        job_name=xgboost_training_job_name,
        experiment_config={
            "TrialName": xgboost_trial.trial_name,
            "TrialComponentDisplayName": "Training"
        },
        wait=False, #Don't wait for the training job to be completed
    )
    
    # Wait before launching the next training job
    time.sleep(2)

INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: xgboost-training-job-1628392650
INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: xgboost-training-job-1628392652


In [26]:
max_depth_trial_name_map

{2: 'xgboost-training-job-trial-2-max-depth-1628392649',
 5: 'xgboost-training-job-trial-5-max-depth-1628392652'}

In [27]:
##Quick check of the trails of the experiment
trails = weather_experiment.list_trials()
type(trails)
for trial in trails:
    print(trial)

TrialSummary(trial_name='xgboost-training-job-trial-5-max-depth-1628392652',trial_arn='arn:aws:sagemaker:us-west-2:802439482869:experiment-trial/xgboost-training-job-trial-5-max-depth-1628392652',display_name='xgboost-training-job-trial-5-max-depth-1628392652',creation_time=datetime.datetime(2021, 8, 8, 3, 17, 32, 730000, tzinfo=tzlocal()),last_modified_time=datetime.datetime(2021, 8, 8, 3, 17, 32, 730000, tzinfo=tzlocal()))
TrialSummary(trial_name='xgboost-training-job-trial-2-max-depth-1628392649',trial_arn='arn:aws:sagemaker:us-west-2:802439482869:experiment-trial/xgboost-training-job-trial-2-max-depth-1628392649',display_name='xgboost-training-job-trial-2-max-depth-1628392649',creation_time=datetime.datetime(2021, 8, 8, 3, 17, 29, 979000, tzinfo=tzlocal()),last_modified_time=datetime.datetime(2021, 8, 8, 3, 17, 29, 979000, tzinfo=tzlocal()))


In [None]:
##Wait till the training jobs are complete.
for training_job in training_jobs:
    print("Training job name: " + training_job)
    description = sm.describe_training_job(TrainingJobName=training_job)
    print("Status : " + description["TrainingJobStatus"])
    
    while description["TrainingJobStatus"] != "Completed" and description["TrainingJobStatus"] != "Failed":
        description = sm.describe_training_job(TrainingJobName=training_job)
        primary_status = description["TrainingJobStatus"]
        print("Status {}".format(primary_status))
        time.sleep(15)

Training job name: xgboost-training-job-1628392650
Status : InProgress
Status InProgress
Status InProgress
Status InProgress
Status InProgress
Status InProgress


### 4. Visualize results from the Experiment.
Compare the model training runs of an experiment using the analytics capabilities of Python SDK to query and compare the training runs for identifying the best model produced by our experiment. You can retrieve trial components by using a search expression.

In [None]:
experiment_name = weather_experiment.experiment_name
experiment_name

In [None]:
from sagemaker.analytics import ExperimentAnalytics
sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = Session(sess)

trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sagemaker_session, experiment_name=experiment_name
)
trial_comp_ds_jobs = trial_component_analytics.dataframe()
trial_comp_ds_jobs

Results show the RMSE metrics for the various hyperparameters tried as part of the Experiment