## Task 2: Use SageMaker Experiments

In this lab, you set up an experiment using Amazon SageMaker Experiments. You train a machine learning (ML) model using XGBoost, perform hyperparameter tuning to test multiple hyperparameter settings and produce a more accurate model, and evaluate your model’s performance.

### Task 2.1: Using the Amazon CodeGuru security scan extension

In this task, you configure Amazon CodeGuru to perform scans of the notebook at regular intervals. 

Automatic code scans are disabled by default. The scans can be run automatically at regular intervals or manually run from any code cell. 

1. Choose **Settings** in the top navigation bar.
1. Choose **Advanced Settings Editor**.

SageMaker Studio displays the *Settings* tab.

1. In the left navigation bar, choose **Amazon CodeGuru**.
1. From the **Auto scans** drop-down menu, select **Enabled**.

Once enabled, automatic scans run every 240 seconds by default. Accept the default value for this lab. 

1. Close the *Settings* tab and return to the tab with the *lab_6.ipynb* notebook.




### Task 2.2: Setup the environment

Before you start training your model, install any necessary dependencies.


Refer to [Manage Machine Learning with Amazon SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html#experiments-features) to learn more about the features of SageMaker Experiments.

In [2]:
#install-dependencies

import boto3
import io
import json
import math
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import sagemaker
import sys
import time
import zipfile

from IPython.display import display
from IPython.display import Image
from sagemaker.analytics import ExperimentAnalytics
from sagemaker.inputs import TrainingInput
from sagemaker.session import Session
from sagemaker.experiments.run import Run, load_run
#from sagemaker.utils import unique_name_from_base  #could be used instead of the date-time append approach, to create a unique Experiment name.
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
from sagemaker.xgboost.estimator import XGBoost
from time import gmtime, strftime

role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sess = boto3.Session()
sm = sess.client('sagemaker')
bucket = sagemaker.Session().default_bucket()
prefix = 'sagemaker/mlasms'

INFO:sagemaker:Created S3 bucket: sagemaker-us-west-2-404554437481


CodeGuru security scans can be started manually from within any code cell in a notebook. 

Start a manual code scan.

1. Select into the previous code cell.
1. Open the context (right-click) menu and then choose **Run CodeGuru scan**.

**CodeGuru: Scan in progress** is displayed at the bottom edge of the SageMaker Studio environment while the scan is running. It disappears when the scan is finished.

1. Select into the previous code cell.
1. Open the context (right-click) menu and then choose **Show diagnostics panel**.

SageMaker Studio opens the *Diagnostics Panel* tab on the bottom half of the environment.

The *Diagnostics Panel* displays a message, "Issue: Notebook best practice violation Suggested remediation: Import statements found beyond the first cell of the notebook." You do not need to fix this issue for this lab.

Refer to [Types of Amazon CodeGuru Extension recommendations](https://docs.aws.amazon.com/codeguru/latest/security-ug/recommendation-types-extension.html) for more information about the types of recommendations that the CodeGuru scans can make about your code.

1. Close the *Diagnostics Panel* tab.

Next, import the dataset.

In [3]:
#import-dataset
lab_test_data = pd.read_csv('adult_data_processed.csv')
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 20)
lab_test_data.head()

Unnamed: 0,income,age,workclass,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week
0,0,39,1,2,2,1,2,1,0,0,2174,0,40
1,0,50,2,2,2,0,2,0,0,0,0,0,13
2,0,38,0,0,0,2,0,1,0,0,0,0,40
3,0,53,0,3,6,0,0,0,1,0,0,0,40
4,0,28,0,2,2,0,3,4,1,1,0,0,40


You split the dataset into training (70 percent), validation (20 percent), and test (10 percent) datasets. The training and validation datasets are during training. The test dataset is used in model evaluation after deployment.

To train using Amazon SageMaker, you need to convert the datasets into either the libSVM or CSV format. This lab uses the CSV format for training. 

Refer to [XGBoost Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) for information about the XGBoost algorithm. 
Refer to [Input/Output Interface for the XGBoost Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#InputOutput-XGBoost) for more information about Input/Output Interface for the XGBoost Algorithm.


In [6]:
#split-dataset
train_data, validation_data, test_data = np.split(
    lab_test_data.sample(frac=1, random_state=1729),
    [int(0.7 * len(lab_test_data)), int(0.9 * len(lab_test_data))],
)

train_data.to_csv('train_data.csv', index=False, header=False)
validation_data.to_csv('validation_data.csv', index=False, header=False)

You have created two dataset files, named *train_data.csv* and *validation_data.csv*. 
Upload these dataset files to Amazon Simple Storage Service (Amazon S3).

In [7]:
#upload-dataset
from sagemaker.s3 import S3Uploader
from sagemaker.inputs import TrainingInput

sagemaker_session = sagemaker.Session()

train_path = S3Uploader.upload('train_data.csv', 's3://{}/{}'.format(bucket, prefix))
validation_path = S3Uploader.upload('validation_data.csv', 's3://{}/{}'.format(bucket, prefix))

train_input = TrainingInput(train_path, content_type='text/csv')
validation_input = TrainingInput(validation_path, content_type='text/csv')

data_inputs = {
    'train': train_input,
    'validation': validation_input
}

### Task 2.3: Create an experiment and run an initial training job

Use SageMaker Experiments to organize, track, compare, and evaluate ML model training experiments through various training components. Refer to [Manage Machine Learning with Amazon SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) for more information about SageMaker Experiments. In SageMaker Experiments, these components include data sets, algorithms, hyperparameters, and metrics. 

In this task, you complete the following:
- Create and track the experiment in Amazon SageMaker Studio.
- Create a run to track inputs, parameters, and metrics.

First, create a name for the experiment and give it a description.

In [8]:
#create unique experiment name
create_date = strftime("%m%d%H%M")

lab_6_experiment_name = "lab-6-{}".format(create_date)
description = "Using SageMaker Experiments with the Adult dataset."




Then, define the optional values for a run name and tags.

In [9]:
# create initial run_name
run_name = "lab-6-run-{}".format(create_date)

# define a run_tag
run_tags = [{'Key': 'lab-6', 'Value': 'lab-6-run'}]

print(f"Experiment name - {lab_6_experiment_name},  run name - {run_name}")

Experiment name - lab-6-02122149,  run name - lab-6-run-02122149


### Task 2.4: Train and tune the model using the XGBoost algorithm

The experiment is set up and ready for training. After training is complete, you can analyze the results in SageMaker Studio. In this task, you do the following: 

- Train the XGBoost model.
- Analyze the experiments in SageMaker Studio.
- Tune the model with hyperparameters.
- Analyze the tuning results in SageMaker Studio.

### Task 2.5: Train the XGBoost Model

Now train the model using the XGBoost algorithm and the experiment that you created. 

The hyperparameters that you set are as follows:
- **eta**: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. 
- **gamma**: Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger it is, the more conservative the algorithm is.
The eta parameter actually shrinks the feature weights to make the boosting process more conservative.
- **max_depth**: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfit.
- **min_child_weight**: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is.
- **num_round**: The number of rounds (trees) used for boosting. Increasing the trees can increase the model accuracy but increases the risk of overfitting.
- **objective**: Specifies the learning task and the corresponding learning objective.
- **subsample**: Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees. This prevents overfitting.
- **verbosity**: Verbosity of printing messages. Valid values are 0 (silent), 1 (warning), 2 (info), and 3 (debug).

The training takes approximately 3–4 minutes to run.

Refer to [XGBoost Hyperparameters](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) for more information about XGBoost hyperparameters.

In [None]:
from sagemaker import image_uris
container = image_uris.retrieve(framework='xgboost',region=boto3.Session().region_name,version='1.5-1')

# initialize hyperparameters
eta=0.2
gamma=4
max_depth=5
min_child_weight=6
num_round=800
objective='binary:logistic'
subsample=0.8
verbosity=0

hyperparameters = {
        "max_depth":max_depth,
        "eta":eta,
        "gamma":gamma,
        "min_child_weight":min_child_weight,
        "subsample":subsample,
        "verbosity":verbosity,
        "objective":objective,
        "num_round":num_round
}

# Set up the estimator
xgb = sagemaker.estimator.Estimator(
    container,
    role, 
    instance_count=1, 
    instance_type='ml.m5.xlarge',
    output_path='s3://{}/{}/output'.format(bucket, prefix),
    sagemaker_session=sagemaker_session,
    EnableSageMakerMetricsTimeSeries=True,
    hyperparameters=hyperparameters,
    tags = run_tags
)


#Run the training job link to Experiment.
with Run(
    experiment_name=lab_6_experiment_name,
    run_name=run_name,
    tags=run_tags,
    sagemaker_session=sagemaker_session,
) as run:

    run.log_parameters({
                        "eta": eta, 
                        "gamma": gamma, 
                        "max_depth": max_depth,
                        "min_child_weight": min_child_weight,
                        "num_round": num_round,
                        "objective": objective,
                        "subsample": subsample,
                        "verbosity": verbosity
                       })
    
#    you may also specify metrics to log
#    run.log_metric(name="", value=x)

# Train the model associating the training run with the current "experiment"
    xgb.fit(
        inputs = data_inputs
    ) 

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2024-02-12-21-49-19-851


2024-02-12 21:49:19 Starting - Starting the training job...
2024-02-12 21:49:36 Starting - Preparing the instances for training...

### Task 2.6: Evaluate model performance pre-tuning

In SageMaker Studio, you can create charts to evaluate your training jobs. For example, after running lab-6 experiments, you can review the validation:logloss_max value in a chart format.

In this lab, you can plot additional metrics in the notebook.

In [None]:
#visualize-training-results-table
run_component_analytics = ExperimentAnalytics(
    experiment_name=lab_6_experiment_name,
    sagemaker_session=Session(sess, sm),
)
run_component_analytics.dataframe()["validation:logloss - Last"].plot(kind="bar", title="validation:logloss - Max", xlabel="training job", ylabel="logloss_max")
plt.show()

### Task 2.7: Tune the model with hyperparameters

You have successfully performed model training using SageMaker Experiments. While training, you can also configure SageMaker to use hyperparameters to significantly affect trained model performance. SageMaker Studio includes various common hyperparameter tuning options for model training. Testing numerous parameters can vary in effectiveness depending on the dataset used. But it can also take a significant amount of time and effort to create the best model.

SageMaker automatic model tuning automates the selection of hyperparameters to optimize training. Refer to [Perform Automatic Model Tuning with SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html) for more information about automatic model tuning. To use it, you specify a range, or a list of possible values, for each hyperparameter that you choose to tune. SageMaker automatic model tuning automatically runs multiple training jobs with various hyperparameter settings. It then evaluates the results of each job based on a specified objective metric and selects the hyperparameter settings for future attempts based on previous results. For each tuning job, you specify a maximum number of training jobs, and the tuning completes when that number has been reached.

The hyperparameter ranges that you need set are as follows:
- **alpha**: L1 regularization term on weights. Increasing this value makes models more conservative.
- **eta**: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative.
- **max_depth**: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfit.
- **min_child_weight**: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is.
- **num_round**: The number of rounds (trees) used for boosting. Increasing the trees can increase the model accuracy but increases the risk of overfitting.

Tuning takes approximately 5 minutes to complete.

Refer to [XGBoost Hyperparameters](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) for more information about XGBoost hyperparameters.

In [None]:
#tune-model
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

# Setup the hyperparameter ranges
hyperparameter_ranges = {
    'alpha': ContinuousParameter(0, 2),
    'eta': ContinuousParameter(0, 1),
    'max_depth': IntegerParameter(1, 10),
    'min_child_weight': ContinuousParameter(1, 10),
    'num_round': IntegerParameter(100, 1000)
}
# Define the target metric and the objective type (max/min)
objective_metric_name = 'validation:auc'
objective_type='Maximize'
# Define the HyperparameterTuner
tuner = HyperparameterTuner(
    estimator = xgb,
    objective_metric_name = objective_metric_name,
    hyperparameter_ranges = hyperparameter_ranges,
    objective_type = objective_type,
    max_jobs=12,
    max_parallel_jobs=4,
    early_stopping_type='Auto',
)

with load_run(sagemaker_session=sagemaker_session, experiment_name=lab_6_experiment_name, run_name=run_name) as run:
# Tune the model
    tuner.fit(
        inputs = data_inputs,
        job_name = lab_6_experiment_name,
    )
    run_component_analytics = ExperimentAnalytics(
    experiment_name=lab_6_experiment_name,
    sagemaker_session=Session(sess, sm),
)
run_component_analytics.dataframe()["validation:logloss - Last"].plot(kind="bar", title="validation:logloss - Max", xlabel="training job", ylabel="logloss_max")
 
plt.show()
    

### Task 2.8: Evaluate model performance post-tuning

In SageMaker Studio, you can also create charts to evaluate your tuning jobs. For example, after running your lab-6-trial training job, you can look at your objective value, the **validation:auc_max**, in a chart format.

![An image of the validation:error_max charts in SageMaker Studio.](Task_2_3_4.png)

In this lab, view the results from the best tuning job and visualize them using charts in the notebook.

In [None]:
#get_experiment_analytics 
run_component_analytics = ExperimentAnalytics(
    experiment_name=lab_6_experiment_name+"-aws-tuning-job",
    sagemaker_session=Session(sess, sm),
)

run_component_analytics.dataframe()

In [None]:
#visualize-tuning-results-auc-max
if run_component_analytics.dataframe()["validation:auc - Max"].iloc[1] != 0:
    run_component_analytics.dataframe()["validation:auc - Max"].plot(kind="bar", title="validation:auc - Max", xlabel="training job", ylabel="auc_max").set_ylim([0.8, 1]);
else:
    run_component_analytics.dataframe()["validation:auc - Last"].plot(kind="bar", title="validation:auc - Max", xlabel="training job", ylabel="auc_max").set_ylim([0.8, 1]);
    
plt.show()

In [None]:
#visualize-tuning-results-auc-max-scatter
N = 12
if run_component_analytics.dataframe()["validation:auc - Max"].iloc[1] != 0:
    x = run_component_analytics.dataframe().sort_values(by=['TrialComponentName'])["validation:auc - Max"];
else:
    x = run_component_analytics.dataframe().sort_values(by=['TrialComponentName'])["validation:auc - Last"];
y = run_component_analytics.dataframe().sort_values(by=['TrialComponentName'])["num_round"]

plt.scatter(x, y, alpha=0.5)
plt.title("auc_max by num_round")
plt.xlabel("validation:auc - Max")
plt.ylabel("num_round");
plt.show()

Finally, can you print the best tuning, job based on your objective metric.

In [None]:
#print-best
tuner.best_training_job()

### Task 2.9: Graph experiment metrics with SageMaker Studio using built-in features

The previously mentioned method creates charts from experiment metrics using in line notebook cells. An additional option is to plot some of the experiment metrics using features within SageMaker Studio. Now that the experiment has run at least once, create a new bar chart in SageMaker Studio.

The next task opens a new tab in SageMaker Studio. To follow these directions, use one of the following options:
- **Option 1:** View the tabs side by side. To create a split screen view from the main SageMaker Studio window, either drag the **lab_6.ipynb** tab to the side or choose the **lab_6.ipynb** tab, and then from the toolbar, select **File** and **New View for Notebook**. You can now have the directions displayed as you explore the artifacts.
- **Option 2:** Switch between the SageMaker Studio tabs to follow these instructions. When you are finished exploring the artifacts, return to the notebook by choosing the **lab_6.ipynb** tab.

1. Choose the **SageMaker Home** icon.
1. Choose **Experiments**.

SageMaker Studio displays the **Experiments** tab.

1. Select the experiment that ends with *aws-tuning-job*.

SageMaker Studio displays the list of **Runs** included in that experiment.

1. On the column header row, select the option in the **Name** column.
1. Choose <span style="background-color:#1a1b22; font-size:90%; color:#57c4f8; position:relative; top:-1px; padding-top:3px; padding-bottom:3px; padding-left:10px; padding-right:10px; border-color:#57c4f8; border-width:thin; border-style:solid; border-radius:2px; margin-right:5px; white-space:nowrap">Analyze</span>.

SageMaker Studio displays the **Run Analyze Chart** tab.

1. On the lower half of the tab, in the chart section, choose <span style="background-color:#1a1b22; font-size:90%; color:#57c4f8; position:relative; top:-1px; padding-top:3px; padding-bottom:3px; padding-left:10px; padding-right:10px; border-color:#57c4f8; border-width:thin; border-style:solid; border-radius:2px; margin-right:5px; white-space:nowrap">+ Add Chart</span>.
1. Choose **Bar**.

SageMaker Studio displays the **Add Chart** window.

1. For **Y-axis**, choose **min_child_weight**.
1. Choose <span style="background-color:#73cdf9; font-size:90%;  color:black; position:relative; top:-1px; padding-top:3px; padding-bottom:3px; padding-left:10px; padding-right:10px; border-color:#57c4f8; border-radius:2px; border-width:3px; margin-right:5px; white-space:nowrap">Create</span>.

A bar chart showing *min_child_weight* per *run* in the experiment is now saved to the charts section.

1. Repeat this process and create a new bar chart for the **train:auc** metric.
1. Repeat this process and create a new bar chart for the **validation:auc** metric.

### Conclusion

Congratulations! You have used Amazon SageMaker Experiments to train and tune models. In the next lab, you use SageMaker Debugger to get insights on potential issues while training a model.

### Cleanup

You have completed this notebook. To move to the next part of the lab, do the following:

- Close this notebook file.
- Return to the lab session and continue with the **Conclusion**.