## Debug training jobs with Amazon SageMaker Debugger

This notebook demonstrates how to use Amazon SageMaker Debugger with a training job.


Amazon SageMaker Debugger is the capability of Amazon SageMaker that allows debugging machine learning training. The capability helps you monitor the training jobs in near real time.  Using Amazon SageMaker Debugger is a two step process - saving model parameters and analysing the saved information. 

#### Overview

1. Setup
2. Train XGBoost model with Amazon SageMaker Debugger enabled
3. Manually analyze debugger output 

## Section 1 - Setup <a id='setup'></a>

In this section, we will import the necessary libraries, setup variables and examine dataset used. that was used to train the XGBoost model to predict an individual's income.

Let's start by specifying:

* The AWS region used to host your model.
* The IAM role associated with this SageMaker notebook instance.
* The S3 bucket used to store the data used to train the model, save debugger information during training and the trained model artifact.

<font color='red'>**Important**</font>: To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the SMDebug libary. In the following cell, change the third line to `install_needed=True` and run to upgrade the libraries.

In [1]:
import sys
import IPython
install_needed = False  # Set to True to install/upgrade
if install_needed:
    print("installing deps and restarting kernel")
    !{sys.executable} -m pip install -U sagemaker
    !{sys.executable} -m pip install -U smdebug
    IPython.Application.instance().kernel.do_shutdown(True)

### 1.1 Import necessary libraries

In [2]:
import boto3
import sagemaker
import os
import pandas as pd

from sagemaker import get_execution_role
from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig, CollectionConfig
from sagemaker.estimator import Estimator

from sagemaker.inputs import TrainingInput

#### 1.2 Setup variables

In [3]:
region = boto3.Session().region_name
print("AWS Region: {}".format(region))

role = get_execution_role()
print("RoleArn: {}".format(role))

s3_bucket = 'datascience-environment-notebookinstance--06dc7a0224df'
s3_prefix  = 'prepared'

content_type = "csv"

AWS Region: us-west-2
RoleArn: arn:aws:iam::802439482869:role/DataScienceEnvironment-SageMakerRole-1SVE0FKUVRVO5


#### 1.3 Create the service clients

In [4]:
sagemaker_client = boto3.client("sagemaker")
s3_client = boto3.client('s3', region_name=region)

### 1.3 S3 bucket and prefix to hold training data, debugger information, and model artifact

In [5]:
##Get the file name at index from the 'prefix' folder
def get_file_in_bucket(prefix,index):
    response = s3_client.list_objects(
        Bucket=s3_bucket,
        Prefix=s3_prefix + "/" + prefix
    )
    ## At '0' index you will find the SUCCESS/FAILURE of file uploades to S3. First data file is at index 1
    file_name = response['Contents'][index]['Key']
    print("Returing file name : " + file_name)
    return file_name

In [6]:
#Since we are using powerful CPU/GPU instances for training over hours, you can choose to use a single file 
#for training and validation instead of the entrie dataset to save some time and trainging costs.  Change the variable
#use_full_data to True to use the complete dataset
use_full_data=False

#Different train and validation inputs
#define the data type and paths to the training and validation datasets
if use_full_data == False:
    train_input = TrainingInput("s3://{}/{}".format(s3_bucket, get_file_in_bucket('train',1)), content_type=content_type)
    validation_input = TrainingInput("s3://{}/{}".format(s3_bucket, get_file_in_bucket('validation',1)), content_type=content_type)
else:
    train_input = TrainingInput("s3://{}/{}/{}/".format(s3_bucket, s3_prefix, 'train'), content_type=content_type, distribution='ShardedByS3Key')
    validation_input = TrainingInput("s3://{}/{}/{}/".format(s3_bucket, s3_prefix, 'validation'), content_type=content_type, distribution='ShardedByS3Key')

##Debugger output needs to be saved in the session bucket
debugger_bucket = sagemaker.Session().default_bucket()
debugger_path = "s3://{}/debugger".format(debugger_bucket)

print(debugger_path)

Returing file name : prepared/train/part-00000-2554f113-947e-46bd-be31-9cd75cb4661c-c000.csv
Returing file name : prepared/validation/part-00000-85addac2-a753-4bc2-b157-26ff8f5d5952-c000.csv
s3://sagemaker-us-west-2-802439482869/debugger


## Section 2 - Train XGBoost model with Amazon SageMaker Debugger enabled. <a id='train'></a>

Now train an XGBoost model with Amazon SageMaker Debugger enabled and monitor the training jobs. This is done using the Amazon SageMaker Estimator API. While the training job is running, use Amazon SageMaker Debugger API to access saved model parameters in real time and visualize them. You can rely on Amazon SageMaker Debugger to take care of downloading a fresh set of model parameters every time you query for them.

Amazon SageMaker Debugger is available in Amazon SageMaker XGBoost container version 0.90-2 or later. If you want to use XGBoost with Amazon SageMaker Debugger, you have to specify `repo_version='0.90-2'` in the `get_image_uri` function.

### 2.1 Build the XGBoost container

Amazon SageMaker Debugger is available in Amazon SageMaker XGBoost container version 0.90-2 or later.

In [7]:
container = sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")

In [8]:
base_job_name = "weather-prediction-regression"

# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:squarederror",
        "num_round":"5"}

save_interval = 20

In [9]:
print(debugger_path)

s3://sagemaker-us-west-2-802439482869/debugger


### 2.2 Train the model

In [10]:
xgboost_estimator = Estimator(
    role=role,
    base_job_name=base_job_name,
    instance_count=1,
    instance_type='ml.m5.24xlarge', 
    volume_size=1000, # 5 GB 
    image_uri=container,
    hyperparameters=hyperparameters,
    max_run=3600,
    debugger_hook_config=DebuggerHookConfig(
        s3_output_path=debugger_path,
        collection_configs=[
            CollectionConfig(name="feature_importance", parameters={"save_interval": str(save_interval)}),
            CollectionConfig(name="average_shap", parameters={"save_interval": str(save_interval)})
        ],
    ),
    rules=[
        Rule.sagemaker(
            rule_configs.loss_not_decreasing(),
            rule_parameters={
                "collection_names": "metrics",
                "num_steps": str(save_interval * 2),
            },
        ),
    ],
)

With the next step, start a training job by using the Estimator object you created above. This job is started in an asynchronous, non-blocking way. This means that control is passed back to the notebook and further commands can be run while the training job is progressing.

In [11]:
xgboost_estimator.fit(
    {"train": train_input, "validation": validation_input},
    wait=True
)

2021-08-08 23:18:49 Starting - Starting the training job...
2021-08-08 23:19:12 Starting - Launching requested ML instancesLossNotDecreasing: InProgress
ProfilerReport-1628464729: InProgress
...
2021-08-08 23:19:45 Starting - Preparing the instances for training.........
2021-08-08 23:21:17 Downloading - Downloading input data...
2021-08-08 23:21:50 Training - Training image download completed. Training in progress.
2021-08-08 23:21:50 Uploading - Uploading generated training model.[34m[2021-08-08 23:21:47.088 ip-10-0-153-242.us-west-2.compute.internal:1 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value reg:squarederror to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost 

### 3. Manually analyze debugger output 

As a result of the above command, Amazon SageMaker starts **one training job and one rule job** for you. The first one is the job that produces the model parameters to be analyzed. The second one analyzes the model parameters to check if `train-error` and `validation-error` are not decreasing at any point during training.

Check the status of the training job below.
After your training job is started, Amazon SageMaker starts a rule-execution job to run the LossNotDecreasing rule.  

The cell below will block till the training job is complete.

In [12]:
import time

for _ in range(100):
    job_name = xgboost_estimator.latest_training_job.name
    client = xgboost_estimator.sagemaker_session.sagemaker_client
    description = client.describe_training_job(TrainingJobName=job_name)
    training_job_status = description["TrainingJobStatus"]
    rule_job_summary = xgboost_estimator.latest_training_job.rule_job_summary()
    rule_evaluation_status = rule_job_summary[0]["RuleEvaluationStatus"]
    print(
        "Training job status: {}, Rule Evaluation Status: {}".format(
            training_job_status, rule_evaluation_status
        )
    )

    if training_job_status in ["Completed", "Failed"]:
        break

    time.sleep(10)

Training job status: Completed, Rule Evaluation Status: InProgress


#### 3.1 Check the status of the Rule Evaluation Job

To get the rule evaluation job that Amazon SageMaker started for you, run the command below. The results show you the `RuleConfigurationName`, `RuleEvaluationJobArn`, `RuleEvaluationStatus`, `StatusDetails`, and `RuleEvaluationJobArn`.
If the model parameters meet a rule evaluation condition, the rule execution job throws a client error with `RuleEvaluationConditionMet`.

The logs of the rule evaluation job are available in the Cloudwatch Logstream `/aws/sagemaker/ProcessingJobs` with `RuleEvaluationJobArn`.

You can see that once the rule execution job starts, it identifies the loss not decreasing situation in the training job, it raises the `RuleEvaluationConditionMet` exception, and it ends the job.

In [13]:
xgboost_estimator.latest_training_job.rule_job_summary()

[{'RuleConfigurationName': 'LossNotDecreasing',
  'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:802439482869:processing-job/weather-prediction-regress-lossnotdecreasing-00e9e924',
  'RuleEvaluationStatus': 'InProgress',
  'LastModifiedTime': datetime.datetime(2021, 8, 8, 23, 22, 17, 806000, tzinfo=tzlocal())},
 {'RuleConfigurationName': 'ProfilerReport-1628464729',
  'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:802439482869:processing-job/weather-prediction-regress-profilerreport-1628464729-47485b95',
  'RuleEvaluationStatus': 'InProgress',
  'LastModifiedTime': datetime.datetime(2021, 8, 8, 23, 22, 13, 80000, tzinfo=tzlocal())}]

In [14]:
xgboost_estimator.latest_job_debugger_artifacts_path()

's3://sagemaker-us-west-2-802439482869/debugger/weather-prediction-regression-2021-08-08-23-18-48-922/debug-output'

Now that you've trained the system, analyze the data.  Here, you focus on after-the-fact analysis.

You import a basic analysis library, which defines the concept of trial, which represents a single training run.


Before getting to analysis, here are some notes on concepts being used in Amazon SageMaker Debugger that help with analysis.
- ***Trial*** - Object that is a centerpiece of the SageMaker Debugger API when it comes to getting access to model parameters. It is a top level abstract that represents a single run of a training job. All model parameters emitted by a training job are associated with its trial.
- ***Tensor*** - Object that represents model parameters, such as weights, gradients, accuracy, and loss, that are saved during training job.

For more details on aforementioned concepts as well as on SageMaker Debugger API in general (including examples) see [SageMaker Debugger Analysis API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/analysis.md) documentation.

In the following code cell, use a ***Trial*** to access model parameters. You can do that by inspecting currently running training job and extract necessary parameters from its debug configuration to instruct SageMaker Debugger where the data you are looking for is located. Keep in mind the following:
- model parameters are being stored in your own S3 bucket to which you can navigate and manually inspect its content if desired.
- You might notice a slight delay before trial object is created. This is normal as SageMaker Debugger monitors the corresponding bucket and waits until model parameters to appear. The delay is introduced by less than instantaneous upload of model parameters from a training container to your S3 bucket. 

In [15]:
from smdebug.trials import create_trial

s3_output_path = xgboost_estimator.latest_job_debugger_artifacts_path()

print(s3_output_path)
trial = create_trial(s3_output_path)

[2021-08-08 23:22:31.992 ip-172-16-11-227:17594 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
s3://sagemaker-us-west-2-802439482869/debugger/weather-prediction-regression-2021-08-08-23-18-48-922/debug-output
[2021-08-08 23:22:32.033 ip-172-16-11-227:17594 INFO s3_trial.py:42] Loading trial debug-output at path s3://sagemaker-us-west-2-802439482869/debugger/weather-prediction-regression-2021-08-08-23-18-48-922/debug-output


You can list all model parameters that you want to analyze. Each one of these names is the name of a model parameter. The name is a combination of the feature name, which in these cases, is auto-assigned by XGBoost, and whether it's an evaluation metric, feature importance, or SHAP value.

In [16]:
trial.tensor_names()

[2021-08-08 23:22:32.248 ip-172-16-11-227:17594 INFO trial.py:198] Training has ended, will refresh one final time in 1 sec.
[2021-08-08 23:22:33.267 ip-172-16-11-227:17594 INFO trial.py:210] Loaded all steps


['average_shap/f0',
 'average_shap/f1',
 'average_shap/f10',
 'average_shap/f11',
 'average_shap/f12',
 'average_shap/f13',
 'average_shap/f14',
 'average_shap/f15',
 'average_shap/f16',
 'average_shap/f17',
 'average_shap/f2',
 'average_shap/f3',
 'average_shap/f4',
 'average_shap/f5',
 'average_shap/f6',
 'average_shap/f7',
 'average_shap/f8',
 'average_shap/f9',
 'train-rmse',
 'validation-rmse']

For each model parameter, we can get the values at all saved steps. 

In [17]:
##Check the values of individual tensors
trial.tensor("average_shap/f10").values()

{0: array([0.], dtype=float32)}

In [18]:
##Check the values of train rmse
trial.tensor("train-rmse").values()

{0: array([0.399997])}

In [19]:
##Check the values of validation rmse
trial.tensor("validation-rmse").values()

{0: array([0.400004])}

In this notebook we performed manual analysis of tensors and metrics capatured by SageMaker Debugger during the training process.  Note that you can plot these values for better visualization or use the SageMaker Studio environment to see built-in visualizations