## Debug training jobs with Amazon SageMaker Debugger Built-In and Custom Rules


This notebook demonstrates how to use Amazon SageMaker Debugger rules with a training job.

#### Overview

1. Set up
2. Train a PyTorch model with SageMaker Debugger built-in rules. 
3. Train a PyTorch model with SageMaker Debugger custom rules. 

### 1. Set up

#### 1.1 Import libraries

In [1]:
import sagemaker
import boto3
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput

from sagemaker.pytorch import PyTorch
from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig, CollectionConfig



####  1.2 Define variables

In [2]:
#Set the s3_bucket to the correct bucket name created in your datascience environment
s3_bucket = 'datascience-environment-notebookinstance--06dc7a0224df'
s3_prefix = 'prepared'
region = boto3.Session().region_name

#### 1.3 Setup service clients

In [3]:
sagemaker_client = boto3.client("sagemaker")
s3_client = boto3.client('s3', region_name=region)

#### 1.4 Training and validation data inputs to training job

In [4]:
##Get the file name at index from the 'prefix' folder
def get_file_in_bucket(prefix,index):
    response = s3_client.list_objects(
        Bucket=s3_bucket,
        Prefix=s3_prefix + "/" + prefix
    )
    ## At '0' index you will find the SUCCESS/FAILURE of file uploades to S3. First data file is at index 1
    file_name = response['Contents'][index]['Key']
    print("Returing file name : " + file_name)
    return file_name

In [5]:
content_type = "csv"

# Define the data type and paths to the training and validation datasets

#Since we are using powerful CPU/GPU instances for training over hours, you can choose to use a single file 
#for training and validation instead of the entrie dataset to save some time and trainging costs.  Change the variable
#use_full_data to True to use the complete dataset
use_full_data=False

#Different train and validation inputs
#define the data type and paths to the training and validation datasets
if use_full_data == False:
    ##Update the csv file names to match the contents in your S3 bucket
    #train_input = TrainingInput("s3://{}/{}/{}/part-00000-2554f113-947e-46bd-be31-9cd75cb4661c-c000.csv".format(s3_bucket, s3_prefix, 'train'), content_type=content_type)
    #validation_input = TrainingInput("s3://{}/{}/{}/part-00000-85addac2-a753-4bc2-b157-26ff8f5d5952-c000.csv".format(s3_bucket, s3_prefix, 'validation'), content_type=content_type)
    train_input = TrainingInput("s3://{}/{}".format(s3_bucket, get_file_in_bucket('train',1)))
    validation_input = TrainingInput("s3://{}/{}".format(s3_bucket, get_file_in_bucket('validation',1)))
else:
    train_input = TrainingInput("s3://{}/{}/{}/".format(s3_bucket, s3_prefix, 'train'), content_type=content_type, distribution='ShardedByS3Key')
    validation_input = TrainingInput("s3://{}/{}/{}/".format(s3_bucket, s3_prefix, 'validation'), content_type=content_type, distribution='ShardedByS3Key')

Returing file name : prepared/train/part-00000-2554f113-947e-46bd-be31-9cd75cb4661c-c000.csv
Returing file name : prepared/validation/part-00000-85addac2-a753-4bc2-b157-26ff8f5d5952-c000.csv


### 2. Train a PyTorch model with SageMaker Debugger built-in rules.

#### 2.1 Define training infrastructure related variables

In [6]:
train_instance_type = "ml.p3.2xlarge"
instance_count = 1

#### 2.2 Define hyperparameters

In [7]:
hyperparameters = {
    "epochs": 20
}

#### 2.3 Configure the built-in rules to use

In [8]:
built_rules=[
        #Check for loss not decreasing during training and stop the training job.
        Rule.sagemaker(
            rule_configs.loss_not_decreasing(),
            #actions=actions,
            actions = (rule_configs.StopTraining())
        ),
        #Check for overfit
        Rule.sagemaker(rule_configs.overfit()),
        #Check for overtraining 
        Rule.sagemaker(rule_configs.overtraining()),
        #Check for stalled training
        Rule.sagemaker(rule_configs.stalled_training_rule())
]

#### 2.4 Create PyTorch estimator with built_in rules

In [9]:
pt_estimator = PyTorch(
    entry_point="train_pytorch.py",
    source_dir="code",
    role=sagemaker.get_execution_role(),
    instance_count=instance_count,
    instance_type=train_instance_type,
    framework_version="1.6",
    py_version="py3",
    volume_size=1024,
    hyperparameters = hyperparameters,
    #Debugger built in rules
    rules = built_rules
)

#### 2.5 Kick off training.

In [10]:
pt_estimator.fit({'train': train_input, 'test': validation_input})

2021-08-08 23:22:14 Starting - Starting the training job...
2021-08-08 23:22:44 Starting - Launching requested ML instancesLossNotDecreasing: InProgress
Overfit: InProgress
Overtraining: InProgress
StalledTrainingRule: InProgress
ProfilerReport-1628464934: InProgress
...
2021-08-08 23:23:07 Starting - Preparing the instances for training.........
2021-08-08 23:24:45 Downloading - Downloading input data...
2021-08-08 23:25:06 Training - Downloading the training image.........
2021-08-08 23:26:46 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-08-08 23:26:38,285 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-08-08 23:26:38,310 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-08-08 23:26:41,358 sagemaker_pytorch_container.

UnexpectedStatusException: Error for Training job pytorch-training-2021-08-08-23-22-14-143: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python3.6 train_pytorch.py --epochs 20"
Traceback (most recent call last):
  File "train_pytorch.py", line 11, in <module>
    from model_pytorch import TabularNet
ModuleNotFoundError: No module named 'model_pytorch'

#### 2.6 View results of the rule execution.

In [11]:
pt_estimator.latest_training_job.rule_job_summary()

[{'RuleConfigurationName': 'LossNotDecreasing',
  'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:802439482869:processing-job/pytorch-training-2021-08-0-lossnotdecreasing-5a5de53c',
  'RuleEvaluationStatus': 'Stopped',
  'LastModifiedTime': datetime.datetime(2021, 8, 8, 23, 31, 36, 825000, tzinfo=tzlocal())},
 {'RuleConfigurationName': 'Overfit',
  'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:802439482869:processing-job/pytorch-training-2021-08-0-overfit-4a209ade',
  'RuleEvaluationStatus': 'Stopped',
  'LastModifiedTime': datetime.datetime(2021, 8, 8, 23, 31, 36, 825000, tzinfo=tzlocal())},
 {'RuleConfigurationName': 'Overtraining',
  'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:802439482869:processing-job/pytorch-training-2021-08-0-overtraining-696a8833',
  'RuleEvaluationStatus': 'Stopped',
  'LastModifiedTime': datetime.datetime(2021, 8, 8, 23, 31, 36, 825000, tzinfo=tzlocal())},
 {'RuleConfigurationName': 'StalledTrainingRule',
  'RuleEvaluationJobArn': 'arn:a

From the rule summary you can see that the "LossNotDecreasing" rule found issues during training.  If you explore the CloudWatch logs for the processing job running the rule, you should see logs similar to the image below. As you can see from the logs, once the "LossNotDecreasing" condition is detected, the rule is triggered and the corresponding action of stopping the training job follows.   


<IMG src = 'images/LossNotDecreasingRule.png'/>

You will also see that training job has been stopped if you check the status of the job.

### 3. Training a PyTorch model with a custom rule.

In [12]:
custom_rule = Rule.custom(
    name="CustomRule",  # used to identify the rule
    # rule evaluator container image
    image_uri="759209512951.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rule-evaluator:latest",
    instance_type="ml.t3.medium",  # instance type to run the rule evaluation on
    source="rules/custom_rule.py",  # path to the rule source file
    rule_to_invoke="CustomGradientRule",  # name of the class to invoke in the rule source file
    volume_size_in_gb=30,  # EBS volume size required to be attached to the rule evaluation instance
    collections_to_save=[CollectionConfig("gradients")],
    # collections to be analyzed by the rule. since this is a first party collection we fetch it as above
    rule_parameters={
        "threshold": "20.0"  # this will be used to intialize 'threshold' param in your constructor
    },
)

In [15]:
pt_estimator_custom = PyTorch(
    entry_point="train_pytorch.py",
    source_dir="code",
    role=sagemaker.get_execution_role(),
    instance_count=instance_count,
    instance_type=train_instance_type,
    framework_version="1.6",
    py_version="py3",
    volume_size=1024,
    hyperparameters={"epochs": 10},
    rules=[custom_rule]
)

In [17]:
pt_estimator_custom.fit({'train': train_input, 'test': validation_input})

2021-08-09 00:30:24 Starting - Starting the training job...
2021-08-09 00:30:48 Starting - Launching requested ML instancesCustomRule: InProgress
ProfilerReport-1628469024: InProgress
......
2021-08-09 00:31:48 Starting - Preparing the instances for training.........
2021-08-09 00:33:09 Downloading - Downloading input data...
2021-08-09 00:33:49 Training - Downloading the training image........[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2021-08-09 00:35:05,433 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2021-08-09 00:35:05,458 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2021-08-09 00:35:08,489 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2021-08-09 00:35:08,946 sagemaker-training-toolkit INFO     Invoking user script
[0m
[34mTraining Env:
[0m
[

In [18]:
pt_estimator_custom.latest_training_job.rule_job_summary()

[{'RuleConfigurationName': 'CustomRule',
  'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:802439482869:processing-job/pytorch-training-2021-08-0-customrule-3975477c',
  'RuleEvaluationStatus': 'NoIssuesFound',
  'LastModifiedTime': datetime.datetime(2021, 8, 9, 2, 38, 39, 523000, tzinfo=tzlocal())},
 {'RuleConfigurationName': 'ProfilerReport-1628469024',
  'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:802439482869:processing-job/pytorch-training-2021-08-0-profilerreport-1628469024-421b9b62',
  'RuleEvaluationStatus': 'IssuesFound',
  'StatusDetails': 'RuleEvaluationConditionMet: Evaluation of the rule ProfilerReport at step 125 resulted in the condition being met\n',
  'LastModifiedTime': datetime.datetime(2021, 8, 9, 2, 38, 37, 201000, tzinfo=tzlocal())}]

From the rule summary you can see that while the custom rule was not triggered, but there were issues found for the profiler report.  Similar to above experiment, explore the CloudWatch logs for more details.