# Profile Training Jobs with Amazon SageMaker Debugger

## Configuring a training job to use SageMaker Debugger

#### The first step is to configure training jobs to use Amazon SageMaker Debugger. By now, you are familiar with using the Estimator object from SageMaker SDK to launch training jobs. To use Amazon SageMaker Debugger, you must enhance Estimator with three additional configuration parameters: DebuggerHookConfig, Rules, and ProfilerConfig

#### With DebuggerHookConfig, you can specify which debugging metrics to collect and where to store them, as shown in the following code block:

In [None]:
Estimator(
    debugger_hook_config=DebuggerHookConfig(
        s3_output_path=bucket_path,  # Where the debug data is stored.
        collection_configs=[ # Organize data to collect into collections.
            CollectionConfig(
                name="metrics",
                parameters={
                    "save_interval": str(save_interval)
                }
            )
        ],
    ),
)

#### s3_output_path is the location where all the collected data is persisted. If this location is not specified, Debugger uses the default path, s3://<output_path>/debug-output/, where <output_path> is the output path of the SageMaker training job. The CollectionConfig list allows you to organize the debug data or tensors into collections for easier analysis. A tensor represents the state of a training network at a specific time during the training process. Data is collected at intervals, as specified by save_interval, which is the number of steps in a training run.

#### How do you know which tensors to collect? SageMaker Debugger comes with a set of built-in collections to capture common training metrics such as weights, layers, and outputs. You can choose to collect all of the available tensors or a subset of them. In the preceding code sample, Debugger is gathering the metrics collection.

In [None]:
# Use Debugger CollectionConfig to create a custom collection

collection_configs=[
        CollectionConfig(
            name="custom_collection",
            parameters={"include_regex": ".*relu |.*tanh | *weight ",})
]

#### While DebuggerHookConfig allows you to configure and save tensors, a rule analyzes the tensors that are captured during the training for specific conditions such as loss not decreasing. SageMaker Debugger supports two different types of rules: built-in and custom. SageMaker Debugger comes with a set of built-in rules in Python that can detect and report common training problems such as overfitting, underfitting, and vanishing gradients. With custom rules, you write your own rules in Python for SageMaker Debugger to evaluate against the collected tensors.

#### For example, in the following code block, Debugger collects tensors related to the metrics collection and evaluates the tensors to detect whether the training loss is reduced throughout the training process:

In [None]:
Estimator(
rules=[
        Rule.sagemaker(
            rule_configs.loss_not_decreasing(),
            rule_parameters={
                "collection_names": "metrics",
                "num_steps": str(save_interval * 2),
            },
        ),
    ],
)

#### Finally, ProfilerConfig allows you to collect system metrics such as CPU, GPU, Memory, I/O, and framework metrics specific to the framework being used in your training job. For the system metrics, you must specify the time interval for which you want to collect metrics, while for framework metrics, you specify the starting step and the number of steps, as shown in the following code block:

In [None]:
Estimator(
    profiler_config = ProfilerConfig(
    ## Monitoring interval in milliseconds
    system_monitor_interval_millis=500)
    ## Start collecting metrics from step 2 and collect from the next 7 steps.
    framework_profile_params=FrameworkProfile(
    start_step=2,
    num_steps=7
))
  

#### Additionally, you can use Debugger's built-in actions to automate the responses. The following code block shows how to use a combination of Debugger's built-in rules and actions to stop a training job if the loss is not continuously reduced during the training process:

In [None]:
built_rules=[
        #Check for loss not decreasing during training and stop the training job.
        Rule.sagemaker(
        rule_configs.loss_not_decreasing(),
        actions = (rule_configs.StopTraining())
        )

]

#### On the other hand, when you have the ProfilerConfig parameter configured, a profiler report with a detailed analysis of system metrics and framework metrics is generated and persisted in S3. You can download, review, and apply recommendations to the profiler report.

In [None]:
#Specify the rules you want to run

built_in_rules=[
    #Check for loss not decreasing during training and stop the training job.
      Rule.sagemaker(
      rule_configs.loss_not_decreasing(),
      actions = (rule_configs.StopTraining())
      ),
      #Check for overfit, overtraining and stalled training
      Rule.sagemaker(rule_configs.overfit()),  
   Rule.sagemaker(rule_configs.overtraining()),       
   Rule.sagemaker(rule_configs.stalled_training_rule())     
]

#Create an estimator and pass in the built_in rules.
pt_estimator = PyTorch(
    rules = built_in_rules
)

#### After calling fit, SageMaker starts one training job and one processing job for each configured built-in rule. The rule evaluation status is visible in the training logs in CloudWatch at regular intervals. You can also view the results of the rule execution programmatically using the following command:

In [None]:
t_estimator.latest_training_job.rule_job_summary()

#### Built-in rules are managed by AWS, freeing you from having to manage updates to rules. You simply plug them into the estimator. However, you may want to monitor a metric that is not included in the built-in rules, in which case you must configure a custom rule. A bit more work is involved with custom rules. For example, let's say you want to track if the gradients are becoming too large during training. To create a custom rule for this, you must extend the Rule interface provided by SageMaker Debugger.  

#### In the following example, the custom rule will work with the tensors that were collected using the gradients collection. The invoke_at_step method provides the logic to be executed. At each step, the mean value of the gradient is compared against a threshold. If the gradient value is greater than the threshold, the rule is triggered, as shown in the following code:

In [None]:
class CustomGradientRule(Rule):
    def __init__(self, base_trial, threshold=10.0):
        super().__init__(base_trial)
        self.threshold = float(threshold)
    def invoke_at_step(self, step):
        for tname in self.base_trial.tensor_names(collection="gradients"):
            t = self.base_trial.tensor(tname)
            abs_mean = t.reduction_value(step, "mean", abs=True)
            if abs_mean > self.threshold:
                return True
        return False

custom_rule = Rule.custom(
    name='CustomRule', # used to identify the rule
    # rule evaluator container image
    image_uri='759209512951.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rule-evaluator:latest', 
    instance_type='ml.t3.medium',
    source='rules/my_custom_rule.py', # path to the rule source file
    rule_to_invoke='CustomGradientRule', # name of the class to invoke in the rule source file
    volume_size_in_gb=30, # EBS volume size required to be attached to the rule evaluation instance
    collections_to_save=[CollectionConfig("gradients")],
    # collections to be analyzed by the rule. since this is a first party collection we fetch it as above
    rule_parameters={
    #Threshold to compare the gradient value against
    "threshold": "20.0"     }
)

pt_estimator_custom = PyTorch(
    ## New parameter
    rules = [custom_rule]
)

estimator.fit(wait = False)