# Monitoring Training with SageMaker Debugger

In this example we will learn how to use SageMaker Debugger to monitor and profile training jobs.

SageMaker Debugger is a comprehensive SageMaker capability that allows you to automatically monitor, debug, and profile DL training jobs running on SageMaker. SageMaker Debugger provides you with insights into your DL training by capturing the internal state of your training loop and instances metrics in near-real time. Debugger also allows you to automatically detect common issues happening during training and take appropriate actions when issues are detected. This allows you to automatically find issues in complex DL training jobs earlier and react accordingly. Additionally, SageMaker Debugger supports writing custom rules for scenarios not covered by built-in rules.

SageMaker has several key components:
- The open source [smedebug library](https://github.com/awslabs/sagemaker-debugger), which integrates with DL frameworks and Linux instances to persist debugging and profiling data to Amazon S3, as well as to retrieve and analyze it once the training job has been started
- The SageMaker Python SDK, which allows you to configure the smedebug library with no or minimal code changes in your training script
- Automatically provisioned processing jobs to validate output tensors and profiling data against rules

SageMaker Debugger supports TensorFlow, PyTorch, and MXNet DL frameworks. The `smedebug` library is installed by default in SageMaker DL containers, so you can start using SageMaker Debugger without having to make any modifications to your training script. You can also install the smdebug library in a custom Docker container and use all the features of SageMaker Debugger.

In this examples, we will use the same CV `Hymenoptera` problem as in TensorBoard example. You will have an opportunity to compare both debugging frameworks on the same problem.

### Prerequisites
This examples requires to have `smedbug` installed locally. Feel free to run cell below to install all required dependencies:

In [None]:
! pip install -r requirements.txt

### Initial Setup

In the cell below, we make initial imports and download required data. Feel free to reuse dataset from previous example (make sure to update `data_url` respectively).

In [None]:
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()
role = get_execution_role()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sm-debugger'

In [None]:
# Downloading dataset and unzipping it locally
! wget https://download.pytorch.org/tutorial/hymenoptera_data.zip
! unzip hymenoptera_data.zip
data_url = sagemaker_session.upload_data(path="./hymenoptera_data", key_prefix="hymenoptera_data")
print(f"S3 location of dataset {data_url}")

## Debugging Your Training Job

In this section we will learn how to configure monitoring of your training job and analyze results for debugging purposes.

### Modifying Training Script

The `smedebug` library requires minimal changes to capture tensors and scalars. Let's review them.

1. You need to initiate the hook object outside of your training loop, as well as after model and optimizer initialization:

```python
    model = initialize_resnet_model(NUM_CLASSES, feature_extract=False, use_pretrained=True)
    model.to(torch.device("cuda"))
    optimizer = optim.SGD(params_to_update, lr=0.001, momentum=0.9)
    criterion = nn.CrossEntropyLoss()
    exp_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=7,
    gamma=0.1)
    hook = smd.Hook.create_from_json_file()
    hook.register_hook(model)
    hook.register_loss(criterion)
```

Note that we are using `.create_from_json_file()` to create our hook object. This method instantiates hook based on the hook configuration you provide in the SageMaker training object. Since we are adding both the model and criterion objects to hook, we should expect to see both model parameters (weights, biases, and others), as well as loss scalar.

2. Inside our training loop, the only modification we need to make is to differentiate between the training and validation phases by switching between smedebug.modes.Train and smedebug. modes.Eval. This will allow smedebug to segregate the tensors that are captured in the training and evaluation phases:

```python
    for epoch in range(1, args.num_epochs + 1):
        for phase in ["train", "val"]:
            if phase == "train":
                model.train()  # Set model to training mode
                if hook:
                    hook.set_mode(modes.TRAIN)
            else:
                model.eval()  # Set model to evaluate mode
                if hook:
                    hook.set_mode(modes.EVAL)
            # Rest of the training loop
```

Run the cell below to review training script in full:

In [None]:
! pygmentize 2_sources/train_resnet_sm.py

### Configuring Training Job

Now, let’s review how to configure `hook`, `rules`, `actions`, and tensor `collections` when running a SageMaker training job.

1. We will start by importing Debugger entities:

In [None]:
from sagemaker.debugger import (CollectionConfig, DebuggerHookConfig,
                                ProfilerRule, Rule, TensorBoardOutputConfig,
                                rule_configs)
from sagemaker.pytorch import PyTorch

2. Then, we must define automatic actions and a set of rules. Here, we are using Debugger’s built-in rules to detect some common DL training issues. Note that we can assign different actions to different rules. In our case, we want to stop our training job immediately when the rule is triggered:

In [None]:
actions = rule_configs.ActionList(
    rule_configs.StopTraining())


rules = [
    Rule.sagemaker(rule_configs.vanishing_gradient(), actions=actions),
    Rule.sagemaker(rule_configs.overfit(), actions=actions),
    Rule.sagemaker(rule_configs.overtraining(), actions=actions),
    Rule.sagemaker(rule_configs.poor_weight_initialization(), actions=actions),
]

3. Next, we must configure the collection of tensors and how they will be persisted. Here, we will define that we want to persist the weights and losses collection. For weights, we will also save a histogram that can be further visualized in TensorBoard. We will also set a saving interval for the training and evaluation phases:

In [None]:
collection_configs=[
        CollectionConfig(
            name="weights",
            parameters={
                "save_histogram": "True"
                }
            ),
        CollectionConfig(name="losses"),
    ]

hook_config = DebuggerHookConfig(
    hook_parameters={"train.save_interval": "1", "eval.save_interval": "1"},
    collection_configs=collection_configs
)

### Running Training Job

Now, we are ready to pass these objects to the SageMaker Estimator object and run training job:

In [None]:

from sagemaker.pytorch import PyTorch

instance_type = 'ml.p2.xlarge'
instance_count = 1
job_name = "pytorch-sm-debugging"
tb_debug_path = f"s3://{bucket}/tensorboard/{job_name}"

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=tb_debug_path
)


debug_estimator = PyTorch(
          entry_point="train_resnet_sm.py",
          source_dir='2_sources',
          role=role,
          instance_type=instance_type,
          sagemaker_session=sagemaker_session,
          image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker",
          instance_count=instance_count,
          hyperparameters={
              "batch-size":64,
              "num-epochs":5,
              "input-size" : 224,
              "num-data-workers" : 4,
              "feature-extract":False,
          },
          disable_profiler=True,
          rules=rules,
          debugger_hook_config=hook_config,
          tensorboard_output_config=tensorboard_output_config,
          base_job_name=job_name,
      )

In [None]:
debug_estimator.fit(inputs={"train":f"{data_url}/train", "val":f"{data_url}/val"}, wait=False)

### Reviewing Debugger Results

SageMaker Debugger provides functionality to retrieve and analyze collected tensors from training jobs as part of the smedebug library. In the following steps, we will highlight some key APIs:

1. In the following code block, we are creating a new trial object using the S3 path where the tensors were persisted. Then, we print list of all available tensors and value of specific tensor `CrossEntropyLoss_output_0`.

In [None]:
import smdebug.pytorch as smd
from smdebug import trials

tensors_path = debug_estimator.latest_job_debugger_artifacts_path()

trial = smd.create_trial(tensors_path)

print(f"Saved these tensors: {trial.tensor_names()}")
print(f"Loss values during evaluation were {trial.tensor('CrossEntropyLoss_output_0').values()}")

2. Using a plotting function `plot_tensor()`, we can visualize loss for the training and evaluation phases. Running the following command will result in a 2D loss chart. Similarly, you can access and process other tensors:

In [None]:
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import host_subplot
from smdebug import modes

def get_data(trial, tname, mode):
    tensor = trial.tensor(tname)
    steps = tensor.steps(mode=mode)
    vals = []
    for s in steps:
        vals.append(tensor.value(s, mode=mode))
    return steps, vals

def plot_tensor(trial, tensor_name):

    steps_train, vals_train = get_data(trial, tensor_name, mode=modes.TRAIN)
    print("loaded TRAIN data")
    steps_eval, vals_eval = get_data(trial, tensor_name, mode=modes.EVAL)
    print("loaded EVAL data")

    fig = plt.figure(figsize=(10, 7))
    host = host_subplot(111)

    par = host.twiny()

    host.set_xlabel("Steps (TRAIN)")
    par.set_xlabel("Steps (EVAL)")
    host.set_ylabel(tensor_name)

    (p1,) = host.plot(steps_train, vals_train, label=tensor_name)
    print("completed TRAIN plot")
    (p2,) = par.plot(steps_eval, vals_eval, label="val_" + tensor_name)
    print("completed EVAL plot")
    leg = plt.legend()

    host.xaxis.get_label().set_color(p1.get_color())
    leg.texts[0].set_color(p1.get_color())

    par.xaxis.get_label().set_color(p2.get_color())
    leg.texts[1].set_color(p2.get_color())

    plt.ylabel(tensor_name)

    plt.show()

plot_tensor(trial, "CrossEntropyLoss_output_0")

3. Let's check if any rules were triggered during our training. The rule evaluation results should have no rule triggered. ou can experiment with rule settings. For instance, you can reset weights on one of the model layers. This will result in triggering the PoorWeightInitiailization rule and the training process being stopped.

In [None]:
for  summary in debug_estimator.latest_training_job.rule_job_summary():
    print(f"Rule: {summary['RuleConfigurationName']}, status: {summary['RuleEvaluationStatus']}")

4. Lastly, let’s visually inspect the saved tensors using TensorBoard. For this, we simply need to start TensorBoard using the S3 path we supplied to the Estimator object earlier. Feel free to explore TensorBoard on your own. You should expect to find histograms of weights.

In [None]:
! tensorboard --logdir  {tb_debug_path}

## Profiling Training Application

In this section we review how to profile resource utilization of your training application using SageMaker Debugger. SageMaker Debugger allows you to collect various types of advanced metrics from your training instances. Once these metrics have been collected, SageMaker generates detailed metrics visualizations, detects resource bottlenecks, and provides recommendations on how instance utilization can be improved.

SageMaker Debugger collects two types of metrics:
- **System metrics**: These are the resource utilization metrics of training instances such as CPU, GPU, network, and I/O.
- **Framework metrics**: These are collected at the DL framework level. This includes metrics collected by native framework profiles (such as PyTorch profiler or TensorFlow Profiler), data loader metrics, and Python profiling metrics.

As in the case of debugging, you can define rules that will be automatically evaluated against collected metrics. If a rule is triggered, you can define one or several actions that will be taken. For example, you can send an email if the training job has GPU utilization below a certain threshold.

### Configuring Profiling

SageMaker Debugger doesn't require training script modification to collect metrics. However, you need to provide profiler configuration as part of your SageMaker `Estimator` object. Let's see how to do this.


1. We start with necessary imports. Similar to debugging examples, we also need to create `actions`, `rules`, and `hooks` objects.

In [None]:
from sagemaker.debugger import (CollectionConfig, DebuggerHookConfig,
                                ProfilerRule, Rule, TensorBoardOutputConfig,
                                rule_configs)

actions = rule_configs.ActionList(
    rule_configs.StopTraining())

rules = [
    ProfilerRule.sagemaker(rule_configs.ProfilerReport())
]

hook_config = DebuggerHookConfig(
    hook_parameters={"train.save_interval": "1", "eval.save_interval": "1"},
)

2. Next we define what system and framework metrics we want to collect. For instance, we can provide a custom configuration for the framework, data loader, and Python. Note that system profiling is enabled by default:

In [None]:
from sagemaker.debugger import (DataloaderProfilingConfig,
                                DetailedProfilingConfig, FrameworkProfile,
                                ProfilerConfig, PythonProfiler,
                                PythonProfilingConfig, cProfileTimer)

profiler_config=ProfilerConfig(
    system_monitor_interval_millis=500,
    framework_profile_params=FrameworkProfile(
        detailed_profiling_config=DetailedProfilingConfig(
            start_step=2, 
            num_steps=1
        ),
        dataloader_profiling_config=DataloaderProfilingConfig(
            start_step=2, 
            num_steps=1
        ),
        python_profiling_config=PythonProfilingConfig(
            start_step=2, 
            num_steps=1, 
            python_profiler=PythonProfiler.CPROFILE, 
            cprofile_timer=cProfileTimer.TOTAL_TIME
        )
    )
)

### Running Training Job

Next, we provide the profiling config to the SageMaker training job configuration and start the training. Note that we set `num-data-workers` to 8, while `ml.p2.xlarge` instance has only 4 CPU cores. Usually, it’s recommended to have the number of data workers equal to the number of CPUs. Let’s see if SageMaker Debugger will be able to detect this suboptimal configuration.

In [None]:

from sagemaker.pytorch import PyTorch

instance_type = 'ml.p2.xlarge'
instance_count = 1
job_name = "pytorch-sm-profiling"
bucket = sagemaker_session.default_bucket()

profiler_estimator = PyTorch(
          entry_point="train_resnet_sm.py",
          source_dir='2_sources',
          role=role,
          instance_type=instance_type,
          sagemaker_session=sagemaker_session,
          image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker",
          instance_count=instance_count,
          hyperparameters={
              "batch-size":64,
              "num-epochs":5,
              "input-size" : 224,
              "num-data-workers" : 8,
              "feature-extract":False,
          },
          disable_profiler=False,
          profiler_config=profiler_config,
          rules=rules,
          base_job_name=job_name,
      )

In [None]:
profiler_estimator.fit(inputs={"train":f"{data_url}/train", "val":f"{data_url}/val"}, wait=False)

### Reviewing Profiling Results

You can start monitoring profiling outcomes in near-real time. We will use the semdebug.profiler API to process profiling outputs:

In [None]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob

training_job_name = profiler_estimator.latest_training_job.job_name
tj = TrainingJob(training_job_name, sagemaker_session.boto_region_name)
tj.wait_for_sys_profiling_data_to_be_available()

Once the data is available, we can retrieve and visualize it. Running the following code will chart the CPU, GPU, and GPU memory utilization from system metrics.

In [None]:
from smdebug.profiler.analysis.notebook_utils.timeline_charts import \
    TimelineCharts

system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

view_timeline_charts = TimelineCharts(
    system_metrics_reader,
    framework_metrics_reader=None,
    select_dimensions=["CPU", "GPU"],
    select_events=["total"],
)

Similarly, you can visualize other collected metrics. SageMaker Debugger also generates a detailed profiling report that aggregates all visualizations, insights, and recommendations in one place. Once your training job has finished, you can download the profile report and all collected data by running the following command in your terminal:

In [None]:

rule_output_path = f"s3://{bucket}/{training_job_name}/rule-output"
! aws s3 cp {rule_output_path} ./ --recursive

Once all the assets have been downloaded, open the profiler-report.html file in your browser and review the generated information. Alternatively, you can open profiler-report.ipynb, which provides the same insights in the form of an executable Jupyter notebook.

The report covers the following aspects:
- System usage statistics
- Framework metrics summary
- Summary of rules and their status
- Training loop analysis and recommendations for optimizations

Note that, in the Dataloading analysis section, you should see a recommendation to decrease the number of data workers according to our expectations.

As you can see, SageMaker Debugger provides extensive profiling capabilities, including a recommendation to improve and automate rule validation with minimal development efforts. Similar to other Debugger capabilities, profiling is free of charge, so long as you are using built-in rules.