# Debugger and Profiler

- **SageMaker Debugger** is focused on **helping developers identify and debug issues during training**<br>
- **SageMaker Profiler** is focused on **helping developers optimize the performance of their training jobs.**

**model_profiling.ipynb** Tasks
1. Add rules you want to create in rules list.
2. Create the profilier and debugger configurations.
3. Create the estimator to train your model.
4. Print the names of all the tensors that were tracked.
5. Print the number of datapoints for one of those tensors for both train and eval mode.

In [10]:
# install dependencies
!pip install smdebug
!pip install jinja2==3.0.1  # To avoid jinja2 markup related error

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [11]:
hyperparameters = {
    "batch_size": 2048,
    "gpu": True,
    "epoch": 2,
    "model": "resnet50",
}

### Create Rules

In [12]:
from sagemaker.debugger import Rule, ProfilerRule, rule_configs

#TODO: Can you add the rules you want to track
rules = [
    Rule.sagemaker(rule_configs.loss_not_decreasing()), # detects when the loss is not decreasing in value at an adequate rate
    ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()), # detect if GPU utilization is low or suffers from fluctuations
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()), #  ProfilerReport rule invokes all of the built-in rules for monitoring and profiling
    Rule.sagemaker(rule_configs.vanishing_gradient()), # detects if the gradients in a trial become extremely small or drop to a zero magnitude
    Rule.sagemaker(rule_configs.overfit()),  # detects if your model is being overfit to the training data
    Rule.sagemaker(rule_configs.overtraining()), # detects if a model is being overtrained
    Rule.sagemaker(rule_configs.poor_weight_initialization()), # detects if your model parameters have been poorly initialized
]

### Create profiler and debugger configurations

In [13]:
from sagemaker.debugger import DebuggerHookConfig, ProfilerConfig, FrameworkProfile

#TODO: Can you create the profiler and debugger configs
profiler_config = ProfilerConfig(
                            system_monitor_interval_millis=500, # time interval for capturing system metrics in milliseconds
                            framework_profile_params=FrameworkProfile(num_steps=10)
                            )

debugger_config = DebuggerHookConfig(
                    hook_parameters={"train.save_interval": "100", 
                                    "eval.save_interval": "10"}
                                )

### Create an estimator to train the model

In [14]:
import sagemaker
from sagemaker.pytorch import PyTorch

#TODO: Create the estimator to train your model
estimator = PyTorch(
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    source_dir="scripts",
    entry_point="pytorch_cifar_profiling.py",
    framework_version="1.8",
    py_version="py36",
    hyperparameters=hyperparameters,
    profiler_config=profiler_config,         # Pass profiler config
    debugger_hook_config=debugger_config,    # Pass debugger hook config
    rules=rules,                             # Pass set of rules you want to track
)

### Fit the estimator

In [15]:
estimator.fit(wait=True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2023-03-02-07-43-39-332


2023-03-02 07:43:40 Starting - Starting the training job...
2023-03-02 07:44:08 Starting - Preparing the instances for trainingLossNotDecreasing: InProgress
VanishingGradient: InProgress
Overfit: InProgress
Overtraining: InProgress
PoorWeightInitialization: InProgress
LowGPUUtilization: InProgress
ProfilerReport: InProgress
.........
2023-03-02 07:45:32 Downloading - Downloading input data
2023-03-02 07:45:32 Training - Downloading the training image........................
2023-03-02 07:49:37 Training - Training image download completed. Training in progress...[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-03-02 07:49:53,914 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-03-02 07:49:53,942 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-03-02 07:49:53,944 sagemaker_pytorch_container

#### Track details of latest training job

In [16]:
import boto3

session = boto3.session.Session()
region = session.region_name

training_job_name = estimator.latest_training_job.name
print(f"Training jobname: {training_job_name}")
print(f"Region: {region}")

Training jobname: pytorch-training-2023-03-02-07-43-39-332
Region: us-east-1


#### Create a trial object from the debugging artifacts generated by the latest SageMaker training job

In [17]:
from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys

trial = create_trial(estimator.latest_job_debugger_artifacts_path())

[2023-03-02 08:04:36.385 datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395:18 INFO s3_trial.py:42] Loading trial debug-output at path s3://sagemaker-us-east-1-293789295245/pytorch-training-2023-03-02-07-43-39-332/debug-output


The **ModeKeys** class is used to specify the mode of operation for the trial object, which could be one of the following modes: `ModeKeys.TRAIN`, `ModeKeys.EVAL`, or `ModeKeys.PREDICT`.

In [18]:
# TODO: Can you print the names of all the tensors that were tracked
# TODO: Can you print the number of datapoints for one of those tensors
# for both train and eval mode

print(trial.tensor_names())
print(len(trial.tensor("CrossEntropyLoss_output_0").steps(mode=ModeKeys.TRAIN)))
print(len(trial.tensor("CrossEntropyLoss_output_0").steps(mode=ModeKeys.EVAL)))

[2023-03-02 08:04:36.687 datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395:18 INFO trial.py:198] Training has ended, will refresh one final time in 1 sec.
[2023-03-02 08:04:37.723 datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395:18 INFO trial.py:210] Loaded all steps
['CrossEntropyLoss_output_0', 'gradient/ResNet_bn1.bias', 'gradient/ResNet_bn1.weight', 'gradient/ResNet_conv1.weight', 'gradient/ResNet_fc.bias', 'gradient/ResNet_fc.weight', 'gradient/ResNet_layer1.0.bn1.bias', 'gradient/ResNet_layer1.0.bn1.weight', 'gradient/ResNet_layer1.0.bn2.bias', 'gradient/ResNet_layer1.0.bn2.weight', 'gradient/ResNet_layer1.0.bn3.bias', 'gradient/ResNet_layer1.0.bn3.weight', 'gradient/ResNet_layer1.0.conv1.weight', 'gradient/ResNet_layer1.0.conv2.weight', 'gradient/ResNet_layer1.0.conv3.weight', 'gradient/ResNet_layer1.0.downsample.0.weight', 'gradient/ResNet_layer1.0.downsample.1.bias', 'gradient/ResNet_layer1.0.downsample.1.weight', 'gradient/ResNet_layer1.1.bn1.bias', 'gradient

**Note:** The `steps()` method is used to retrieve the number of steps recorded for a given tensor in a given mode.

The **wait_for_sys_profiling_data_to_be_available()** method is then called on the tj object. This method waits for the system profiling data for the specified training job to be available. `System profiling data` includes *CPU and memory utilization, system metrics, and other low-level performance data.*

Once the system profiling data is available, it can be analyzed using the smdebug library to identify performance bottlenecks in the training process and optimize the training job.


In [19]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob

tj = TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()

ProfilerConfig:{'S3OutputPath': 's3://sagemaker-us-east-1-293789295245/', 'ProfilingIntervalInMilliseconds': 500, 'ProfilingParameters': {'DataloaderProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "MetricsRegex": ".*", }', 'DetailedProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'FileOpenFailThreshold': '50', 'HorovodProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'LocalPath': '/opt/ml/output/profiler', 'PythonProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "ProfilerName": "cprofile", "cProfileTimer": "total_time", }', 'RotateFileCloseIntervalInSeconds': '60', 'RotateMaxFileSizeInBytes': '10485760', 'SMDataParallelProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }'}, 'DisableProfiler': False}
s3 path:s3://sagemaker-us-east-1-293789295245/pytorch-training-2023-03-02-07-43-39-332/profiler-output


Profiler data from system is available


#### View Timeline Charts

The given code imports the **TimelineCharts** class from the **smdebug.profiler.analysis.notebook_utils.timeline_charts** module, which *provides functions for visualizing system profiling data for SageMaker training jobs.*<br>


The code then creates an instance of the TimelineCharts class and passes the following arguments to the constructor:

- **system_metrics_reader:** This is a reader object that reads the system profiling data for the specified training job. It is obtained by calling the *get_systems_metrics_reader()* method on the *tj* object created earlier.
- **framework_metrics_reader:** This is an optional reader object that reads the framework profiling data for the specified training job. It is set to *None* in this code, indicating that only system profiling data will be visualized.
- **select_dimensions:** This is a list of dimensions that will be included in the visualization. In this code, only the "CPU" and "GPU" dimensions are selected.
- **select_events:** This is a list of events that will be included in the visualization. In this code, only the "total" event is selected.<br>

The **TimelineCharts** object can then be used to create interactive timeline charts that show how system metrics (such as CPU and GPU utilization) change over time during the training job. These charts can help identify performance bottlenecks in the training process and optimize the training job.

In [20]:
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

view_timeline_charts = TimelineCharts(
    system_metrics_reader,
    framework_metrics_reader=None,
    select_dimensions=["CPU", "GPU"],
    select_events=["total"],
)

[2023-03-02 08:04:37.892 datascience-1-0-ml-t3-medium-1abf3407f667f989be9d86559395:18 INFO metrics_reader_base.py:134] Getting 20 event files
select events:['total']
select dimensions:['CPU', 'GPU']
filtered_events:{'total'}
filtered_dimensions:{'GPUMemoryUtilization-nodeid:algo-1', 'GPUUtilization-nodeid:algo-1', 'CPUUtilization-nodeid:algo-1'}


In [21]:
# Rule output path
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"You will find the profiler report in {rule_output_path}")

You will find the profiler report in s3://sagemaker-us-east-1-293789295245/pytorch-training-2023-03-02-07-43-39-332/rule-output


**Inspect the contents of an S3 bucket or directory, which may be useful for verifying that files or outputs have been generated correctly by a previous step in a workflow `(recursively)`**

In [22]:
# indicates that the command should recursively list all the contents of the specified directory, including any subdirectories
! aws s3 ls {rule_output_path} --recursive

2023-03-02 08:04:26     413133 pytorch-training-2023-03-02-07-43-39-332/rule-output/ProfilerReport/profiler-output/profiler-report.html
2023-03-02 08:04:42     268943 pytorch-training-2023-03-02-07-43-39-332/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb
2023-03-02 08:04:37        555 pytorch-training-2023-03-02-07-43-39-332/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json
2023-03-02 08:04:37      21707 pytorch-training-2023-03-02-07-43-39-332/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json
2023-03-02 08:04:37       1955 pytorch-training-2023-03-02-07-43-39-332/rule-output/ProfilerReport/profiler-output/profiler-reports/Dataloader.json
2023-03-02 08:04:37        130 pytorch-training-2023-03-02-07-43-39-332/rule-output/ProfilerReport/profiler-output/profiler-reports/GPUMemoryIncrease.json
2023-03-02 08:04:37       1235 pytorch-training-2023-03-02-07-43-39-332/rule-output/ProfilerReport/profiler-output/profiler-re

**Copy files from an S3 bucket or directory specified by the `rule_output_path` variable to the current working directory `(./)` in the local file system** `--recursive` **flag:** indicates that the command should recursively copy all the contents of the specified directory, including any subdirectories.

In [23]:
! aws s3 cp {rule_output_path} ./ --recursive

download: s3://sagemaker-us-east-1-293789295245/pytorch-training-2023-03-02-07-43-39-332/rule-output/ProfilerReport/profiler-output/profiler-report.html to ProfilerReport/profiler-output/profiler-report.html
download: s3://sagemaker-us-east-1-293789295245/pytorch-training-2023-03-02-07-43-39-332/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb to ProfilerReport/profiler-output/profiler-report.ipynb
download: s3://sagemaker-us-east-1-293789295245/pytorch-training-2023-03-02-07-43-39-332/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json to ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json
download: s3://sagemaker-us-east-1-293789295245/pytorch-training-2023-03-02-07-43-39-332/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json to ProfilerReport/profiler-output/profiler-reports/BatchSize.json
download: s3://sagemaker-us-east-1-293789295245/pytorch-training-2023-03-02-07-43-39-332/rule-output/ProfilerRepor

#### Fetch Autogenerated Profiler Report Name

In [29]:
print(estimator.latest_training_job.rule_job_summary())

[{'RuleConfigurationName': 'LossNotDecreasing', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-east-1:293789295245:processing-job/pytorch-training-2023-03-0-lossnotdecreasing-e9a24e94', 'RuleEvaluationStatus': 'Error', 'StatusDetails': 'InternalServerError: We encountered an internal error. Please try again.', 'LastModifiedTime': datetime.datetime(2023, 3, 2, 8, 4, 42, 440000, tzinfo=tzlocal())}, {'RuleConfigurationName': 'VanishingGradient', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-east-1:293789295245:processing-job/pytorch-training-2023-03-0-vanishinggradient-be62daf0', 'RuleEvaluationStatus': 'Error', 'StatusDetails': 'InternalServerError: We encountered an internal error. Please try again.', 'LastModifiedTime': datetime.datetime(2023, 3, 2, 8, 4, 42, 440000, tzinfo=tzlocal())}, {'RuleConfigurationName': 'Overfit', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-east-1:293789295245:processing-job/pytorch-training-2023-03-0-overfit-0ace15e4', 'RuleEvaluationStatus': 'NoIssuesFound'

In [24]:
import os

# get the autogenerated folder name of profiler report
profiler_report_name = [
    rule["RuleConfigurationName"]
    for rule in estimator.latest_training_job.rule_job_summary()
    if "Profiler" in rule["RuleConfigurationName"]
][0]

In [30]:
print(profiler_report_name)

ProfilerReport


#### Display **profiler report**

In [31]:
import IPython

IPython.display.HTML(filename=profiler_report_name + "/profiler-output/profiler-report.html")

Unnamed: 0,Description,Recommendation,Number of times rule triggered,Number of datapoints,Rule parameters
LowGPUUtilization,"Checks if the GPU utilization is low or fluctuating. This can happen due to bottlenecks, blocking calls for synchronizations, or a small batch size.","Check if there are bottlenecks, minimize blocking calls, change distributed training strategy, or increase the batch size.",11,2244,threshold_p95:70  threshold_p5:10  window:500  patience:1000
BatchSize,"Checks if GPUs are underutilized because the batch size is too small. To detect this problem, the rule analyzes the average GPU memory footprint, the CPU and the GPU utilization.","The batch size is too small, and GPUs are underutilized. Consider running on a smaller instance type or increasing the batch size.",11,2243,cpu_threshold_p95:70  gpu_threshold_p95:70  gpu_memory_threshold_p95:70  patience:1000  window:500
MaxInitializationTime,Checks if the time spent on initialization exceeds a threshold percent of the total training time. The rule waits until the first step of training loop starts. The initialization can take longer if downloading the entire dataset from Amazon S3 in File mode. The default threshold is 20 minutes.,"Initialization takes too long. If using File mode, consider switching to Pipe mode in case you are using TensorFlow framework.",0,67,threshold:20
LoadBalancing,"Detects workload balancing issues across GPUs. Workload imbalance can occur in training jobs with data parallelism. The gradients are accumulated on a primary GPU, and this GPU might be overused with regard to other GPUs, resulting in reducing the efficiency of data parallelization.",Choose a different distributed training strategy or a different distributed training framework.,0,2244,threshold:0.2  patience:1000
IOBottleneck,Checks if the data I/O wait time is high and the GPU utilization is low. It might indicate IO bottlenecks where GPU is waiting for data to arrive from storage. The rule evaluates the I/O and GPU utilization rates and triggers the issue if the time spent on the IO bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent.,"Pre-fetch data or choose different file formats, such as binary formats that improve I/O performance.",0,2246,threshold:50  io_threshold:50  gpu_threshold:10  patience:1000
CPUBottleneck,"Checks if the CPU utilization is high and the GPU utilization is low. It might indicate CPU bottlenecks, where the GPUs are waiting for data to arrive from the CPUs. The rule evaluates the CPU and GPU utilization rates, and triggers the issue if the time spent on the CPU bottlenecks exceeds a threshold percent of the total training time. The default threshold is 50 percent.",Consider increasing the number of data loaders or applying data pre-fetching.,0,2246,threshold:50  cpu_threshold:90  gpu_threshold:10  patience:1000
StepOutlier,"Detects outliers in step duration. The step duration for forward and backward pass should be roughly the same throughout the training. If there are significant outliers, it may indicate a system stall or bottleneck issues.","Check if there are any bottlenecks (CPU, I/O) correlated to the step outliers.",0,67,threshold:3  mode:None  n_outliers:10  stddev:3
Dataloader,"Checks how many data loaders are running in parallel and whether the total number is equal the number of available CPU cores. The rule triggers if number is much smaller or larger than the number of available cores. If too small, it might lead to low GPU utilization. If too large, it might impact other compute intensive operations on CPU.",Change the number of data loader processes.,0,10,min_threshold:70  max_threshold:200
GPUMemoryIncrease,Measures the average GPU memory footprint and triggers if there is a large increase.,Choose a larger instance type with more memory if footprint is close to maximum available memory.,0,2244,increase:5  patience:1000  window:10
