# SageMaker Model Profiling

Here we will see how we can use Sagemaker Profiling to see our training system metrics as well as generate a Profiler Report.

First we will need to install `smdebug`.

## `pytorch_cifar_profiling.py`
<details>
  <summary> Click here to see the full code for the script </summary>

```python

import argparse
import time

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.models as models
import torchvision.transforms as transforms

from smdebug import modes
from smdebug.profiler.utils import str2bool
from smdebug.pytorch import get_hook

def train(args, net, device):
    hook = get_hook(create_if_not_exists=True)
    batch_size = args.batch_size
    epoch = args.epoch
    transform_train = transforms.Compose(
        [
            transforms.RandomCrop(32, padding=4),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
        ]
    )

    transform_valid = transforms.Compose(
        [
            transforms.ToTensor(),
            transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
        ]
    )

    trainset = torchvision.datasets.CIFAR10(
        root="./data", train=True, download=True, transform=transform_train
    )
    trainloader = torch.utils.data.DataLoader(
        trainset,
        batch_size=batch_size,
        shuffle=True
    )

    validset = torchvision.datasets.CIFAR10(
        root="./data", train=False, download=True, transform=transform_valid
    )
    validloader = torch.utils.data.DataLoader(
        validset,
        batch_size=batch_size,
        shuffle=False
    )

    loss_optim = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=1.0, momentum=0.9)

    epoch_times = []

    if hook:
        hook.register_loss(loss_optim)
    # train the model

    for i in range(epoch):
        print("START TRAINING")
        if hook:
            hook.set_mode(modes.TRAIN)
        start = time.time()
        net.train()
        train_loss = 0
        for _, (inputs, targets) in enumerate(trainloader):
            inputs, targets = inputs.to(device), targets.to(device)
            optimizer.zero_grad()
            outputs = net(inputs)
            loss = loss_optim(outputs, targets)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()

        print("START VALIDATING")
        if hook:
            hook.set_mode(modes.EVAL)
        net.eval()
        val_loss = 0
        with torch.no_grad():
            for _, (inputs, targets) in enumerate(validloader):
                inputs, targets = inputs.to(device), targets.to(device)
                outputs = net(inputs)
                loss = loss_optim(outputs, targets)
                val_loss += loss.item()

        epoch_time = time.time() - start
        epoch_times.append(epoch_time)
        print(
            "Epoch %d: train loss %.3f, val loss %.3f, in %.1f sec"
            % (i, train_loss, val_loss, epoch_time)
        )

    # calculate training time after all epoch
    p50 = np.percentile(epoch_times, 50)
    return p50


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--batch_size", type=int, default=128)
    parser.add_argument("--epoch", type=int, default=1)
    parser.add_argument("--gpu", type=str2bool, default=True)
    parser.add_argument("--model", type=str, default="resnet50")

    opt = parser.parse_args()

    for key, value in vars(opt).items():
        print(f"{key}:{value}")
    # create model
    net = models.__dict__[opt.model](pretrained=True)
    if opt.gpu == 1:
        device = torch.device("cuda")
    else:
        device = torch.device("cpu")
    net.to(device)

    # Start the training.
    median_time = train(opt, net, device)
    print("Median training time per Epoch=%.1f sec" % median_time)


if __name__ == "__main__":
    main()
```

</details>

In [2]:
# install dependencies
!pip install smdebug

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 1.0.8 
Collecting smdebug
  Using cached smdebug-1.0.34-py2.py3-none-any.whl.metadata (1.8 kB)
Collecting protobuf<=3.20.3,>=3.20.0 (from smdebug)
  Using cached protobuf-3.20.3-py2.py3-none-any.whl.metadata (720 bytes)
Collecting pyinstrument==3.4.2 (from smdebug)
  Using cached pyinstrument-3.4.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting pyinstrument-cext>=0.2.2 (from pyinstrument==3.4.2->smdebug)
  Using cached pyinstrument_cext-0.2.4-cp311-cp311-linux_x86_64.whl
Using cached smdebug-1.0.34-py2.py3-none-any.whl (280 kB)
Using cached pyinstrument-3.4.2-py2.py3-none-any.whl (83 kB)
Using cached protobuf-3.20.3-py2.py3-none-any.whl (162 kB)


Next we will need to specify the metrics that we want to track and create the profiler rules. Below you can see that I have specified to track 3 metrics: Loss not decreasing, Low GPU Utilization and also to generate the profiler report. I have also specified that these metrics should be tracked every 500 milliseconds.

In [3]:
from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules = [
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    ProfilerRule.sagemaker(rule_configs.LowGPUUtilization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]

Trying to create a Glue session for the kernel.
Session Type: etl
Session ID: a274dc71-ecf7-4284-bf72-3da0b40f1740
Applying the following default arguments:
--glue_kernel_version 1.0.8
--enable-glue-datacatalog true


Following exception encountered while creating session: An error occurred (InvalidInputException) when calling the CreateSession operation: Role arn:aws:iam::332367199158:role/service-role/AmazonSageMaker-ExecutionRole-20250404T165298 should be given assume role permissions for Glue Service.
 (Service: AWSGlueJobExecutor; Status Code: 400; Error Code: InvalidInputException; Request ID: f0a1ade0-20e5-489a-9192-b9d31ce9d730; Proxy: null) 

Error message: Role arn:aws:iam::332367199158:role/service-role/AmazonSageMaker-ExecutionRole-20250404T165298 should be given assume role permissions for Glue Service.
 (Service: AWSGlueJobExecutor; Status Code: 400; Error Code: InvalidInputException; Request ID: f0a1ade0-20e5-489a-9192-b9d31ce9d730; Proxy: null) 

Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/aws_glue_interactive_sessions_kernel/glue_kernel_utils/KernelGateway.py", line 104, in create_session
    response = self.glue_client.create_session(
        

In [4]:
from sagemaker.debugger import ProfilerConfig, FrameworkProfile

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500, framework_profile_params=FrameworkProfile(num_steps=10)
)

Trying to create a Glue session for the kernel.
Session Type: etl
Session ID: a274dc71-ecf7-4284-bf72-3da0b40f1740
Applying the following default arguments:
--glue_kernel_version 1.0.8
--enable-glue-datacatalog true


Following exception encountered while creating session: An error occurred (InvalidInputException) when calling the CreateSession operation: Role arn:aws:iam::332367199158:role/service-role/AmazonSageMaker-ExecutionRole-20250404T165298 should be given assume role permissions for Glue Service.
 (Service: AWSGlueJobExecutor; Status Code: 400; Error Code: InvalidInputException; Request ID: 4a6b2bd7-bfce-48c1-93da-7c82f43e061e; Proxy: null) 

Error message: Role arn:aws:iam::332367199158:role/service-role/AmazonSageMaker-ExecutionRole-20250404T165298 should be given assume role permissions for Glue Service.
 (Service: AWSGlueJobExecutor; Status Code: 400; Error Code: InvalidInputException; Request ID: 4a6b2bd7-bfce-48c1-93da-7c82f43e061e; Proxy: null) 

Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/aws_glue_interactive_sessions_kernel/glue_kernel_utils/KernelGateway.py", line 104, in create_session
    response = self.glue_client.create_session(
        

Now that we have specified our profiler rules, we can create our hyperparameter dict and estimator to perform training. We will also need to specify our profiler rules and configs in the estimator.

In [5]:
hyperparameters = {
    "batch_size": 2048,
    "gpu": True,
    "epoch": 2,
    "model": "resnet50",
}

Trying to create a Glue session for the kernel.
Session Type: etl
Session ID: 66704c60-022b-414f-894f-070c25ce575c
Applying the following default arguments:
--glue_kernel_version 1.0.8
--enable-glue-datacatalog true


Following exception encountered while creating session: An error occurred (AccessDeniedException) when calling the CreateSession operation: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Error message: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/aws_glue_interactive_sessions_kernel/glue_kernel_utils/KernelGateway.py", line 104, in create_session
    response = self.glue_cli

In [6]:
import sagemaker
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    entry_point="pytorch_cifar_profiling.py",
    framework_version="1.8",
    py_version="py36",
    hyperparameters=hyperparameters,
    profiler_config=profiler_config,
    rules=rules,
)

Trying to create a Glue session for the kernel.
Session Type: etl
Session ID: 66704c60-022b-414f-894f-070c25ce575c
Applying the following default arguments:
--glue_kernel_version 1.0.8
--enable-glue-datacatalog true


Following exception encountered while creating session: An error occurred (AccessDeniedException) when calling the CreateSession operation: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Error message: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/aws_glue_interactive_sessions_kernel/glue_kernel_utils/KernelGateway.py", line 104, in create_session
    response = self.glue_cli

In [7]:
estimator.fit(wait=True)

Trying to create a Glue session for the kernel.
Session Type: etl
Session ID: 66704c60-022b-414f-894f-070c25ce575c
Applying the following default arguments:
--glue_kernel_version 1.0.8
--enable-glue-datacatalog true


Following exception encountered while creating session: An error occurred (AccessDeniedException) when calling the CreateSession operation: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Error message: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/aws_glue_interactive_sessions_kernel/glue_kernel_utils/KernelGateway.py", line 104, in create_session
    response = self.glue_cli

In [8]:
import boto3

session = boto3.session.Session()
region = session.region_name

training_job_name = estimator.latest_training_job.name
print(f"Training jobname: {training_job_name}")
print(f"Region: {region}")

Trying to create a Glue session for the kernel.
Session Type: etl
Session ID: 66704c60-022b-414f-894f-070c25ce575c
Applying the following default arguments:
--glue_kernel_version 1.0.8
--enable-glue-datacatalog true


Following exception encountered while creating session: An error occurred (AccessDeniedException) when calling the CreateSession operation: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Error message: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/aws_glue_interactive_sessions_kernel/glue_kernel_utils/KernelGateway.py", line 104, in create_session
    response = self.glue_cli

## Checking System Utilization
Below is some boilerplate code to get the training job object using the training job name and display the system metrics. The plots may not show up in the classroom, but it will show up when you train the model in SageMaker Studio.

In [9]:
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob

tj = TrainingJob(training_job_name, region)
tj.wait_for_sys_profiling_data_to_be_available()

Trying to create a Glue session for the kernel.
Session Type: etl
Session ID: 66704c60-022b-414f-894f-070c25ce575c
Applying the following default arguments:
--glue_kernel_version 1.0.8
--enable-glue-datacatalog true


Following exception encountered while creating session: An error occurred (AccessDeniedException) when calling the CreateSession operation: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Error message: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/aws_glue_interactive_sessions_kernel/glue_kernel_utils/KernelGateway.py", line 104, in create_session
    response = self.glue_cli

In [10]:
from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts

system_metrics_reader = tj.get_systems_metrics_reader()
system_metrics_reader.refresh_event_file_list()

view_timeline_charts = TimelineCharts(
    system_metrics_reader,
    framework_metrics_reader=None,
    select_dimensions=["CPU", "GPU"],
    select_events=["total"],
)

Trying to create a Glue session for the kernel.
Session Type: etl
Session ID: 66704c60-022b-414f-894f-070c25ce575c
Applying the following default arguments:
--glue_kernel_version 1.0.8
--enable-glue-datacatalog true


Following exception encountered while creating session: An error occurred (AccessDeniedException) when calling the CreateSession operation: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Error message: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/aws_glue_interactive_sessions_kernel/glue_kernel_utils/KernelGateway.py", line 104, in create_session
    response = self.glue_cli

## Profiler Report
Next we will fetch the profiler report from the S3 bucket where it was stored and display it. The profiler report may not display in the notebook, but you can take a look at it from the ProfilerReport folder.


In [11]:
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"You will find the profiler report in {rule_output_path}")

Trying to create a Glue session for the kernel.
Session Type: etl
Session ID: 66704c60-022b-414f-894f-070c25ce575c
Applying the following default arguments:
--glue_kernel_version 1.0.8
--enable-glue-datacatalog true


Following exception encountered while creating session: An error occurred (AccessDeniedException) when calling the CreateSession operation: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Error message: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/aws_glue_interactive_sessions_kernel/glue_kernel_utils/KernelGateway.py", line 104, in create_session
    response = self.glue_cli

In [13]:
! aws s3 ls {rule_output_path} --recursive


Parameter validation failed:
Invalid bucket name "{rule_output_path}": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"


In [15]:
! aws s3 cp {rule_output_path} ./ --recursive


usage: aws s3 cp <LocalPath> <S3Uri> or <S3Uri> <LocalPath> or <S3Uri> <S3Uri>
Error: Invalid argument type


In [16]:
import os

# get the autogenerated folder name of profiler report
profiler_report_name = [
    rule["RuleConfigurationName"]
    for rule in estimator.latest_training_job.rule_job_summary()
    if "Profiler" in rule["RuleConfigurationName"]
][0]

Trying to create a Glue session for the kernel.
Session Type: etl
Session ID: 66704c60-022b-414f-894f-070c25ce575c
Applying the following default arguments:
--glue_kernel_version 1.0.8
--enable-glue-datacatalog true


Following exception encountered while creating session: An error occurred (AccessDeniedException) when calling the CreateSession operation: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Error message: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/aws_glue_interactive_sessions_kernel/glue_kernel_utils/KernelGateway.py", line 104, in create_session
    response = self.glue_cli

In [17]:
import IPython

IPython.display.HTML(filename=profiler_report_name + "/profiler-output/profiler-report.html")

Trying to create a Glue session for the kernel.
Session Type: etl
Session ID: 66704c60-022b-414f-894f-070c25ce575c
Applying the following default arguments:
--glue_kernel_version 1.0.8
--enable-glue-datacatalog true


Following exception encountered while creating session: An error occurred (AccessDeniedException) when calling the CreateSession operation: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Error message: User: arn:aws:sts::332367199158:assumed-role/AmazonSageMaker-ExecutionRole-20250404T165298/SageMaker is not authorized to perform: glue:CreateSession on resource: arn:aws:glue:us-east-1:332367199158:session/66704c60-022b-414f-894f-070c25ce575c because no identity-based policy allows the glue:CreateSession action 

Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/aws_glue_interactive_sessions_kernel/glue_kernel_utils/KernelGateway.py", line 104, in create_session
    response = self.glue_cli