# PyTorch Profiling Tutorial (Quick Start)

The goal of this tutorial is to help you quickly get hands-on with **PyTorch's built-in profiler**, enabling you to analyze and diagnose performance bottlenecks in deep learning models.

By the end of this guide, you'll be able to:
- Configure and run the profiler for both short and long training runs.
- Understand key profiler options such as `record_shapes`, `with_stack`.
- Generate Chrome trace timelines. 
- Use profiler scheduling and trace handlers to profile long-running training jobs efficiently.

We'll use a simple ResNet18 model with synthetic data to stay focused on profiling behavior—not model convergence or accuracy.


## 1  Environment & model setup


In [None]:
import torch
import torchvision.models as models

device = "cuda"
dtype  = torch.bfloat16
model  = models.resnet18().to(device).to(dtype)


## 2  Random data & **one reusable training step**

In [None]:
B, C, H, W   = 5, 3, 224, 224
num_classes  = 1000

dummy_input  = torch.randn(B, C, H, W,
                           device=device, dtype=dtype)
dummy_target = torch.randn(B, num_classes,
                           device=device, dtype=dtype)

def train_step():
    # Single forward + backward pass.
    output = model(dummy_input)
    loss   = torch.nn.functional.mse_loss(output, dummy_target)
    loss.backward()

# test it out
train_step()

## 3  Warm‑up

In [None]:
def warm_up(iters: int = 10):
    for _ in range(iters):
        train_step()
    torch.cuda.synchronize()

## 4  Basic profiling pass – capture a timeline
This section shows how to capture a minimal execution trace using PyTorch's profiler. The `activities` argument specifies which device events to track:

* `ProfilerActivity.CPU` – records CPU-side operator execution.
* `ProfilerActivity.CUDA` – records GPU kernel launches and durations.



In [None]:
from torch.profiler import profile, ProfilerActivity

warm_up()
with profile(activities=[ProfilerActivity.CPU,
                         ProfilerActivity.CUDA]) as p:
    train_step()
p.export_chrome_trace("trace_minimal.json")


Open these json files in **https://ui.perfetto.dev/**


## 5  Recording tensor shapes

In [None]:
warm_up()
with profile(activities=[ProfilerActivity.CPU,
                         ProfilerActivity.CUDA],
             record_shapes=True) as p:
    train_step()
p.export_chrome_trace("trace_shapes.json")

## 6  Including Python stack traces (larger trace files)

In [None]:
warm_up()
with profile(activities=[ProfilerActivity.CPU,
                         ProfilerActivity.CUDA],
             record_shapes=True,
             with_stack=True) as p:
    train_step()
p.export_chrome_trace("trace_stack.json")


⚠️ `with_stack=True` notably increases the **trace file size** and it is recommended to switch it off for large profiling runs

## 7  Profiling long trainings with a **schedule**

When profiling longer training runs, capturing every iteration is too expensive and unnecessary. PyTorch's `schedule()` allows fine-grained control over when to record and when to skip. Here's what the arguments mean:

* `wait`: Number of iterations to skip before profiling begins.
* `warmup`: Profiler enabled but not yet saving traces — it measures kernel timings internally so the system reaches steady‑state, but these iterations are discarded.
* `active`: Number of iterations to record traces for.
* `repeat`: How many times to repeat the (wait → warmup → active) cycle.

This lets you profile windows of activity in long runs without generating massive trace files.

By default, `repeat=0`, which means the profiler will continue executing (wait → warmup → active) cycles **indefinitely** until the job ends. Setting `repeat=1` means only one such cycle is run, and `repeat=2` runs two cycles, and so on.


In [None]:
from torch.profiler import schedule

sched_wait, sched_warmup, sched_active, sched_repeat = 10, 5, 3, 2
sched = schedule(wait=sched_wait, 
                 warmup=sched_warmup, 
                 active=sched_active, 
                 repeat=sched_repeat)

def trace_handler(p):
    # this is called at the end of the active window
    # ``p.step_num`` is the last iteration of the *active* window.
    start = p.step_num - sched_active + 1
    end   = p.step_num
    p.export_chrome_trace(f"trace_iter{start}_{end}.json")

with profile(activities=[ProfilerActivity.CPU,
                         ProfilerActivity.CUDA],
             schedule=sched,
             record_shapes=True,
             with_stack=True,
             on_trace_ready=trace_handler) as p:
    warm_up()
    for _ in range(100):
        train_step()
        p.step()                   # marks iteration boundary

Now that you undersand the details of collecting the trace we can move on to the analysis. ![Torch Profiling Analysis docs](./profiling_analysis.md)  