# PyTorch Profiler on Intel Gaudi HPU

## Overview

The PyTorch Profiler is a powerful tool for analyzing and optimizing the performance of PyTorch models running on Intel Gaudi AI accelerators (HPU). It provides detailed insights into:

- **CPU and HPU Activity**: Track execution time spent on both host CPU and the Gaudi device
- **Kernel Utilization**: Monitor Tensor Processing Cores (TPC) and Matrix Multiplication Engine (MME) usage
- **Memory Profiling**: Analyze HPU memory allocation patterns during training/inference
- **Performance Recommendations**: Get actionable guidance for optimization

## Key Concepts

### ProfilerActivity Types
- `torch.profiler.ProfilerActivity.CPU`: Profiles CPU operations
- `torch.profiler.ProfilerActivity.HPU`: Profiles Intel Gaudi HPU operations (replaces CUDA for GPU)

### Profiler Schedule
The profiler uses a schedule to control data collection:
- `wait`: Number of steps to skip before profiling
- `warmup`: Number of warmup steps (data collected but not recorded)
- `active`: Number of steps to actively record
- `repeat`: Number of times to repeat the cycle

### mark_step()
On Gaudi HPU, `htcore.mark_step()` is crucial as it triggers the execution of accumulated graph operations. This is essential for proper profiling as it ensures operations are actually executed on the device.

## Trace Output
The profiler generates JSON trace files that can be:
1. Viewed in TensorBoard using `torch-tb-profiler` plugin
2. Uploaded to **Perfetto** trace viewer for detailed analysis


## Setup: Import Required Libraries


In [8]:
import torch
import habana_frameworks.torch.core as htcore
import os

# Create output directory for trace files
os.makedirs('./profile_traces', exist_ok=True)

# Set up device
device = torch.device('hpu')
print(f"Using device: {device}")


Using device: hpu


## Trace 1: Random Tensor Addition (A + B)

This trace captures the profiling data for element-wise addition of two random tensors with shape (1024, 1024).


In [9]:
# Define matrix size
MATRIX_SIZE = (1024, 1024)

# Define profiler activities for CPU and HPU
activities = [
    torch.profiler.ProfilerActivity.CPU,
    torch.profiler.ProfilerActivity.HPU
]

# Trace 1: Tensor Addition
print("Starting Trace 1: Tensor Addition (A + B)")

# Create random tensors on HPU
A = torch.randn(MATRIX_SIZE, dtype=torch.float32, device=device)
B = torch.randn(MATRIX_SIZE, dtype=torch.float32, device=device)

# Define output file path
trace1_path = './profile_traces/addition.pt.trace.json'

with torch.profiler.profile(
    schedule=torch.profiler.schedule(wait=0, warmup=5, active=5, repeat=1),
    activities=activities,
    on_trace_ready=lambda p: p.export_chrome_trace(trace1_path),
    record_shapes=True,
    with_stack=True
) as profiler:
    for step in range(15):        
        # Perform addition
        C = A + B
        
        # Trigger execution of accumulated operations
        htcore.mark_step()
        
        # Step the profiler
        profiler.step()

print(f"Trace 1 completed! Saved to: {trace1_path}")


Starting Trace 1: Tensor Addition (A + B)
Trace 1 completed! Saved to: ./profile_traces/addition.pt.trace.json


## Trace 2: Random Tensor Matrix Multiplication (A @ B)

This trace captures the profiling data for matrix multiplication of two random tensors with shape (1024, 1024). Matrix multiplication utilizes the Matrix Multiplication Engine (MME) on Gaudi.


In [10]:
# Trace 2: Matrix Multiplication
print("Starting Trace 2: Matrix Multiplication (A @ B)")

# Create random tensors on HPU
A = torch.randn(MATRIX_SIZE, dtype=torch.float32, device=device)
B = torch.randn(MATRIX_SIZE, dtype=torch.float32, device=device)

# Define output file path
trace2_path = './profile_traces/matmul.pt.trace.json'

with torch.profiler.profile(
    schedule=torch.profiler.schedule(wait=0, warmup=5, active=5, repeat=1),
    activities=activities,
    on_trace_ready=lambda p: p.export_chrome_trace(trace2_path),
    record_shapes=True,
    with_stack=True
) as profiler:
    for step in range(15):
        # Perform matrix multiplication
        C = A @ B  # equivalent to torch.matmul(A, B)
        
        # Trigger execution of accumulated operations
        htcore.mark_step()
        
        # Step the profiler
        profiler.step()

print(f"Trace 2 completed! Saved to: {trace2_path}")


Starting Trace 2: Matrix Multiplication (A @ B)
Trace 2 completed! Saved to: ./profile_traces/matmul.pt.trace.json


## Trace 3: Matmul ‚Üí ReLU ‚Üí Matmul (Eager Mode vs. torch.compile)

This trace compares the performance of a simple neural network pattern (matmul ‚Üí relu ‚Üí matmul) between:

1. **Eager Mode**: Direct execution without graph compilation
2. **torch.compile with hpu_backend**: Intel Gaudi optimized graph compilation

According to [Habana Documentation](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Getting_Started_with_Inference.html), `torch.compile` with `hpu_backend` is the recommended approach for optimal performance on Intel Gaudi. The backend enables graph-level optimizations specific to HPU architecture.


In [11]:
import torch.nn as nn
import torch.nn.functional as F

# Define a simple model: Matmul -> ReLU -> Matmul
class MatmulReluMatmul(nn.Module):
    def __init__(self, size):
        super().__init__()
        self.weight1 = nn.Parameter(torch.randn(size, size))
        self.weight2 = nn.Parameter(torch.randn(size, size))
    
    def forward(self, x):
        # Step 1: Matmul
        x = torch.matmul(x, self.weight1)
        # Step 2: ReLU
        x = F.relu(x)
        # Step 3: Matmul
        x = torch.matmul(x, self.weight2)
        return x

# Create model for Eager mode (no torch.compile)
model_eager = MatmulReluMatmul(MATRIX_SIZE[0]).to(device)
model_eager.eval()

print("Model created for Eager mode profiling")
print(model_eager)


Model created for Eager mode profiling
MatmulReluMatmul()


### Trace 3-a: Eager Mode (without torch.compile)


In [12]:
# Trace 3a: Eager Mode
print("Starting Trace 3a: Matmul -> ReLU -> Matmul (Eager Mode)")

# Create input tensor
input_tensor = torch.randn(MATRIX_SIZE, dtype=torch.float32, device=device)

# Define output file path
trace3a_path = './profile_traces/eager_mode.pt.trace.json'

with torch.no_grad():
    with torch.profiler.profile(
        schedule=torch.profiler.schedule(wait=0, warmup=5, active=5, repeat=1),
        activities=activities,
        on_trace_ready=lambda p: p.export_chrome_trace(trace3a_path),
        record_shapes=True,
        with_stack=True
    ) as profiler:
        for step in range(15):
            # Forward pass: Matmul -> ReLU -> Matmul
            output = model_eager(input_tensor)
            
            # Trigger execution of accumulated operations
            htcore.mark_step()
            
            # Step the profiler
            profiler.step()

print(f"Trace 3a completed! Saved to: {trace3a_path}")


Starting Trace 3a: Matmul -> ReLU -> Matmul (Eager Mode)
Trace 3a completed! Saved to: ./profile_traces/eager_mode.pt.trace.json


### Trace 3-b: torch.compile with hpu_backend

Using `torch.compile(model, backend="hpu_backend")` enables Intel Gaudi-specific graph optimizations for improved performance.


In [13]:
# Create a new model instance for torch.compile
model_compiled = MatmulReluMatmul(MATRIX_SIZE[0]).to(device)
model_compiled.eval()

# Wrap with torch.compile using hpu_backend
model_compiled = torch.compile(model_compiled, backend="hpu_backend")

print("Model compiled with hpu_backend")

# Trace 3b: torch.compile Mode
print("Starting Trace 3b: Matmul -> ReLU -> Matmul (torch.compile)")

# Create input tensor
input_tensor = torch.randn(MATRIX_SIZE, dtype=torch.float32, device=device)

# Define output file path
trace3b_path = './profile_traces/torch_compile.pt.trace.json'

with torch.no_grad():
    with torch.profiler.profile(
        schedule=torch.profiler.schedule(wait=0, warmup=5, active=5, repeat=1),
        activities=activities,
        on_trace_ready=lambda p: p.export_chrome_trace(trace3b_path),
        record_shapes=True,
        with_stack=True
    ) as profiler:
        for step in range(15):
            # Forward pass: Matmul -> ReLU -> Matmul (compiled)
            output = model_compiled(input_tensor)
            
            # Trigger execution of accumulated operations
            htcore.mark_step()
            
            # Step the profiler
            profiler.step()

print(f"Trace 3b completed! Saved to: {trace3b_path}")


Model compiled with hpu_backend
Starting Trace 3b: Matmul -> ReLU -> Matmul (torch.compile)
Trace 3b completed! Saved to: ./profile_traces/torch_compile.pt.trace.json


## List Generated Trace Files


In [14]:
import os

print("=" * 60)
print("Generated Trace Files Summary")
print("=" * 60)

trace_files = [
    ("Trace 1 - Addition (A + B)", "./profile_traces/addition.pt.trace.json"),
    ("Trace 2 - Matrix Multiplication (A @ B)", "./profile_traces/matmul.pt.trace.json"),
    ("Trace 3a - Matmul‚ÜíReLU‚ÜíMatmul (Eager Mode)", "./profile_traces/eager_mode.pt.trace.json"),
    ("Trace 3b - Matmul‚ÜíReLU‚ÜíMatmul (torch.compile)", "./profile_traces/torch_compile.pt.trace.json"),
]

for name, path in trace_files:
    exists = "‚úÖ" if os.path.exists(path) else "‚ùå"
    size = f"({os.path.getsize(path) / 1024:.1f} KB)" if os.path.exists(path) else "(not found)"
    print(f"\n{exists} {name}:")
    print(f"   {path} {size}")

print("\n" + "=" * 60)


Generated Trace Files Summary

‚úÖ Trace 1 - Addition (A + B):
   ./profile_traces/addition.pt.trace.json (171.2 KB)

‚úÖ Trace 2 - Matrix Multiplication (A @ B):
   ./profile_traces/matmul.pt.trace.json (137.2 KB)

‚úÖ Trace 3a - Matmul‚ÜíReLU‚ÜíMatmul (Eager Mode):
   ./profile_traces/eager_mode.pt.trace.json (282.7 KB)

‚úÖ Trace 3b - Matmul‚ÜíReLU‚ÜíMatmul (torch.compile):
   ./profile_traces/torch_compile.pt.trace.json (581.4 KB)



---

## üîç Viewing the Trace Results

### Habana Perfetto Viewer

**Upload the file to https://perfetto.habana.ai and view the API calls and hardware trace events.**

Steps:
1. Download the generated `.json` trace files to your local machine:
   - `./profile_traces/addition.pt.trace.json` - Tensor Addition trace
   - `./profile_traces/matmul.pt.trace.json` - Matrix Multiplication trace
   - `./profile_traces/eager_mode.pt.trace.json` - Eager Mode (Matmul‚ÜíReLU‚ÜíMatmul)
   - `./profile_traces/torch_compile.pt.trace.json` - torch.compile Mode (Matmul‚ÜíReLU‚ÜíMatmul)
2. Open https://perfetto.habana.ai in your browser
3. Click "Open trace file" and select your downloaded JSON file
4. Explore the trace timeline to analyze:
   - CPU and HPU activity
   - Kernel execution times (TPC and MME utilization)
   - Memory operations
   - API call sequences

### Comparing Eager Mode vs. torch.compile

When viewing Trace 3a (Eager Mode) and Trace 3b (torch.compile), compare:
- **Graph compilation overhead**: torch.compile may have initial compilation cost
- **Kernel fusion**: torch.compile with `hpu_backend` can fuse multiple operations
- **Overall execution time**: Compiled graphs typically run faster after warmup
