Skip to content

test: add profiler regression guard for HIP graph replay#432

Merged
valarLip merged 2 commits into
ROCm:mainfrom
sunway513:test/profiler-regression-guard
Mar 28, 2026
Merged

test: add profiler regression guard for HIP graph replay#432
valarLip merged 2 commits into
ROCm:mainfrom
sunway513:test/profiler-regression-guard

Conversation

@sunway513
Copy link
Copy Markdown
Collaborator

Summary

  • Adds a lightweight integration test (~3 seconds on MI355) that detects rocprofiler-sdk interception overhead in HIP graph replay
  • When PyTorch kineto links librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%
  • This directly impacts decode throughput in TP>1 serving scenarios

What it checks

Metric Pass Fail (regression)
ldd libtorch_cpu.so libroctracer64.so librocprofiler-sdk.so
GPU occupancy > 80% < 80% (healthy = 97%)
Gaps > 100us <= 5 > 5 (healthy = 0)
hipGraphLaunch avg < 150us > 150us (healthy = 50us)

Background

ROCm/pytorch#3056 (merged 2026-03-17) re-enabled rocprofiler-sdk in kineto after ROCm/pytorch#2579 had reverted it. The SDK's interception hooks add overhead to every HIP API call, which is particularly harmful for HIP graph replay where kernel execution times are small relative to dispatch overhead.

On a real model (Qwen3-0.6B, 28 layers, TP=1), the regression drops graph replay GPU occupancy from 89.5% to 9.0%, with hipGraphLaunch taking 103ms vs 7ms.

Test plan

  • PASS on rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix (roctracer kineto)
  • Correctly FAIL on rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (rocprofiler-sdk kineto)
  • Runs via pytest (2 tests, 3.27s) and standalone python3 (3.8s)
  • No dependency on ATOM or AITER — uses only PyTorch + standard transformer ops

🤖 Generated with Claude Code

Adds a lightweight integration test (~3s) that detects rocprofiler-sdk
interception overhead which degrades HIP graph replay performance in the
inference serving stack.

When PyTorch kineto is linked against librocprofiler-sdk.so instead of
libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead,
dropping GPU occupancy from ~97% to ~75%. This directly impacts decode
throughput in TP>1 serving scenarios.

Test checks:
- libtorch_cpu.so links roctracer (not rocprofiler-sdk)
- HIP graph replay GPU occupancy > 80% (healthy = 97%)
- Inter-kernel gaps > 100us count <= 5 (healthy = 0)
- hipGraphLaunch CPU time < 150us (healthy = 50us)

Validated on MI355 with:
- PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix
- FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock)

Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new integration-style regression test to detect ROCm/PyTorch profiler (kineto) backend linkage and runtime overhead that can degrade HIP graph replay performance.

Changes:

  • Add tests/test_profiler_regression.py with a small transformer graph-replay workload.
  • Detect kineto linkage via ldd libtorch_cpu.so and analyze a torch.profiler chrome trace for occupancy/gap/hipGraphLaunch metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +134 to +156
def _profile_graph_replay(model, g, static_input, num_iters=GRAPH_ITERS):
"""Profile graph replay with torch.profiler and return trace dict."""
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
record_shapes=False,
with_stack=False,
) as prof:
for _ in range(num_iters):
static_input.copy_(torch.randn_like(static_input))
g.replay()
torch.cuda.synchronize()

with tempfile.NamedTemporaryFile(suffix=".json", delete=False) as f:
trace_path = f.name
prof.export_chrome_trace(trace_path)

with open(trace_path) as f:
trace = json.load(f)
os.unlink(trace_path)
return trace, prof
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_profile_graph_replay accepts a model argument but doesn't use it, and returns (trace, prof) even though callers never use prof. Dropping the unused parameter and returning just trace (or returning prof only when needed) would simplify the API and avoid unused variables at call sites.

Copilot uses AI. Check for mistakes.
Comment on lines +308 to +311
if not torch.cuda.is_available():
print("SKIP: No GPU available")
exit(0)

Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standalone runner uses the interactive-helper exit(...). In scripts/tests it's more reliable to use sys.exit(...) (after import sys) so the exit behavior is consistent even when site isn't imported.

Copilot uses AI. Check for mistakes.
with_stack=False,
) as prof:
for _ in range(num_iters):
static_input.copy_(torch.randn_like(static_input))
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The profiling loop includes torch.randn_like(static_input) and a copy_ before each g.replay(), which adds extra GPU kernels/memcpy events into the trace. Since _analyze_trace() computes occupancy/gaps over all GPU events, these extra ops can skew the metrics and reduce sensitivity to hipGraphLaunch overhead. Consider using an in-place input update (e.g., static_input.normal_()), or keep the input constant during profiling so the measured occupancy/gaps reflect graph replay itself.

Suggested change
static_input.copy_(torch.randn_like(static_input))
# Keep inputs constant during profiling so that measured GPU
# occupancy and gaps primarily reflect graph replay itself,
# without additional kernels/memcpy from input updates.

Copilot uses AI. Check for mistakes.
Comment on lines +231 to +234
assert backend in (
"roctracer",
"unknown",
), f"Unexpected profiler backend: {backend}"
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_kineto_backend_is_roctracer currently treats a failed/indeterminate backend detection as pass (backend == 'unknown'). That means the test may succeed without actually verifying the intended regression guard condition. Consider pytest.skip on 'unknown' (with a clear reason), or make the assertion stricter when running on ROCm builds so CI can't silently miss the linkage check.

Copilot uses AI. Check for mistakes.
Comment thread tests/test_profiler_regression.py Outdated
Comment on lines +19 to +38
import ctypes
import json
import os
import subprocess
import tempfile

import pytest
import torch
import torch.nn as nn


# ---------------------------------------------------------------------------
# Thresholds
# ---------------------------------------------------------------------------
GPU_OCCUPANCY_PASS = 0.90 # >90% = healthy graph replay
GPU_OCCUPANCY_FAIL = 0.80 # <80% = regression detected
HIPGRAPHLAUNCH_MAX_US = 150 # healthy ~50us, regressed ~316us
GAP_100US_MAX_COUNT = 5 # healthy = 0, regressed = 9+
KERNEL_COUNT_TOLERANCE = 0.20 # allow 20% variance from expected

Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few unused items that should be removed or used to avoid confusion: ctypes is imported but never referenced, and KERNEL_COUNT_TOLERANCE is defined but not used anywhere in the test.

Copilot uses AI. Check for mistakes.
@valarLip valarLip merged commit f70484d into ROCm:main Mar 28, 2026
22 of 26 checks passed
Jasen2201 pushed a commit to Jasen2201/ATOM that referenced this pull request Apr 10, 2026
* test: add profiler regression guard for HIP graph replay

Adds a lightweight integration test (~3s) that detects rocprofiler-sdk
interception overhead which degrades HIP graph replay performance in the
inference serving stack.

When PyTorch kineto is linked against librocprofiler-sdk.so instead of
libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead,
dropping GPU occupancy from ~97% to ~75%. This directly impacts decode
throughput in TP>1 serving scenarios.

Test checks:
- libtorch_cpu.so links roctracer (not rocprofiler-sdk)
- HIP graph replay GPU occupancy > 80% (healthy = 97%)
- Inter-kernel gaps > 100us count <= 5 (healthy = 0)
- hipGraphLaunch CPU time < 150us (healthy = 50us)

Validated on MI355 with:
- PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix
- FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock)

Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056


* style: fix black + ruff formatting

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants