test: add profiler regression guard for HIP graph replay by sunway513 · Pull Request #432 · ROCm/ATOM

sunway513 · 2026-03-28T02:02:59Z

Summary

Adds a lightweight integration test (~3 seconds on MI355) that detects rocprofiler-sdk interception overhead in HIP graph replay
When PyTorch kineto links librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%
This directly impacts decode throughput in TP>1 serving scenarios

What it checks

Metric	Pass	Fail (regression)
`ldd libtorch_cpu.so`	`libroctracer64.so`	`librocprofiler-sdk.so`
GPU occupancy	> 80%	< 80% (healthy = 97%)
Gaps > 100us	<= 5	> 5 (healthy = 0)
hipGraphLaunch avg	< 150us	> 150us (healthy = 50us)

Background

ROCm/pytorch#3056 (merged 2026-03-17) re-enabled rocprofiler-sdk in kineto after ROCm/pytorch#2579 had reverted it. The SDK's interception hooks add overhead to every HIP API call, which is particularly harmful for HIP graph replay where kernel execution times are small relative to dispatch overhead.

On a real model (Qwen3-0.6B, 28 layers, TP=1), the regression drops graph replay GPU occupancy from 89.5% to 9.0%, with hipGraphLaunch taking 103ms vs 7ms.

Test plan

PASS on rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix (roctracer kineto)
Correctly FAIL on rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (rocprofiler-sdk kineto)
Runs via pytest (2 tests, 3.27s) and standalone python3 (3.8s)
No dependency on ATOM or AITER — uses only PyTorch + standard transformer ops

🤖 Generated with Claude Code

Adds a lightweight integration test (~3s) that detects rocprofiler-sdk interception overhead which degrades HIP graph replay performance in the inference serving stack. When PyTorch kineto is linked against librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%. This directly impacts decode throughput in TP>1 serving scenarios. Test checks: - libtorch_cpu.so links roctracer (not rocprofiler-sdk) - HIP graph replay GPU occupancy > 80% (healthy = 97%) - Inter-kernel gaps > 100us count <= 5 (healthy = 0) - hipGraphLaunch CPU time < 150us (healthy = 50us) Validated on MI355 with: - PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix - FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock) Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new integration-style regression test to detect ROCm/PyTorch profiler (kineto) backend linkage and runtime overhead that can degrade HIP graph replay performance.

Changes:

Add tests/test_profiler_regression.py with a small transformer graph-replay workload.
Detect kineto linkage via ldd libtorch_cpu.so and analyze a torch.profiler chrome trace for occupancy/gap/hipGraphLaunch metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-28T02:09:10Z

+def _profile_graph_replay(model, g, static_input, num_iters=GRAPH_ITERS):
+    """Profile graph replay with torch.profiler and return trace dict."""
+    with torch.profiler.profile(
+        activities=[
+            torch.profiler.ProfilerActivity.CPU,
+            torch.profiler.ProfilerActivity.CUDA,
+        ],
+        record_shapes=False,
+        with_stack=False,
+    ) as prof:
+        for _ in range(num_iters):
+            static_input.copy_(torch.randn_like(static_input))
+            g.replay()
+        torch.cuda.synchronize()
+
+    with tempfile.NamedTemporaryFile(suffix=".json", delete=False) as f:
+        trace_path = f.name
+    prof.export_chrome_trace(trace_path)
+
+    with open(trace_path) as f:
+        trace = json.load(f)
+    os.unlink(trace_path)
+    return trace, prof


_profile_graph_replay accepts a model argument but doesn't use it, and returns (trace, prof) even though callers never use prof. Dropping the unused parameter and returning just trace (or returning prof only when needed) would simplify the API and avoid unused variables at call sites.

Copilot · 2026-03-28T02:09:10Z

+    if not torch.cuda.is_available():
+        print("SKIP: No GPU available")
+        exit(0)
+


The standalone runner uses the interactive-helper exit(...). In scripts/tests it's more reliable to use sys.exit(...) (after import sys) so the exit behavior is consistent even when site isn't imported.

Copilot · 2026-03-28T02:09:11Z

+        with_stack=False,
+    ) as prof:
+        for _ in range(num_iters):
+            static_input.copy_(torch.randn_like(static_input))


The profiling loop includes torch.randn_like(static_input) and a copy_ before each g.replay(), which adds extra GPU kernels/memcpy events into the trace. Since _analyze_trace() computes occupancy/gaps over all GPU events, these extra ops can skew the metrics and reduce sensitivity to hipGraphLaunch overhead. Consider using an in-place input update (e.g., static_input.normal_()), or keep the input constant during profiling so the measured occupancy/gaps reflect graph replay itself.

Suggested change

static_input.copy_(torch.randn_like(static_input))

# Keep inputs constant during profiling so that measured GPU

# occupancy and gaps primarily reflect graph replay itself,

# without additional kernels/memcpy from input updates.

Copilot · 2026-03-28T02:09:11Z

+        assert backend in (
+            "roctracer",
+            "unknown",
+        ), f"Unexpected profiler backend: {backend}"


test_kineto_backend_is_roctracer currently treats a failed/indeterminate backend detection as pass (backend == 'unknown'). That means the test may succeed without actually verifying the intended regression guard condition. Consider pytest.skip on 'unknown' (with a clear reason), or make the assertion stricter when running on ROCm builds so CI can't silently miss the linkage check.

Copilot · 2026-03-28T02:09:11Z

+import ctypes
+import json
+import os
+import subprocess
+import tempfile
+
+import pytest
+import torch
+import torch.nn as nn
+
+
+# ---------------------------------------------------------------------------
+# Thresholds
+# ---------------------------------------------------------------------------
+GPU_OCCUPANCY_PASS = 0.90  # >90% = healthy graph replay
+GPU_OCCUPANCY_FAIL = 0.80  # <80% = regression detected
+HIPGRAPHLAUNCH_MAX_US = 150  # healthy ~50us, regressed ~316us
+GAP_100US_MAX_COUNT = 5  # healthy = 0, regressed = 9+
+KERNEL_COUNT_TOLERANCE = 0.20  # allow 20% variance from expected
+


There are a few unused items that should be removed or used to avoid confusion: ctypes is imported but never referenced, and KERNEL_COUNT_TOLERANCE is defined but not used anywhere in the test.

* test: add profiler regression guard for HIP graph replay Adds a lightweight integration test (~3s) that detects rocprofiler-sdk interception overhead which degrades HIP graph replay performance in the inference serving stack. When PyTorch kineto is linked against librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%. This directly impacts decode throughput in TP>1 serving scenarios. Test checks: - libtorch_cpu.so links roctracer (not rocprofiler-sdk) - HIP graph replay GPU occupancy > 80% (healthy = 97%) - Inter-kernel gaps > 100us count <= 5 (healthy = 0) - hipGraphLaunch CPU time < 150us (healthy = 50us) Validated on MI355 with: - PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix - FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock) Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056 * style: fix black + ruff formatting --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 28, 2026 02:03

sunway513 mentioned this pull request Mar 28, 2026

test: add profiler regression guard for HIP graph replay ROCm/aiter#2511

Closed

4 tasks

Copilot started reviewing on behalf of sunway513 March 28, 2026 02:04 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

style: fix black + ruff formatting

6cc7f5b

valarLip approved these changes Mar 28, 2026

View reviewed changes

valarLip merged commit f70484d into ROCm:main Mar 28, 2026
22 of 26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add profiler regression guard for HIP graph replay#432

test: add profiler regression guard for HIP graph replay#432
valarLip merged 2 commits into
ROCm:mainfrom
sunway513:test/profiler-regression-guard

sunway513 commented Mar 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-            static_input.copy_(torch.randn_like(static_input))
+            # Keep inputs constant during profiling so that measured GPU
+            # occupancy and gaps primarily reflect graph replay itself,
+            # without additional kernels/memcpy from input updates.

Conversation

sunway513 commented Mar 28, 2026

Summary

What it checks

Background

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants