test: add profiler regression guard for HIP graph replay by sunway513 · Pull Request #2511 · ROCm/aiter

sunway513 · 2026-03-27T17:51:29Z

Summary

Adds a lightweight regression test (~3 seconds on MI355) that detects rocprofiler-sdk interception overhead in HIP graph replay
When PyTorch kineto links librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%
This test catches the regression before it reaches production

What it checks

Metric	Pass	Fail (regression)
`ldd libtorch_cpu.so`	`libroctracer64.so`	`librocprofiler-sdk.so`
GPU occupancy	> 80%	< 80% (healthy = 97%)
Gaps > 100us	<= 5	> 5 (healthy = 0)
hipGraphLaunch avg	< 150us	> 150us (healthy = 50us)

Background

ROCm/pytorch#3056 (merged 2026-03-17) re-enabled rocprofiler-sdk in kineto after ROCm/pytorch#2579 had reverted it. The SDK's interception hooks add overhead to every HIP API call, which is particularly harmful for HIP graph replay where kernel execution times are small relative to dispatch overhead.

Test plan

PASS on rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix (roctracer kineto)
Correctly FAIL on rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (rocprofiler-sdk kineto)
Runs via pytest (2 tests, 3.27s) and standalone python3 (3.8s)
No dependency on aiter kernels — uses only PyTorch + standard transformer ops

🤖 Generated with Claude Code

Adds a lightweight test (~3s) that detects rocprofiler-sdk interception overhead which degrades HIP graph replay performance. When PyTorch kineto is linked against librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%. Test checks: - libtorch_cpu.so links roctracer (not rocprofiler-sdk) - HIP graph replay GPU occupancy > 80% (healthy = 97%) - Inter-kernel gaps > 100us count <= 5 (healthy = 0) - hipGraphLaunch CPU time < 150us (healthy = 50us) Uses a 4-layer mini transformer (BS=8, seq=1, FP16) to amplify dispatch overhead. Runs standalone or via pytest. Validated on MI355 with: - PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix - FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock) Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-27T17:52:05Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-355`	Run Triton tests on MI355 in addition to MI325
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2511 --add-label <label>

Copilot

Pull request overview

Adds a new regression test to detect performance degradation in HIP graph replay caused by PyTorch kineto linking against librocprofiler-sdk.so (profiler interception overhead) instead of libroctracer64.so.

Changes:

Add op_tests/test_profiler_regression.py with:
- A linkage check (ldd libtorch_cpu.so) to detect rocprofiler-sdk vs roctracer.
- A HIP graph replay workload + torch.profiler trace analysis to assert minimum GPU occupancy / gap / hipGraphLaunch timing thresholds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-27T17:57:42Z

+    @pytest.mark.skipif(not torch.cuda.is_available(), reason="No GPU available")
+    def test_hipgraph_replay_occupancy(self):
+        """HIP graph replay should have >90% GPU occupancy under profiler."""
+        device = "cuda:0"


torch.cuda.set_device() expects an int device index or a torch.device, but here device is a string ("cuda:0"). This will raise TypeError on typical PyTorch builds and cause the test to fail before running any assertions. Use torch.device("cuda:0") (or 0) consistently with the other tests in this repo.

Suggested change

device = "cuda:0"

device = torch.device("cuda:0")

Copilot · 2026-03-27T17:57:42Z

+        print("SKIP: No GPU available")
+        exit(0)
+
+    device = "cuda:0"


Same issue as above: torch.cuda.set_device() is passed a string ("cuda:0"), which can raise TypeError and break standalone execution. Use a torch.device or integer index.

Suggested change

device = "cuda:0"

device = torch.device("cuda:0")

Copilot · 2026-03-27T17:57:43Z

+    def __init__(self, hidden=HIDDEN, heads=HEADS):
+        super().__init__()


_TransformerBlock.__init__ takes a heads parameter, but the module never stores/uses it; forward() instead uses the global HEADS constant. This is easy to trip over if someone tries to instantiate the block with a different head count; prefer storing self.heads = heads and using it in the reshape logic (or remove the parameter).

Copilot · 2026-03-27T17:57:43Z

+Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056
+"""
+
+import ctypes


Unused import: ctypes is never referenced in this file. Please remove it to keep the test minimal and avoid confusion about any intended low-level HIP interaction.

Suggested change

import ctypes

Copilot · 2026-03-27T17:57:43Z

+GPU_OCCUPANCY_FAIL = 0.80  # <80% = regression detected
+HIPGRAPHLAUNCH_MAX_US = 150  # healthy ~50us, regressed ~316us
+GAP_100US_MAX_COUNT = 5  # healthy = 0, regressed = 9+
+KERNEL_COUNT_TOLERANCE = 0.20  # allow 20% variance from expected


KERNEL_COUNT_TOLERANCE is defined but never used. Either remove it, or add the intended kernel-count assertion so the constant has a purpose (otherwise it looks like an incomplete check).

Suggested change

KERNEL_COUNT_TOLERANCE = 0.20 # allow 20% variance from expected

Copilot · 2026-03-27T17:57:43Z

+    with open(trace_path) as f:
+        trace = json.load(f)
+    os.unlink(trace_path)
+    return trace, prof


_profile_graph_replay() doesn't use its model parameter, and the returned prof object is unused by both callers. Consider dropping the unused parameter and either returning only trace or using _ at the call sites to avoid dead assignments.

Suggested change

return trace, prof

return trace

Copilot · 2026-03-27T17:57:43Z

+
+    @pytest.mark.skipif(not torch.cuda.is_available(), reason="No GPU available")
+    def test_hipgraph_replay_occupancy(self):
+        """HIP graph replay should have >90% GPU occupancy under profiler."""


This docstring says the replay "should have >90% GPU occupancy", but the test only fails below 80% and merely warns below 90%. Please adjust the wording to match the actual pass/fail criteria (or tighten the assertion) to avoid confusion when diagnosing failures.

Suggested change

"""HIP graph replay should have >90% GPU occupancy under profiler."""

"""HIP graph replay should maintain high GPU occupancy (fails <80%, warns <90%)."""

Copilot · 2026-03-27T17:57:44Z

+    if not torch.cuda.is_available():
+        print("SKIP: No GPU available")
+        exit(0)
+


In the standalone runner, exit(...) relies on site injecting exit and isn't guaranteed in all Python invocation modes (e.g., python -S). Prefer import sys and sys.exit(...) for robustness.

sunway513 · 2026-03-28T02:04:14Z

Moving to ROCm/ATOM#432 — this is an integration-level regression test, better suited under ATOM.

sunway513 requested review from a team and Copilot March 27, 2026 17:51

Copilot started reviewing on behalf of sunway513 March 27, 2026 17:53 View session

Copilot AI reviewed Mar 27, 2026

View reviewed changes

sunway513 closed this Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add profiler regression guard for HIP graph replay#2511

test: add profiler regression guard for HIP graph replay#2511
sunway513 wants to merge 1 commit into
ROCm:mainfrom
sunway513:test/profiler-regression-guard

sunway513 commented Mar 27, 2026

Uh oh!

github-actions Bot commented Mar 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

Copilot AI Mar 27, 2026

Uh oh!

sunway513 commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		def __init__(self, hidden=HIDDEN, heads=HEADS):
		super().__init__()

	"""HIP graph replay should have >90% GPU occupancy under profiler."""
	"""HIP graph replay should maintain high GPU occupancy (fails <80%, warns <90%)."""

Conversation

sunway513 commented Mar 27, 2026

Summary

What it checks

Background

Test plan

Uh oh!

github-actions Bot commented Mar 27, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

sunway513 commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants