Skip to content

test: add profiler regression guard for HIP graph replay#2511

Closed
sunway513 wants to merge 1 commit into
ROCm:mainfrom
sunway513:test/profiler-regression-guard
Closed

test: add profiler regression guard for HIP graph replay#2511
sunway513 wants to merge 1 commit into
ROCm:mainfrom
sunway513:test/profiler-regression-guard

Conversation

@sunway513
Copy link
Copy Markdown
Collaborator

Summary

  • Adds a lightweight regression test (~3 seconds on MI355) that detects rocprofiler-sdk interception overhead in HIP graph replay
  • When PyTorch kineto links librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%
  • This test catches the regression before it reaches production

What it checks

Metric Pass Fail (regression)
ldd libtorch_cpu.so libroctracer64.so librocprofiler-sdk.so
GPU occupancy > 80% < 80% (healthy = 97%)
Gaps > 100us <= 5 > 5 (healthy = 0)
hipGraphLaunch avg < 150us > 150us (healthy = 50us)

Background

ROCm/pytorch#3056 (merged 2026-03-17) re-enabled rocprofiler-sdk in kineto after ROCm/pytorch#2579 had reverted it. The SDK's interception hooks add overhead to every HIP API call, which is particularly harmful for HIP graph replay where kernel execution times are small relative to dispatch overhead.

Test plan

  • PASS on rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix (roctracer kineto)
  • Correctly FAIL on rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (rocprofiler-sdk kineto)
  • Runs via pytest (2 tests, 3.27s) and standalone python3 (3.8s)
  • No dependency on aiter kernels — uses only PyTorch + standard transformer ops

🤖 Generated with Claude Code

Adds a lightweight test (~3s) that detects rocprofiler-sdk interception
overhead which degrades HIP graph replay performance. When PyTorch kineto
is linked against librocprofiler-sdk.so instead of libroctracer64.so,
every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy
from ~97% to ~75%.

Test checks:
- libtorch_cpu.so links roctracer (not rocprofiler-sdk)
- HIP graph replay GPU occupancy > 80% (healthy = 97%)
- Inter-kernel gaps > 100us count <= 5 (healthy = 0)
- hipGraphLaunch CPU time < 150us (healthy = 50us)

Uses a 4-layer mini transformer (BS=8, seq=1, FP16) to amplify
dispatch overhead. Runs standalone or via pytest.

Validated on MI355 with:
- PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix
- FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock)

Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sunway513 sunway513 requested review from a team and Copilot March 27, 2026 17:51
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI325
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2511 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new regression test to detect performance degradation in HIP graph replay caused by PyTorch kineto linking against librocprofiler-sdk.so (profiler interception overhead) instead of libroctracer64.so.

Changes:

  • Add op_tests/test_profiler_regression.py with:
    • A linkage check (ldd libtorch_cpu.so) to detect rocprofiler-sdk vs roctracer.
    • A HIP graph replay workload + torch.profiler trace analysis to assert minimum GPU occupancy / gap / hipGraphLaunch timing thresholds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@pytest.mark.skipif(not torch.cuda.is_available(), reason="No GPU available")
def test_hipgraph_replay_occupancy(self):
"""HIP graph replay should have >90% GPU occupancy under profiler."""
device = "cuda:0"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.cuda.set_device() expects an int device index or a torch.device, but here device is a string ("cuda:0"). This will raise TypeError on typical PyTorch builds and cause the test to fail before running any assertions. Use torch.device("cuda:0") (or 0) consistently with the other tests in this repo.

Suggested change
device = "cuda:0"
device = torch.device("cuda:0")

Copilot uses AI. Check for mistakes.
print("SKIP: No GPU available")
exit(0)

device = "cuda:0"
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: torch.cuda.set_device() is passed a string ("cuda:0"), which can raise TypeError and break standalone execution. Use a torch.device or integer index.

Suggested change
device = "cuda:0"
device = torch.device("cuda:0")

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +53
def __init__(self, hidden=HIDDEN, heads=HEADS):
super().__init__()
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_TransformerBlock.__init__ takes a heads parameter, but the module never stores/uses it; forward() instead uses the global HEADS constant. This is easy to trip over if someone tries to instantiate the block with a different head count; prefer storing self.heads = heads and using it in the reshape logic (or remove the parameter).

Copilot uses AI. Check for mistakes.
Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056
"""

import ctypes
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import: ctypes is never referenced in this file. Please remove it to keep the test minimal and avoid confusion about any intended low-level HIP interaction.

Suggested change
import ctypes

Copilot uses AI. Check for mistakes.
GPU_OCCUPANCY_FAIL = 0.80 # <80% = regression detected
HIPGRAPHLAUNCH_MAX_US = 150 # healthy ~50us, regressed ~316us
GAP_100US_MAX_COUNT = 5 # healthy = 0, regressed = 9+
KERNEL_COUNT_TOLERANCE = 0.20 # allow 20% variance from expected
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KERNEL_COUNT_TOLERANCE is defined but never used. Either remove it, or add the intended kernel-count assertion so the constant has a purpose (otherwise it looks like an incomplete check).

Suggested change
KERNEL_COUNT_TOLERANCE = 0.20 # allow 20% variance from expected

Copilot uses AI. Check for mistakes.
with open(trace_path) as f:
trace = json.load(f)
os.unlink(trace_path)
return trace, prof
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_profile_graph_replay() doesn't use its model parameter, and the returned prof object is unused by both callers. Consider dropping the unused parameter and either returning only trace or using _ at the call sites to avoid dead assignments.

Suggested change
return trace, prof
return trace

Copilot uses AI. Check for mistakes.

@pytest.mark.skipif(not torch.cuda.is_available(), reason="No GPU available")
def test_hipgraph_replay_occupancy(self):
"""HIP graph replay should have >90% GPU occupancy under profiler."""
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring says the replay "should have >90% GPU occupancy", but the test only fails below 80% and merely warns below 90%. Please adjust the wording to match the actual pass/fail criteria (or tighten the assertion) to avoid confusion when diagnosing failures.

Suggested change
"""HIP graph replay should have >90% GPU occupancy under profiler."""
"""HIP graph replay should maintain high GPU occupancy (fails <80%, warns <90%)."""

Copilot uses AI. Check for mistakes.
Comment on lines +308 to +311
if not torch.cuda.is_available():
print("SKIP: No GPU available")
exit(0)

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the standalone runner, exit(...) relies on site injecting exit and isn't guaranteed in all Python invocation modes (e.g., python -S). Prefer import sys and sys.exit(...) for robustness.

Copilot uses AI. Check for mistakes.
@sunway513
Copy link
Copy Markdown
Collaborator Author

Moving to ROCm/ATOM#432 — this is an integration-level regression test, better suited under ATOM.

@sunway513 sunway513 closed this Mar 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants