test: add profiler regression guard for HIP graph replay#2511
Conversation
Adds a lightweight test (~3s) that detects rocprofiler-sdk interception overhead which degrades HIP graph replay performance. When PyTorch kineto is linked against librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%. Test checks: - libtorch_cpu.so links roctracer (not rocprofiler-sdk) - HIP graph replay GPU occupancy > 80% (healthy = 97%) - Inter-kernel gaps > 100us count <= 5 (healthy = 0) - hipGraphLaunch CPU time < 150us (healthy = 50us) Uses a 4-layer mini transformer (BS=8, seq=1, FP16) to amplify dispatch overhead. Runs standalone or via pytest. Validated on MI355 with: - PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix - FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock) Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
There was a problem hiding this comment.
Pull request overview
Adds a new regression test to detect performance degradation in HIP graph replay caused by PyTorch kineto linking against librocprofiler-sdk.so (profiler interception overhead) instead of libroctracer64.so.
Changes:
- Add
op_tests/test_profiler_regression.pywith:- A linkage check (
ldd libtorch_cpu.so) to detect rocprofiler-sdk vs roctracer. - A HIP graph replay workload +
torch.profilertrace analysis to assert minimum GPU occupancy / gap / hipGraphLaunch timing thresholds.
- A linkage check (
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @pytest.mark.skipif(not torch.cuda.is_available(), reason="No GPU available") | ||
| def test_hipgraph_replay_occupancy(self): | ||
| """HIP graph replay should have >90% GPU occupancy under profiler.""" | ||
| device = "cuda:0" |
There was a problem hiding this comment.
torch.cuda.set_device() expects an int device index or a torch.device, but here device is a string ("cuda:0"). This will raise TypeError on typical PyTorch builds and cause the test to fail before running any assertions. Use torch.device("cuda:0") (or 0) consistently with the other tests in this repo.
| device = "cuda:0" | |
| device = torch.device("cuda:0") |
| print("SKIP: No GPU available") | ||
| exit(0) | ||
|
|
||
| device = "cuda:0" |
There was a problem hiding this comment.
Same issue as above: torch.cuda.set_device() is passed a string ("cuda:0"), which can raise TypeError and break standalone execution. Use a torch.device or integer index.
| device = "cuda:0" | |
| device = torch.device("cuda:0") |
| def __init__(self, hidden=HIDDEN, heads=HEADS): | ||
| super().__init__() |
There was a problem hiding this comment.
_TransformerBlock.__init__ takes a heads parameter, but the module never stores/uses it; forward() instead uses the global HEADS constant. This is easy to trip over if someone tries to instantiate the block with a different head count; prefer storing self.heads = heads and using it in the reshape logic (or remove the parameter).
| Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056 | ||
| """ | ||
|
|
||
| import ctypes |
There was a problem hiding this comment.
Unused import: ctypes is never referenced in this file. Please remove it to keep the test minimal and avoid confusion about any intended low-level HIP interaction.
| import ctypes |
| GPU_OCCUPANCY_FAIL = 0.80 # <80% = regression detected | ||
| HIPGRAPHLAUNCH_MAX_US = 150 # healthy ~50us, regressed ~316us | ||
| GAP_100US_MAX_COUNT = 5 # healthy = 0, regressed = 9+ | ||
| KERNEL_COUNT_TOLERANCE = 0.20 # allow 20% variance from expected |
There was a problem hiding this comment.
KERNEL_COUNT_TOLERANCE is defined but never used. Either remove it, or add the intended kernel-count assertion so the constant has a purpose (otherwise it looks like an incomplete check).
| KERNEL_COUNT_TOLERANCE = 0.20 # allow 20% variance from expected |
| with open(trace_path) as f: | ||
| trace = json.load(f) | ||
| os.unlink(trace_path) | ||
| return trace, prof |
There was a problem hiding this comment.
_profile_graph_replay() doesn't use its model parameter, and the returned prof object is unused by both callers. Consider dropping the unused parameter and either returning only trace or using _ at the call sites to avoid dead assignments.
| return trace, prof | |
| return trace |
|
|
||
| @pytest.mark.skipif(not torch.cuda.is_available(), reason="No GPU available") | ||
| def test_hipgraph_replay_occupancy(self): | ||
| """HIP graph replay should have >90% GPU occupancy under profiler.""" |
There was a problem hiding this comment.
This docstring says the replay "should have >90% GPU occupancy", but the test only fails below 80% and merely warns below 90%. Please adjust the wording to match the actual pass/fail criteria (or tighten the assertion) to avoid confusion when diagnosing failures.
| """HIP graph replay should have >90% GPU occupancy under profiler.""" | |
| """HIP graph replay should maintain high GPU occupancy (fails <80%, warns <90%).""" |
| if not torch.cuda.is_available(): | ||
| print("SKIP: No GPU available") | ||
| exit(0) | ||
|
|
There was a problem hiding this comment.
In the standalone runner, exit(...) relies on site injecting exit and isn't guaranteed in all Python invocation modes (e.g., python -S). Prefer import sys and sys.exit(...) for robustness.
|
Moving to ROCm/ATOM#432 — this is an integration-level regression test, better suited under ATOM. |
Summary
librocprofiler-sdk.soinstead oflibroctracer64.so, everyhipGraphLaunchincurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%What it checks
ldd libtorch_cpu.solibroctracer64.solibrocprofiler-sdk.soBackground
ROCm/pytorch#3056 (merged 2026-03-17) re-enabled rocprofiler-sdk in kineto after ROCm/pytorch#2579 had reverted it. The SDK's interception hooks add overhead to every HIP API call, which is particularly harmful for HIP graph replay where kernel execution times are small relative to dispatch overhead.
Test plan
rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix(roctracer kineto)rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326(rocprofiler-sdk kineto)pytest(2 tests, 3.27s) and standalonepython3(3.8s)🤖 Generated with Claude Code