test: add profiler regression guard for HIP graph replay#432
Conversation
Adds a lightweight integration test (~3s) that detects rocprofiler-sdk interception overhead which degrades HIP graph replay performance in the inference serving stack. When PyTorch kineto is linked against librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%. This directly impacts decode throughput in TP>1 serving scenarios. Test checks: - libtorch_cpu.so links roctracer (not rocprofiler-sdk) - HIP graph replay GPU occupancy > 80% (healthy = 97%) - Inter-kernel gaps > 100us count <= 5 (healthy = 0) - hipGraphLaunch CPU time < 150us (healthy = 50us) Validated on MI355 with: - PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix - FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock) Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new integration-style regression test to detect ROCm/PyTorch profiler (kineto) backend linkage and runtime overhead that can degrade HIP graph replay performance.
Changes:
- Add
tests/test_profiler_regression.pywith a small transformer graph-replay workload. - Detect kineto linkage via
ldd libtorch_cpu.soand analyze atorch.profilerchrome trace for occupancy/gap/hipGraphLaunch metrics.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _profile_graph_replay(model, g, static_input, num_iters=GRAPH_ITERS): | ||
| """Profile graph replay with torch.profiler and return trace dict.""" | ||
| with torch.profiler.profile( | ||
| activities=[ | ||
| torch.profiler.ProfilerActivity.CPU, | ||
| torch.profiler.ProfilerActivity.CUDA, | ||
| ], | ||
| record_shapes=False, | ||
| with_stack=False, | ||
| ) as prof: | ||
| for _ in range(num_iters): | ||
| static_input.copy_(torch.randn_like(static_input)) | ||
| g.replay() | ||
| torch.cuda.synchronize() | ||
|
|
||
| with tempfile.NamedTemporaryFile(suffix=".json", delete=False) as f: | ||
| trace_path = f.name | ||
| prof.export_chrome_trace(trace_path) | ||
|
|
||
| with open(trace_path) as f: | ||
| trace = json.load(f) | ||
| os.unlink(trace_path) | ||
| return trace, prof |
There was a problem hiding this comment.
_profile_graph_replay accepts a model argument but doesn't use it, and returns (trace, prof) even though callers never use prof. Dropping the unused parameter and returning just trace (or returning prof only when needed) would simplify the API and avoid unused variables at call sites.
| if not torch.cuda.is_available(): | ||
| print("SKIP: No GPU available") | ||
| exit(0) | ||
|
|
There was a problem hiding this comment.
The standalone runner uses the interactive-helper exit(...). In scripts/tests it's more reliable to use sys.exit(...) (after import sys) so the exit behavior is consistent even when site isn't imported.
| with_stack=False, | ||
| ) as prof: | ||
| for _ in range(num_iters): | ||
| static_input.copy_(torch.randn_like(static_input)) |
There was a problem hiding this comment.
The profiling loop includes torch.randn_like(static_input) and a copy_ before each g.replay(), which adds extra GPU kernels/memcpy events into the trace. Since _analyze_trace() computes occupancy/gaps over all GPU events, these extra ops can skew the metrics and reduce sensitivity to hipGraphLaunch overhead. Consider using an in-place input update (e.g., static_input.normal_()), or keep the input constant during profiling so the measured occupancy/gaps reflect graph replay itself.
| static_input.copy_(torch.randn_like(static_input)) | |
| # Keep inputs constant during profiling so that measured GPU | |
| # occupancy and gaps primarily reflect graph replay itself, | |
| # without additional kernels/memcpy from input updates. |
| assert backend in ( | ||
| "roctracer", | ||
| "unknown", | ||
| ), f"Unexpected profiler backend: {backend}" |
There was a problem hiding this comment.
test_kineto_backend_is_roctracer currently treats a failed/indeterminate backend detection as pass (backend == 'unknown'). That means the test may succeed without actually verifying the intended regression guard condition. Consider pytest.skip on 'unknown' (with a clear reason), or make the assertion stricter when running on ROCm builds so CI can't silently miss the linkage check.
| import ctypes | ||
| import json | ||
| import os | ||
| import subprocess | ||
| import tempfile | ||
|
|
||
| import pytest | ||
| import torch | ||
| import torch.nn as nn | ||
|
|
||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Thresholds | ||
| # --------------------------------------------------------------------------- | ||
| GPU_OCCUPANCY_PASS = 0.90 # >90% = healthy graph replay | ||
| GPU_OCCUPANCY_FAIL = 0.80 # <80% = regression detected | ||
| HIPGRAPHLAUNCH_MAX_US = 150 # healthy ~50us, regressed ~316us | ||
| GAP_100US_MAX_COUNT = 5 # healthy = 0, regressed = 9+ | ||
| KERNEL_COUNT_TOLERANCE = 0.20 # allow 20% variance from expected | ||
|
|
There was a problem hiding this comment.
There are a few unused items that should be removed or used to avoid confusion: ctypes is imported but never referenced, and KERNEL_COUNT_TOLERANCE is defined but not used anywhere in the test.
* test: add profiler regression guard for HIP graph replay Adds a lightweight integration test (~3s) that detects rocprofiler-sdk interception overhead which degrades HIP graph replay performance in the inference serving stack. When PyTorch kineto is linked against librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%. This directly impacts decode throughput in TP>1 serving scenarios. Test checks: - libtorch_cpu.so links roctracer (not rocprofiler-sdk) - HIP graph replay GPU occupancy > 80% (healthy = 97%) - Inter-kernel gaps > 100us count <= 5 (healthy = 0) - hipGraphLaunch CPU time < 150us (healthy = 50us) Validated on MI355 with: - PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix - FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock) Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056 * style: fix black + ruff formatting --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
librocprofiler-sdk.soinstead oflibroctracer64.so, everyhipGraphLaunchincurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%What it checks
ldd libtorch_cpu.solibroctracer64.solibrocprofiler-sdk.soBackground
ROCm/pytorch#3056 (merged 2026-03-17) re-enabled rocprofiler-sdk in kineto after ROCm/pytorch#2579 had reverted it. The SDK's interception hooks add overhead to every HIP API call, which is particularly harmful for HIP graph replay where kernel execution times are small relative to dispatch overhead.
On a real model (Qwen3-0.6B, 28 layers, TP=1), the regression drops graph replay GPU occupancy from 89.5% to 9.0%, with hipGraphLaunch taking 103ms vs 7ms.
Test plan
rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix(roctracer kineto)rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326(rocprofiler-sdk kineto)pytest(2 tests, 3.27s) and standalonepython3(3.8s)🤖 Generated with Claude Code