Use rocprofiler-sdk enabled kineto#3056
Merged
Merged
Conversation
|
Jenkins build for a6f7fd3fb692c6a280783c0337b5a300159c224e commit finished as FAILURE |
Collaborator
|
Docker built - rocm/pytorch-private:81_ubuntu24.04_py3.13_pytorch_release-2.9_rocprofiler |
Collaborator
|
Merging this since it is already verified on above docker image. |
pruthvistony
approved these changes
Mar 17, 2026
sunway513
added a commit
to sunway513/aiter
that referenced
this pull request
Mar 27, 2026
Adds a lightweight test (~3s) that detects rocprofiler-sdk interception overhead which degrades HIP graph replay performance. When PyTorch kineto is linked against librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%. Test checks: - libtorch_cpu.so links roctracer (not rocprofiler-sdk) - HIP graph replay GPU occupancy > 80% (healthy = 97%) - Inter-kernel gaps > 100us count <= 5 (healthy = 0) - hipGraphLaunch CPU time < 150us (healthy = 50us) Uses a 4-layer mini transformer (BS=8, seq=1, FP16) to amplify dispatch overhead. Runs standalone or via pytest. Validated on MI355 with: - PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix - FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock) Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
sunway513
added a commit
to sunway513/ATOM
that referenced
this pull request
Mar 28, 2026
Adds a lightweight integration test (~3s) that detects rocprofiler-sdk interception overhead which degrades HIP graph replay performance in the inference serving stack. When PyTorch kineto is linked against librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%. This directly impacts decode throughput in TP>1 serving scenarios. Test checks: - libtorch_cpu.so links roctracer (not rocprofiler-sdk) - HIP graph replay GPU occupancy > 80% (healthy = 97%) - Inter-kernel gaps > 100us count <= 5 (healthy = 0) - hipGraphLaunch CPU time < 150us (healthy = 50us) Validated on MI355 with: - PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix - FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock) Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4 tasks
valarLip
pushed a commit
to ROCm/ATOM
that referenced
this pull request
Mar 28, 2026
* test: add profiler regression guard for HIP graph replay Adds a lightweight integration test (~3s) that detects rocprofiler-sdk interception overhead which degrades HIP graph replay performance in the inference serving stack. When PyTorch kineto is linked against librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%. This directly impacts decode throughput in TP>1 serving scenarios. Test checks: - libtorch_cpu.so links roctracer (not rocprofiler-sdk) - HIP graph replay GPU occupancy > 80% (healthy = 97%) - Inter-kernel gaps > 100us count <= 5 (healthy = 0) - hipGraphLaunch CPU time < 150us (healthy = 50us) Validated on MI355 with: - PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix - FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock) Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: fix black + ruff formatting --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Jasen2201
pushed a commit
to Jasen2201/ATOM
that referenced
this pull request
Apr 10, 2026
* test: add profiler regression guard for HIP graph replay Adds a lightweight integration test (~3s) that detects rocprofiler-sdk interception overhead which degrades HIP graph replay performance in the inference serving stack. When PyTorch kineto is linked against librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%. This directly impacts decode throughput in TP>1 serving scenarios. Test checks: - libtorch_cpu.so links roctracer (not rocprofiler-sdk) - HIP graph replay GPU occupancy > 80% (healthy = 97%) - Inter-kernel gaps > 100us count <= 5 (healthy = 0) - hipGraphLaunch CPU time < 150us (healthy = 50us) Validated on MI355 with: - PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix - FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock) Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056 * style: fix black + ruff formatting --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Enables kineto to use rocprofiler-sdk to gather gpu data.
Roctracer is deprecated. For now, that code remains in place to support older rocm versions.