Skip to content

Use rocprofiler-sdk enabled kineto#3056

Merged
pruthvistony merged 1 commit into
release/2.9from
kineto_rocprof_release2.9
Mar 17, 2026
Merged

Use rocprofiler-sdk enabled kineto#3056
pruthvistony merged 1 commit into
release/2.9from
kineto_rocprof_release2.9

Conversation

@mwootton
Copy link
Copy Markdown

Enables kineto to use rocprofiler-sdk to gather gpu data.
Roctracer is deprecated. For now, that code remains in place to support older rocm versions.

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Mar 10, 2026

Jenkins build for a6f7fd3fb692c6a280783c0337b5a300159c224e commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@pruthvistony
Copy link
Copy Markdown
Collaborator

Docker built - rocm/pytorch-private:81_ubuntu24.04_py3.13_pytorch_release-2.9_rocprofiler

@pruthvistony
Copy link
Copy Markdown
Collaborator

Merging this since it is already verified on above docker image.

@pruthvistony pruthvistony merged commit 81ed462 into release/2.9 Mar 17, 2026
3 of 8 checks passed
@pruthvistony pruthvistony deleted the kineto_rocprof_release2.9 branch March 17, 2026 16:12
sunway513 added a commit to sunway513/aiter that referenced this pull request Mar 27, 2026
Adds a lightweight test (~3s) that detects rocprofiler-sdk interception
overhead which degrades HIP graph replay performance. When PyTorch kineto
is linked against librocprofiler-sdk.so instead of libroctracer64.so,
every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy
from ~97% to ~75%.

Test checks:
- libtorch_cpu.so links roctracer (not rocprofiler-sdk)
- HIP graph replay GPU occupancy > 80% (healthy = 97%)
- Inter-kernel gaps > 100us count <= 5 (healthy = 0)
- hipGraphLaunch CPU time < 150us (healthy = 50us)

Uses a 4-layer mini transformer (BS=8, seq=1, FP16) to amplify
dispatch overhead. Runs standalone or via pytest.

Validated on MI355 with:
- PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix
- FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock)

Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
sunway513 added a commit to sunway513/ATOM that referenced this pull request Mar 28, 2026
Adds a lightweight integration test (~3s) that detects rocprofiler-sdk
interception overhead which degrades HIP graph replay performance in the
inference serving stack.

When PyTorch kineto is linked against librocprofiler-sdk.so instead of
libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead,
dropping GPU occupancy from ~97% to ~75%. This directly impacts decode
throughput in TP>1 serving scenarios.

Test checks:
- libtorch_cpu.so links roctracer (not rocprofiler-sdk)
- HIP graph replay GPU occupancy > 80% (healthy = 97%)
- Inter-kernel gaps > 100us count <= 5 (healthy = 0)
- hipGraphLaunch CPU time < 150us (healthy = 50us)

Validated on MI355 with:
- PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix
- FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock)

Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
valarLip pushed a commit to ROCm/ATOM that referenced this pull request Mar 28, 2026
* test: add profiler regression guard for HIP graph replay

Adds a lightweight integration test (~3s) that detects rocprofiler-sdk
interception overhead which degrades HIP graph replay performance in the
inference serving stack.

When PyTorch kineto is linked against librocprofiler-sdk.so instead of
libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead,
dropping GPU occupancy from ~97% to ~75%. This directly impacts decode
throughput in TP>1 serving scenarios.

Test checks:
- libtorch_cpu.so links roctracer (not rocprofiler-sdk)
- HIP graph replay GPU occupancy > 80% (healthy = 97%)
- Inter-kernel gaps > 100us count <= 5 (healthy = 0)
- hipGraphLaunch CPU time < 150us (healthy = 50us)

Validated on MI355 with:
- PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix
- FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock)

Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: fix black + ruff formatting

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Jasen2201 pushed a commit to Jasen2201/ATOM that referenced this pull request Apr 10, 2026
* test: add profiler regression guard for HIP graph replay

Adds a lightweight integration test (~3s) that detects rocprofiler-sdk
interception overhead which degrades HIP graph replay performance in the
inference serving stack.

When PyTorch kineto is linked against librocprofiler-sdk.so instead of
libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead,
dropping GPU occupancy from ~97% to ~75%. This directly impacts decode
throughput in TP>1 serving scenarios.

Test checks:
- libtorch_cpu.so links roctracer (not rocprofiler-sdk)
- HIP graph replay GPU occupancy > 80% (healthy = 97%)
- Inter-kernel gaps > 100us count <= 5 (healthy = 0)
- hipGraphLaunch CPU time < 150us (healthy = 50us)

Validated on MI355 with:
- PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix
- FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock)

Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056


* style: fix black + ruff formatting

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants