Use rocprofiler-sdk enabled kineto by mwootton · Pull Request #3056 · ROCm/pytorch

mwootton · 2026-03-10T19:18:57Z

Enables kineto to use rocprofiler-sdk to gather gpu data.
Roctracer is deprecated. For now, that code remains in place to support older rocm versions.

rocm-repo-management-api · 2026-03-10T19:35:57Z

Jenkins build for a6f7fd3fb692c6a280783c0337b5a300159c224e commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

pruthvistony · 2026-03-17T16:11:41Z

Docker built - rocm/pytorch-private:81_ubuntu24.04_py3.13_pytorch_release-2.9_rocprofiler

pruthvistony · 2026-03-17T16:12:02Z

Merging this since it is already verified on above docker image.

Adds a lightweight test (~3s) that detects rocprofiler-sdk interception overhead which degrades HIP graph replay performance. When PyTorch kineto is linked against librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%. Test checks: - libtorch_cpu.so links roctracer (not rocprofiler-sdk) - HIP graph replay GPU occupancy > 80% (healthy = 97%) - Inter-kernel gaps > 100us count <= 5 (healthy = 0) - hipGraphLaunch CPU time < 150us (healthy = 50us) Uses a 4-layer mini transformer (BS=8, seq=1, FP16) to amplify dispatch overhead. Runs standalone or via pytest. Validated on MI355 with: - PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix - FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock) Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a lightweight integration test (~3s) that detects rocprofiler-sdk interception overhead which degrades HIP graph replay performance in the inference serving stack. When PyTorch kineto is linked against librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%. This directly impacts decode throughput in TP>1 serving scenarios. Test checks: - libtorch_cpu.so links roctracer (not rocprofiler-sdk) - HIP graph replay GPU occupancy > 80% (healthy = 97%) - Inter-kernel gaps > 100us count <= 5 (healthy = 0) - hipGraphLaunch CPU time < 150us (healthy = 50us) Validated on MI355 with: - PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix - FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock) Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add profiler regression guard for HIP graph replay Adds a lightweight integration test (~3s) that detects rocprofiler-sdk interception overhead which degrades HIP graph replay performance in the inference serving stack. When PyTorch kineto is linked against librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%. This directly impacts decode throughput in TP>1 serving scenarios. Test checks: - libtorch_cpu.so links roctracer (not rocprofiler-sdk) - HIP graph replay GPU occupancy > 80% (healthy = 97%) - Inter-kernel gaps > 100us count <= 5 (healthy = 0) - hipGraphLaunch CPU time < 150us (healthy = 50us) Validated on MI355 with: - PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix - FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock) Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: fix black + ruff formatting --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add profiler regression guard for HIP graph replay Adds a lightweight integration test (~3s) that detects rocprofiler-sdk interception overhead which degrades HIP graph replay performance in the inference serving stack. When PyTorch kineto is linked against librocprofiler-sdk.so instead of libroctracer64.so, every hipGraphLaunch incurs ~270us of overhead, dropping GPU occupancy from ~97% to ~75%. This directly impacts decode throughput in TP>1 serving scenarios. Test checks: - libtorch_cpu.so links roctracer (not rocprofiler-sdk) - HIP graph replay GPU occupancy > 80% (healthy = 97%) - Inter-kernel gaps > 100us count <= 5 (healthy = 0) - hipGraphLaunch CPU time < 150us (healthy = 50us) Validated on MI355 with: - PASS: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326_full_fix - FAIL: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326 (stock) Reference: ROCm/rocm-systems#4401, ROCm/pytorch#2579, ROCm/pytorch#3056 * style: fix black + ruff formatting --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use rocprofiler-sdk enabled kineto

a6f7fd3

pruthvistony approved these changes Mar 17, 2026

View reviewed changes

pruthvistony merged commit 81ed462 into release/2.9 Mar 17, 2026
3 of 8 checks passed

pruthvistony deleted the kineto_rocprof_release2.9 branch March 17, 2026 16:12

sunway513 mentioned this pull request Mar 27, 2026

test: add profiler regression guard for HIP graph replay ROCm/aiter#2511

Closed

4 tasks

sunway513 mentioned this pull request Mar 28, 2026

test: add profiler regression guard for HIP graph replay ROCm/ATOM#432

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use rocprofiler-sdk enabled kineto#3056

Use rocprofiler-sdk enabled kineto#3056
pruthvistony merged 1 commit into
release/2.9from
kineto_rocprof_release2.9

mwootton commented Mar 10, 2026

Uh oh!

rocm-repo-management-api Bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

pruthvistony commented Mar 17, 2026

Uh oh!

pruthvistony commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mwootton commented Mar 10, 2026

Uh oh!

rocm-repo-management-api Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pruthvistony commented Mar 17, 2026

Uh oh!

pruthvistony commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rocm-repo-management-api Bot commented Mar 10, 2026 •

edited

Loading