Fix roctracer (v1) profiler build for rocm-jaxlib-v0.9.0#812
Conversation
Cherry-pick b3248ad added RocmTracer::GpuAgents() and a matching test, both of which only exist on the rocprofiler-sdk (v3) tracer backend. The roctracer (v1) backend (--define=xla_rocm_profiler=v1) does not expose these APIs, breaking //pjrt/tools:build_gpu_plugin_wheel. Guard the v3-only call in device_tracer_rocm.cc and the v3-only tests in rocm_tracer_test.cc with the XLA_GPU_ROCM_TRACER_BACKEND macro. SetGpuAgents() is a no-op on the base collector, so the v1 path preserves pre-cherrypick behavior.
|
Hi @cj401-amd I added support to build it with rocm-tracer aswell. since b3248ad was from upstream it didn't have rocm-tracer. also I need to add this patch here since QA wants it. |
| @@ -235,3 +242,5 @@ TEST(RocmTracerTest, CapturesHipEvents) { | |||
| } // namespace | |||
| } // namespace profiler | |||
| } // namespace xla | |||
|
|
|||
| #endif // XLA_GPU_ROCM_TRACER_BACKEND == XLA_GPU_ROCM_TRACER_BACKEND_V3 | |||
There was a problem hiding this comment.
Nit: Wrapping the entire test body in the #if guard is a pragmatic fix. When built with --define=xla_rocm_profiler=v1, this produces a test binary with zero test cases. By default gtest reports that as a pass, so this is fine today — but if CI ever adds --gtest_fail_if_no_tests, this target would start failing for v1 builds. Just something to be aware of; no action needed now.
| #if XLA_GPU_ROCM_TRACER_BACKEND == XLA_GPU_ROCM_TRACER_BACKEND_V3 | ||
| // GpuAgents() is only available on the rocprofiler-sdk (v3) backend. | ||
| // The v1 (roctracer) tracer does not expose agent data; SetGpuAgents() on | ||
| // the base RocmTraceCollector is a no-op by default, so skipping the call | ||
| // in the v1 path preserves prior behavior. | ||
| rocm_trace_collector_->SetGpuAgents(rocm_tracer_->GpuAgents()); | ||
| #endif |
There was a problem hiding this comment.
Clean fix. The #if guard correctly scopes the SetGpuAgents() call to the v3 backend, and since it's a no-op on the base RocmTraceCollector, the v1 path preserves prior behavior. LGTM.
Claude Review SummaryOverall: LGTM — Clean, well-scoped build fix. This PR correctly guards the v3-only See inline comments for minor observations. |
Commit history for google/XNNPACK (bccfe733 -> ace56b61): - b3e7d5a1 mohammadmseet-hue: Fix stack buffer overflows in NCHW reduce rewrite and ynnpack shim functions - 1dfa3da5 mohammadmseet-hue: Address review: return error instead of clamping, use YNN_LOG_ERROR - 0cd97f2f Ken Unger: add rvv support for f16-vcmul - 1f8d093a Ken Unger: add rvv support for f16-vcmul - d4474df4 velonica0: rvv-f16-activation - e36eb021 Ken Unger: add fp16 rvv kernels for vsin,vcos,vexp - d3121ea9 velonica0: [RVV] add rvv f32 kernels for velu, vgelu, vapproxgelu - ae231328 velonica0: Alphabetize RVV elementwise entries in cmake/bzl lists - 0b6f61af velonica0: fix cmake bug - 803079cd Gregory Comer: Add AVX512 f32<->bf16 vcvt kernels - 80303ee9 Gregory Comer: Add native AVX512_BF16 f32->bf16 vcvt kernel - b0a078ab Volodymyr Kysenko: Add benchmarks for int4x2/int2x4 to int8_t conversions. - aca142ca Dillon Sharlet: Decide whether to constant fold heuristically - 172b7f4a Marie White: Add arm_neonbf16 binary kernels - b7ebadee MarkLee131: Reject qpint8 in xnn_define_dynamically_quantized_tensor_value - 524c06d2 MarkLee131: Detect size_t overflow in get_tensor_size and reject the tensor - 0d406ba6 Volodymyr Kysenko: Add 2-bit and 4-bit interleave kernels. - 3b773ce1 Quentin Khan: Don't produce an op when `Cast` is casting to the same type as the input type. - 16acef83 Volodymyr Kysenko: Fix typo in the comment. - c8ec3d63 Misha Gutman: Added dynamic_b support for qdu8-f32-qc8w operator. - 1c7554a8 Dillon Sharlet: Remove unnecessary assert - bf188a3d Nicolas Pitre: Add Zephyr RTOS (Generic) platform support - 55d4036f Frank Barchard: Increase tolerance for SUBGRAPH_FP16.fully_connected_qd8_f16_qc8w test to account for numerical deviation. - 8e06513a XNNPACK Team: Merge pull request openxla#10060 from npitre:zephyr-support-pr - 6a9a9c50 Dillon Sharlet: Disable AMX kernels if msan is enabled - a9cb6cb9 XNNPACK Team: Merge pull request openxla#10023 from GregoryComer:bf16-f32-vcvt-avx512 - cde7d935 Marie White: Improve tile sizes for arm_neonbf16 kernels. Tuned with AI agents. - fadd2dbe XNNPACK Team: Merge pull request openxla#9986 from velonica0:rvv-f16-elementwise - 4ca5fb8d Dillon: Merge branch 'master' into f16-unary-trig-rvv - b2985573 Quentin Khan: Add wrappers for storage type of 2/4 bit int and 16 bit floats. - 9c70eb91 Quentin Khan: Add reverse data type to native type mapping. - 2c52c9f Quentin Khan: Add a conversion function to be able to specialize buffer copy from a sequence. - c817561c Quentin Khan: Move declaration of `NativeStorage` and clarify comment of `StorageImpl`. - 7b375800 MarkLee131: Clarify qpint8 rejection wording - 1437d94b MarkLee131: Use xnn_safe_mul/xnn_safe_add in get_tensor_size - f074438f MarkLee131: Split xnn_safe_mul/xnn_safe_add into separate statements - 64081049 Dillon Sharlet: Resubmit openxla#10069 - 56496fd6 Dillon Sharlet: Add int32 sum kernels - bbc68d90 XNNPACK Team: Merge pull request openxla#9963 from velonica0:rvv-elementwise - 51759bd4 XNNPACK Team: Merge pull request openxla#10102 from MarkLee131:fix/integer-overflow-tensor-size - 562e5274 Dillon Sharlet: Refactor `make_schedule` to allow building just the loop splits, and not a whole `scheduling_info` - 8e4e9d5b Dillon Sharlet: Change reduce to make the identity buffer in slinky, instead of in the subgraph - 64d21ff8 Ken Unger: handle unconfigured f16-vcmul kernel - 834051a2 XNNPACK Team: Merge pull request openxla#10101 from MarkLee131:fix/qpint8-null-deref - b3a5d44f Ken Unger: merge master - 8b3bda45 Ken Unger: update-microkernels - e2da1edb Frank Barchard: Add f16_wasmrelaxedsimd SIMD headers - 5aa5d64e Quentin Khan: Add a parallel lib to `utils:matchers` for internal targets that are only compiled with OSS. - b2f46c0c Quentin Khan: Add a matcher to to check whether two graph are isomorphic. - f81e3eda Volodymyr Kysenko: Support channelwise zero points in YNNPACK quantized dot products. - 4a318ee8 Frank Barchard: Add portable SIMD template for f16-vsqrt - 4780ab70 Frank Barchard: Run generator to create rvv kernels - 3659dcf2 Ken Unger: merge master - f589c63c Jonathan Clohessy: Update CMakeLists.txt to match SME defaults from bazel - 6833e630 Dillon: Merge branch 'master' into f16-unary-trig-rvv - 26c61a7a XNNPACK Team: Merge pull request openxla#9989 from ken-unger:f16-unary-trig-rvv - b0328fc2 Frank Barchard: Fix WAsm typo in XNNPACK by renaming to Wasm - 04b67752 Dillon Sharlet: Refactor tolerance calculations - 8c2df4d5 Dillon Sharlet: Parallelize reductions in YNNPACK - e9de2685 Dillon Sharlet: Add reference kernels for fp64 elementwise ops - a9390e5a Dillon Sharlet: Fix hexagon build - a3da013b Dillon Sharlet: Add benchmark coverage of reference fp64 elementwise ops - 8a3902dd Dillon Sharlet: Add optimized kernels for fp64 elementwise ops - c8c86398 Dillon Sharlet: Add fp64 fma rules to elementwise compiler - 807d9f9c Alexander Shaposhnikov: Introduce XNN_NO_SANITIZE_FUNCTION macro. - 8e406b86 Dillon Sharlet: Loosen tolerances for dequantize_dot test - bb6c6a48 Misha Gutman: Added convert from qint8 to qcint8. - 689c5c60 Misha Gutman: Removed convert qint8 to qcint8 tests from ynnpack test set. - a3664b21 Dillon Sharlet: Avoid capturing kernel in reduce ops - 0fc9e7e7 Volodymyr Kysenko: Disable subgraph_matcher_test when use_ynnpack is enabled. - b5bc455b Dillon Sharlet: Enable adding and removing dimensions via static_transpose - 6dfbf304 Frank Barchard: Optimize xnn_round_f32 for Hexagon HVX. - 52d94589 XNNPACK Team: Merge pull request openxla#9851 from mohammadmseet-hue:fix/nchw-reduce-overflow-and-shim-bounds - 6bd50499 Misha Gutman: Fixed the crash due to unaligned read. - 94ce3bb6 Volodymyr Kysenko: Refactor extent handling in YNNPACK subgraph. - cceae52c Dillon Sharlet: Always constant fold pack_b ops - e0729a7c Dillon Sharlet: Add assert to catch infinite loop case - 50b01640 Frank Barchard: Fix Hexagon HVX build failure 'sf type used as qf32' on Clang 19 - b3daaef9 Dillon Sharlet: Enable sum(squared(x)) => sum_squared(x) for fp64 - 8f17e0c0 Dillon Sharlet: Relax tolerances of dequantize_dot more - a493bbeb Dillon Sharlet: Add missing benchmark - 95103d5b Frank Barchard: Enable f16 vsqrt wasmrelaxedsimd kernel and scalar fallbacks - d830cd16 Volodymyr Kysenko: Rewrite reduce(static_transpose(x)) into reduce(x) - 7b1bde34 Dillon Sharlet: Remove ternary multiply for purely float types - a571a74b Dillon Sharlet: Add tolerance for quantized int8 operations that may round differently - 58698bd6 Dillon Sharlet: Add `exp2_round` simd helper - ece55c6e Dillon Sharlet: Add rewrite for `sum(a*b)` => `dot(a, b)` where appropriate - fc7f8975 Frederic Rechtenstein: Fix alignment-related crash on AVX512 - 58a233a4 XNNPACK Team: Merge pull request openxla#10167 from JonathanC-ARM:jonclo01/sync_bazel_cmake_defaults - 778408a8 Dillon Sharlet: Add exp_fp64 kernels - 16c63a38 Volodymyr Kysenko: Add benchmarks for fully connected with QC4W and QC2W weights. - 62f1d600 Misha Gutman: Added rewrite `bmm(a:f32, dequant(b:qint8):f32) -> f32` into - f6cf463c Volodymyr Kysenko: Disable BatchMatrixMultiplyDequantBmmRewrite test under ynnpack. - 5660b4b0 Dillon Sharlet: Implement `static_expand_dims` using `static_transpose` - 11e206b8 XNNPACK Team: Implement `static_expand_dims` using `static_transpose` - 8e4e78fd Quentin Khan: Don't use `graph::Tensor` in the XNNPack lowering interface. - b12ed13b Quentin Khan: Fix memory outdated planning optimization invalidated by reshapes. - c3d8c276 Misha Gutman: Disabled bmm rewrite by default as gemma4 fails precision. - fb152529 Volodymyr Kysenko: Rename QD8F32QC8W benchmark to QD8F32QC8WFullyConnected for consistency. - d8f5abe9 Dillon Sharlet: Rename svcnt => svcnts - 445e613a Dillon Sharlet: Fix spurious debug messages about sum(a*b) -> dot(a, b) rewrites - 48e1d0f0 Dillon Sharlet: Add test coverage of static and dynamic shapes - be45bb35 Dillon Sharlet: Add more test coverage for reduce operators - d48bc34c Dillon Sharlet: Add support for rewriting `sum(a*b, init_c)` => `dot(a, b, init_c)` - 4908d191 Marie White: Fix get_dot_kernel type bug - 84aa6a95 Dillon Sharlet: Move gemm, conv shapes hardcoded in benchmarks to text files - c5c413de Richard Townsend: [gn] Update DEPS - d877e1a1 Dillon Sharlet: Fix warning "unexpected tokens following preprocessor directive - expected a newline" - dbf04022 Volodymyr Kysenko: Fix handling of sub-byte types in packer. - 5039d217 Dillon Sharlet: Fix unsimplified slice extents - 6c8ac561 Frank Barchard: F16-VTANH for avx512, wasm and scalar - 3245ce20 Frank Barchard: Enable f16 vsin and vcos wasmrelaxedsimd kernel and scalar fallbacks - 091b9be6 XNNPACK Team: Enable f16 vsin and vcos wasmrelaxedsimd kernel and scalar fallbacks - 99e4485d Dillon Sharlet: Add `horizontal_sum` for floating point types - f919d369 Quentin Khan: Don't call optimize in fp16 rewrite tests. - c723a993 Quentin Khan: Prepare static_reduce test for upcoming fp16 to fp32 rewrite. - 0a27dcf1 Frank Barchard: Enable f16 vsin and vcos wasmrelaxedsimd kernel and scalar fallbacks - f43db489 Dillon Sharlet: Fix loss of precision for fp64 constants - 6e50ae9f Dillon Sharlet: Fix reshape -> slice pattern - 74daa88a Dillon Sharlet: Use internal define_static_expand_dims in define_dot - 28ef957f Dillon Sharlet: Disable sum(a*b) => dot(a, b) rewrite if there are no broadcast dimensions on either side - 016914cb Richard Townsend: [gn] Add pthreadpool for the Chromium config - 25d15607 Volodymyr Kysenko: Fix store in the tail of transpose kernels for sub-byte types. - 5d007c4c Volodymyr Kysenko: Make reference int2/int4 convert work with unaligned n. - 713c3b72 Dillon Sharlet: Require reshape strides to be the shape we need too - 2e6e343b Dillon Sharlet: Rewrite reduce kernels to optimize for numerical behavior - 73c5abb5 Marie White: Fix bug in `get_max_concurrency`. - 1dbb15fc Marie White: Fix fully-connected DynamicB tests to work with QP8. - cf96f77e Marie White: Fix fully-connected DynamicB tests to work with QP8. - 7829cd69 Quentin Khan: Move row sum rewrite to after other optimization rewrites. - 2d16035f Dillon Sharlet: Fix bugs with reduce fusion - 0b66c9f1 Dillon Sharlet: Fix slice bugs - 768003bd Marie White: Fix rank pollution in channelwise quantized scales for YNNPACK. - 26f5c9e1 Marie White: Fix logical extent calculation during constant folding for sub-byte types. - bb971d4 Dillon Sharlet: Refactor the implementation of `remove_static_broadcast_from_elementwise` - 860a6421 XNNPACK Team: Fix rank pollution in channelwise quantized scales for YNNPACK. - f3513194 Frank Barchard: Add rules for updating copyright for new files and removing trailing spaces on blank lines - 5dba5dad Dillon Sharlet: Improve static_slice test coverage - f569d17b Ken Unger: merge master - 12f71cd4 XNNPACK Team: Merge pull request openxla#9971 from ken-unger:f16-vcmul-rvv - 2466b8c2 Dillon Sharlet: Update deps to get bug fixes - cc278f5c Dillon Sharlet: Add support for strides to static_slice - 53007d69 Dillon Sharlet: Add YNN_FLAG_NO_EXCESS_PRECISION - fe166973 Dillon Sharlet: Disable static_slice test until slinky bug is fixed - 4fad5b39 Dillon Sharlet: Disable static_slice test until slinky bug is fixed - d72fa85c Dillon Sharlet: Improve log_fp32 kernels - 95ee916a Dillon Sharlet: Use a better unroll factor for log2_fp32_sse2 - 9ab80cd6 Volodymyr Kysenko: Allow adding function own loops even if some of its non-trivial loops has been already fused. - 11fb8859 Dillon Sharlet: Implement round to nearest even for float -> bf16 conversions - 49e266f7 Volodymyr Kysenko: Add optimized convert int2/int4 to int8 kernels. - ace56b61 Dillon Sharlet: Improve `exp` kernel accuracy and correctness - 34c80155 Volodymyr Kysenko: Make sure partial reduction splits match the loop step. - 7bf9c692 Frank Barchard: Fix ambiguous std::isfinite, std::abs, and std::fpclassify calls for _Float16 in test framework by explicitly casting to float. - c3ac56a5 Quentin Khan: Add subgraph matcher target to `BUILD.gn`. - 1c292bfc Richard Townsend: [gn] Test building AVX512 - 8da42ae2 Gerardo Carranza: Add support for log fp16 in XNNPACK. - 1052f90b Richard Townsend: [gn] Add support for building/testing AArch32 - 01db6e14 Dillon Sharlet: Fix possible infinite recursion in convert - f1fe9b5c Dillon Sharlet: Only rewrite reduce(convert(x)) if we have a kernel for that reduction type. - 98c8ded4 Dillon Sharlet: Polynomial approximation improvements for `exp` and `log` Commit history for dsharlet/slinky (1032be67 -> eb004cb3): - 63c773f3 Dillon: Simplify `make_buffer` with new broadcast dimensions to `transpose` (#802) - 66efc5ef Dillon: Fix `can_fuse` for broadcast dimensions (#803) - 6fcfed78 Dillon: Fix more instances of `fold_factor` that should have been changed to `stride` after #802 (#806) - 2af0a012 Dillon: Remove unnecessary branches for the rank of buffers when accessing dims (#807) - 70b443b7 Dillon: Add fast path to `for_each_element` for rank 0 buffers (#805) - dea32175 Dillon: Remove extent 1 dimensions in `optimize_dims` (#797) - 7e02995b Dillon: Fix out of bounds vector access when simplifying nested transpose ops (#808) - 9140d8ac Dillon: Change drop-loops to keep the loop but rewrite the extent (#809) - f3ab7b63 Dillon: Fix aliases that use buffer bounds before they are defined (#810) - 7bc45e1f Dillon: Add support for `slice_buffer`, `slice_dim`, and `transpose` in `alias_copies` (#811) - 0335d87e Dillon: Cast object instead of function pointer (#812) - 27f5d9d9 Dillon: Fix externally defined fold factors (#813) - c08ef409 Dillon: Fix copy aliasing for copies that remove dimensions (#814) - 284794e8 Dillon: Fix bugs uncovered by copying from a rank > 0 buffer to a scalar (#815) - 56f8638a Dillon: Fix crop simplification bug (#816) - 0fbea044 Dillon: Fix simplify of nested transposes (#817) - c01931be Dillon: Fix a straggler usage of `op->dims` => `dims` (#818) - eb004cb3 Dillon: Fix strided copies (#819) PiperOrigin-RevId: 917405521
Cherry-pick b3248ad added RocmTracer::GpuAgents() and a matching test,
both of which only exist on the rocprofiler-sdk (v3) tracer backend.
The roctracer (v1) backend (--define=xla_rocm_profiler=v1) does not
expose these APIs, breaking //pjrt/tools:build_gpu_plugin_wheel.
Guard the v3-only call in device_tracer_rocm.cc and the v3-only tests
in rocm_tracer_test.cc with the XLA_GPU_ROCM_TRACER_BACKEND macro.
SetGpuAgents() is a no-op on the base collector, so the v1 path
preserves pre-cherrypick behavior.