Cherry-pick openxla/xla#44428: [ROCm] packed bf16 atomic add for scatter/segment_sum by magaonka-amd · Pull Request #987 · ROCm/xla

magaonka-amd · 2026-06-24T06:09:20Z

Motivation

Requested in #985 review: cherry-pick openxla#44428 onto rocm-jaxlib-v0.10.2 (missing from the JAX 0.10.2 pinned XLA base 5a9e73cb; PR merged upstream Jun 17).

Commit (cherry-picked with `-x`)

97544f7a9e — PR [ROCm] Emit packed bf16 atomic add for scatter/segment_sum by matchin… openxla/xla#44428: [ROCm] Emit packed bf16 atomic add for scatter/segment_sum

Files changed

xla/backends/gpu/codegen/emitters/transforms/atomic_rmw_utils.cc
xla/backends/gpu/codegen/emitters/transforms/tests/lower_tensors.mlir

Test Plan

ROCm jaxlib build on rocm-jaxlib-v0.10.2; release-validation CI.

…ent_sum by matchin… Imported from GitHub PR openxla#44428 …g FloatNormalization conversions. 📝 Summary of Changes Make atomic-RMW matcher (GetAtomicModifierParameters) to look through the extf → addf(f32) → truncf body that FloatNormalization emits for bf16, recovering the narrow bf16 modifier so scatter-add lowers to packed atomicrmw fadd <2 x bf16> (global_atomic_pk_add_bf16) instead of a CAS loop. GpuFloatSupport/FloatNormalization are unchanged; targets without a native bf16 atomic still fall back to CAS. 🎯 Justification bf16 segment_sum/scatter-add result in slow CAS loop on MI300/MI350 despite the HW having a packed bf16 atomic, making bf16 ~7x slower than f16. 🚀 Kind of Contribution Please remove what does not apply: ⚡️ Performance Improvement, 🧪 Tests 📊 Benchmark (for Performance Improvements) Please measure and include speedups for one of the public HLOs in `compiler/xla/tools/benchmarks/hlo/`. 🧪 Unit Tests: Added direct_atomic_rmw_fadd_bf16_widened + a gfx942 CHECK-GFX942-MI300 RUN line to lower_tensors.mlir, asserting the packed atomicrmw fadd <2 x bf16> with no CAS. All 9 RUN-line prefixes pass. 🧪 Execution Tests: What execution tests were added? For example, a new optimization should be tested with an end-to-end execution test triggering the optimization and asserting correctness. Please provide test cases running with at most 2 GPUs. Copybara import of the project: -- edcb06b by Zoran Jovanovic <zjovanov@amd.com>: [ROCm] Emit packed bf16 atomic add for scatter/segment_sum by matching FloatNormalization conversions. Merging this change closes openxla#44428 COPYBARA_INTEGRATE_REVIEW=openxla#44428 from ROCm:rocm-bf16-atomic-scatter edcb06b PiperOrigin-RevId: 933630040 (cherry picked from commit 97544f7)

magaonka-amd · 2026-06-24T14:26:16Z

Superseded by #993, which combines all four ROCm 0.10.2 cherry-pick PRs into a single PR against rocm-jaxlib-v0.10.2.

This was referenced Jun 24, 2026

Cherry-pick PR #983 ROCm fixes to rocm-jaxlib-v0.10.2 #985

Closed

[ROCm] Release fixes for rocm-jaxlib-v0.10.2 (combined cherry-picks) #993

Merged

magaonka-amd closed this Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cherry-pick openxla/xla#44428: [ROCm] packed bf16 atomic add for scatter/segment_sum#987

Cherry-pick openxla/xla#44428: [ROCm] packed bf16 atomic add for scatter/segment_sum#987
magaonka-amd wants to merge 1 commit into
rocm-jaxlib-v0.10.2from
cherrypick-44428-to-v0.10.2

magaonka-amd commented Jun 24, 2026

Uh oh!

magaonka-amd commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

magaonka-amd commented Jun 24, 2026

Motivation

Commit (cherry-picked with -x)

Files changed

Test Plan

Uh oh!

magaonka-amd commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Commit (cherry-picked with `-x`)