[AMDGPU] SIFoldOperands: Delay foldCopyToVGPROfScalarAddOfFrameIndex #141558

ritter-x2a · 2025-05-27T07:55:15Z

foldCopyToVGPROfScalarAddOfFrameIndex transforms s_adds whose results are copied
to vector registers into v_adds. We don't want to do that if foldInstOperand
(which so far runs later) can fold the sreg->vreg copy away.
This patch therefore delays foldCopyToVGPROfScalarAddOfFrameIndex until after
foldInstOperand.

This avoids unnecessary movs in the flat-scratch-svs.ll test and also avoids
regressions in an upcoming patch to enable ISD::PTRADD nodes.

foldCopyToVGPROfScalarAddOfFrameIndex transforms s_adds whose results are copied to vector registers into v_adds. We don't want to do that if foldInstOperand (which so far runs later) can fold the sreg->vreg copy away. This patch therefore delays foldCopyToVGPROfScalarAddOfFrameIndex until after foldInstOperand. This avoids unnecessary movs in the flat-scratch-svs.ll test and also avoids regressions in an upcoming patch to enable ISD::PTRADD nodes.

ritter-x2a · 2025-05-27T07:55:31Z

[AMDGPU] SIFoldOperands: Delay foldCopyToVGPROfScalarAddOfFrameIndex #141558 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-05-27T07:56:36Z

@llvm/pr-subscribers-backend-amdgpu

Author: Fabian Ritter (ritter-x2a)

Changes

foldCopyToVGPROfScalarAddOfFrameIndex transforms s_adds whose results are copied
to vector registers into v_adds. We don't want to do that if foldInstOperand
(which so far runs later) can fold the sreg->vreg copy away.
This patch therefore delays foldCopyToVGPROfScalarAddOfFrameIndex until after
foldInstOperand.

This avoids unnecessary movs in the flat-scratch-svs.ll test and also avoids
regressions in an upcoming patch to enable ISD::PTRADD nodes.

Full diff: https://github.com/llvm/llvm-project/pull/141558.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (+8-5)
(modified) llvm/test/CodeGen/AMDGPU/flat-scratch-svs.ll (+9-15)

diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index 8b12eeba15618..a28c6e2811abe 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -1909,10 +1909,6 @@ bool SIFoldOperandsImpl::tryFoldFoldableCopy(
   if (!DstReg.isVirtual())
     return false;
 
-  if (OpToFold.isReg() &&
-      foldCopyToVGPROfScalarAddOfFrameIndex(DstReg, OpToFold.getReg(), MI))
-    return true;
-
   // Fold copy to AGPR through reg_sequence
   // TODO: Handle with subregister extract
   if (OpToFold.isReg() && MI.isCopy() && !MI.getOperand(1).getSubReg()) {
@@ -1947,7 +1943,14 @@ bool SIFoldOperandsImpl::tryFoldFoldableCopy(
     Changed = true;
   }
 
-  return Changed;
+  if (Changed)
+    return true;
+
+  // Run this after foldInstOperand to avoid turning scalar additions into
+  // vector additions when the result scalar result could just be folded into
+  // the user(s).
+  return (OpToFold.isReg() &&
+          foldCopyToVGPROfScalarAddOfFrameIndex(DstReg, OpToFold.getReg(), MI));
 }
 
 // Clamp patterns are canonically selected to v_max_* instructions, so only
diff --git a/llvm/test/CodeGen/AMDGPU/flat-scratch-svs.ll b/llvm/test/CodeGen/AMDGPU/flat-scratch-svs.ll
index bcd5d1e87954f..a98df5c97293c 100644
--- a/llvm/test/CodeGen/AMDGPU/flat-scratch-svs.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat-scratch-svs.ll
@@ -182,8 +182,7 @@ define amdgpu_kernel void @soff1_voff2(i32 %soff) {
 ; GFX942-SDAG-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
 ; GFX942-SDAG-NEXT:    v_mov_b32_e32 v1, 1
 ; GFX942-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX942-SDAG-NEXT:    v_mov_b32_e32 v2, s0
-; GFX942-SDAG-NEXT:    v_mad_u32_u24 v0, v0, 2, v2
+; GFX942-SDAG-NEXT:    v_mad_u32_u24 v0, v0, 2, s0
 ; GFX942-SDAG-NEXT:    v_add_u32_e32 v2, 1, v0
 ; GFX942-SDAG-NEXT:    v_add_u32_e32 v3, 2, v0
 ; GFX942-SDAG-NEXT:    scratch_store_byte v2, v1, off sc0 sc1
@@ -356,8 +355,7 @@ define amdgpu_kernel void @soff1_voff4(i32 %soff) {
 ; GFX942-SDAG-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
 ; GFX942-SDAG-NEXT:    v_mov_b32_e32 v1, 1
 ; GFX942-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX942-SDAG-NEXT:    v_mov_b32_e32 v2, s0
-; GFX942-SDAG-NEXT:    v_mad_u32_u24 v0, v0, 4, v2
+; GFX942-SDAG-NEXT:    v_mad_u32_u24 v0, v0, 4, s0
 ; GFX942-SDAG-NEXT:    v_add_u32_e32 v2, 1, v0
 ; GFX942-SDAG-NEXT:    v_add_u32_e32 v3, 2, v0
 ; GFX942-SDAG-NEXT:    scratch_store_byte v2, v1, off sc0 sc1
@@ -701,14 +699,13 @@ define amdgpu_kernel void @soff2_voff2(i32 %soff) {
 ; GFX942-SDAG-NEXT:    s_load_dword s0, s[4:5], 0x24
 ; GFX942-SDAG-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
 ; GFX942-SDAG-NEXT:    v_mov_b32_e32 v1, 1
+; GFX942-SDAG-NEXT:    v_mov_b32_e32 v2, 2
 ; GFX942-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX942-SDAG-NEXT:    s_lshl_b32 s0, s0, 1
-; GFX942-SDAG-NEXT:    v_mov_b32_e32 v2, s0
-; GFX942-SDAG-NEXT:    v_mad_u32_u24 v0, v0, 2, v2
+; GFX942-SDAG-NEXT:    v_mad_u32_u24 v0, v0, 2, s0
 ; GFX942-SDAG-NEXT:    scratch_store_byte v0, v1, off offset:1 sc0 sc1
 ; GFX942-SDAG-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-SDAG-NEXT:    v_add_u32_e32 v1, 2, v0
-; GFX942-SDAG-NEXT:    v_mov_b32_e32 v2, 2
 ; GFX942-SDAG-NEXT:    scratch_store_byte v1, v2, off sc0 sc1
 ; GFX942-SDAG-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-SDAG-NEXT:    v_add_u32_e32 v0, 4, v0
@@ -884,14 +881,13 @@ define amdgpu_kernel void @soff2_voff4(i32 %soff) {
 ; GFX942-SDAG-NEXT:    s_load_dword s0, s[4:5], 0x24
 ; GFX942-SDAG-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
 ; GFX942-SDAG-NEXT:    v_mov_b32_e32 v1, 1
+; GFX942-SDAG-NEXT:    v_mov_b32_e32 v2, 2
 ; GFX942-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX942-SDAG-NEXT:    s_lshl_b32 s0, s0, 1
-; GFX942-SDAG-NEXT:    v_mov_b32_e32 v2, s0
-; GFX942-SDAG-NEXT:    v_mad_u32_u24 v0, v0, 4, v2
+; GFX942-SDAG-NEXT:    v_mad_u32_u24 v0, v0, 4, s0
 ; GFX942-SDAG-NEXT:    scratch_store_byte v0, v1, off offset:1 sc0 sc1
 ; GFX942-SDAG-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-SDAG-NEXT:    v_add_u32_e32 v1, 2, v0
-; GFX942-SDAG-NEXT:    v_mov_b32_e32 v2, 2
 ; GFX942-SDAG-NEXT:    scratch_store_byte v1, v2, off sc0 sc1
 ; GFX942-SDAG-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-SDAG-NEXT:    v_add_u32_e32 v0, 4, v0
@@ -1239,14 +1235,13 @@ define amdgpu_kernel void @soff4_voff2(i32 %soff) {
 ; GFX942-SDAG-NEXT:    s_load_dword s0, s[4:5], 0x24
 ; GFX942-SDAG-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
 ; GFX942-SDAG-NEXT:    v_mov_b32_e32 v1, 1
+; GFX942-SDAG-NEXT:    v_mov_b32_e32 v2, 2
 ; GFX942-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX942-SDAG-NEXT:    s_lshl_b32 s0, s0, 2
-; GFX942-SDAG-NEXT:    v_mov_b32_e32 v2, s0
-; GFX942-SDAG-NEXT:    v_mad_u32_u24 v0, v0, 2, v2
+; GFX942-SDAG-NEXT:    v_mad_u32_u24 v0, v0, 2, s0
 ; GFX942-SDAG-NEXT:    scratch_store_byte v0, v1, off offset:1 sc0 sc1
 ; GFX942-SDAG-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-SDAG-NEXT:    v_add_u32_e32 v1, 2, v0
-; GFX942-SDAG-NEXT:    v_mov_b32_e32 v2, 2
 ; GFX942-SDAG-NEXT:    scratch_store_byte v1, v2, off sc0 sc1
 ; GFX942-SDAG-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-SDAG-NEXT:    v_add_u32_e32 v0, 4, v0
@@ -1425,8 +1420,7 @@ define amdgpu_kernel void @soff4_voff4(i32 %soff) {
 ; GFX942-SDAG-NEXT:    v_mov_b32_e32 v2, 2
 ; GFX942-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX942-SDAG-NEXT:    s_lshl_b32 s0, s0, 2
-; GFX942-SDAG-NEXT:    v_mov_b32_e32 v3, s0
-; GFX942-SDAG-NEXT:    v_mad_u32_u24 v0, v0, 4, v3
+; GFX942-SDAG-NEXT:    v_mad_u32_u24 v0, v0, 4, s0
 ; GFX942-SDAG-NEXT:    scratch_store_byte v0, v1, off offset:1 sc0 sc1
 ; GFX942-SDAG-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-SDAG-NEXT:    scratch_store_byte v0, v2, off offset:2 sc0 sc1

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

ritter-x2a · 2025-05-27T09:29:19Z

Merge activity

May 27, 9:29 AM UTC: A user started a stack merge that includes this pull request via Graphite.
May 27, 9:30 AM UTC: @ritter-x2a merged this pull request with Graphite.

…lvm#141558) foldCopyToVGPROfScalarAddOfFrameIndex transforms s_adds whose results are copied to vector registers into v_adds. We don't want to do that if foldInstOperand (which so far runs later) can fold the sreg->vreg copy away. This patch therefore delays foldCopyToVGPROfScalarAddOfFrameIndex until after foldInstOperand. This avoids unnecessary movs in the flat-scratch-svs.ll test and also avoids regressions in an upcoming patch to enable ISD::PTRADD nodes.

ritter-x2a added the backend:AMDGPU label May 27, 2025 — with Graphite App

ritter-x2a requested review from arsenm, jmmartinez, Pierre-vh and shiltian May 27, 2025 07:57

ritter-x2a marked this pull request as ready for review May 27, 2025 07:57

arsenm approved these changes May 27, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp Outdated Show resolved Hide resolved

Remove superfluous parentheses

e816a43

ritter-x2a merged commit fb27867 into main May 27, 2025
11 checks passed

ritter-x2a deleted the users/ritter-x2a/05-27-_amdgpu_sifoldoperands_delay_foldcopytovgprofscalaraddofframeindex branch May 27, 2025 09:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] SIFoldOperands: Delay foldCopyToVGPROfScalarAddOfFrameIndex #141558

[AMDGPU] SIFoldOperands: Delay foldCopyToVGPROfScalarAddOfFrameIndex #141558

Uh oh!

ritter-x2a commented May 27, 2025

Uh oh!

ritter-x2a commented May 27, 2025

Uh oh!

llvmbot commented May 27, 2025

Uh oh!

Uh oh!

ritter-x2a commented May 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[AMDGPU] SIFoldOperands: Delay foldCopyToVGPROfScalarAddOfFrameIndex #141558

[AMDGPU] SIFoldOperands: Delay foldCopyToVGPROfScalarAddOfFrameIndex #141558

Uh oh!

Conversation

ritter-x2a commented May 27, 2025

Uh oh!

ritter-x2a commented May 27, 2025

Uh oh!

llvmbot commented May 27, 2025

Uh oh!

Uh oh!

ritter-x2a commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

Uh oh!

ritter-x2a commented May 27, 2025 •

edited

Loading