Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU: Move insertion into V2SCopies map #130776

Merged

Conversation

arsenm
Copy link
Contributor

@arsenm arsenm commented Mar 11, 2025

Insert the start instruction directly into the map before the uses. This
prevents improperly re-visting sgpr->vgpr phi inputs multiple times which
would trigger a use after free.

I don't particularly trust the iteration scheme here. This is also
unnecessarily revisting transitive users of a phi or reg_sequence for every
input operand, but I will address that separately.

Fixes #130646. I also believe it fixes #130119, although that test fails
less consistently for me.

Copy link
Contributor Author

arsenm commented Mar 11, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@arsenm arsenm marked this pull request as ready for review March 11, 2025 13:32
@llvmbot
Copy link
Member

llvmbot commented Mar 11, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

Insert the start instruction directly into the map before the uses. This
prevents improperly re-visting sgpr->vgpr phi inputs multiple times which
would trigger a use after free.

I don't particularly trust the iteration scheme here. This is also
unnecessarily revisting transitive users of a phi or reg_sequence for every
input operand, but I will address that separately.

Fixes #130646. I also believe it fixes #130119, although that test fails
less consistently for me.


Patch is 772.59 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/130776.diff

38 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp (+5-1)
  • (modified) llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll (+229-231)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll (+1776-2018)
  • (modified) llvm/test/CodeGen/AMDGPU/atomicrmw-expand.ll (+3-9)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+10-10)
  • (modified) llvm/test/CodeGen/AMDGPU/carryout-selection.ll (+629-728)
  • (modified) llvm/test/CodeGen/AMDGPU/cf-loop-on-constant.ll (+8-10)
  • (modified) llvm/test/CodeGen/AMDGPU/ctlz_zero_undef.ll (+7-10)
  • (modified) llvm/test/CodeGen/AMDGPU/cttz_zero_undef.ll (+8-10)
  • (modified) llvm/test/CodeGen/AMDGPU/extract_vector_dynelt.ll (+254-130)
  • (modified) llvm/test/CodeGen/AMDGPU/fcopysign.f16.ll (+145-144)
  • (added) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-phi-regression-issue130646-issue130119.ll (+168)
  • (added) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-phi-regression-issue130646.mir (+88)
  • (modified) llvm/test/CodeGen/AMDGPU/fptrunc.ll (+151-152)
  • (modified) llvm/test/CodeGen/AMDGPU/frem.ll (+85-110)
  • (modified) llvm/test/CodeGen/AMDGPU/ftrunc.f64.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/idiv-licm.ll (+309-322)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+26-45)
  • (modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+48-58)
  • (modified) llvm/test/CodeGen/AMDGPU/insert_waitcnt_for_precise_memory.ll (+108-127)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.mulo.ll (+23-25)
  • (modified) llvm/test/CodeGen/AMDGPU/load-constant-i1.ll (+380-480)
  • (modified) llvm/test/CodeGen/AMDGPU/mad_64_32.ll (+7-4)
  • (modified) llvm/test/CodeGen/AMDGPU/multilevel-break.ll (+8-9)
  • (modified) llvm/test/CodeGen/AMDGPU/opt-sgpr-to-vgpr-copy.mir (+19-19)
  • (modified) llvm/test/CodeGen/AMDGPU/scalar_to_vector.ll (+28-76)
  • (modified) llvm/test/CodeGen/AMDGPU/sdiv.ll (+243-259)
  • (modified) llvm/test/CodeGen/AMDGPU/sdiv64.ll (+78-96)
  • (modified) llvm/test/CodeGen/AMDGPU/sdwa-peephole.ll (+39-94)
  • (modified) llvm/test/CodeGen/AMDGPU/sminmax.v2i16.ll (+22-32)
  • (modified) llvm/test/CodeGen/AMDGPU/sra.ll (+43-79)
  • (modified) llvm/test/CodeGen/AMDGPU/srem.ll (+1195-1383)
  • (modified) llvm/test/CodeGen/AMDGPU/srem64.ll (+257-308)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv.ll (+187-226)
  • (modified) llvm/test/CodeGen/AMDGPU/udiv64.ll (+40-52)
  • (modified) llvm/test/CodeGen/AMDGPU/udivrem.ll (+211-243)
  • (modified) llvm/test/CodeGen/AMDGPU/urem64.ll (+156-176)
  • (modified) llvm/test/CodeGen/AMDGPU/wave32.ll (+220-252)
diff --git a/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp b/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
index 4342e7a369c13..ba75afc593577 100644
--- a/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
@@ -691,6 +691,9 @@ bool SIFixSGPRCopies::run(MachineFunction &MF) {
                             TII->get(AMDGPU::COPY), NewDst)
                         .addReg(MO.getReg());
                 MO.setReg(NewDst);
+
+                // FIXME: We are transitively revisiting users of this
+                // instruction for every input.
                 analyzeVGPRToSGPRCopy(NewCopy);
               }
             }
@@ -928,6 +931,8 @@ void SIFixSGPRCopies::analyzeVGPRToSGPRCopy(MachineInstr* MI) {
 
   V2SCopyInfo Info(getNextVGPRToSGPRCopyId(), MI,
                       TRI->getRegSizeInBits(*DstRC));
+  V2SCopies[Info.ID] = Info;
+
   SmallVector<MachineInstr *, 8> AnalysisWorklist;
   // Needed because the SSA is not a tree but a graph and may have
   // forks and joins. We should not then go same way twice.
@@ -976,7 +981,6 @@ void SIFixSGPRCopies::analyzeVGPRToSGPRCopy(MachineInstr* MI) {
       AnalysisWorklist.push_back(U);
     }
   }
-  V2SCopies[Info.ID] = Info;
 }
 
 // The main function that computes the VGPR to SGPR copy score
diff --git a/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll b/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll
index 3116b5d59a097..b955e9eb10c1e 100644
--- a/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll
+++ b/llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll
@@ -512,317 +512,315 @@ define void @v32_asm_def_use(float %v0, float %v1) #4 {
 define amdgpu_kernel void @introduced_copy_to_sgpr(i64 %arg, i32 %arg1, i32 %arg2, i64 %arg3, <2 x half> %arg4, <2 x half> %arg5) #3 {
 ; GFX908-LABEL: introduced_copy_to_sgpr:
 ; GFX908:       ; %bb.0: ; %bb
-; GFX908-NEXT:    global_load_ushort v16, v[0:1], off glc
+; GFX908-NEXT:    global_load_ushort v0, v[0:1], off glc
 ; GFX908-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x0
-; GFX908-NEXT:    s_load_dwordx2 s[4:5], s[8:9], 0x10
-; GFX908-NEXT:    s_load_dword s7, s[8:9], 0x18
-; GFX908-NEXT:    s_mov_b32 s6, 0
-; GFX908-NEXT:    s_mov_b32 s9, s6
+; GFX908-NEXT:    s_load_dwordx2 s[10:11], s[8:9], 0x10
+; GFX908-NEXT:    s_load_dword s5, s[8:9], 0x18
 ; GFX908-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX908-NEXT:    v_cvt_f32_u32_e32 v0, s3
-; GFX908-NEXT:    s_sub_i32 s8, 0, s3
-; GFX908-NEXT:    v_cvt_f32_f16_e32 v17, s7
-; GFX908-NEXT:    v_mov_b32_e32 v19, 0
-; GFX908-NEXT:    v_rcp_iflag_f32_e32 v2, v0
-; GFX908-NEXT:    v_mov_b32_e32 v0, 0
+; GFX908-NEXT:    v_cvt_f32_u32_e32 v1, s3
+; GFX908-NEXT:    s_sub_i32 s4, 0, s3
+; GFX908-NEXT:    s_lshr_b32 s12, s5, 16
+; GFX908-NEXT:    v_cvt_f32_f16_e32 v26, s5
+; GFX908-NEXT:    v_rcp_iflag_f32_e32 v1, v1
+; GFX908-NEXT:    v_cvt_f32_f16_e32 v27, s12
+; GFX908-NEXT:    s_lshl_b64 s[8:9], s[10:11], 5
+; GFX908-NEXT:    s_or_b32 s8, s8, 28
+; GFX908-NEXT:    v_mul_f32_e32 v1, 0x4f7ffffe, v1
+; GFX908-NEXT:    v_cvt_u32_f32_e32 v2, v1
+; GFX908-NEXT:    v_mov_b32_e32 v15, s9
+; GFX908-NEXT:    s_lshl_b64 s[6:7], s[0:1], 5
+; GFX908-NEXT:    v_mov_b32_e32 v14, s8
+; GFX908-NEXT:    v_mul_lo_u32 v1, s4, v2
+; GFX908-NEXT:    s_mov_b32 s4, 0
+; GFX908-NEXT:    v_mul_hi_u32 v3, v2, v1
 ; GFX908-NEXT:    v_mov_b32_e32 v1, 0
-; GFX908-NEXT:    v_mul_f32_e32 v2, 0x4f7ffffe, v2
-; GFX908-NEXT:    v_cvt_u32_f32_e32 v2, v2
-; GFX908-NEXT:    v_readfirstlane_b32 s10, v2
-; GFX908-NEXT:    s_mul_i32 s8, s8, s10
-; GFX908-NEXT:    s_mul_hi_u32 s8, s10, s8
-; GFX908-NEXT:    s_add_i32 s10, s10, s8
-; GFX908-NEXT:    s_mul_hi_u32 s8, s2, s10
-; GFX908-NEXT:    s_mul_i32 s10, s8, s3
-; GFX908-NEXT:    s_sub_i32 s2, s2, s10
-; GFX908-NEXT:    s_add_i32 s11, s8, 1
-; GFX908-NEXT:    s_sub_i32 s10, s2, s3
-; GFX908-NEXT:    s_cmp_ge_u32 s2, s3
-; GFX908-NEXT:    s_cselect_b32 s8, s11, s8
-; GFX908-NEXT:    s_cselect_b32 s2, s10, s2
-; GFX908-NEXT:    s_add_i32 s10, s8, 1
-; GFX908-NEXT:    s_cmp_ge_u32 s2, s3
-; GFX908-NEXT:    s_cselect_b32 s8, s10, s8
-; GFX908-NEXT:    s_lshr_b32 s7, s7, 16
-; GFX908-NEXT:    v_cvt_f32_f16_e32 v18, s7
-; GFX908-NEXT:    s_lshl_b64 s[2:3], s[0:1], 5
-; GFX908-NEXT:    s_lshl_b64 s[12:13], s[8:9], 5
-; GFX908-NEXT:    s_lshl_b64 s[10:11], s[4:5], 5
-; GFX908-NEXT:    s_or_b32 s10, s10, 28
+; GFX908-NEXT:    v_add_u32_e32 v2, v2, v3
+; GFX908-NEXT:    v_mul_hi_u32 v4, s2, v2
+; GFX908-NEXT:    v_mov_b32_e32 v2, s10
+; GFX908-NEXT:    v_mov_b32_e32 v3, s11
+; GFX908-NEXT:    v_mul_lo_u32 v5, v4, s3
+; GFX908-NEXT:    v_add_u32_e32 v6, 1, v4
+; GFX908-NEXT:    v_sub_u32_e32 v5, s2, v5
+; GFX908-NEXT:    v_subrev_u32_e32 v7, s3, v5
+; GFX908-NEXT:    v_cmp_le_u32_e32 vcc, s3, v5
+; GFX908-NEXT:    v_cndmask_b32_e32 v4, v4, v6, vcc
+; GFX908-NEXT:    v_cndmask_b32_e32 v5, v5, v7, vcc
+; GFX908-NEXT:    v_add_u32_e32 v7, 1, v4
+; GFX908-NEXT:    v_cmp_le_u32_e32 vcc, s3, v5
 ; GFX908-NEXT:    s_waitcnt vmcnt(0)
-; GFX908-NEXT:    v_readfirstlane_b32 s7, v16
-; GFX908-NEXT:    s_and_b32 s7, 0xffff, s7
-; GFX908-NEXT:    s_mul_i32 s1, s1, s7
-; GFX908-NEXT:    s_mul_hi_u32 s9, s0, s7
-; GFX908-NEXT:    s_mul_i32 s0, s0, s7
-; GFX908-NEXT:    s_add_i32 s1, s9, s1
-; GFX908-NEXT:    s_lshl_b64 s[14:15], s[0:1], 5
+; GFX908-NEXT:    v_and_b32_e32 v28, 0xffff, v0
+; GFX908-NEXT:    v_cndmask_b32_e32 v0, v4, v7, vcc
+; GFX908-NEXT:    v_mul_lo_u32 v8, s1, v28
+; GFX908-NEXT:    v_mul_hi_u32 v9, s0, v28
+; GFX908-NEXT:    v_lshlrev_b64 v[4:5], 5, v[0:1]
+; GFX908-NEXT:    v_mul_lo_u32 v6, s0, v28
+; GFX908-NEXT:    v_add_u32_e32 v7, v9, v8
+; GFX908-NEXT:    v_accvgpr_write_b32 a0, v4
+; GFX908-NEXT:    v_accvgpr_write_b32 a1, v5
+; GFX908-NEXT:    v_lshlrev_b64 v[6:7], 5, v[6:7]
 ; GFX908-NEXT:    s_branch .LBB3_2
-; GFX908-NEXT:  .LBB3_1: ; %Flow20
+; GFX908-NEXT:  .LBB3_1: ; %bb12
 ; GFX908-NEXT:    ; in Loop: Header=BB3_2 Depth=1
-; GFX908-NEXT:    s_andn2_b64 vcc, exec, s[0:1]
-; GFX908-NEXT:    s_cbranch_vccz .LBB3_12
+; GFX908-NEXT:    v_add_co_u32_e32 v2, vcc, v2, v0
+; GFX908-NEXT:    v_accvgpr_read_b32 v5, a1
+; GFX908-NEXT:    v_addc_co_u32_e32 v3, vcc, 0, v3, vcc
+; GFX908-NEXT:    v_accvgpr_read_b32 v4, a0
+; GFX908-NEXT:    v_add_co_u32_e32 v14, vcc, v14, v4
+; GFX908-NEXT:    v_addc_co_u32_e32 v15, vcc, v15, v5, vcc
+; GFX908-NEXT:    s_cbranch_execz .LBB3_12
 ; GFX908-NEXT:  .LBB3_2: ; %bb9
 ; GFX908-NEXT:    ; =>This Loop Header: Depth=1
 ; GFX908-NEXT:    ; Child Loop BB3_5 Depth 2
-; GFX908-NEXT:    s_mov_b64 s[16:17], -1
+; GFX908-NEXT:    s_mov_b64 s[0:1], -1
 ; GFX908-NEXT:    s_cbranch_scc0 .LBB3_10
 ; GFX908-NEXT:  ; %bb.3: ; %bb14
 ; GFX908-NEXT:    ; in Loop: Header=BB3_2 Depth=1
-; GFX908-NEXT:    global_load_dwordx2 v[2:3], v[0:1], off
-; GFX908-NEXT:    v_cmp_gt_i64_e64 s[0:1], s[4:5], -1
-; GFX908-NEXT:    s_mov_b32 s7, s6
-; GFX908-NEXT:    v_cndmask_b32_e64 v6, 0, 1, s[0:1]
-; GFX908-NEXT:    v_mov_b32_e32 v4, s6
-; GFX908-NEXT:    v_cmp_ne_u32_e64 s[0:1], 1, v6
-; GFX908-NEXT:    v_mov_b32_e32 v6, s6
-; GFX908-NEXT:    v_mov_b32_e32 v9, s7
-; GFX908-NEXT:    v_mov_b32_e32 v5, s7
-; GFX908-NEXT:    v_mov_b32_e32 v7, s7
-; GFX908-NEXT:    v_mov_b32_e32 v8, s6
-; GFX908-NEXT:    v_cmp_lt_i64_e64 s[16:17], s[4:5], 0
-; GFX908-NEXT:    v_mov_b32_e32 v11, v5
-; GFX908-NEXT:    s_mov_b64 s[18:19], s[10:11]
-; GFX908-NEXT:    v_mov_b32_e32 v10, v4
+; GFX908-NEXT:    v_mov_b32_e32 v10, 0
+; GFX908-NEXT:    v_mov_b32_e32 v11, 0
+; GFX908-NEXT:    global_load_dwordx2 v[10:11], v[10:11], off
+; GFX908-NEXT:    v_cmp_lt_i64_e32 vcc, -1, v[2:3]
+; GFX908-NEXT:    s_mov_b32 s5, s4
+; GFX908-NEXT:    v_cndmask_b32_e64 v16, 0, 1, vcc
+; GFX908-NEXT:    v_accvgpr_write_b32 a2, v14
+; GFX908-NEXT:    v_cmp_gt_i64_e64 s[2:3], 0, v[2:3]
+; GFX908-NEXT:    v_accvgpr_write_b32 a3, v15
+; GFX908-NEXT:    v_mov_b32_e32 v13, s5
+; GFX908-NEXT:    v_cmp_ne_u32_e64 s[0:1], 1, v16
+; GFX908-NEXT:    v_mov_b32_e32 v17, s5
+; GFX908-NEXT:    v_mov_b32_e32 v12, s4
+; GFX908-NEXT:    v_mov_b32_e32 v16, s4
 ; GFX908-NEXT:    s_waitcnt vmcnt(0)
-; GFX908-NEXT:    v_readfirstlane_b32 s7, v2
-; GFX908-NEXT:    v_readfirstlane_b32 s9, v3
-; GFX908-NEXT:    s_add_u32 s7, s7, 1
-; GFX908-NEXT:    s_addc_u32 s9, s9, 0
-; GFX908-NEXT:    s_mul_hi_u32 s20, s2, s7
-; GFX908-NEXT:    s_mul_i32 s9, s2, s9
-; GFX908-NEXT:    s_mul_i32 s21, s3, s7
-; GFX908-NEXT:    s_add_i32 s9, s20, s9
-; GFX908-NEXT:    s_mul_i32 s7, s2, s7
-; GFX908-NEXT:    s_add_i32 s9, s9, s21
+; GFX908-NEXT:    v_add_co_u32_e32 v20, vcc, 1, v10
+; GFX908-NEXT:    v_addc_co_u32_e32 v18, vcc, 0, v11, vcc
+; GFX908-NEXT:    v_mul_lo_u32 v21, s6, v18
+; GFX908-NEXT:    v_mul_hi_u32 v22, s6, v20
+; GFX908-NEXT:    v_mul_lo_u32 v23, s7, v20
+; GFX908-NEXT:    v_mul_lo_u32 v29, s6, v20
+; GFX908-NEXT:    v_mov_b32_e32 v19, s5
+; GFX908-NEXT:    v_add_u32_e32 v20, v22, v21
+; GFX908-NEXT:    v_add_u32_e32 v30, v20, v23
+; GFX908-NEXT:    v_mov_b32_e32 v21, s5
+; GFX908-NEXT:    v_mov_b32_e32 v18, s4
+; GFX908-NEXT:    v_mov_b32_e32 v20, s4
 ; GFX908-NEXT:    s_branch .LBB3_5
 ; GFX908-NEXT:  .LBB3_4: ; %bb58
 ; GFX908-NEXT:    ; in Loop: Header=BB3_5 Depth=2
-; GFX908-NEXT:    v_add_co_u32_sdwa v2, vcc, v2, v16 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
-; GFX908-NEXT:    v_addc_co_u32_e32 v3, vcc, 0, v3, vcc
-; GFX908-NEXT:    s_add_u32 s18, s18, s14
-; GFX908-NEXT:    v_cmp_lt_i64_e64 s[22:23], -1, v[2:3]
-; GFX908-NEXT:    s_addc_u32 s19, s19, s15
-; GFX908-NEXT:    s_mov_b64 s[20:21], 0
-; GFX908-NEXT:    s_andn2_b64 vcc, exec, s[22:23]
+; GFX908-NEXT:    v_add_co_u32_e32 v10, vcc, v10, v28
+; GFX908-NEXT:    v_addc_co_u32_e32 v11, vcc, 0, v11, vcc
+; GFX908-NEXT:    v_add_co_u32_e32 v14, vcc, v14, v6
+; GFX908-NEXT:    v_cmp_lt_i64_e64 s[10:11], -1, v[10:11]
+; GFX908-NEXT:    v_addc_co_u32_e32 v15, vcc, v15, v7, vcc
+; GFX908-NEXT:    s_mov_b64 s[8:9], 0
+; GFX908-NEXT:    s_andn2_b64 vcc, exec, s[10:11]
 ; GFX908-NEXT:    s_cbranch_vccz .LBB3_9
 ; GFX908-NEXT:  .LBB3_5: ; %bb16
 ; GFX908-NEXT:    ; Parent Loop BB3_2 Depth=1
 ; GFX908-NEXT:    ; => This Inner Loop Header: Depth=2
-; GFX908-NEXT:    s_add_u32 s20, s18, s7
-; GFX908-NEXT:    s_addc_u32 s21, s19, s9
-; GFX908-NEXT:    global_load_dword v21, v19, s[20:21] offset:-12 glc
+; GFX908-NEXT:    v_add_co_u32_e32 v22, vcc, v14, v29
+; GFX908-NEXT:    v_addc_co_u32_e32 v23, vcc, v15, v30, vcc
+; GFX908-NEXT:    global_load_dword v32, v[22:23], off offset:-12 glc
 ; GFX908-NEXT:    s_waitcnt vmcnt(0)
-; GFX908-NEXT:    global_load_dword v20, v19, s[20:21] offset:-8 glc
+; GFX908-NEXT:    global_load_dword v31, v[22:23], off offset:-8 glc
 ; GFX908-NEXT:    s_waitcnt vmcnt(0)
-; GFX908-NEXT:    global_load_dword v12, v19, s[20:21] offset:-4 glc
+; GFX908-NEXT:    global_load_dword v24, v[22:23], off offset:-4 glc
 ; GFX908-NEXT:    s_waitcnt vmcnt(0)
-; GFX908-NEXT:    global_load_dword v12, v19, s[20:21] glc
+; GFX908-NEXT:    global_load_dword v22, v[22:23], off glc
 ; GFX908-NEXT:    s_waitcnt vmcnt(0)
-; GFX908-NEXT:    ds_read_b64 v[12:13], v19
-; GFX908-NEXT:    ds_read_b64 v[14:15], v0
+; GFX908-NEXT:    ds_read_b64 v[22:23], v1
+; GFX908-NEXT:    ds_read_b64 v[24:25], v0
 ; GFX908-NEXT:    s_and_b64 vcc, exec, s[0:1]
 ; GFX908-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX908-NEXT:    s_cbranch_vccnz .LBB3_7
 ; GFX908-NEXT:  ; %bb.6: ; %bb51
 ; GFX908-NEXT:    ; in Loop: Header=BB3_5 Depth=2
-; GFX908-NEXT:    v_cvt_f32_f16_sdwa v22, v21 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
-; GFX908-NEXT:    v_cvt_f32_f16_e32 v21, v21
-; GFX908-NEXT:    v_cvt_f32_f16_sdwa v23, v20 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
-; GFX908-NEXT:    v_cvt_f32_f16_e32 v20, v20
-; GFX908-NEXT:    v_add_f32_e32 v24, v17, v12
-; GFX908-NEXT:    v_add_f32_e32 v25, v18, v13
-; GFX908-NEXT:    v_add_f32_e32 v26, 0, v12
-; GFX908-NEXT:    v_add_f32_e32 v27, 0, v13
-; GFX908-NEXT:    v_add_f32_e32 v15, v22, v15
-; GFX908-NEXT:    v_add_f32_e32 v14, v21, v14
-; GFX908-NEXT:    v_add_f32_e32 v13, v23, v13
-; GFX908-NEXT:    v_add_f32_e32 v12, v20, v12
-; GFX908-NEXT:    v_add_f32_e32 v5, v5, v25
-; GFX908-NEXT:    v_add_f32_e32 v4, v4, v24
-; GFX908-NEXT:    v_add_f32_e32 v7, v7, v27
-; GFX908-NEXT:    v_add_f32_e32 v6, v6, v26
-; GFX908-NEXT:    v_add_f32_e32 v8, v8, v14
-; GFX908-NEXT:    v_add_f32_e32 v9, v9, v15
-; GFX908-NEXT:    v_add_f32_e32 v10, v10, v12
-; GFX908-NEXT:    v_add_f32_e32 v11, v11, v13
+; GFX908-NEXT:    v_cvt_f32_f16_sdwa v33, v32 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
+; GFX908-NEXT:    v_cvt_f32_f16_e32 v32, v32
+; GFX908-NEXT:    v_cvt_f32_f16_sdwa v34, v31 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
+; GFX908-NEXT:    v_cvt_f32_f16_e32 v31, v31
+; GFX908-NEXT:    v_add_f32_e32 v4, v26, v22
+; GFX908-NEXT:    v_add_f32_e32 v5, v27, v23
+; GFX908-NEXT:    v_add_f32_e32 v8, 0, v22
+; GFX908-NEXT:    v_add_f32_e32 v9, 0, v23
+; GFX908-NEXT:    v_add_f32_e32 v25, v33, v25
+; GFX908-NEXT:    v_add_f32_e32 v24, v32, v24
+; GFX908-NEXT:    v_add_f32_e32 v23, v34, v23
+; GFX908-NEXT:    v_add_f32_e32 v22, v31, v22
+; GFX908-NEXT:    v_add_f32_e32 v13, v13, v5
+; GFX908-NEXT:    v_add_f32_e32 v12, v12, v4
+; GFX908-NEXT:    v_add_f32_e32 v17, v17, v9
+; GFX908-NEXT:    v_add_f32_e32 v16, v16, v8
+; GFX908-NEXT:    v_add_f32_e32 v18, v18, v24
+; GFX908-NEXT:    v_add_f32_e32 v19, v19, v25
+; GFX908-NEXT:    v_add_f32_e32 v20, v20, v22
+; GFX908-NEXT:    v_add_f32_e32 v21, v21, v23
 ; GFX908-NEXT:    s_branch .LBB3_4
 ; GFX908-NEXT:  .LBB3_7: ; in Loop: Header=BB3_5 Depth=2
-; GFX908-NEXT:    s_mov_b64 s[20:21], s[16:17]
-; GFX908-NEXT:    s_andn2_b64 vcc, exec, s[20:21]
+; GFX908-NEXT:    s_mov_b64 s[8:9], s[2:3]
+; GFX908-NEXT:    s_andn2_b64 vcc, exec, s[8:9]
 ; GFX908-NEXT:    s_cbranch_vccz .LBB3_4
 ; GFX908-NEXT:  ; %bb.8: ; in Loop: Header=BB3_2 Depth=1
-; GFX908-NEXT:    s_mov_b64 s[20:21], -1
-; GFX908-NEXT:    ; implicit-def: $vgpr2_vgpr3
-; GFX908-NEXT:    ; implicit-def: $sgpr18_sgpr19
+; GFX908-NEXT:    s_mov_b64 s[8:9], -1
+; GFX908-NEXT:    ; implicit-def: $vgpr10_vgpr11
+; GFX908-NEXT:    ; implicit-def: $vgpr14_vgpr15
 ; GFX908-NEXT:  .LBB3_9: ; %loop.exit.guard
 ; GFX908-NEXT:    ; in Loop: Header=BB3_2 Depth=1
-; GFX908-NEXT:    s_xor_b64 s[16:17], s[20:21], -1
+; GFX908-NEXT:    v_accvgpr_read_b32 v15, a3
+; GFX908-NEXT:    s_xor_b64 s[0:1], s[8:9], -1
+; GFX908-NEXT:    v_accvgpr_read_b32 v14, a2
 ; GFX908-NEXT:  .LBB3_10: ; %Flow19
 ; GFX908-NEXT:    ; in Loop: Header=BB3_2 Depth=1
-; GFX908-NEXT:    s_mov_b64 s[0:1], -1
-; GFX908-NEXT:    s_and_b64 vcc, exec, s[16:17]
-; GFX908-NEXT:    s_cbranch_vccz .LBB3_1
-; GFX908-NEXT:  ; %bb.11: ; %bb12
-; GFX908-NEXT:    ; in Loop: Header=BB3_2 Depth=1
-; GFX908-NEXT:    s_add_u32 s4, s4, s8
-; GFX908-NEXT:    s_addc_u32 s5, s5, 0
-; GFX908-NEXT:    s_add_u32 s10, s10, s12
-; GFX908-NEXT:    s_addc_u32 s11, s11, s13
-; GFX908-NEXT:    s_mov_b64 s[0:1], 0
-; GFX908-NEXT:    s_branch .LBB3_1
+; GFX908-NEXT:    s_and_b64 vcc, exec, s[0:1]
+; GFX908-NEXT:    s_cbranch_vccnz .LBB3_1
+; GFX908-NEXT:  ; %bb.11:
+; GFX908-NEXT:    ; implicit-def: $vgpr2_vgpr3
+; GFX908-NEXT:    ; implicit-def: $vgpr14_vgpr15
 ; GFX908-NEXT:  .LBB3_12: ; %DummyReturnBlock
 ; GFX908-NEXT:    s_endpgm
 ;
 ; GFX90A-LABEL: introduced_copy_to_sgpr:
 ; GFX90A:       ; %bb.0: ; %bb
-; GFX90A-NEXT:    global_load_ushort v18, v[0:1], off glc
+; GFX90A-NEXT:    global_load_ushort v10, v[0:1], off glc
 ; GFX90A-NEXT:    s_load_dwordx4 s[0:3], s[8:9], 0x0
-; GFX90A-NEXT:    s_load_dwordx2 s[4:5], s[8:9], 0x10
-; GFX90A-NEXT:    s_load_dword s7, s[8:9], 0x18
-; GFX90A-NEXT:    s_mov_b32 s6, 0
-; GFX90A-NEXT:    s_mov_b32 s9, s6
+; GFX90A-NEXT:    s_load_dword s5, s[8:9], 0x18
+; GFX90A-NEXT:    v_mov_b32_e32 v1, 0
+; GFX90A-NEXT:    s_load_dwordx2 s[8:9], s[8:9], 0x10
+; GFX90A-NEXT:    s_mov_b32 s4, 0
 ; GFX90A-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX90A-NEXT:    v_cvt_f32_u32_e32 v0, s3
-; GFX90A-NEXT:    s_sub_i32 s8, 0, s3
-; GFX90A-NEXT:    v_mov_b32_e32 v19, 0
-; GFX90A-NEXT:    v_rcp_iflag_f32_e32 v2, v0
-; GFX90A-NEXT:    v_pk_mov_b32 v[0:1], 0, 0
-; GFX90A-NEXT:    v_mul_f32_e32 v2, 0x4f7ffffe, v2
-; GFX90A-NEXT:    v_cvt_u32_f32_e32 v3, v2
-; GFX90A-NEXT:    v_cvt_f32_f16_e32 v2, s7
-; GFX90A-NEXT:    v_readfirstlane_b32 s10, v3
-; GFX90A-NEXT:    s_mul_i32 s8, s8, s10
-; GFX90A-NEXT:    s_mul_hi_u32 s8, s10, s8
-; GFX90A-NEXT:    s_add_i32 s10, s10, s8
-; GFX90A-NEXT:    s_mul_hi_u32 s8, s2, s10
-; GFX90A-NEXT:    s_mul_i32 s10, s8, s3
-; GFX90A-NEXT:    s_sub_i32 s2, s2, s10
-; GFX90A-NEXT:    s_add_i32 s11, s8, 1
-; GFX90A-NEXT:    s_sub_i32 s10, s2, s3
-; GFX90A-NEXT:    s_cmp_ge_u32 s2, s3
-; GFX90A-NEXT:    s_cselect_b32 s8, s11, s8
-; GFX90A-NEXT:    s_cselect_b32 s2, s10, s2
-; GFX90A-NEXT:    s_add_i32 s10, s8, 1
-; GFX90A-NEXT:    s_cmp_ge_u32 s2, s3
-; GFX90A-NEXT:    s_cselect_b32 s8, s10, s8
-; GFX90A-NEXT:    s_lshr_b32 s7, s7, 16
-; GFX90A-NEXT:    v_cvt_f32_f16_e32 v3, s7
-; GFX90A-NEXT:    s_lshl_b64 s[2:3], s[0:1], 5
-; GFX90A-NEXT:    s_lshl_b64 s[12:13], s[8:9], 5
-; GFX90A-NEXT:    s_lshl_b64 s[10:11], s[4:5], 5
+; GFX90A-NEXT:    s_sub_i32 s12, 0, s3
+; GFX90A-NEXT:    s_lshr_b32 s13, s5, 16
+; GFX90A-NEXT:    v_cvt_f32_f16_e32 v2, s5
+; GFX90A-NEXT:    v_rcp_iflag_f32_e32 v0, v0
+; GFX90A-NEXT:    v_cvt_f32_f16_e32 v3, s13
+; GFX90A-NEXT:    s_lshl_b64 s[10:11], s[8:9], 5
 ; GFX90A-NEXT:    s_or_b32 s10, s10, 28
+; GFX90A-NEXT:    v_mul_f32_e32 v0, 0x4f7ffffe, v0
+; GFX90A-NEXT:    v_cvt_u32_f32_e32 v0, v0
+; GFX90A-NEXT:    s_lshl_b64 s[6:7], s[0:1], 5
+; GFX90A-NEXT:    v_pk_mov_b32 v[4:5], s[8:9], s[8:9] op_sel:[0,1]
+; GFX90A-NEXT:    v_pk_mov_b32 v[6:7], s[10:11], s[10:11] op_sel:[0,1]
+; GFX90A-NEXT:    v_mul_lo_u32 v8, s12, v0
+; GFX90A-NEXT:    v_mul_hi_u32 v8, v0, v8
+; GFX90A-NEXT:    v_add_u32_e32 v0, v0, v8
+; GFX90A-NEXT:    v_mul_hi_u32 v0, s2, v0
+; GFX90A-NEXT:    v_mul_lo_u32 v8, v0, s3
+; GFX90A-NEXT:    v_sub_u32_e32 v8, s2, v8
+; GFX90A-NEXT:    v_add_u32_e32 v9, 1, v0
+; GFX90A-NEXT:    v_subrev_u32_e32 v11, s3, v8
+; GFX90A-NEXT:    v_cmp_le_u32_e32 vcc, s3, v8
+; GFX90A-NEXT:    v_cndmask_b32_e32 v0, v0, v9, vcc
+; GFX90A-NEXT:    v_cndmask_b32_e32 v8, v8, v11, vcc
+; GFX90A-NEXT:    v_add_u32_e32 v9, 1, v0
+; GFX90A-NEXT:    v_cmp_le_u32_e32 vcc, s3, v8
+; GFX90A-NEXT:    v_cndmask_b32_e32 v0, v0, v9, vcc
+; GFX90A-NEXT:    v_lshlrev_b64 v[8:9], 5, v[0:1]
 ; GFX90A-NEXT:    s_waitcnt vmcnt(0)
-; GFX90A-NEXT:    v_readfirstlane_b32 s7, v18
-; GFX90A-NEXT:    s_and_b32 s7, 0xffff, s7
-; GFX90A-NEXT:    s_mul_i32 s1, s1, s7
-; GFX90A-NEXT:    s_mul_hi_u32 s9, s0, s7
-; GFX90A-NEXT:    s_mul_i32 s0, s0, s7
-; GFX90A-NEXT:    s_add_i32 s1, s9, s1
-; GFX90A-NEXT:    s_lshl_b64 s[14:15], s[0:1], 5
+; GFX90A-NEXT:    v_and_b32_e32 v30, 0xffff, v10
+; GFX90A-NEXT:    v_mul_lo_u32 v11, s1, v30
+; GFX90A-NEXT:    v_mul_hi_u32 v12, s0, v30
+; GFX90A-NEXT:    v_mul_lo_u32 v10, s0, v30
+; GFX90A-NEXT:    v_add_u32_e32 v11, v12, v11
+; GFX90A-NEXT:    v_lshlrev_b64 v[10:11], 5, v[10:11]
+; GFX90A-NEXT:    v_pk_mov_b32 v[12:13], 0, 0
 ; GFX90A-NEXT:    s_branch .LBB3_2
-; GFX90A-NEXT:  .LBB3_1: ; %Flow20
+; GFX90A-NEXT:  .LBB3_1: ; %bb12
 ; GFX90A-NEXT:    ; in Loop: Header=BB3_2 Depth=1
-; GFX90A-NEXT:    s_andn2_b64 vcc, exec, s[0:1]
-; GFX90A-NEXT:    s_cbranch_vccz .LBB3_12
+; GFX90A-NEXT:    v_add_co_u32_e32 v4, vcc, v4, v0
+; GFX90A-NEXT:    v_addc_co_u32_e32 v5, vcc, 0, v5, vcc
+; GFX90A-NEXT:    v_add_co_u32_e32 v6, vcc, v6, v8
+; GFX90A-NEXT:    v_addc_co_u32_e32 v7, vcc, v7, v9, vcc
+; GFX90A-NEXT:    s_cbranch_execz .LBB3_12
 ; GFX90A-NEXT:  .LBB3_2: ; %bb9
 ; GFX90A-NEXT:    ; =>This Loop Header: Depth=1
 ; GFX90A-NEXT:    ; Child Loop BB3_5 Depth 2
-; GFX90A-NEXT:    s_mov_b64 s[16:17], -1
+; GFX90A-NEXT:    s_mov_b64 s[0:1], -1
 ; GFX90A-NEXT:    s_cbranch_scc0 .LBB3_10
 ; GFX90A-NEXT:  ; %bb.3: ; %bb14
 ; GFX90A-NEXT:    ; in Loop: Header=BB3_2 Depth=1
-; GFX90A-NEXT:    global_load_dwordx2 v[4:5], v[0:1], off
-; GFX90A-NEXT:    v_cmp_gt_i64_e64 s[0:1], s[4:5], -1
-; GFX90A-NEXT:    s_mov_b32 s7, s6
-; GFX90A-NEXT:    v_cndmask_b32_e64 v8, 0, 1, s[0:1]
-; GFX90A-NEXT:    v_pk_mov_b32 v[6:7], s[6:7], s[6:7] op_sel:[0,1]
-; GFX90A-NEXT:    v_cmp_ne_u32_e64 s[0:1], 1, v8
-; GFX90A-NEXT:    v_pk_mov_b32 v[8:9], s[6:7], s[6:7] op_sel:[0,1]
-; GFX90A-NEXT:    v_pk_mov_b32 v[10:11], s[6:7], s[6:7] op_sel:[0,1]
-; GFX90A-NEXT:    v_cmp_lt_i64_e64 s[16:17], s[4:5], 0
-; GFX90A...
[truncated]

@arsenm arsenm force-pushed the users/arsenm/amdgpu/move-insert-of-illegal-v-to-s-copy-into-map branch from 871dcce to 32cde78 Compare March 13, 2025 15:13
@arsenm arsenm force-pushed the users/arsenm/amdgpu/move-insert-of-illegal-v-to-s-copy-into-map branch from 32cde78 to f3ebefa Compare March 17, 2025 06:31
Insert the start instruction directly into the map before the uses. This
prevents improperly re-visting sgpr->vgpr phi inputs multiple times which
would trigger a use after free.

I don't particularly trust the iteration scheme here. This is also
unnecessarily revisting transitive users of a phi or reg_sequence for every
input operand, but I will address that separately.

Fixes #130646. I also believe it fixes #130119, although that test fails
less consistently for me.
@arsenm arsenm force-pushed the users/arsenm/amdgpu/move-insert-of-illegal-v-to-s-copy-into-map branch from f3ebefa to c51fc6c Compare March 17, 2025 09:49
Copy link
Collaborator

@dstutt dstutt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable.
I encountered another case where this was an issue - and this fix worked.

@arsenm arsenm merged commit dea5aa7 into main Mar 18, 2025
11 checks passed
@arsenm arsenm deleted the users/arsenm/amdgpu/move-insert-of-illegal-v-to-s-copy-into-map branch March 18, 2025 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants