Skip to content

[VectorCombine][X86] Use updated getVectorInstrCost hook #137823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented Apr 29, 2025

This addresses a TODO where previously scalarizeBinopOrCmp conservatively bailed if one of the operands was a load.

getVectorInstrCost was updated to take in values in https://reviews.llvm.org/D140498 so we can pass in the scalar value to be inserted, which should return an accurate cost for a gather.

To prevent regressions on x86 this tries to constant fold NewVecC up front so we can pass it into TTI and get a more accurate cost.

We want to remove this restriction on RISC-V since this is always profitable whether or not the scalar is a load.

This addresses a TODO where previously scalarizeBinopOrCmp conservatively bailed if one of the operands was a load.

getVectorInstrCost was updated to take in values in https://reviews.llvm.org/D140498 so we can pass in the scalar value to be inserted, which should return an accurate cost for a gather.

We want to remove this restriction on RISC-V since this is always profitable whether or not the scalar is a load.

On X86 this seems to prevent scalarization on SSE where the index is 0, because the cost of an insertion into undef goes from 12 -> 1 with the value passed into it. Is this correct? Or is there a way to fix this in X86TTIImpl::getVectorInstrCost? cc @alexey-bataev
@llvmbot
Copy link
Member

llvmbot commented Apr 29, 2025

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Luke Lau (lukel97)

Changes

This addresses a TODO where previously scalarizeBinopOrCmp conservatively bailed if one of the operands was a load.

getVectorInstrCost was updated to take in values in https://reviews.llvm.org/D140498 so we can pass in the scalar value to be inserted, which should return an accurate cost for a gather.

We want to remove this restriction on RISC-V since this is always profitable whether or not the scalar is a load.

On X86 this seems to prevent scalarization on SSE where the index is 0, because the cost of an insertion into undef goes from 12 -> 1 with the value passed into it. Is this correct? Or is there a way to fix this in X86TTIImpl::getVectorInstrCost? cc @alexey-bataev


Patch is 30.80 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137823.diff

7 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/VectorCombine.cpp (+8-12)
  • (modified) llvm/test/Transforms/VectorCombine/X86/insert-binop-inseltpoison.ll (+23-10)
  • (modified) llvm/test/Transforms/VectorCombine/X86/insert-binop-with-constant-inseltpoison.ll (+58-28)
  • (modified) llvm/test/Transforms/VectorCombine/X86/insert-binop-with-constant.ll (+58-28)
  • (modified) llvm/test/Transforms/VectorCombine/X86/insert-binop.ll (+23-10)
  • (modified) llvm/test/Transforms/VectorCombine/X86/scalarize-cmp-inseltpoison.ll (+41-19)
  • (modified) llvm/test/Transforms/VectorCombine/X86/scalarize-cmp.ll (+41-19)
diff --git a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
index 04c084ffdda97..f046a7d305d51 100644
--- a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
+++ b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
@@ -1066,14 +1066,6 @@ bool VectorCombine::scalarizeBinopOrCmp(Instruction &I) {
       VecTy1->getElementCount().getKnownMinValue() <= Index1)
     return false;
 
-  // Bail for single insertion if it is a load.
-  // TODO: Handle this once getVectorInstrCost can cost for load/stores.
-  auto *I0 = dyn_cast_or_null<Instruction>(V0);
-  auto *I1 = dyn_cast_or_null<Instruction>(V1);
-  if ((IsConst0 && I1 && I1->mayReadFromMemory()) ||
-      (IsConst1 && I0 && I0->mayReadFromMemory()))
-    return false;
-
   uint64_t Index = IsConst0 ? Index1 : Index0;
   Type *ScalarTy = IsConst0 ? V1->getType() : V0->getType();
   Type *VecTy = I.getType();
@@ -1100,11 +1092,15 @@ bool VectorCombine::scalarizeBinopOrCmp(Instruction &I) {
   // both sequences.
   InstructionCost InsertCost = TTI.getVectorInstrCost(
       Instruction::InsertElement, VecTy, CostKind, Index);
-  InstructionCost OldCost =
-      (IsConst0 ? 0 : InsertCost) + (IsConst1 ? 0 : InsertCost) + VectorOpCost;
+  InstructionCost InsertCostV0 = TTI.getVectorInstrCost(
+      Instruction::InsertElement, VecTy, CostKind, Index, VecC0, V0);
+  InstructionCost InsertCostV1 = TTI.getVectorInstrCost(
+      Instruction::InsertElement, VecTy, CostKind, Index, VecC1, V1);
+  InstructionCost OldCost = (IsConst0 ? 0 : InsertCostV0) +
+                            (IsConst1 ? 0 : InsertCostV1) + VectorOpCost;
   InstructionCost NewCost = ScalarOpCost + InsertCost +
-                            (IsConst0 ? 0 : !Ins0->hasOneUse() * InsertCost) +
-                            (IsConst1 ? 0 : !Ins1->hasOneUse() * InsertCost);
+                            (IsConst0 ? 0 : !Ins0->hasOneUse() * InsertCostV0) +
+                            (IsConst1 ? 0 : !Ins1->hasOneUse() * InsertCostV1);
 
   // We want to scalarize unless the vector variant actually has lower cost.
   if (OldCost < NewCost || !NewCost.isValid())
diff --git a/llvm/test/Transforms/VectorCombine/X86/insert-binop-inseltpoison.ll b/llvm/test/Transforms/VectorCombine/X86/insert-binop-inseltpoison.ll
index c1100780254c1..76440c7047059 100644
--- a/llvm/test/Transforms/VectorCombine/X86/insert-binop-inseltpoison.ll
+++ b/llvm/test/Transforms/VectorCombine/X86/insert-binop-inseltpoison.ll
@@ -8,10 +8,16 @@ declare void @usef(<4 x float>)
 ; Eliminating an insert is profitable.
 
 define <16 x i8> @ins0_ins0_add(i8 %x, i8 %y) {
-; CHECK-LABEL: @ins0_ins0_add(
-; CHECK-NEXT:    [[R_SCALAR:%.*]] = add i8 [[X:%.*]], [[Y:%.*]]
-; CHECK-NEXT:    [[R:%.*]] = insertelement <16 x i8> poison, i8 [[R_SCALAR]], i64 0
-; CHECK-NEXT:    ret <16 x i8> [[R]]
+; SSE-LABEL: @ins0_ins0_add(
+; SSE-NEXT:    [[I0:%.*]] = insertelement <16 x i8> poison, i8 [[X:%.*]], i32 0
+; SSE-NEXT:    [[I1:%.*]] = insertelement <16 x i8> poison, i8 [[Y:%.*]], i32 0
+; SSE-NEXT:    [[R:%.*]] = add <16 x i8> [[I0]], [[I1]]
+; SSE-NEXT:    ret <16 x i8> [[R]]
+;
+; AVX-LABEL: @ins0_ins0_add(
+; AVX-NEXT:    [[R_SCALAR:%.*]] = add i8 [[X:%.*]], [[Y:%.*]]
+; AVX-NEXT:    [[R:%.*]] = insertelement <16 x i8> poison, i8 [[R_SCALAR]], i64 0
+; AVX-NEXT:    ret <16 x i8> [[R]]
 ;
   %i0 = insertelement <16 x i8> poison, i8 %x, i32 0
   %i1 = insertelement <16 x i8> poison, i8 %y, i32 0
@@ -155,12 +161,19 @@ define <2 x i64> @ins1_ins1_urem(i64 %x, i64 %y) {
 ; Extra use is accounted for in cost calculation.
 
 define <4 x i32> @ins0_ins0_xor(i32 %x, i32 %y) {
-; CHECK-LABEL: @ins0_ins0_xor(
-; CHECK-NEXT:    [[I0:%.*]] = insertelement <4 x i32> poison, i32 [[X:%.*]], i32 0
-; CHECK-NEXT:    call void @use(<4 x i32> [[I0]])
-; CHECK-NEXT:    [[R_SCALAR:%.*]] = xor i32 [[X]], [[Y:%.*]]
-; CHECK-NEXT:    [[R:%.*]] = insertelement <4 x i32> poison, i32 [[R_SCALAR]], i64 0
-; CHECK-NEXT:    ret <4 x i32> [[R]]
+; SSE-LABEL: @ins0_ins0_xor(
+; SSE-NEXT:    [[I0:%.*]] = insertelement <4 x i32> poison, i32 [[X:%.*]], i32 0
+; SSE-NEXT:    call void @use(<4 x i32> [[I0]])
+; SSE-NEXT:    [[I1:%.*]] = insertelement <4 x i32> poison, i32 [[Y:%.*]], i32 0
+; SSE-NEXT:    [[R:%.*]] = xor <4 x i32> [[I0]], [[I1]]
+; SSE-NEXT:    ret <4 x i32> [[R]]
+;
+; AVX-LABEL: @ins0_ins0_xor(
+; AVX-NEXT:    [[I0:%.*]] = insertelement <4 x i32> poison, i32 [[X:%.*]], i32 0
+; AVX-NEXT:    call void @use(<4 x i32> [[I0]])
+; AVX-NEXT:    [[R_SCALAR:%.*]] = xor i32 [[X]], [[Y:%.*]]
+; AVX-NEXT:    [[R:%.*]] = insertelement <4 x i32> poison, i32 [[R_SCALAR]], i64 0
+; AVX-NEXT:    ret <4 x i32> [[R]]
 ;
   %i0 = insertelement <4 x i32> poison, i32 %x, i32 0
   call void @use(<4 x i32> %i0)
diff --git a/llvm/test/Transforms/VectorCombine/X86/insert-binop-with-constant-inseltpoison.ll b/llvm/test/Transforms/VectorCombine/X86/insert-binop-with-constant-inseltpoison.ll
index 05251cb829b2b..751539aa0f431 100644
--- a/llvm/test/Transforms/VectorCombine/X86/insert-binop-with-constant-inseltpoison.ll
+++ b/llvm/test/Transforms/VectorCombine/X86/insert-binop-with-constant-inseltpoison.ll
@@ -3,10 +3,15 @@
 ; RUN: opt < %s -passes=vector-combine -S -mtriple=x86_64-- -mattr=AVX2 | FileCheck %s --check-prefixes=CHECK,AVX
 
 define <2 x i64> @add_constant(i64 %x) {
-; CHECK-LABEL: @add_constant(
-; CHECK-NEXT:    [[BO_SCALAR:%.*]] = add i64 [[X:%.*]], 42
-; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> poison, i64 [[BO_SCALAR]], i64 0
-; CHECK-NEXT:    ret <2 x i64> [[BO]]
+; SSE-LABEL: @add_constant(
+; SSE-NEXT:    [[INS:%.*]] = insertelement <2 x i64> poison, i64 [[X:%.*]], i32 0
+; SSE-NEXT:    [[BO:%.*]] = add <2 x i64> [[INS]], <i64 42, i64 undef>
+; SSE-NEXT:    ret <2 x i64> [[BO]]
+;
+; AVX-LABEL: @add_constant(
+; AVX-NEXT:    [[BO_SCALAR:%.*]] = add i64 [[X:%.*]], 42
+; AVX-NEXT:    [[BO:%.*]] = insertelement <2 x i64> poison, i64 [[BO_SCALAR]], i64 0
+; AVX-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ins = insertelement <2 x i64> poison, i64 %x, i32 0
   %bo = add <2 x i64> %ins, <i64 42, i64 undef>
@@ -14,10 +19,15 @@ define <2 x i64> @add_constant(i64 %x) {
 }
 
 define <2 x i64> @add_constant_not_undef_lane(i64 %x) {
-; CHECK-LABEL: @add_constant_not_undef_lane(
-; CHECK-NEXT:    [[BO_SCALAR:%.*]] = add i64 [[X:%.*]], 42
-; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> poison, i64 [[BO_SCALAR]], i64 0
-; CHECK-NEXT:    ret <2 x i64> [[BO]]
+; SSE-LABEL: @add_constant_not_undef_lane(
+; SSE-NEXT:    [[INS:%.*]] = insertelement <2 x i64> poison, i64 [[X:%.*]], i32 0
+; SSE-NEXT:    [[BO:%.*]] = add <2 x i64> [[INS]], <i64 42, i64 -42>
+; SSE-NEXT:    ret <2 x i64> [[BO]]
+;
+; AVX-LABEL: @add_constant_not_undef_lane(
+; AVX-NEXT:    [[BO_SCALAR:%.*]] = add i64 [[X:%.*]], 42
+; AVX-NEXT:    [[BO:%.*]] = insertelement <2 x i64> poison, i64 [[BO_SCALAR]], i64 0
+; AVX-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ins = insertelement <2 x i64> poison, i64 %x, i32 0
   %bo = add <2 x i64> %ins, <i64 42, i64 -42>
@@ -153,8 +163,8 @@ define <2 x i64> @shl_constant_op0_not_undef_lane(i64 %x) {
 define <2 x i64> @shl_constant_op0_load(ptr %p) {
 ; CHECK-LABEL: @shl_constant_op0_load(
 ; CHECK-NEXT:    [[LD:%.*]] = load i64, ptr [[P:%.*]], align 8
-; CHECK-NEXT:    [[INS:%.*]] = insertelement <2 x i64> poison, i64 [[LD]], i32 1
-; CHECK-NEXT:    [[BO:%.*]] = shl <2 x i64> <i64 undef, i64 2>, [[INS]]
+; CHECK-NEXT:    [[BO_SCALAR:%.*]] = shl i64 2, [[LD]]
+; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> poison, i64 [[BO_SCALAR]], i64 1
 ; CHECK-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ld = load i64, ptr %p
@@ -204,8 +214,8 @@ define <2 x i64> @shl_constant_op1_not_undef_lane(i64 %x) {
 define <2 x i64> @shl_constant_op1_load(ptr %p) {
 ; CHECK-LABEL: @shl_constant_op1_load(
 ; CHECK-NEXT:    [[LD:%.*]] = load i64, ptr [[P:%.*]], align 8
-; CHECK-NEXT:    [[INS:%.*]] = insertelement <2 x i64> poison, i64 [[LD]], i32 0
-; CHECK-NEXT:    [[BO:%.*]] = shl nuw <2 x i64> [[INS]], <i64 5, i64 2>
+; CHECK-NEXT:    [[BO_SCALAR:%.*]] = shl nuw i64 [[LD]], 5
+; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> poison, i64 [[BO_SCALAR]], i64 0
 ; CHECK-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ld = load i64, ptr %p
@@ -479,10 +489,15 @@ define <2 x i64> @sdiv_constant_op1_not_undef_lane(i64 %x) {
 }
 
 define <2 x i64> @and_constant(i64 %x) {
-; CHECK-LABEL: @and_constant(
-; CHECK-NEXT:    [[BO_SCALAR:%.*]] = and i64 [[X:%.*]], 42
-; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> poison, i64 [[BO_SCALAR]], i64 0
-; CHECK-NEXT:    ret <2 x i64> [[BO]]
+; SSE-LABEL: @and_constant(
+; SSE-NEXT:    [[INS:%.*]] = insertelement <2 x i64> poison, i64 [[X:%.*]], i32 0
+; SSE-NEXT:    [[BO:%.*]] = and <2 x i64> [[INS]], <i64 42, i64 undef>
+; SSE-NEXT:    ret <2 x i64> [[BO]]
+;
+; AVX-LABEL: @and_constant(
+; AVX-NEXT:    [[BO_SCALAR:%.*]] = and i64 [[X:%.*]], 42
+; AVX-NEXT:    [[BO:%.*]] = insertelement <2 x i64> poison, i64 [[BO_SCALAR]], i64 0
+; AVX-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ins = insertelement <2 x i64> poison, i64 %x, i32 0
   %bo = and <2 x i64> %ins, <i64 42, i64 undef>
@@ -490,10 +505,15 @@ define <2 x i64> @and_constant(i64 %x) {
 }
 
 define <2 x i64> @and_constant_not_undef_lane(i64 %x) {
-; CHECK-LABEL: @and_constant_not_undef_lane(
-; CHECK-NEXT:    [[BO_SCALAR:%.*]] = and i64 [[X:%.*]], 42
-; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> poison, i64 [[BO_SCALAR]], i64 0
-; CHECK-NEXT:    ret <2 x i64> [[BO]]
+; SSE-LABEL: @and_constant_not_undef_lane(
+; SSE-NEXT:    [[INS:%.*]] = insertelement <2 x i64> poison, i64 [[X:%.*]], i32 0
+; SSE-NEXT:    [[BO:%.*]] = and <2 x i64> [[INS]], <i64 42, i64 -42>
+; SSE-NEXT:    ret <2 x i64> [[BO]]
+;
+; AVX-LABEL: @and_constant_not_undef_lane(
+; AVX-NEXT:    [[BO_SCALAR:%.*]] = and i64 [[X:%.*]], 42
+; AVX-NEXT:    [[BO:%.*]] = insertelement <2 x i64> poison, i64 [[BO_SCALAR]], i64 0
+; AVX-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ins = insertelement <2 x i64> poison, i64 %x, i32 0
   %bo = and <2 x i64> %ins, <i64 42, i64 -42>
@@ -523,10 +543,15 @@ define <2 x i64> @or_constant_not_undef_lane(i64 %x) {
 }
 
 define <2 x i64> @xor_constant(i64 %x) {
-; CHECK-LABEL: @xor_constant(
-; CHECK-NEXT:    [[BO_SCALAR:%.*]] = xor i64 [[X:%.*]], 42
-; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> poison, i64 [[BO_SCALAR]], i64 0
-; CHECK-NEXT:    ret <2 x i64> [[BO]]
+; SSE-LABEL: @xor_constant(
+; SSE-NEXT:    [[INS:%.*]] = insertelement <2 x i64> poison, i64 [[X:%.*]], i32 0
+; SSE-NEXT:    [[BO:%.*]] = xor <2 x i64> [[INS]], <i64 42, i64 undef>
+; SSE-NEXT:    ret <2 x i64> [[BO]]
+;
+; AVX-LABEL: @xor_constant(
+; AVX-NEXT:    [[BO_SCALAR:%.*]] = xor i64 [[X:%.*]], 42
+; AVX-NEXT:    [[BO:%.*]] = insertelement <2 x i64> poison, i64 [[BO_SCALAR]], i64 0
+; AVX-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ins = insertelement <2 x i64> poison, i64 %x, i32 0
   %bo = xor <2 x i64> %ins, <i64 42, i64 undef>
@@ -534,10 +559,15 @@ define <2 x i64> @xor_constant(i64 %x) {
 }
 
 define <2 x i64> @xor_constant_not_undef_lane(i64 %x) {
-; CHECK-LABEL: @xor_constant_not_undef_lane(
-; CHECK-NEXT:    [[BO_SCALAR:%.*]] = xor i64 [[X:%.*]], 42
-; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> poison, i64 [[BO_SCALAR]], i64 0
-; CHECK-NEXT:    ret <2 x i64> [[BO]]
+; SSE-LABEL: @xor_constant_not_undef_lane(
+; SSE-NEXT:    [[INS:%.*]] = insertelement <2 x i64> poison, i64 [[X:%.*]], i32 0
+; SSE-NEXT:    [[BO:%.*]] = xor <2 x i64> [[INS]], <i64 42, i64 -42>
+; SSE-NEXT:    ret <2 x i64> [[BO]]
+;
+; AVX-LABEL: @xor_constant_not_undef_lane(
+; AVX-NEXT:    [[BO_SCALAR:%.*]] = xor i64 [[X:%.*]], 42
+; AVX-NEXT:    [[BO:%.*]] = insertelement <2 x i64> poison, i64 [[BO_SCALAR]], i64 0
+; AVX-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ins = insertelement <2 x i64> poison, i64 %x, i32 0
   %bo = xor <2 x i64> %ins, <i64 42, i64 -42>
diff --git a/llvm/test/Transforms/VectorCombine/X86/insert-binop-with-constant.ll b/llvm/test/Transforms/VectorCombine/X86/insert-binop-with-constant.ll
index bbdd76c58b58e..2b4db0583e69c 100644
--- a/llvm/test/Transforms/VectorCombine/X86/insert-binop-with-constant.ll
+++ b/llvm/test/Transforms/VectorCombine/X86/insert-binop-with-constant.ll
@@ -3,10 +3,15 @@
 ; RUN: opt < %s -passes=vector-combine -S -mtriple=x86_64-- -mattr=AVX2 | FileCheck %s --check-prefixes=CHECK,AVX
 
 define <2 x i64> @add_constant(i64 %x) {
-; CHECK-LABEL: @add_constant(
-; CHECK-NEXT:    [[BO_SCALAR:%.*]] = add i64 [[X:%.*]], 42
-; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> undef, i64 [[BO_SCALAR]], i64 0
-; CHECK-NEXT:    ret <2 x i64> [[BO]]
+; SSE-LABEL: @add_constant(
+; SSE-NEXT:    [[INS:%.*]] = insertelement <2 x i64> undef, i64 [[X:%.*]], i32 0
+; SSE-NEXT:    [[BO:%.*]] = add <2 x i64> [[INS]], <i64 42, i64 undef>
+; SSE-NEXT:    ret <2 x i64> [[BO]]
+;
+; AVX-LABEL: @add_constant(
+; AVX-NEXT:    [[BO_SCALAR:%.*]] = add i64 [[X:%.*]], 42
+; AVX-NEXT:    [[BO:%.*]] = insertelement <2 x i64> undef, i64 [[BO_SCALAR]], i64 0
+; AVX-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ins = insertelement <2 x i64> undef, i64 %x, i32 0
   %bo = add <2 x i64> %ins, <i64 42, i64 undef>
@@ -14,10 +19,15 @@ define <2 x i64> @add_constant(i64 %x) {
 }
 
 define <2 x i64> @add_constant_not_undef_lane(i64 %x) {
-; CHECK-LABEL: @add_constant_not_undef_lane(
-; CHECK-NEXT:    [[BO_SCALAR:%.*]] = add i64 [[X:%.*]], 42
-; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> undef, i64 [[BO_SCALAR]], i64 0
-; CHECK-NEXT:    ret <2 x i64> [[BO]]
+; SSE-LABEL: @add_constant_not_undef_lane(
+; SSE-NEXT:    [[INS:%.*]] = insertelement <2 x i64> undef, i64 [[X:%.*]], i32 0
+; SSE-NEXT:    [[BO:%.*]] = add <2 x i64> [[INS]], <i64 42, i64 -42>
+; SSE-NEXT:    ret <2 x i64> [[BO]]
+;
+; AVX-LABEL: @add_constant_not_undef_lane(
+; AVX-NEXT:    [[BO_SCALAR:%.*]] = add i64 [[X:%.*]], 42
+; AVX-NEXT:    [[BO:%.*]] = insertelement <2 x i64> undef, i64 [[BO_SCALAR]], i64 0
+; AVX-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ins = insertelement <2 x i64> undef, i64 %x, i32 0
   %bo = add <2 x i64> %ins, <i64 42, i64 -42>
@@ -153,8 +163,8 @@ define <2 x i64> @shl_constant_op0_not_undef_lane(i64 %x) {
 define <2 x i64> @shl_constant_op0_load(ptr %p) {
 ; CHECK-LABEL: @shl_constant_op0_load(
 ; CHECK-NEXT:    [[LD:%.*]] = load i64, ptr [[P:%.*]], align 8
-; CHECK-NEXT:    [[INS:%.*]] = insertelement <2 x i64> undef, i64 [[LD]], i32 1
-; CHECK-NEXT:    [[BO:%.*]] = shl <2 x i64> <i64 undef, i64 2>, [[INS]]
+; CHECK-NEXT:    [[BO_SCALAR:%.*]] = shl i64 2, [[LD]]
+; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> poison, i64 [[BO_SCALAR]], i64 1
 ; CHECK-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ld = load i64, ptr %p
@@ -204,8 +214,8 @@ define <2 x i64> @shl_constant_op1_not_undef_lane(i64 %x) {
 define <2 x i64> @shl_constant_op1_load(ptr %p) {
 ; CHECK-LABEL: @shl_constant_op1_load(
 ; CHECK-NEXT:    [[LD:%.*]] = load i64, ptr [[P:%.*]], align 8
-; CHECK-NEXT:    [[INS:%.*]] = insertelement <2 x i64> undef, i64 [[LD]], i32 0
-; CHECK-NEXT:    [[BO:%.*]] = shl nuw <2 x i64> [[INS]], <i64 5, i64 2>
+; CHECK-NEXT:    [[BO_SCALAR:%.*]] = shl nuw i64 [[LD]], 5
+; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> zeroinitializer, i64 [[BO_SCALAR]], i64 0
 ; CHECK-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ld = load i64, ptr %p
@@ -479,10 +489,15 @@ define <2 x i64> @sdiv_constant_op1_not_undef_lane(i64 %x) {
 }
 
 define <2 x i64> @and_constant(i64 %x) {
-; CHECK-LABEL: @and_constant(
-; CHECK-NEXT:    [[BO_SCALAR:%.*]] = and i64 [[X:%.*]], 42
-; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> <i64 0, i64 undef>, i64 [[BO_SCALAR]], i64 0
-; CHECK-NEXT:    ret <2 x i64> [[BO]]
+; SSE-LABEL: @and_constant(
+; SSE-NEXT:    [[INS:%.*]] = insertelement <2 x i64> undef, i64 [[X:%.*]], i32 0
+; SSE-NEXT:    [[BO:%.*]] = and <2 x i64> [[INS]], <i64 42, i64 undef>
+; SSE-NEXT:    ret <2 x i64> [[BO]]
+;
+; AVX-LABEL: @and_constant(
+; AVX-NEXT:    [[BO_SCALAR:%.*]] = and i64 [[X:%.*]], 42
+; AVX-NEXT:    [[BO:%.*]] = insertelement <2 x i64> <i64 0, i64 undef>, i64 [[BO_SCALAR]], i64 0
+; AVX-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ins = insertelement <2 x i64> undef, i64 %x, i32 0
   %bo = and <2 x i64> %ins, <i64 42, i64 undef>
@@ -490,10 +505,15 @@ define <2 x i64> @and_constant(i64 %x) {
 }
 
 define <2 x i64> @and_constant_not_undef_lane(i64 %x) {
-; CHECK-LABEL: @and_constant_not_undef_lane(
-; CHECK-NEXT:    [[BO_SCALAR:%.*]] = and i64 [[X:%.*]], 42
-; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> zeroinitializer, i64 [[BO_SCALAR]], i64 0
-; CHECK-NEXT:    ret <2 x i64> [[BO]]
+; SSE-LABEL: @and_constant_not_undef_lane(
+; SSE-NEXT:    [[INS:%.*]] = insertelement <2 x i64> undef, i64 [[X:%.*]], i32 0
+; SSE-NEXT:    [[BO:%.*]] = and <2 x i64> [[INS]], <i64 42, i64 -42>
+; SSE-NEXT:    ret <2 x i64> [[BO]]
+;
+; AVX-LABEL: @and_constant_not_undef_lane(
+; AVX-NEXT:    [[BO_SCALAR:%.*]] = and i64 [[X:%.*]], 42
+; AVX-NEXT:    [[BO:%.*]] = insertelement <2 x i64> zeroinitializer, i64 [[BO_SCALAR]], i64 0
+; AVX-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ins = insertelement <2 x i64> undef, i64 %x, i32 0
   %bo = and <2 x i64> %ins, <i64 42, i64 -42>
@@ -523,10 +543,15 @@ define <2 x i64> @or_constant_not_undef_lane(i64 %x) {
 }
 
 define <2 x i64> @xor_constant(i64 %x) {
-; CHECK-LABEL: @xor_constant(
-; CHECK-NEXT:    [[BO_SCALAR:%.*]] = xor i64 [[X:%.*]], 42
-; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> <i64 undef, i64 0>, i64 [[BO_SCALAR]], i64 0
-; CHECK-NEXT:    ret <2 x i64> [[BO]]
+; SSE-LABEL: @xor_constant(
+; SSE-NEXT:    [[INS:%.*]] = insertelement <2 x i64> undef, i64 [[X:%.*]], i32 0
+; SSE-NEXT:    [[BO:%.*]] = xor <2 x i64> [[INS]], <i64 42, i64 undef>
+; SSE-NEXT:    ret <2 x i64> [[BO]]
+;
+; AVX-LABEL: @xor_constant(
+; AVX-NEXT:    [[BO_SCALAR:%.*]] = xor i64 [[X:%.*]], 42
+; AVX-NEXT:    [[BO:%.*]] = insertelement <2 x i64> <i64 undef, i64 0>, i64 [[BO_SCALAR]], i64 0
+; AVX-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ins = insertelement <2 x i64> undef, i64 %x, i32 0
   %bo = xor <2 x i64> %ins, <i64 42, i64 undef>
@@ -534,10 +559,15 @@ define <2 x i64> @xor_constant(i64 %x) {
 }
 
 define <2 x i64> @xor_constant_not_undef_lane(i64 %x) {
-; CHECK-LABEL: @xor_constant_not_undef_lane(
-; CHECK-NEXT:    [[BO_SCALAR:%.*]] = xor i64 [[X:%.*]], 42
-; CHECK-NEXT:    [[BO:%.*]] = insertelement <2 x i64> undef, i64 [[BO_SCALAR]], i64 0
-; CHECK-NEXT:    ret <2 x i64> [[BO]]
+; SSE-LABEL: @xor_constant_not_undef_lane(
+; SSE-NEXT:    [[INS:%.*]] = insertelement <2 x i64> undef, i64 [[X:%.*]], i32 0
+; SSE-NEXT:    [[BO:%.*]] = xor <2 x i64> [[INS]], <i64 42, i64 -42>
+; SSE-NEXT:    ret <2 x i64> [[BO]]
+;
+; AVX-LABEL: @xor_constant_not_undef_lane(
+; AVX-NEXT:    [[BO_SCALAR:%.*]] = xor i64 [[X:%.*]], 42
+; AVX-NEXT:    [[BO:%.*]] = insertelement <2 x i64> undef, i64 [[BO_SCALAR]], i64 0
+; AVX-NEXT:    ret <2 x i64> [[BO]]
 ;
   %ins = insertelement <2 x i64> undef, i64 %x, i32 0
   %bo = xor <2 x i64> %ins, <i64 42, i64 -42>
diff --git a/llvm/test/Transforms/VectorCombine/X86/insert-binop.ll b/llvm/test/Transforms/VectorCombine/X86/insert-binop.ll
index cd7e2ad2ca2c6..789ee7b3cdf0d 100644
--- a/llvm/test/Transforms/VectorCombine/X86/insert-binop.ll
+++ b/llvm/test/Transforms/VectorCombine/X86/insert-binop.ll
@@ -8,10 +8,16 @@ declare void @usef(<4 x float>)
 ; Eliminating an insert is profitable.
 
 define <16 x i8> @ins0_ins0_add(i8 %x, i8 %y) {
-; CHECK-LABEL: @ins0_ins0_add(
-; CHECK-NEXT:    [[R_SCALAR:%.*]] = add i8 [[X:%.*]], [[Y:%.*]]
-; CHECK-NEXT:    [[R:%.*]] = insertelement <16 x i8> undef, i8 [[R_SCALAR]], i64 0
-; CHECK-NEXT:    ret <16 x i8> [[R]]
+; SSE-LABEL: @ins0_ins0_add(
+; SSE-NEXT:    [[I0:%.*]] = insertelement <16 x i8> undef, i8 [[X:%.*]], i32 0
+; SSE-NEXT:    [[I1:%.*]] = insertelement <16 x i8> undef, i8 [[Y:%.*]], i32 0
+; SSE-NEXT:    [[R:%.*]] = add <16 x i8> [[I0]], [[I1]]
+; SSE-NEXT:    ret <16 x i8> [[R]]
+;
+; AVX-LABEL: @ins0_ins0_add(
+; AVX-NEXT:    [[R_SCALAR:%.*]] = add i8 [[X:%.*]], [[Y:%.*]]
+; AVX-NEXT:    [[R:%.*]] = insertelement <16 x i8> undef, i8 [[R_SCALAR]], i64 0
+; AVX-NEXT:    ret <16 x i8> [[R]]
 ;
   %i0 = insertelement <16 x i8> undef, i8 %x, i32 0
   %i1 = insertelement <16 x i8> undef, i8 %y, i32 0
@@ -155,12 +161,19 @@ define <2 x i64> @ins1_ins1_urem(i64 %x, i64 %y) {
 ; Extra use is accounted for in cost calculation.
 
 define <4 x i32> @ins0_ins0_xor(i32 %x, i32 %y) {
-; CHECK-LABEL: @ins0_ins0_xor(
-; CHECK-NEXT:    [[I0:%.*]] = insertelement <4 x i32> undef, i32 [[X:...
[truncated]

Copy link

github-actions bot commented Apr 29, 2025

⚠️ undef deprecator found issues in your code. ⚠️

You can test this locally with the following command:
git diff -U0 --pickaxe-regex -S '([^a-zA-Z0-9#_-]undef[^a-zA-Z0-9_-]|UndefValue::get)' 'HEAD~1' HEAD llvm/test/Transforms/VectorCombine/RISCV/binop-scalarize.ll llvm/lib/Transforms/Vectorize/VectorCombine.cpp llvm/test/Transforms/VectorCombine/X86/insert-binop-with-constant-inseltpoison.ll llvm/test/Transforms/VectorCombine/X86/insert-binop-with-constant.ll llvm/test/Transforms/VectorCombine/X86/insert-binop.ll llvm/test/Transforms/VectorCombine/X86/scalarize-cmp-inseltpoison.ll llvm/test/Transforms/VectorCombine/X86/scalarize-cmp.ll

The following files introduce new uses of undef:

  • llvm/test/Transforms/VectorCombine/X86/insert-binop-with-constant.ll
  • llvm/test/Transforms/VectorCombine/X86/insert-binop.ll
  • llvm/test/Transforms/VectorCombine/X86/scalarize-cmp.ll

Undef is now deprecated and should only be used in the rare cases where no replacement is possible. For example, a load of uninitialized memory yields undef. You should use poison values for placeholders instead.

In tests, avoid using undef and having tests that trigger undefined behavior. If you need an operand with some unimportant value, you can add a new argument to the function and use that instead.

For example, this is considered a bad practice:

define void @fn() {
  ...
  br i1 undef, ...
}

Please use the following instead:

define void @fn(i1 %cond) {
  ...
  br i1 %cond, ...
}

Please refer to the Undefined Behavior Manual for more information.

; CHECK-NEXT: [[BO_SCALAR:%.*]] = fadd double [[X:%.*]], 4.200000e+01
; CHECK-NEXT: [[BO:%.*]] = insertelement <2 x double> <double 0x7FF8000000000000, double undef>, double [[BO_SCALAR]], i64 0
; CHECK-NEXT: [[INS:%.*]] = insertelement <2 x double> undef, double [[X:%.*]], i32 0
; CHECK-NEXT: [[BO:%.*]] = fadd <2 x double> [[INS]], <double 4.200000e+01, double undef>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reasoning behind this no longer being scalarized is that at SSE2 and above double and <2 x double> have the same fadd cost:

; SSE2-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:3 SizeLat:1 for: %F64 = fadd double undef, undef
; SSE2-NEXT:  Cost Model: Found costs of RThru:2 CodeSize:1 Lat:3 SizeLat:1 for: %V2F64 = fadd <2 x double> undef, undef

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a note here that this still generates the same code ultimately because DAGCombine will scalarize it either way

@lukel97
Copy link
Contributor Author

lukel97 commented May 7, 2025

Ping, I've updated this to plumb through the constant folded base vector so I think the x86 regressions should be gone now. The remaining test diffs look more legitimate

@RKSimon RKSimon requested a review from davemgreen May 7, 2025 12:04
@@ -168,11 +168,17 @@ define <2 x i1> @constant_op1_i64_not_undef_lane(i64 %x) {
; negative test - load prevents the transform
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comments

@lukel97 lukel97 force-pushed the vector-combine/scalarizeBinOpOfSplats-insertelement-cost branch from 58af11e to 89bccad Compare May 21, 2025 10:04
@lukel97
Copy link
Contributor Author

lukel97 commented May 21, 2025

Gentle ping

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lukel97 lukel97 merged commit 97f6076 into llvm:main May 27, 2025
10 of 11 checks passed
lukel97 added a commit that referenced this pull request May 28, 2025
…#141300)

We currently pull shuffles through binops and intrinsics, which is an
important canonical form for VectorCombine to be able to scalarize
vector sequences. But while binops can be folded with a constant
operand, intrinsics currently require all operands to be shufflevectors.

This extends intrinsic folding to be in line with regular binops by
reusing the constant "unshuffling" logic.

As far as I can tell the list of currently folded intrinsics don't
require any special UB handling.

This change in combination with #138095 and #137823 fixes the following
C:

```c
void max(int *x, int *y, int n) {
  for (int i = 0; i < n; i++)
    x[i] += *y > 42 ? *y : 42;
}
```

Not using the splatted vector form on RISC-V with `-O3 -march=rva23u64`:

```asm
	vmv.s.x	v8, a4
	li	a4, 42
	vmax.vx	v10, v8, a4
	vrgather.vi	v8, v10, 0
.LBB0_9:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vl2re32.v	v10, (a5)
	vadd.vv	v10, v10, v8
	vs2r.v	v10, (a5)
```

i.e., it now generates

```asm
        li	a6, 42
        max	a6, a4, a6
.LBB0_9:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vl2re32.v	v8, (a5)
	vadd.vx	v8, v8, a6
	vs2r.v	v8, (a5)
```
sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this pull request Jun 3, 2025
This addresses a TODO where previously scalarizeBinopOrCmp
conservatively bailed if one of the operands was a load.

getVectorInstrCost was updated to take in values in
https://reviews.llvm.org/D140498 so we can pass in the scalar value to
be inserted, which should return an accurate cost for a gather.

To prevent regressions on x86 this tries to constant fold NewVecC up
front so we can pass it into TTI and get a more accurate cost.

We want to remove this restriction on RISC-V since this is always
profitable whether or not the scalar is a load.
sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this pull request Jun 3, 2025
…llvm#141300)

We currently pull shuffles through binops and intrinsics, which is an
important canonical form for VectorCombine to be able to scalarize
vector sequences. But while binops can be folded with a constant
operand, intrinsics currently require all operands to be shufflevectors.

This extends intrinsic folding to be in line with regular binops by
reusing the constant "unshuffling" logic.

As far as I can tell the list of currently folded intrinsics don't
require any special UB handling.

This change in combination with llvm#138095 and llvm#137823 fixes the following
C:

```c
void max(int *x, int *y, int n) {
  for (int i = 0; i < n; i++)
    x[i] += *y > 42 ? *y : 42;
}
```

Not using the splatted vector form on RISC-V with `-O3 -march=rva23u64`:

```asm
	vmv.s.x	v8, a4
	li	a4, 42
	vmax.vx	v10, v8, a4
	vrgather.vi	v8, v10, 0
.LBB0_9:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vl2re32.v	v10, (a5)
	vadd.vv	v10, v10, v8
	vs2r.v	v10, (a5)
```

i.e., it now generates

```asm
        li	a6, 42
        max	a6, a4, a6
.LBB0_9:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	vl2re32.v	v8, (a5)
	vadd.vx	v8, v8, a6
	vs2r.v	v8, (a5)
```
@fmayer
Copy link
Contributor

fmayer commented Jun 11, 2025

Hey. This seems to have broken MSAN for us (caused false positive). I am not clear yet whether this is a problem with this patch per se, or MSAN. Would you be opposed to reverting this while we figure out what's wrong? Sorry for the bother.

@lukel97
Copy link
Contributor Author

lukel97 commented Jun 11, 2025

Hey. This seems to have broken MSAN for us (caused false positive). I am not clear yet whether this is a problem with this patch per se, or MSAN. Would you be opposed to reverting this while we figure out what's wrong? Sorry for the bother.

I think a few patches have landed on top of this which might make a revert tricky unfortunately. Is there a buildbot failure or reproducer to try out?

@fmayer
Copy link
Contributor

fmayer commented Jun 11, 2025

IR diff looks like this (in various places). - is before this patch, + after

   call void @__msan_unpoison(ptr noundef nonnull %add.ptr88, i64 noundef 4) #55, !dbg !242357
   %30 = load i32, ptr %add.ptr88, align 4, !dbg !242362, !alias.scope !242364, !noalias !242371
   %vecinit3.i.i.i.i = insertelement <4 x i32> <i32 poison, i32 0, i32 0, i32 0>, i32 %30, i64 0, !dbg !242373
-  %cmp.i.i.i54.i.i = icmp ult <4 x i32> %vecinit3.i.i.i.i, splat (i32 2), !dbg !242374
+  %cmp.i.i.i54.i.i.scalar = icmp ult i32 %30, 2, !dbg !242374
+  %cmp.i.i.i54.i.i = insertelement <4 x i1> <i1 poison, i1 true, i1 true, i1 true>, i1 %cmp.i.i.i54.i.i.scalar, i64 0, !dbg !242374
   %31 = bitcast <4 x i1> %cmp.i.i.i54.i.i to i4, !dbg !242381
   %32 = and i4 %31, 1, !dbg !242384
   %cmp.i32.i.i.not = icmp eq i4 %32, 0, !dbg !242386

@fmayer
Copy link
Contributor

fmayer commented Jun 11, 2025

Alive seems to approve of this transformation: https://alive2.llvm.org/ce/z/xUNKWG

@lukel97
Copy link
Contributor Author

lukel97 commented Jun 11, 2025

Alive seems to approve of this transformation: https://alive2.llvm.org/ce/z/xUNKWG

FWIW this patch didn't actually change the transformation, but just changed the cost model so it might kick in more places. So I guess if the transformation was faulty it would have already existed before this patch.

Is it possible to reduce the test case with llvm-reduce somehow?

@fmayer
Copy link
Contributor

fmayer commented Jun 12, 2025

Alive seems to approve of this transformation: https://alive2.llvm.org/ce/z/xUNKWG

FWIW this patch didn't actually change the transformation, but just changed the cost model so it might kick in more places. So I guess if the transformation was faulty it would have already existed before this patch.

Is it possible to reduce the test case with llvm-reduce somehow?

Yeah I am working on that. I just diffed the IR and hoped something would come out, but it all looks good.

   %28 = load i64, ptr %add.ptr87, align 1, !dbg !190701, !alias.scope !190705
-  %vecinit1.i.i.i519 = insertelement <2 x i64> <i64 poison, i64 0>, i64 %28, i64 0, !dbg !190701
-  %cmp.i.i.i362 = icmp ult <2 x i64> %vecinit1.i.i.i519, splat (i64 2), !dbg !190710
+  %cmp.i.i.i362.scalar = icmp ult i64 %28, 2, !dbg !190710
+  %cmp.i.i.i362 = insertelement <2 x i1> <i1 poison, i1 true>, i1 %cmp.i.i.i362.scalar, i64 0, !dbg !190710

https://alive2.llvm.org/ce/z/CyUZmA

-  %14 = shl <2 x i64> %vecinit1.i.i, splat (i64 1), !dbg !190577
+  %.scalar = shl i64 %11, 1, !dbg !190577
+  %14 = insertelement <2 x i64> <i64 poison, i64 0>, i64 %.scalar, i64 0, !dbg !190577
   %15 = call noundef <2 x double> @llvm.x86.avx.vpermilvar.pd(<2 x double> <double 4.940660e-324, double 9.881310e-324>, <2 x i64> %14), !dbg !190585
   %16 = bitcast <2 x double> %15 to <2 x i64>, !dbg !190586
   %17 = load ptr, ptr %expected, align 8, !dbg !190593

https://alive2.llvm.org/ce/z/SV9bKL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants