[AMDGPU] Support D16 folding for image.sample with multiple extractelement and fptrunc users #141758

harrisonGPU · 2025-05-28T13:15:44Z

Now we only support D16 folding for image sample instructions with a single user: a fptrunc to half.
However, we can actually support D16 folding for image.sample instructions with multiple users,
as long as each user follows the pattern of extractelement followed by fptrunc to half.
For example:

  %sample = call <4 x float> @llvm.amdgcn.image.sample
  %e0 = extractelement <4 x float> %sample, i32 0
  %h0 = fptrunc float %e0 to half
  %e1 = extractelement <4 x float> %sample, i32 1
  %h1 = fptrunc float %e1 to half
  %e2 = extractelement <4 x float> %sample, i32 2
  %h2 = fptrunc float %e2 to half

This change enables D16 folding for such cases and avoids generating v_cvt_f16_f32_e32 instructions.

llvmbot · 2025-05-28T13:16:17Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-amdgpu

Author: Harrison Hao (harrisonGPU)

Changes

Now we only support D16 folding for image sample instructions with a single user: a fptrunc to half.
However, we can actually support D16 folding for image.sample instructions with multiple users, as long as each user follows the pattern of extractelement followed by fptrunc to half.
For example:

  %sample = call &lt;4 x float&gt; @<!-- -->llvm.amdgcn.image.sample
  %e0 = extractelement &lt;4 x float&gt; %sample, i32 0
  %h0 = fptrunc float %e0 to half
  %e1 = extractelement &lt;4 x float&gt; %sample, i32 1
  %h1 = fptrunc float %e1 to half
  %e2 = extractelement &lt;4 x float&gt; %sample, i32 2
  %h2 = fptrunc float %e2 to half

This change enables D16 folding for such cases and avoids generating v_cvt_f16_f32_e32 instructions.

Patch is 37.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/141758.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp (+62)
(modified) llvm/test/Transforms/InstCombine/AMDGPU/image-d16.ll (+415-2)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 5f6ab24182d5e..2ff1501e2b977 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -269,6 +269,68 @@ simplifyAMDGCNImageIntrinsic(const GCNSubtarget *ST,
                                        ArgTys[0] = User->getType();
                                      });
         }
+      } else {
+        // Only perform D16 folding if every user of the image sample is
+        // an ExtractElementInst immediately followed by an FPTrunc to half.
+        SmallVector<ExtractElementInst *, 4> Extracts;
+        SmallVector<FPTruncInst *, 4> Truncs;
+        bool AllHalfExtracts = true;
+
+        for (User *U : II.users()) {
+          auto *Ext = dyn_cast<ExtractElementInst>(U);
+          if (!Ext || !Ext->hasOneUse()) {
+            AllHalfExtracts = false;
+            break;
+          }
+          auto *Tr = dyn_cast<FPTruncInst>(*Ext->user_begin());
+          if (!Tr || !Tr->getType()->getScalarType()->isHalfTy()) {
+            AllHalfExtracts = false;
+            break;
+          }
+          Extracts.push_back(Ext);
+          Truncs.push_back(Tr);
+        }
+
+        if (AllHalfExtracts && !Extracts.empty()) {
+          auto *VecTy = cast<VectorType>(II.getType());
+          unsigned NElts = VecTy->getElementCount().getKnownMinValue();
+          Type *HalfVecTy =
+              VectorType::get(Type::getHalfTy(II.getContext()), NElts, false);
+
+          // Obtain the original image sample intrinsic's signature
+          // and replace its return type with the half-vector for D16 folding
+          SmallVector<Type *, 8> SigTys;
+          if (!Intrinsic::getIntrinsicSignature(II.getCalledFunction(), SigTys))
+            return nullptr;
+          SigTys[0] = HalfVecTy;
+
+          Module *M = II.getModule();
+          Function *HalfDecl =
+              Intrinsic::getOrInsertDeclaration(M, ImageDimIntr->Intr, SigTys);
+
+          II.mutateType(HalfVecTy);
+          II.setCalledFunction(HalfDecl);
+
+          IRBuilder<> Builder(&II);
+          for (auto [lane, Ext] : enumerate(Extracts)) {
+            FPTruncInst *Tr = Truncs[lane];
+            Value *Idx = Ext->getIndexOperand();
+
+            Builder.SetInsertPoint(Tr);
+
+            Value *HalfExtract = Builder.CreateExtractElement(&II, Idx);
+            HalfExtract->takeName(Tr);
+
+            Tr->replaceAllUsesWith(HalfExtract);
+          }
+
+          for (auto *T : Truncs)
+            IC.eraseInstFromFunction(*T);
+          for (auto *E : Extracts)
+            IC.eraseInstFromFunction(*E);
+
+          return &II;
+        }
       }
     }
   }
diff --git a/llvm/test/Transforms/InstCombine/AMDGPU/image-d16.ll b/llvm/test/Transforms/InstCombine/AMDGPU/image-d16.ll
index 30431ad724843..d39cceb9b549a 100644
--- a/llvm/test/Transforms/InstCombine/AMDGPU/image-d16.ll
+++ b/llvm/test/Transforms/InstCombine/AMDGPU/image-d16.ll
@@ -1,8 +1,9 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
 ; RUN: opt -mtriple=amdgcn--amdpal -mcpu=gfx700 -S -passes=instcombine %s | FileCheck --check-prefixes=GFX7 %s
 ; RUN: opt -mtriple=amdgcn--amdpal -mcpu=gfx810 -S -passes=instcombine %s | FileCheck --check-prefixes=GFX81PLUS %s
-; RUN: opt -mtriple=amdgcn--amdpal -mcpu=gfx900 -S -passes=instcombine %s | FileCheck --check-prefixes=GFX81PLUS %s
-; RUN: opt -mtriple=amdgcn--amdpal -mcpu=gfx1010 -S -passes=instcombine %s | FileCheck --check-prefixes=GFX81PLUS %s
+; RUN: opt -mtriple=amdgcn--amdpal -mcpu=gfx900 -S -passes=instcombine %s | FileCheck --check-prefixes=GFX9 %s
+; RUN: opt -mtriple=amdgcn--amdpal -mcpu=gfx1010 -S -passes=instcombine %s | FileCheck --check-prefixes=GFX10PLUS %s
+; RUN: opt -mtriple=amdgcn--amdpal -mcpu=gfx1100 -S -passes=instcombine %s | FileCheck --check-prefixes=GFX11 %s
 
 define amdgpu_ps half @image_sample_2d_fptrunc_to_d16(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %s, float %t) {
 ; GFX7-LABEL: @image_sample_2d_fptrunc_to_d16(
@@ -16,6 +17,21 @@ define amdgpu_ps half @image_sample_2d_fptrunc_to_d16(<8 x i32> inreg %rsrc, <4
 ; GFX81PLUS-NEXT:    [[TEX:%.*]] = call half @llvm.amdgcn.image.sample.lz.2d.f16.f32.v8i32.v4i32(i32 1, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
 ; GFX81PLUS-NEXT:    ret half [[TEX]]
 ;
+; GFX9-LABEL: @image_sample_2d_fptrunc_to_d16(
+; GFX9-NEXT:  main_body:
+; GFX9-NEXT:    [[TEX:%.*]] = call half @llvm.amdgcn.image.sample.lz.2d.f16.f32.v8i32.v4i32(i32 1, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX9-NEXT:    ret half [[TEX]]
+;
+; GFX10PLUS-LABEL: @image_sample_2d_fptrunc_to_d16(
+; GFX10PLUS-NEXT:  main_body:
+; GFX10PLUS-NEXT:    [[TEX:%.*]] = call half @llvm.amdgcn.image.sample.lz.2d.f16.f32.v8i32.v4i32(i32 1, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX10PLUS-NEXT:    ret half [[TEX]]
+;
+; GFX11-LABEL: @image_sample_2d_fptrunc_to_d16(
+; GFX11-NEXT:  main_body:
+; GFX11-NEXT:    [[TEX:%.*]] = call half @llvm.amdgcn.image.sample.lz.2d.f16.f32.v8i32.v4i32(i32 1, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX11-NEXT:    ret half [[TEX]]
+;
 main_body:
   %tex = call float @llvm.amdgcn.image.sample.lz.2d.f32.f32.v8i32.v4i32(i32 1, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 0, i32 0)
   %tex_half = fptrunc float %tex to half
@@ -40,6 +56,30 @@ define amdgpu_ps half @image_sample_2d_v2f32(<8 x i32> inreg %rsrc, <4 x i32> in
 ; GFX81PLUS-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
 ; GFX81PLUS-NEXT:    ret half [[ADDF_SUM_0]]
 ;
+; GFX9-LABEL: @image_sample_2d_v2f32(
+; GFX9-NEXT:  main_body:
+; GFX9-NEXT:    [[TEX:%.*]] = call <2 x half> @llvm.amdgcn.image.sample.lz.2d.v2f16.f32.v8i32.v4i32(i32 3, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX9-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <2 x half> [[TEX]], i64 0
+; GFX9-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <2 x half> [[TEX]], i64 1
+; GFX9-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX9-NEXT:    ret half [[ADDF_SUM_0]]
+;
+; GFX10PLUS-LABEL: @image_sample_2d_v2f32(
+; GFX10PLUS-NEXT:  main_body:
+; GFX10PLUS-NEXT:    [[TEX:%.*]] = call <2 x half> @llvm.amdgcn.image.sample.lz.2d.v2f16.f32.v8i32.v4i32(i32 3, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX10PLUS-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <2 x half> [[TEX]], i64 0
+; GFX10PLUS-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <2 x half> [[TEX]], i64 1
+; GFX10PLUS-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX10PLUS-NEXT:    ret half [[ADDF_SUM_0]]
+;
+; GFX11-LABEL: @image_sample_2d_v2f32(
+; GFX11-NEXT:  main_body:
+; GFX11-NEXT:    [[TEX:%.*]] = call <2 x half> @llvm.amdgcn.image.sample.lz.2d.v2f16.f32.v8i32.v4i32(i32 3, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX11-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <2 x half> [[TEX]], i64 0
+; GFX11-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <2 x half> [[TEX]], i64 1
+; GFX11-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX11-NEXT:    ret half [[ADDF_SUM_0]]
+;
 main_body:
   %tex = call <2 x float> @llvm.amdgcn.image.sample.lz.2d.v2f32.f32.v8i32.v4i32(i32 3, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 0, i32 0)
   %tex_2_half = fptrunc <2 x float> %tex to <2 x half>
@@ -71,6 +111,36 @@ define amdgpu_ps half @image_sample_2d_v3f32(<8 x i32> inreg %rsrc, <4 x i32> in
 ; GFX81PLUS-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[ADDF_SUM_0]], [[TEX_HALF_2]]
 ; GFX81PLUS-NEXT:    ret half [[ADDF_SUM_1]]
 ;
+; GFX9-LABEL: @image_sample_2d_v3f32(
+; GFX9-NEXT:  main_body:
+; GFX9-NEXT:    [[TEX:%.*]] = call <3 x half> @llvm.amdgcn.image.sample.lz.2d.v3f16.f32.v8i32.v4i32(i32 7, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX9-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <3 x half> [[TEX]], i64 0
+; GFX9-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <3 x half> [[TEX]], i64 1
+; GFX9-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <3 x half> [[TEX]], i64 2
+; GFX9-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX9-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[ADDF_SUM_0]], [[TEX_HALF_2]]
+; GFX9-NEXT:    ret half [[ADDF_SUM_1]]
+;
+; GFX10PLUS-LABEL: @image_sample_2d_v3f32(
+; GFX10PLUS-NEXT:  main_body:
+; GFX10PLUS-NEXT:    [[TEX:%.*]] = call <3 x half> @llvm.amdgcn.image.sample.lz.2d.v3f16.f32.v8i32.v4i32(i32 7, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX10PLUS-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <3 x half> [[TEX]], i64 0
+; GFX10PLUS-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <3 x half> [[TEX]], i64 1
+; GFX10PLUS-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <3 x half> [[TEX]], i64 2
+; GFX10PLUS-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX10PLUS-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[ADDF_SUM_0]], [[TEX_HALF_2]]
+; GFX10PLUS-NEXT:    ret half [[ADDF_SUM_1]]
+;
+; GFX11-LABEL: @image_sample_2d_v3f32(
+; GFX11-NEXT:  main_body:
+; GFX11-NEXT:    [[TEX:%.*]] = call <3 x half> @llvm.amdgcn.image.sample.lz.2d.v3f16.f32.v8i32.v4i32(i32 7, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX11-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <3 x half> [[TEX]], i64 0
+; GFX11-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <3 x half> [[TEX]], i64 1
+; GFX11-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <3 x half> [[TEX]], i64 2
+; GFX11-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX11-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[ADDF_SUM_0]], [[TEX_HALF_2]]
+; GFX11-NEXT:    ret half [[ADDF_SUM_1]]
+;
 main_body:
   %tex = call <3 x float> @llvm.amdgcn.image.sample.lz.2d.v3f32.f32.v8i32.v4i32(i32 7, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 0, i32 0)
   %tex_3_half = fptrunc <3 x float> %tex to <3 x half>
@@ -108,6 +178,42 @@ define amdgpu_ps half @image_sample_2d_v4f32(<8 x i32> inreg %rsrc, <4 x i32> in
 ; GFX81PLUS-NEXT:    [[ADDF_SUM_2:%.*]] = fadd half [[ADDF_SUM_0]], [[ADDF_SUM_1]]
 ; GFX81PLUS-NEXT:    ret half [[ADDF_SUM_2]]
 ;
+; GFX9-LABEL: @image_sample_2d_v4f32(
+; GFX9-NEXT:  main_body:
+; GFX9-NEXT:    [[TEX:%.*]] = call <4 x half> @llvm.amdgcn.image.sample.lz.2d.v4f16.f32.v8i32.v4i32(i32 15, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX9-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <4 x half> [[TEX]], i64 0
+; GFX9-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <4 x half> [[TEX]], i64 1
+; GFX9-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <4 x half> [[TEX]], i64 2
+; GFX9-NEXT:    [[TEX_HALF_3:%.*]] = extractelement <4 x half> [[TEX]], i64 3
+; GFX9-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX9-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[TEX_HALF_2]], [[TEX_HALF_3]]
+; GFX9-NEXT:    [[ADDF_SUM_2:%.*]] = fadd half [[ADDF_SUM_0]], [[ADDF_SUM_1]]
+; GFX9-NEXT:    ret half [[ADDF_SUM_2]]
+;
+; GFX10PLUS-LABEL: @image_sample_2d_v4f32(
+; GFX10PLUS-NEXT:  main_body:
+; GFX10PLUS-NEXT:    [[TEX:%.*]] = call <4 x half> @llvm.amdgcn.image.sample.lz.2d.v4f16.f32.v8i32.v4i32(i32 15, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX10PLUS-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <4 x half> [[TEX]], i64 0
+; GFX10PLUS-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <4 x half> [[TEX]], i64 1
+; GFX10PLUS-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <4 x half> [[TEX]], i64 2
+; GFX10PLUS-NEXT:    [[TEX_HALF_3:%.*]] = extractelement <4 x half> [[TEX]], i64 3
+; GFX10PLUS-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX10PLUS-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[TEX_HALF_2]], [[TEX_HALF_3]]
+; GFX10PLUS-NEXT:    [[ADDF_SUM_2:%.*]] = fadd half [[ADDF_SUM_0]], [[ADDF_SUM_1]]
+; GFX10PLUS-NEXT:    ret half [[ADDF_SUM_2]]
+;
+; GFX11-LABEL: @image_sample_2d_v4f32(
+; GFX11-NEXT:  main_body:
+; GFX11-NEXT:    [[TEX:%.*]] = call <4 x half> @llvm.amdgcn.image.sample.lz.2d.v4f16.f32.v8i32.v4i32(i32 15, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX11-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <4 x half> [[TEX]], i64 0
+; GFX11-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <4 x half> [[TEX]], i64 1
+; GFX11-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <4 x half> [[TEX]], i64 2
+; GFX11-NEXT:    [[TEX_HALF_3:%.*]] = extractelement <4 x half> [[TEX]], i64 3
+; GFX11-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX11-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[TEX_HALF_2]], [[TEX_HALF_3]]
+; GFX11-NEXT:    [[ADDF_SUM_2:%.*]] = fadd half [[ADDF_SUM_0]], [[ADDF_SUM_1]]
+; GFX11-NEXT:    ret half [[ADDF_SUM_2]]
+;
 main_body:
   %tex = call <4 x float> @llvm.amdgcn.image.sample.lz.2d.v4f32.f32.v8i32.v4i32(i32 15, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 0, i32 0)
   %tex_4_half = fptrunc <4 x float> %tex to <4 x half>
@@ -121,6 +227,79 @@ main_body:
   ret half %addf_sum.2
 }
 
+define void @image_sample_2d_multi_fptrunc_to_d16(<8 x i32> %surf_desc, <4 x i32> %samp, float %u, float %v, ptr addrspace(7) %out) {
+; GFX7-LABEL: @image_sample_2d_multi_fptrunc_to_d16(
+; GFX7-NEXT:  main_body:
+; GFX7-NEXT:    [[SAMPLE:%.*]] = call <3 x float> @llvm.amdgcn.image.sample.lz.2d.v3f32.f32.v8i32.v4i32(i32 7, float [[U:%.*]], float [[V:%.*]], <8 x i32> [[SURF_DESC:%.*]], <4 x i32> [[SAMPLER_DESC:%.*]], i1 false, i32 0, i32 0)
+; GFX7-NEXT:    [[E0:%.*]] = extractelement <3 x float> [[SAMPLE]], i64 0
+; GFX7-NEXT:    [[H0:%.*]] = fptrunc float [[E0]] to half
+; GFX7-NEXT:    [[E1:%.*]] = extractelement <3 x float> [[SAMPLE]], i64 1
+; GFX7-NEXT:    [[H1:%.*]] = fptrunc float [[E1]] to half
+; GFX7-NEXT:    [[E2:%.*]] = extractelement <3 x float> [[SAMPLE]], i64 2
+; GFX7-NEXT:    [[H2:%.*]] = fptrunc float [[E2]] to half
+; GFX7-NEXT:    [[MUL:%.*]] = fmul half [[H0]], [[H1]]
+; GFX7-NEXT:    [[RES:%.*]] = fadd half [[MUL]], [[H2]]
+; GFX7-NEXT:    store half [[RES]], ptr addrspace(7) [[OUT:%.*]], align 2
+; GFX7-NEXT:    ret void
+;
+; GFX81PLUS-LABEL: @image_sample_2d_multi_fptrunc_to_d16(
+; GFX81PLUS-NEXT:  main_body:
+; GFX81PLUS-NEXT:    [[SAMPLE:%.*]] = call <3 x half> @llvm.amdgcn.image.sample.lz.2d.v3f16.f32.v8i32.v4i32(i32 7, float [[U:%.*]], float [[V:%.*]], <8 x i32> [[SURF_DESC:%.*]], <4 x i32> [[SAMPLER_DESC:%.*]], i1 false, i32 0, i32 0)
+; GFX81PLUS-NEXT:    [[H0:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 0
+; GFX81PLUS-NEXT:    [[H1:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 1
+; GFX81PLUS-NEXT:    [[H2:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 2
+; GFX81PLUS-NEXT:    [[MUL:%.*]] = fmul half [[H0]], [[H1]]
+; GFX81PLUS-NEXT:    [[RES:%.*]] = fadd half [[MUL]], [[H2]]
+; GFX81PLUS-NEXT:    store half [[RES]], ptr addrspace(7) [[OUT:%.*]], align 2
+; GFX81PLUS-NEXT:    ret void
+;
+; GFX9-LABEL: @image_sample_2d_multi_fptrunc_to_d16(
+; GFX9-NEXT:  main_body:
+; GFX9-NEXT:    [[SAMPLE:%.*]] = call <3 x half> @llvm.amdgcn.image.sample.lz.2d.v3f16.f32.v8i32.v4i32(i32 7, float [[U:%.*]], float [[V:%.*]], <8 x i32> [[SURF_DESC:%.*]], <4 x i32> [[SAMPLER_DESC:%.*]], i1 false, i32 0, i32 0)
+; GFX9-NEXT:    [[HALF_EXT2:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 0
+; GFX9-NEXT:    [[HALF_EXT1:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 1
+; GFX9-NEXT:    [[HALF_EXT:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 2
+; GFX9-NEXT:    [[MUL:%.*]] = fmul half [[HALF_EXT2]], [[HALF_EXT1]]
+; GFX9-NEXT:    [[RES:%.*]] = fadd half [[MUL]], [[HALF_EXT]]
+; GFX9-NEXT:    store half [[RES]], ptr addrspace(7) [[OUT:%.*]], align 2
+; GFX9-NEXT:    ret void
+;
+; GFX10PLUS-LABEL: @image_sample_2d_multi_fptrunc_to_d16(
+; GFX10PLUS-NEXT:  main_body:
+; GFX10PLUS-NEXT:    [[SAMPLE:%.*]] = call <3 x half> @llvm.amdgcn.image.sample.lz.2d.v3f16.f32.v8i32.v4i32(i32 7, float [[U:%.*]], float [[V:%.*]], <8 x i32> [[SURF_DESC:%.*]], <4 x i32> [[SAMPLER_DESC:%.*]], i1 false, i32 0, i32 0)
+; GFX10PLUS-NEXT:    [[HALF_EXT2:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 0
+; GFX10PLUS-NEXT:    [[HALF_EXT1:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 1
+; GFX10PLUS-NEXT:    [[HALF_EXT:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 2
+; GFX10PLUS-NEXT:    [[MUL:%.*]] = fmul half [[HALF_EXT2]], [[HALF_EXT1]]
+; GFX10PLUS-NEXT:    [[RES:%.*]] = fadd half [[MUL]], [[HALF_EXT]]
+; GFX10PLUS-NEXT:    store half [[RES]], ptr addrspace(7) [[OUT:%.*]], align 2
+; GFX10PLUS-NEXT:    ret void
+;
+; GFX11-LABEL: @image_sample_2d_multi_fptrunc_to_d16(
+; GFX11-NEXT:  main_body:
+; GFX11-NEXT:    [[SAMPLE:%.*]] = call <3 x half> @llvm.amdgcn.image.sample.lz.2d.v3f16.f32.v8i32.v4i32(i32 7, float [[U:%.*]], float [[V:%.*]], <8 x i32> [[SURF_DESC:%.*]], <4 x i32> [[SAMPLER_DESC:%.*]], i1 false, i32 0, i32 0)
+; GFX11-NEXT:    [[HALF_EXT2:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 0
+; GFX11-NEXT:    [[HALF_EXT1:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 1
+; GFX11-NEXT:    [[HALF_EXT:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 2
+; GFX11-NEXT:    [[MUL:%.*]] = fmul half [[HALF_EXT2]], [[HALF_EXT1]]
+; GFX11-NEXT:    [[RES:%.*]] = fadd half [[MUL]], [[HALF_EXT]]
+; GFX11-NEXT:    store half [[RES]], ptr addrspace(7) [[OUT:%.*]], align 2
+; GFX11-NEXT:    ret void
+;
+main_body:
+  %sample = call <4 x float> @llvm.amdgcn.image.sample.lz.2d.v4f32.f32.v8i32.v4i32(i32 15, float %u, float %v, <8 x i32> %surf_desc, <4 x i32> %samp, i1 false, i32 0, i32 0)
+  %e0 = extractelement <4 x float> %sample, i32 0
+  %h0 = fptrunc float %e0 to half
+  %e1 = extractelement <4 x float> %sample, i32 1
+  %h1 = fptrunc float %e1 to half
+  %e2 = extractelement <4 x float> %sample, i32 2
+  %h2 = fptrunc float %e2 to half
+  %mul = fmul half %h0, %h1
+  %res = fadd half %mul, %h2
+  store half %res, ptr addrspace(7) %out, align 2
+  ret void
+}
+
 define amdgpu_ps half @image_gather4_2d_v4f32(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, half %s, half %t) {
 ; GFX7-LABEL: @image_gather4_2d_v4f32(
 ; GFX7-NEXT:  main_body:
@@ -147,6 +326,42 @@ define amdgpu_ps half @image_gather4_2d_v4f32(<8 x i32> inreg %rsrc, <4 x i32> i
 ; GFX81PLUS-NEXT:    [[ADDF_SUM_2:%.*]] = fadd half [[ADDF_SUM_0]], [[ADDF_SUM_1]]
 ; GFX81PLUS-NEXT:    ret half [[ADDF_SUM_2]]
 ;
+; GFX9-LABEL: @image_gather4_2d_v4f32(
+; GFX9-NEXT:  main_body:
+; GFX9-NEXT:    [[TEX:%.*]] = call <4 x half> @llvm.amdgcn.image.gather4.2d.v4f16.f16.v8i32.v4i32(i32 1, half [[S:%.*]], half [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX9-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <4 x half> [[TEX]], i64 0
+; GFX9-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <4 x half> [[TEX]], i64 1
+; GFX9-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <4 x half> [[TEX]], i64 2
+; GFX9-NEXT:    [[TEX_HALF_3:%.*]] = extractelement <4 x half> [[TEX]], i64 3
+; GFX9-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX9-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[TEX_HALF_2]], [[TEX_HALF_3]]
+; GFX9-NEXT:    [[ADDF_SUM_2:%.*]] = fadd half [[ADDF_SUM_0]], [[ADDF_SUM_1]]
+; GFX9-NEXT:    ret half [[ADDF_SUM_2]]
+;
+; GFX10PLUS-LABEL: @image_gather4_2d_v4f32(
+; GFX10PLUS-NEXT:  main_body:
+; GFX10PLUS-NEXT:    [[TEX:%.*]] = call <4 x half> @llvm.amdgcn.image.gather4.2d.v4f16.f16.v8i32.v4i32(i32 1, half [[S:%.*]], half [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX10PLUS-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <4 x half> [[TEX]], i64 0
+; GFX10PLUS-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <4 x half> [[TEX]], i64 1
+; GFX10PLUS-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <4 x half> [[TEX]], i64 2
+; GFX10PLUS-NEXT:...
[truncated]

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

jayfoad · 2025-05-28T15:18:41Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+          for (auto [lane, Ext] : enumerate(Extracts)) {
+            FPTruncInst *Tr = Truncs[lane];


Suggested change

for (auto [lane, Ext] : enumerate(Extracts)) {

FPTruncInst *Tr = Truncs[lane];

for (auto [Ext, Tr] : zip(Extracts, Truncs)) {

Or put them into a vector of std::pair in the first place.

Thanks, I have used :

for (auto [Ext, Tr] : zip(Extracts, Truncs))

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

arsenm · 2025-05-28T15:39:52Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+
+        for (User *U : II.users()) {
+          auto *Ext = dyn_cast<ExtractElementInst>(U);
+          if (!Ext || !Ext->hasOneUse()) {


Missing negative test for the !hasOneUse?

Thanks, I added a test called image_sample_2d_extractelement_multi_use_no_d16.

arsenm · 2025-05-28T15:40:14Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+            break;
+          }
+          auto *Tr = dyn_cast<FPTruncInst>(*Ext->user_begin());
+          if (!Tr || !Tr->getType()->getScalarType()->isHalfTy()) {


Missing negative test for fptrunc to a different type?

Thanks, I added a test called image_sample_2d_multi_fptrunc_non_half_no_d16.

jayfoad · 2025-05-28T16:07:43Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

@@ -269,6 +269,68 @@ simplifyAMDGCNImageIntrinsic(const GCNSubtarget *ST,
                                       ArgTys[0] = User->getType();
                                     });
        }
+      } else {


I think it would make a bit more sense to put all of this inside:

if (II.getType().isVectorTy()) { new code here } else { old code here }

I don't think this makes sense, because if we encounter an image sample with a vector type and its only user is a fptrunc from a float vector to a half vector, we should not fall back to the "old code" path. For example:

%38 = call reassoc nnan nsz arcp contract afn <4 x float> @llvm.amdgcn.image.sample.l.2d.v4f32.f32.v8i32.v4i32(i32 15, float %36, float %37, float 0.000000e+00, <8 x i32> %34, <4 x i32> %35, i1 false, i32 0, i32 0) %39 = fptrunc <4 x float> %38 to <4 x half>

In this case, the image sample result is a vector and directly truncated to a half vector, which doesn't fit either the new extract‑element + fptrunc pattern or the old code path that expects scalar values. So simply checking isVectorTy() is not sufficient to distinguish the cases.

Actually the old code path does handle that case.

Here's a new suggestion: just remove the "else" on line 272. So even if II has one use, if the old code path fails, we should still try the new code path. This will handle the case where the only use of II is a single extractelement instruction.

Thanks, I have updated it.

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

shiltian · 2025-05-28T16:13:26Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+          II.mutateType(HalfVecTy);
+          II.setCalledFunction(HalfDecl);
+
+          IRBuilder<> Builder(&II);


Do you need to use this c'tor since the insert point is updated right away in the loop?

Thanks, I’ve changed it to IRBuilder<> Builder(II.getContext()); to initialize the builder.

shiltian · 2025-06-08T13:49:52Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+          Truncs.push_back(Tr);
+        }
+
+        if (AllHalfExtracts && !Extracts.empty()) {


bail out early to avoid giant indentation here

We don’t bail out early, because doing so would change the existing control-flow logic: there are additional optimizations that run after the hasD16Images check (e.g. the A16/G16 path).

jayfoad

LGTM

jayfoad · 2025-06-16T09:53:02Z

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

+      SmallVector<ExtractElementInst *, 4> Extracts;
+      SmallVector<FPTruncInst *, 4> Truncs;


Nit: I still thinks this would be neater as a vector-of-std::pairs.

Thanks, I have updated it.

…ement and fptrunc users

llvm-ci · 2025-06-18T01:15:10Z

LLVM Buildbot has detected a new failure on builder fuchsia-x86_64-linux running on fuchsia-debian-64-us-central1-a-1 while building llvm at step 4 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/11/builds/17534

Here is the relevant piece of the build log for the reference

Step 4 (annotate) failure: 'python ../llvm-zorg/zorg/buildbot/builders/annotated/fuchsia-linux.py ...' (failure)
...
[215/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.a64l.dir/a64l.cpp.obj
[216/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.labs.dir/labs.cpp.obj
[217/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strcmp.dir/strcmp.cpp.obj
[218/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strncpy.dir/strncpy.cpp.obj
[219/2505] Building CXX object libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.fini.dir/fini.cpp.obj
[220/2505] Generating header features.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/features.yaml
[221/2505] Building CXX object libc/src/stdio/baremetal/CMakeFiles/libc.src.stdio.baremetal.putchar.dir/putchar.cpp.obj
[222/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.llabs.dir/llabs.cpp.obj
[223/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.srand.dir/srand.cpp.obj
[224/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj
FAILED: libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj 
/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-9cluh2_w/./bin/clang++ --target=armv8.1m.main-none-eabi -DLIBC_NAMESPACE=__llvm_libc_21_0_0_git -I/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc -isystem /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-9cluh2_w/include/armv8.1m.main-unknown-none-eabi --target=armv8.1m.main-none-eabi -Wno-atomic-alignment "-Dvfprintf(stream, format, vlist)=vprintf(format, vlist)" "-Dfprintf(stream, format, ...)=printf(format)" -D_LIBCPP_PRINT=1 -mthumb -mfloat-abi=hard -march=armv8.1-m.main+mve.fp+fp.dp -mcpu=cortex-m55 -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -ffunction-sections -fdata-sections -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-9cluh2_w/runtimes/runtimes-armv8.1m.main-none-eabi-bins=../../../../llvm-project -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/= -no-canonical-prefixes -Os -DNDEBUG --target=armv8.1m.main-none-eabi -DLIBC_QSORT_IMPL=LIBC_QSORT_HEAP_SORT -DLIBC_TYPES_TIME_T_IS_32_BIT -DLIBC_ADD_NULL_CHECKS "-DLIBC_MATH=(LIBC_MATH_SKIP_ACCURATE_PASS | LIBC_MATH_SMALL_TABLES)" -fpie -ffreestanding -DLIBC_FULL_BUILD -nostdlibinc -ffixed-point -fno-builtin -fno-exceptions -fno-lax-vector-conversions -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-rtti -ftrivial-auto-var-init=pattern -fno-omit-frame-pointer -Wall -Wextra -Werror -Wconversion -Wno-sign-conversion -Wdeprecated -Wno-c99-extensions -Wno-gnu-imaginary-constant -Wno-pedantic -Wimplicit-fallthrough -Wwrite-strings -Wextra-semi -Wnewline-eof -Wnonportable-system-include-path -Wstrict-prototypes -Wthread-safety -Wglobal-constructors -DLIBC_COPT_PUBLIC_PACKAGING -MD -MT libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj -MF libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj.d -o libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj -c /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdlib/memalignment.cpp
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdlib/memalignment.cpp:20:3: error: unknown type name 'uintptr_t'
   20 |   uintptr_t addr = reinterpret_cast<uintptr_t>(p);
      |   ^
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdlib/memalignment.cpp:20:37: error: unknown type name 'uintptr_t'
   20 |   uintptr_t addr = reinterpret_cast<uintptr_t>(p);
      |                                     ^
2 errors generated.
[225/2505] Building CXX object libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.init.dir/init.cpp.obj
[226/2505] Copying CXX header __atomic/fence.h
[227/2505] Copying CXX header __algorithm/min_max_result.h
[228/2505] Copying CXX header __atomic/support.h
[229/2505] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.inv_trigf_utils.dir/inv_trigf_utils.cpp.obj
[230/2505] Building CXX object libc/src/stdio/baremetal/CMakeFiles/libc.src.stdio.baremetal.puts.dir/puts.cpp.obj
[231/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strncmp.dir/strncmp.cpp.obj
[232/2505] Copying CXX header __assert
[233/2505] Copying CXX header __algorithm/move_backward.h
[234/2505] Generating header errno.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/errno.yaml
[235/2505] Copying CXX header __algorithm/move.h
[236/2505] Copying CXX header __algorithm/transform.h
[237/2505] Generating header strings.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/strings.yaml
[238/2505] Copying CXX header __algorithm/ranges_any_of.h
[239/2505] Copying CXX header __algorithm/min_element.h
[240/2505] Copying CXX header __bit/has_single_bit.h
[241/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strcoll.dir/strcoll.cpp.obj
[242/2505] Copying CXX header __algorithm/ranges_equal.h
[243/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strspn.dir/strspn.cpp.obj
[244/2505] Copying CXX header __bit/bit_cast.h
[245/2505] Building CXX object libc/src/compiler/generic/CMakeFiles/libc.src.compiler.generic.__stack_chk_fail.dir/__stack_chk_fail.cpp.obj
[246/2505] Generating header complex.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/complex.yaml
[247/2505] Generating header inttypes.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/inttypes.yaml
[248/2505] Building CXX object libc/src/strings/CMakeFiles/libc.src.strings.strcasecmp.dir/strcasecmp.cpp.obj
[249/2505] Generating header locale.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/locale.yaml
[250/2505] Building CXX object libc/src/strings/CMakeFiles/libc.src.strings.strncasecmp.dir/strncasecmp.cpp.obj
[251/2505] Generating header uchar.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/uchar.yaml
[252/2505] Generating header fenv.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/fenv.yaml
[253/2505] Generating header setjmp.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/setjmp.yaml
[254/2505] Generating header wchar.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/wchar.yaml
Step 6 (build) failure: build (failure)
...
[215/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.a64l.dir/a64l.cpp.obj
[216/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.labs.dir/labs.cpp.obj
[217/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strcmp.dir/strcmp.cpp.obj
[218/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strncpy.dir/strncpy.cpp.obj
[219/2505] Building CXX object libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.fini.dir/fini.cpp.obj
[220/2505] Generating header features.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/features.yaml
[221/2505] Building CXX object libc/src/stdio/baremetal/CMakeFiles/libc.src.stdio.baremetal.putchar.dir/putchar.cpp.obj
[222/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.llabs.dir/llabs.cpp.obj
[223/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.srand.dir/srand.cpp.obj
[224/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj
FAILED: libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj 
/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-9cluh2_w/./bin/clang++ --target=armv8.1m.main-none-eabi -DLIBC_NAMESPACE=__llvm_libc_21_0_0_git -I/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc -isystem /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-9cluh2_w/include/armv8.1m.main-unknown-none-eabi --target=armv8.1m.main-none-eabi -Wno-atomic-alignment "-Dvfprintf(stream, format, vlist)=vprintf(format, vlist)" "-Dfprintf(stream, format, ...)=printf(format)" -D_LIBCPP_PRINT=1 -mthumb -mfloat-abi=hard -march=armv8.1-m.main+mve.fp+fp.dp -mcpu=cortex-m55 -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -ffunction-sections -fdata-sections -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-9cluh2_w/runtimes/runtimes-armv8.1m.main-none-eabi-bins=../../../../llvm-project -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/= -no-canonical-prefixes -Os -DNDEBUG --target=armv8.1m.main-none-eabi -DLIBC_QSORT_IMPL=LIBC_QSORT_HEAP_SORT -DLIBC_TYPES_TIME_T_IS_32_BIT -DLIBC_ADD_NULL_CHECKS "-DLIBC_MATH=(LIBC_MATH_SKIP_ACCURATE_PASS | LIBC_MATH_SMALL_TABLES)" -fpie -ffreestanding -DLIBC_FULL_BUILD -nostdlibinc -ffixed-point -fno-builtin -fno-exceptions -fno-lax-vector-conversions -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-rtti -ftrivial-auto-var-init=pattern -fno-omit-frame-pointer -Wall -Wextra -Werror -Wconversion -Wno-sign-conversion -Wdeprecated -Wno-c99-extensions -Wno-gnu-imaginary-constant -Wno-pedantic -Wimplicit-fallthrough -Wwrite-strings -Wextra-semi -Wnewline-eof -Wnonportable-system-include-path -Wstrict-prototypes -Wthread-safety -Wglobal-constructors -DLIBC_COPT_PUBLIC_PACKAGING -MD -MT libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj -MF libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj.d -o libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj -c /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdlib/memalignment.cpp
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdlib/memalignment.cpp:20:3: error: unknown type name 'uintptr_t'
   20 |   uintptr_t addr = reinterpret_cast<uintptr_t>(p);
      |   ^
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdlib/memalignment.cpp:20:37: error: unknown type name 'uintptr_t'
   20 |   uintptr_t addr = reinterpret_cast<uintptr_t>(p);
      |                                     ^
2 errors generated.
[225/2505] Building CXX object libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.init.dir/init.cpp.obj
[226/2505] Copying CXX header __atomic/fence.h
[227/2505] Copying CXX header __algorithm/min_max_result.h
[228/2505] Copying CXX header __atomic/support.h
[229/2505] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.inv_trigf_utils.dir/inv_trigf_utils.cpp.obj
[230/2505] Building CXX object libc/src/stdio/baremetal/CMakeFiles/libc.src.stdio.baremetal.puts.dir/puts.cpp.obj
[231/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strncmp.dir/strncmp.cpp.obj
[232/2505] Copying CXX header __assert
[233/2505] Copying CXX header __algorithm/move_backward.h
[234/2505] Generating header errno.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/errno.yaml
[235/2505] Copying CXX header __algorithm/move.h
[236/2505] Copying CXX header __algorithm/transform.h
[237/2505] Generating header strings.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/strings.yaml
[238/2505] Copying CXX header __algorithm/ranges_any_of.h
[239/2505] Copying CXX header __algorithm/min_element.h
[240/2505] Copying CXX header __bit/has_single_bit.h
[241/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strcoll.dir/strcoll.cpp.obj
[242/2505] Copying CXX header __algorithm/ranges_equal.h
[243/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strspn.dir/strspn.cpp.obj
[244/2505] Copying CXX header __bit/bit_cast.h
[245/2505] Building CXX object libc/src/compiler/generic/CMakeFiles/libc.src.compiler.generic.__stack_chk_fail.dir/__stack_chk_fail.cpp.obj
[246/2505] Generating header complex.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/complex.yaml
[247/2505] Generating header inttypes.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/inttypes.yaml
[248/2505] Building CXX object libc/src/strings/CMakeFiles/libc.src.strings.strcasecmp.dir/strcasecmp.cpp.obj
[249/2505] Generating header locale.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/locale.yaml
[250/2505] Building CXX object libc/src/strings/CMakeFiles/libc.src.strings.strncasecmp.dir/strncasecmp.cpp.obj
[251/2505] Generating header uchar.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/uchar.yaml
[252/2505] Generating header fenv.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/fenv.yaml
[253/2505] Generating header setjmp.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/setjmp.yaml
[254/2505] Generating header wchar.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/wchar.yaml

…ement and fptrunc users (llvm#141758) Now we only support D16 folding for `image sample` instructions with a single user: a `fptrunc` to half. However, we can actually support D16 folding for image.sample instructions with multiple users, as long as each user follows the pattern of extractelement followed by fptrunc to half. For example: ``` %sample = call <4 x float> @llvm.amdgcn.image.sample %e0 = extractelement <4 x float> %sample, i32 0 %h0 = fptrunc float %e0 to half %e1 = extractelement <4 x float> %sample, i32 1 %h1 = fptrunc float %e1 to half %e2 = extractelement <4 x float> %sample, i32 2 %h2 = fptrunc float %e2 to half ``` This change enables D16 folding for such cases and avoids generating `v_cvt_f16_f32_e32` instructions.

harrisonGPU requested review from jayfoad, arsenm, perlfu, nhaehnle, shiltian and ruiling May 28, 2025 13:15

llvmbot added backend:AMDGPU llvm:instcombine llvm:transforms labels May 28, 2025

harrisonGPU mentioned this pull request May 28, 2025

[NFC][AMDGPU] Add D16 test for multiple fptrunc image sample #141771

Closed

jayfoad reviewed May 28, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp Outdated Show resolved Hide resolved

arsenm reviewed May 28, 2025

View reviewed changes

jayfoad reviewed May 28, 2025

View reviewed changes

shiltian reviewed May 28, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp Outdated Show resolved Hide resolved

shiltian reviewed May 28, 2025

View reviewed changes

harrisonGPU force-pushed the amdgpu/imageCombine branch from cc66281 to 7b416b7 Compare May 28, 2025 17:06

harrisonGPU requested review from jayfoad, arsenm and shiltian May 29, 2025 04:50

harrisonGPU force-pushed the amdgpu/imageCombine branch from cce389a to c31843d Compare May 29, 2025 04:59

harrisonGPU force-pushed the amdgpu/imageCombine branch from c31843d to dfed9ec Compare June 8, 2025 08:10

shiltian reviewed Jun 8, 2025

View reviewed changes

harrisonGPU force-pushed the amdgpu/imageCombine branch from dfed9ec to 887688e Compare June 16, 2025 06:33

jayfoad approved these changes Jun 16, 2025

View reviewed changes

harrisonGPU added 4 commits June 17, 2025 16:41

[AMDGPU] Support D16 folding for image.sample with multiple extractel…

7ac4ddb

…ement and fptrunc users

[AMDGPU] Update for comments.

01edbef

[AMDGPU] Add two negative tests.

93b8337

[AMDGPU] Update lit test.

69e3ebe

harrisonGPU added 3 commits June 17, 2025 16:41

[AMDGPU] Update for comments.

b1cccd9

[AMDGPU] use std pair.

e62c1fc

[AMDGPU] Remove return nullopt

86a4102

harrisonGPU force-pushed the amdgpu/imageCombine branch from d26b66f to 86a4102 Compare June 17, 2025 10:27

harrisonGPU merged commit 0defde8 into llvm:main Jun 18, 2025
7 checks passed

harrisonGPU deleted the amdgpu/imageCombine branch June 18, 2025 01:00

		for (auto [lane, Ext] : enumerate(Extracts)) {
		FPTruncInst *Tr = Truncs[lane];

	for (auto [lane, Ext] : enumerate(Extracts)) {
	FPTruncInst *Tr = Truncs[lane];
	for (auto [Ext, Tr] : zip(Extracts, Truncs)) {

		SmallVector<ExtractElementInst *, 4> Extracts;
		SmallVector<FPTruncInst *, 4> Truncs;

[AMDGPU] Support D16 folding for image.sample with multiple extractelement and fptrunc users #141758

[AMDGPU] Support D16 folding for image.sample with multiple extractelement and fptrunc users #141758

Uh oh!

Conversation

harrisonGPU commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shiltian May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harrisonGPU May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayfoad left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvm-ci commented Jun 18, 2025

Uh oh!

Uh oh!

harrisonGPU commented May 28, 2025 •

edited

Loading

llvmbot commented May 28, 2025 •

edited

Loading

shiltian May 28, 2025 •

edited

Loading

harrisonGPU May 29, 2025 •

edited

Loading