Skip to content

[AMDGPU] Support D16 folding for image.sample with multiple extractelement and fptrunc users #141758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 18, 2025

Conversation

harrisonGPU
Copy link
Contributor

@harrisonGPU harrisonGPU commented May 28, 2025

Now we only support D16 folding for image sample instructions with a single user: a fptrunc to half.
However, we can actually support D16 folding for image.sample instructions with multiple users,
as long as each user follows the pattern of extractelement followed by fptrunc to half.
For example:

  %sample = call <4 x float> @llvm.amdgcn.image.sample
  %e0 = extractelement <4 x float> %sample, i32 0
  %h0 = fptrunc float %e0 to half
  %e1 = extractelement <4 x float> %sample, i32 1
  %h1 = fptrunc float %e1 to half
  %e2 = extractelement <4 x float> %sample, i32 2
  %h2 = fptrunc float %e2 to half

This change enables D16 folding for such cases and avoids generating v_cvt_f16_f32_e32 instructions.

@llvmbot
Copy link
Member

llvmbot commented May 28, 2025

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-amdgpu

Author: Harrison Hao (harrisonGPU)

Changes

Now we only support D16 folding for image sample instructions with a single user: a fptrunc to half.
However, we can actually support D16 folding for image.sample instructions with multiple users, as long as each user follows the pattern of extractelement followed by fptrunc to half.
For example:

  %sample = call &lt;4 x float&gt; @<!-- -->llvm.amdgcn.image.sample
  %e0 = extractelement &lt;4 x float&gt; %sample, i32 0
  %h0 = fptrunc float %e0 to half
  %e1 = extractelement &lt;4 x float&gt; %sample, i32 1
  %h1 = fptrunc float %e1 to half
  %e2 = extractelement &lt;4 x float&gt; %sample, i32 2
  %h2 = fptrunc float %e2 to half

This change enables D16 folding for such cases and avoids generating v_cvt_f16_f32_e32 instructions.


Patch is 37.95 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/141758.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp (+62)
  • (modified) llvm/test/Transforms/InstCombine/AMDGPU/image-d16.ll (+415-2)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
index 5f6ab24182d5e..2ff1501e2b977 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
@@ -269,6 +269,68 @@ simplifyAMDGCNImageIntrinsic(const GCNSubtarget *ST,
                                        ArgTys[0] = User->getType();
                                      });
         }
+      } else {
+        // Only perform D16 folding if every user of the image sample is
+        // an ExtractElementInst immediately followed by an FPTrunc to half.
+        SmallVector<ExtractElementInst *, 4> Extracts;
+        SmallVector<FPTruncInst *, 4> Truncs;
+        bool AllHalfExtracts = true;
+
+        for (User *U : II.users()) {
+          auto *Ext = dyn_cast<ExtractElementInst>(U);
+          if (!Ext || !Ext->hasOneUse()) {
+            AllHalfExtracts = false;
+            break;
+          }
+          auto *Tr = dyn_cast<FPTruncInst>(*Ext->user_begin());
+          if (!Tr || !Tr->getType()->getScalarType()->isHalfTy()) {
+            AllHalfExtracts = false;
+            break;
+          }
+          Extracts.push_back(Ext);
+          Truncs.push_back(Tr);
+        }
+
+        if (AllHalfExtracts && !Extracts.empty()) {
+          auto *VecTy = cast<VectorType>(II.getType());
+          unsigned NElts = VecTy->getElementCount().getKnownMinValue();
+          Type *HalfVecTy =
+              VectorType::get(Type::getHalfTy(II.getContext()), NElts, false);
+
+          // Obtain the original image sample intrinsic's signature
+          // and replace its return type with the half-vector for D16 folding
+          SmallVector<Type *, 8> SigTys;
+          if (!Intrinsic::getIntrinsicSignature(II.getCalledFunction(), SigTys))
+            return nullptr;
+          SigTys[0] = HalfVecTy;
+
+          Module *M = II.getModule();
+          Function *HalfDecl =
+              Intrinsic::getOrInsertDeclaration(M, ImageDimIntr->Intr, SigTys);
+
+          II.mutateType(HalfVecTy);
+          II.setCalledFunction(HalfDecl);
+
+          IRBuilder<> Builder(&II);
+          for (auto [lane, Ext] : enumerate(Extracts)) {
+            FPTruncInst *Tr = Truncs[lane];
+            Value *Idx = Ext->getIndexOperand();
+
+            Builder.SetInsertPoint(Tr);
+
+            Value *HalfExtract = Builder.CreateExtractElement(&II, Idx);
+            HalfExtract->takeName(Tr);
+
+            Tr->replaceAllUsesWith(HalfExtract);
+          }
+
+          for (auto *T : Truncs)
+            IC.eraseInstFromFunction(*T);
+          for (auto *E : Extracts)
+            IC.eraseInstFromFunction(*E);
+
+          return &II;
+        }
       }
     }
   }
diff --git a/llvm/test/Transforms/InstCombine/AMDGPU/image-d16.ll b/llvm/test/Transforms/InstCombine/AMDGPU/image-d16.ll
index 30431ad724843..d39cceb9b549a 100644
--- a/llvm/test/Transforms/InstCombine/AMDGPU/image-d16.ll
+++ b/llvm/test/Transforms/InstCombine/AMDGPU/image-d16.ll
@@ -1,8 +1,9 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
 ; RUN: opt -mtriple=amdgcn--amdpal -mcpu=gfx700 -S -passes=instcombine %s | FileCheck --check-prefixes=GFX7 %s
 ; RUN: opt -mtriple=amdgcn--amdpal -mcpu=gfx810 -S -passes=instcombine %s | FileCheck --check-prefixes=GFX81PLUS %s
-; RUN: opt -mtriple=amdgcn--amdpal -mcpu=gfx900 -S -passes=instcombine %s | FileCheck --check-prefixes=GFX81PLUS %s
-; RUN: opt -mtriple=amdgcn--amdpal -mcpu=gfx1010 -S -passes=instcombine %s | FileCheck --check-prefixes=GFX81PLUS %s
+; RUN: opt -mtriple=amdgcn--amdpal -mcpu=gfx900 -S -passes=instcombine %s | FileCheck --check-prefixes=GFX9 %s
+; RUN: opt -mtriple=amdgcn--amdpal -mcpu=gfx1010 -S -passes=instcombine %s | FileCheck --check-prefixes=GFX10PLUS %s
+; RUN: opt -mtriple=amdgcn--amdpal -mcpu=gfx1100 -S -passes=instcombine %s | FileCheck --check-prefixes=GFX11 %s
 
 define amdgpu_ps half @image_sample_2d_fptrunc_to_d16(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, float %s, float %t) {
 ; GFX7-LABEL: @image_sample_2d_fptrunc_to_d16(
@@ -16,6 +17,21 @@ define amdgpu_ps half @image_sample_2d_fptrunc_to_d16(<8 x i32> inreg %rsrc, <4
 ; GFX81PLUS-NEXT:    [[TEX:%.*]] = call half @llvm.amdgcn.image.sample.lz.2d.f16.f32.v8i32.v4i32(i32 1, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
 ; GFX81PLUS-NEXT:    ret half [[TEX]]
 ;
+; GFX9-LABEL: @image_sample_2d_fptrunc_to_d16(
+; GFX9-NEXT:  main_body:
+; GFX9-NEXT:    [[TEX:%.*]] = call half @llvm.amdgcn.image.sample.lz.2d.f16.f32.v8i32.v4i32(i32 1, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX9-NEXT:    ret half [[TEX]]
+;
+; GFX10PLUS-LABEL: @image_sample_2d_fptrunc_to_d16(
+; GFX10PLUS-NEXT:  main_body:
+; GFX10PLUS-NEXT:    [[TEX:%.*]] = call half @llvm.amdgcn.image.sample.lz.2d.f16.f32.v8i32.v4i32(i32 1, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX10PLUS-NEXT:    ret half [[TEX]]
+;
+; GFX11-LABEL: @image_sample_2d_fptrunc_to_d16(
+; GFX11-NEXT:  main_body:
+; GFX11-NEXT:    [[TEX:%.*]] = call half @llvm.amdgcn.image.sample.lz.2d.f16.f32.v8i32.v4i32(i32 1, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX11-NEXT:    ret half [[TEX]]
+;
 main_body:
   %tex = call float @llvm.amdgcn.image.sample.lz.2d.f32.f32.v8i32.v4i32(i32 1, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 0, i32 0)
   %tex_half = fptrunc float %tex to half
@@ -40,6 +56,30 @@ define amdgpu_ps half @image_sample_2d_v2f32(<8 x i32> inreg %rsrc, <4 x i32> in
 ; GFX81PLUS-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
 ; GFX81PLUS-NEXT:    ret half [[ADDF_SUM_0]]
 ;
+; GFX9-LABEL: @image_sample_2d_v2f32(
+; GFX9-NEXT:  main_body:
+; GFX9-NEXT:    [[TEX:%.*]] = call <2 x half> @llvm.amdgcn.image.sample.lz.2d.v2f16.f32.v8i32.v4i32(i32 3, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX9-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <2 x half> [[TEX]], i64 0
+; GFX9-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <2 x half> [[TEX]], i64 1
+; GFX9-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX9-NEXT:    ret half [[ADDF_SUM_0]]
+;
+; GFX10PLUS-LABEL: @image_sample_2d_v2f32(
+; GFX10PLUS-NEXT:  main_body:
+; GFX10PLUS-NEXT:    [[TEX:%.*]] = call <2 x half> @llvm.amdgcn.image.sample.lz.2d.v2f16.f32.v8i32.v4i32(i32 3, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX10PLUS-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <2 x half> [[TEX]], i64 0
+; GFX10PLUS-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <2 x half> [[TEX]], i64 1
+; GFX10PLUS-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX10PLUS-NEXT:    ret half [[ADDF_SUM_0]]
+;
+; GFX11-LABEL: @image_sample_2d_v2f32(
+; GFX11-NEXT:  main_body:
+; GFX11-NEXT:    [[TEX:%.*]] = call <2 x half> @llvm.amdgcn.image.sample.lz.2d.v2f16.f32.v8i32.v4i32(i32 3, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX11-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <2 x half> [[TEX]], i64 0
+; GFX11-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <2 x half> [[TEX]], i64 1
+; GFX11-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX11-NEXT:    ret half [[ADDF_SUM_0]]
+;
 main_body:
   %tex = call <2 x float> @llvm.amdgcn.image.sample.lz.2d.v2f32.f32.v8i32.v4i32(i32 3, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 0, i32 0)
   %tex_2_half = fptrunc <2 x float> %tex to <2 x half>
@@ -71,6 +111,36 @@ define amdgpu_ps half @image_sample_2d_v3f32(<8 x i32> inreg %rsrc, <4 x i32> in
 ; GFX81PLUS-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[ADDF_SUM_0]], [[TEX_HALF_2]]
 ; GFX81PLUS-NEXT:    ret half [[ADDF_SUM_1]]
 ;
+; GFX9-LABEL: @image_sample_2d_v3f32(
+; GFX9-NEXT:  main_body:
+; GFX9-NEXT:    [[TEX:%.*]] = call <3 x half> @llvm.amdgcn.image.sample.lz.2d.v3f16.f32.v8i32.v4i32(i32 7, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX9-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <3 x half> [[TEX]], i64 0
+; GFX9-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <3 x half> [[TEX]], i64 1
+; GFX9-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <3 x half> [[TEX]], i64 2
+; GFX9-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX9-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[ADDF_SUM_0]], [[TEX_HALF_2]]
+; GFX9-NEXT:    ret half [[ADDF_SUM_1]]
+;
+; GFX10PLUS-LABEL: @image_sample_2d_v3f32(
+; GFX10PLUS-NEXT:  main_body:
+; GFX10PLUS-NEXT:    [[TEX:%.*]] = call <3 x half> @llvm.amdgcn.image.sample.lz.2d.v3f16.f32.v8i32.v4i32(i32 7, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX10PLUS-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <3 x half> [[TEX]], i64 0
+; GFX10PLUS-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <3 x half> [[TEX]], i64 1
+; GFX10PLUS-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <3 x half> [[TEX]], i64 2
+; GFX10PLUS-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX10PLUS-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[ADDF_SUM_0]], [[TEX_HALF_2]]
+; GFX10PLUS-NEXT:    ret half [[ADDF_SUM_1]]
+;
+; GFX11-LABEL: @image_sample_2d_v3f32(
+; GFX11-NEXT:  main_body:
+; GFX11-NEXT:    [[TEX:%.*]] = call <3 x half> @llvm.amdgcn.image.sample.lz.2d.v3f16.f32.v8i32.v4i32(i32 7, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX11-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <3 x half> [[TEX]], i64 0
+; GFX11-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <3 x half> [[TEX]], i64 1
+; GFX11-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <3 x half> [[TEX]], i64 2
+; GFX11-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX11-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[ADDF_SUM_0]], [[TEX_HALF_2]]
+; GFX11-NEXT:    ret half [[ADDF_SUM_1]]
+;
 main_body:
   %tex = call <3 x float> @llvm.amdgcn.image.sample.lz.2d.v3f32.f32.v8i32.v4i32(i32 7, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 0, i32 0)
   %tex_3_half = fptrunc <3 x float> %tex to <3 x half>
@@ -108,6 +178,42 @@ define amdgpu_ps half @image_sample_2d_v4f32(<8 x i32> inreg %rsrc, <4 x i32> in
 ; GFX81PLUS-NEXT:    [[ADDF_SUM_2:%.*]] = fadd half [[ADDF_SUM_0]], [[ADDF_SUM_1]]
 ; GFX81PLUS-NEXT:    ret half [[ADDF_SUM_2]]
 ;
+; GFX9-LABEL: @image_sample_2d_v4f32(
+; GFX9-NEXT:  main_body:
+; GFX9-NEXT:    [[TEX:%.*]] = call <4 x half> @llvm.amdgcn.image.sample.lz.2d.v4f16.f32.v8i32.v4i32(i32 15, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX9-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <4 x half> [[TEX]], i64 0
+; GFX9-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <4 x half> [[TEX]], i64 1
+; GFX9-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <4 x half> [[TEX]], i64 2
+; GFX9-NEXT:    [[TEX_HALF_3:%.*]] = extractelement <4 x half> [[TEX]], i64 3
+; GFX9-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX9-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[TEX_HALF_2]], [[TEX_HALF_3]]
+; GFX9-NEXT:    [[ADDF_SUM_2:%.*]] = fadd half [[ADDF_SUM_0]], [[ADDF_SUM_1]]
+; GFX9-NEXT:    ret half [[ADDF_SUM_2]]
+;
+; GFX10PLUS-LABEL: @image_sample_2d_v4f32(
+; GFX10PLUS-NEXT:  main_body:
+; GFX10PLUS-NEXT:    [[TEX:%.*]] = call <4 x half> @llvm.amdgcn.image.sample.lz.2d.v4f16.f32.v8i32.v4i32(i32 15, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX10PLUS-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <4 x half> [[TEX]], i64 0
+; GFX10PLUS-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <4 x half> [[TEX]], i64 1
+; GFX10PLUS-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <4 x half> [[TEX]], i64 2
+; GFX10PLUS-NEXT:    [[TEX_HALF_3:%.*]] = extractelement <4 x half> [[TEX]], i64 3
+; GFX10PLUS-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX10PLUS-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[TEX_HALF_2]], [[TEX_HALF_3]]
+; GFX10PLUS-NEXT:    [[ADDF_SUM_2:%.*]] = fadd half [[ADDF_SUM_0]], [[ADDF_SUM_1]]
+; GFX10PLUS-NEXT:    ret half [[ADDF_SUM_2]]
+;
+; GFX11-LABEL: @image_sample_2d_v4f32(
+; GFX11-NEXT:  main_body:
+; GFX11-NEXT:    [[TEX:%.*]] = call <4 x half> @llvm.amdgcn.image.sample.lz.2d.v4f16.f32.v8i32.v4i32(i32 15, float [[S:%.*]], float [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX11-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <4 x half> [[TEX]], i64 0
+; GFX11-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <4 x half> [[TEX]], i64 1
+; GFX11-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <4 x half> [[TEX]], i64 2
+; GFX11-NEXT:    [[TEX_HALF_3:%.*]] = extractelement <4 x half> [[TEX]], i64 3
+; GFX11-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX11-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[TEX_HALF_2]], [[TEX_HALF_3]]
+; GFX11-NEXT:    [[ADDF_SUM_2:%.*]] = fadd half [[ADDF_SUM_0]], [[ADDF_SUM_1]]
+; GFX11-NEXT:    ret half [[ADDF_SUM_2]]
+;
 main_body:
   %tex = call <4 x float> @llvm.amdgcn.image.sample.lz.2d.v4f32.f32.v8i32.v4i32(i32 15, float %s, float %t, <8 x i32> %rsrc, <4 x i32> %samp, i1 false, i32 0, i32 0)
   %tex_4_half = fptrunc <4 x float> %tex to <4 x half>
@@ -121,6 +227,79 @@ main_body:
   ret half %addf_sum.2
 }
 
+define void @image_sample_2d_multi_fptrunc_to_d16(<8 x i32> %surf_desc, <4 x i32> %samp, float %u, float %v, ptr addrspace(7) %out) {
+; GFX7-LABEL: @image_sample_2d_multi_fptrunc_to_d16(
+; GFX7-NEXT:  main_body:
+; GFX7-NEXT:    [[SAMPLE:%.*]] = call <3 x float> @llvm.amdgcn.image.sample.lz.2d.v3f32.f32.v8i32.v4i32(i32 7, float [[U:%.*]], float [[V:%.*]], <8 x i32> [[SURF_DESC:%.*]], <4 x i32> [[SAMPLER_DESC:%.*]], i1 false, i32 0, i32 0)
+; GFX7-NEXT:    [[E0:%.*]] = extractelement <3 x float> [[SAMPLE]], i64 0
+; GFX7-NEXT:    [[H0:%.*]] = fptrunc float [[E0]] to half
+; GFX7-NEXT:    [[E1:%.*]] = extractelement <3 x float> [[SAMPLE]], i64 1
+; GFX7-NEXT:    [[H1:%.*]] = fptrunc float [[E1]] to half
+; GFX7-NEXT:    [[E2:%.*]] = extractelement <3 x float> [[SAMPLE]], i64 2
+; GFX7-NEXT:    [[H2:%.*]] = fptrunc float [[E2]] to half
+; GFX7-NEXT:    [[MUL:%.*]] = fmul half [[H0]], [[H1]]
+; GFX7-NEXT:    [[RES:%.*]] = fadd half [[MUL]], [[H2]]
+; GFX7-NEXT:    store half [[RES]], ptr addrspace(7) [[OUT:%.*]], align 2
+; GFX7-NEXT:    ret void
+;
+; GFX81PLUS-LABEL: @image_sample_2d_multi_fptrunc_to_d16(
+; GFX81PLUS-NEXT:  main_body:
+; GFX81PLUS-NEXT:    [[SAMPLE:%.*]] = call <3 x half> @llvm.amdgcn.image.sample.lz.2d.v3f16.f32.v8i32.v4i32(i32 7, float [[U:%.*]], float [[V:%.*]], <8 x i32> [[SURF_DESC:%.*]], <4 x i32> [[SAMPLER_DESC:%.*]], i1 false, i32 0, i32 0)
+; GFX81PLUS-NEXT:    [[H0:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 0
+; GFX81PLUS-NEXT:    [[H1:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 1
+; GFX81PLUS-NEXT:    [[H2:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 2
+; GFX81PLUS-NEXT:    [[MUL:%.*]] = fmul half [[H0]], [[H1]]
+; GFX81PLUS-NEXT:    [[RES:%.*]] = fadd half [[MUL]], [[H2]]
+; GFX81PLUS-NEXT:    store half [[RES]], ptr addrspace(7) [[OUT:%.*]], align 2
+; GFX81PLUS-NEXT:    ret void
+;
+; GFX9-LABEL: @image_sample_2d_multi_fptrunc_to_d16(
+; GFX9-NEXT:  main_body:
+; GFX9-NEXT:    [[SAMPLE:%.*]] = call <3 x half> @llvm.amdgcn.image.sample.lz.2d.v3f16.f32.v8i32.v4i32(i32 7, float [[U:%.*]], float [[V:%.*]], <8 x i32> [[SURF_DESC:%.*]], <4 x i32> [[SAMPLER_DESC:%.*]], i1 false, i32 0, i32 0)
+; GFX9-NEXT:    [[HALF_EXT2:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 0
+; GFX9-NEXT:    [[HALF_EXT1:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 1
+; GFX9-NEXT:    [[HALF_EXT:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 2
+; GFX9-NEXT:    [[MUL:%.*]] = fmul half [[HALF_EXT2]], [[HALF_EXT1]]
+; GFX9-NEXT:    [[RES:%.*]] = fadd half [[MUL]], [[HALF_EXT]]
+; GFX9-NEXT:    store half [[RES]], ptr addrspace(7) [[OUT:%.*]], align 2
+; GFX9-NEXT:    ret void
+;
+; GFX10PLUS-LABEL: @image_sample_2d_multi_fptrunc_to_d16(
+; GFX10PLUS-NEXT:  main_body:
+; GFX10PLUS-NEXT:    [[SAMPLE:%.*]] = call <3 x half> @llvm.amdgcn.image.sample.lz.2d.v3f16.f32.v8i32.v4i32(i32 7, float [[U:%.*]], float [[V:%.*]], <8 x i32> [[SURF_DESC:%.*]], <4 x i32> [[SAMPLER_DESC:%.*]], i1 false, i32 0, i32 0)
+; GFX10PLUS-NEXT:    [[HALF_EXT2:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 0
+; GFX10PLUS-NEXT:    [[HALF_EXT1:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 1
+; GFX10PLUS-NEXT:    [[HALF_EXT:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 2
+; GFX10PLUS-NEXT:    [[MUL:%.*]] = fmul half [[HALF_EXT2]], [[HALF_EXT1]]
+; GFX10PLUS-NEXT:    [[RES:%.*]] = fadd half [[MUL]], [[HALF_EXT]]
+; GFX10PLUS-NEXT:    store half [[RES]], ptr addrspace(7) [[OUT:%.*]], align 2
+; GFX10PLUS-NEXT:    ret void
+;
+; GFX11-LABEL: @image_sample_2d_multi_fptrunc_to_d16(
+; GFX11-NEXT:  main_body:
+; GFX11-NEXT:    [[SAMPLE:%.*]] = call <3 x half> @llvm.amdgcn.image.sample.lz.2d.v3f16.f32.v8i32.v4i32(i32 7, float [[U:%.*]], float [[V:%.*]], <8 x i32> [[SURF_DESC:%.*]], <4 x i32> [[SAMPLER_DESC:%.*]], i1 false, i32 0, i32 0)
+; GFX11-NEXT:    [[HALF_EXT2:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 0
+; GFX11-NEXT:    [[HALF_EXT1:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 1
+; GFX11-NEXT:    [[HALF_EXT:%.*]] = extractelement <3 x half> [[SAMPLE]], i64 2
+; GFX11-NEXT:    [[MUL:%.*]] = fmul half [[HALF_EXT2]], [[HALF_EXT1]]
+; GFX11-NEXT:    [[RES:%.*]] = fadd half [[MUL]], [[HALF_EXT]]
+; GFX11-NEXT:    store half [[RES]], ptr addrspace(7) [[OUT:%.*]], align 2
+; GFX11-NEXT:    ret void
+;
+main_body:
+  %sample = call <4 x float> @llvm.amdgcn.image.sample.lz.2d.v4f32.f32.v8i32.v4i32(i32 15, float %u, float %v, <8 x i32> %surf_desc, <4 x i32> %samp, i1 false, i32 0, i32 0)
+  %e0 = extractelement <4 x float> %sample, i32 0
+  %h0 = fptrunc float %e0 to half
+  %e1 = extractelement <4 x float> %sample, i32 1
+  %h1 = fptrunc float %e1 to half
+  %e2 = extractelement <4 x float> %sample, i32 2
+  %h2 = fptrunc float %e2 to half
+  %mul = fmul half %h0, %h1
+  %res = fadd half %mul, %h2
+  store half %res, ptr addrspace(7) %out, align 2
+  ret void
+}
+
 define amdgpu_ps half @image_gather4_2d_v4f32(<8 x i32> inreg %rsrc, <4 x i32> inreg %samp, half %s, half %t) {
 ; GFX7-LABEL: @image_gather4_2d_v4f32(
 ; GFX7-NEXT:  main_body:
@@ -147,6 +326,42 @@ define amdgpu_ps half @image_gather4_2d_v4f32(<8 x i32> inreg %rsrc, <4 x i32> i
 ; GFX81PLUS-NEXT:    [[ADDF_SUM_2:%.*]] = fadd half [[ADDF_SUM_0]], [[ADDF_SUM_1]]
 ; GFX81PLUS-NEXT:    ret half [[ADDF_SUM_2]]
 ;
+; GFX9-LABEL: @image_gather4_2d_v4f32(
+; GFX9-NEXT:  main_body:
+; GFX9-NEXT:    [[TEX:%.*]] = call <4 x half> @llvm.amdgcn.image.gather4.2d.v4f16.f16.v8i32.v4i32(i32 1, half [[S:%.*]], half [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX9-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <4 x half> [[TEX]], i64 0
+; GFX9-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <4 x half> [[TEX]], i64 1
+; GFX9-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <4 x half> [[TEX]], i64 2
+; GFX9-NEXT:    [[TEX_HALF_3:%.*]] = extractelement <4 x half> [[TEX]], i64 3
+; GFX9-NEXT:    [[ADDF_SUM_0:%.*]] = fadd half [[TEX_HALF_0]], [[TEX_HALF_1]]
+; GFX9-NEXT:    [[ADDF_SUM_1:%.*]] = fadd half [[TEX_HALF_2]], [[TEX_HALF_3]]
+; GFX9-NEXT:    [[ADDF_SUM_2:%.*]] = fadd half [[ADDF_SUM_0]], [[ADDF_SUM_1]]
+; GFX9-NEXT:    ret half [[ADDF_SUM_2]]
+;
+; GFX10PLUS-LABEL: @image_gather4_2d_v4f32(
+; GFX10PLUS-NEXT:  main_body:
+; GFX10PLUS-NEXT:    [[TEX:%.*]] = call <4 x half> @llvm.amdgcn.image.gather4.2d.v4f16.f16.v8i32.v4i32(i32 1, half [[S:%.*]], half [[T:%.*]], <8 x i32> [[RSRC:%.*]], <4 x i32> [[SAMP:%.*]], i1 false, i32 0, i32 0)
+; GFX10PLUS-NEXT:    [[TEX_HALF_0:%.*]] = extractelement <4 x half> [[TEX]], i64 0
+; GFX10PLUS-NEXT:    [[TEX_HALF_1:%.*]] = extractelement <4 x half> [[TEX]], i64 1
+; GFX10PLUS-NEXT:    [[TEX_HALF_2:%.*]] = extractelement <4 x half> [[TEX]], i64 2
+; GFX10PLUS-NEXT:...
[truncated]

Comment on lines 315 to 316
for (auto [lane, Ext] : enumerate(Extracts)) {
FPTruncInst *Tr = Truncs[lane];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (auto [lane, Ext] : enumerate(Extracts)) {
FPTruncInst *Tr = Truncs[lane];
for (auto [Ext, Tr] : zip(Extracts, Truncs)) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or put them into a vector of std::pair in the first place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have used :

for (auto [Ext, Tr] : zip(Extracts, Truncs))


for (User *U : II.users()) {
auto *Ext = dyn_cast<ExtractElementInst>(U);
if (!Ext || !Ext->hasOneUse()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing negative test for the !hasOneUse?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I added a test called image_sample_2d_extractelement_multi_use_no_d16.

break;
}
auto *Tr = dyn_cast<FPTruncInst>(*Ext->user_begin());
if (!Tr || !Tr->getType()->getScalarType()->isHalfTy()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing negative test for fptrunc to a different type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I added a test called image_sample_2d_multi_fptrunc_non_half_no_d16.

@@ -269,6 +269,68 @@ simplifyAMDGCNImageIntrinsic(const GCNSubtarget *ST,
ArgTys[0] = User->getType();
});
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make a bit more sense to put all of this inside:

  if (II.getType().isVectorTy()) {
    new code here
  } else {
    old code here
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this makes sense, because if we encounter an image sample with a vector type and its only user is a fptrunc from a float vector to a half vector, we should not fall back to the "old code" path. For example:

  %38 = call reassoc nnan nsz arcp contract afn <4 x float> @llvm.amdgcn.image.sample.l.2d.v4f32.f32.v8i32.v4i32(i32 15, float %36, float %37, float 0.000000e+00, <8 x i32> %34, <4 x i32> %35, i1 false, i32 0, i32 0)
  %39 = fptrunc <4 x float> %38 to <4 x half>

In this case, the image sample result is a vector and directly truncated to a half vector, which doesn't fit either the new extract‑element + fptrunc pattern or the old code path that expects scalar values. So simply checking isVectorTy() is not sufficient to distinguish the cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the old code path does handle that case.

Here's a new suggestion: just remove the "else" on line 272. So even if II has one use, if the old code path fails, we should still try the new code path. This will handle the case where the only use of II is a single extractelement instruction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have updated it.

II.mutateType(HalfVecTy);
II.setCalledFunction(HalfDecl);

IRBuilder<> Builder(&II);
Copy link
Contributor

@shiltian shiltian May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to use this c'tor since the insert point is updated right away in the loop?

Copy link
Contributor Author

@harrisonGPU harrisonGPU May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I’ve changed it to IRBuilder<> Builder(II.getContext()); to initialize the builder.

@harrisonGPU harrisonGPU force-pushed the amdgpu/imageCombine branch from cc66281 to 7b416b7 Compare May 28, 2025 17:06
@harrisonGPU harrisonGPU force-pushed the amdgpu/imageCombine branch from cce389a to c31843d Compare May 29, 2025 04:59
@harrisonGPU harrisonGPU force-pushed the amdgpu/imageCombine branch from c31843d to dfed9ec Compare June 8, 2025 08:10
Truncs.push_back(Tr);
}

if (AllHalfExtracts && !Extracts.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bail out early to avoid giant indentation here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don’t bail out early, because doing so would change the existing control-flow logic: there are additional optimizations that run after the hasD16Images check (e.g. the A16/G16 path).

@harrisonGPU harrisonGPU force-pushed the amdgpu/imageCombine branch from dfed9ec to 887688e Compare June 16, 2025 06:33
Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 276 to 277
SmallVector<ExtractElementInst *, 4> Extracts;
SmallVector<FPTruncInst *, 4> Truncs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I still thinks this would be neater as a vector-of-std::pairs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have updated it.

@harrisonGPU harrisonGPU force-pushed the amdgpu/imageCombine branch from d26b66f to 86a4102 Compare June 17, 2025 10:27
@harrisonGPU harrisonGPU merged commit 0defde8 into llvm:main Jun 18, 2025
7 checks passed
@harrisonGPU harrisonGPU deleted the amdgpu/imageCombine branch June 18, 2025 01:00
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jun 18, 2025

LLVM Buildbot has detected a new failure on builder fuchsia-x86_64-linux running on fuchsia-debian-64-us-central1-a-1 while building llvm at step 4 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/11/builds/17534

Here is the relevant piece of the build log for the reference
Step 4 (annotate) failure: 'python ../llvm-zorg/zorg/buildbot/builders/annotated/fuchsia-linux.py ...' (failure)
...
[215/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.a64l.dir/a64l.cpp.obj
[216/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.labs.dir/labs.cpp.obj
[217/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strcmp.dir/strcmp.cpp.obj
[218/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strncpy.dir/strncpy.cpp.obj
[219/2505] Building CXX object libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.fini.dir/fini.cpp.obj
[220/2505] Generating header features.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/features.yaml
[221/2505] Building CXX object libc/src/stdio/baremetal/CMakeFiles/libc.src.stdio.baremetal.putchar.dir/putchar.cpp.obj
[222/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.llabs.dir/llabs.cpp.obj
[223/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.srand.dir/srand.cpp.obj
[224/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj
FAILED: libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj 
/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-9cluh2_w/./bin/clang++ --target=armv8.1m.main-none-eabi -DLIBC_NAMESPACE=__llvm_libc_21_0_0_git -I/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc -isystem /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-9cluh2_w/include/armv8.1m.main-unknown-none-eabi --target=armv8.1m.main-none-eabi -Wno-atomic-alignment "-Dvfprintf(stream, format, vlist)=vprintf(format, vlist)" "-Dfprintf(stream, format, ...)=printf(format)" -D_LIBCPP_PRINT=1 -mthumb -mfloat-abi=hard -march=armv8.1-m.main+mve.fp+fp.dp -mcpu=cortex-m55 -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -ffunction-sections -fdata-sections -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-9cluh2_w/runtimes/runtimes-armv8.1m.main-none-eabi-bins=../../../../llvm-project -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/= -no-canonical-prefixes -Os -DNDEBUG --target=armv8.1m.main-none-eabi -DLIBC_QSORT_IMPL=LIBC_QSORT_HEAP_SORT -DLIBC_TYPES_TIME_T_IS_32_BIT -DLIBC_ADD_NULL_CHECKS "-DLIBC_MATH=(LIBC_MATH_SKIP_ACCURATE_PASS | LIBC_MATH_SMALL_TABLES)" -fpie -ffreestanding -DLIBC_FULL_BUILD -nostdlibinc -ffixed-point -fno-builtin -fno-exceptions -fno-lax-vector-conversions -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-rtti -ftrivial-auto-var-init=pattern -fno-omit-frame-pointer -Wall -Wextra -Werror -Wconversion -Wno-sign-conversion -Wdeprecated -Wno-c99-extensions -Wno-gnu-imaginary-constant -Wno-pedantic -Wimplicit-fallthrough -Wwrite-strings -Wextra-semi -Wnewline-eof -Wnonportable-system-include-path -Wstrict-prototypes -Wthread-safety -Wglobal-constructors -DLIBC_COPT_PUBLIC_PACKAGING -MD -MT libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj -MF libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj.d -o libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj -c /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdlib/memalignment.cpp
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdlib/memalignment.cpp:20:3: error: unknown type name 'uintptr_t'
   20 |   uintptr_t addr = reinterpret_cast<uintptr_t>(p);
      |   ^
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdlib/memalignment.cpp:20:37: error: unknown type name 'uintptr_t'
   20 |   uintptr_t addr = reinterpret_cast<uintptr_t>(p);
      |                                     ^
2 errors generated.
[225/2505] Building CXX object libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.init.dir/init.cpp.obj
[226/2505] Copying CXX header __atomic/fence.h
[227/2505] Copying CXX header __algorithm/min_max_result.h
[228/2505] Copying CXX header __atomic/support.h
[229/2505] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.inv_trigf_utils.dir/inv_trigf_utils.cpp.obj
[230/2505] Building CXX object libc/src/stdio/baremetal/CMakeFiles/libc.src.stdio.baremetal.puts.dir/puts.cpp.obj
[231/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strncmp.dir/strncmp.cpp.obj
[232/2505] Copying CXX header __assert
[233/2505] Copying CXX header __algorithm/move_backward.h
[234/2505] Generating header errno.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/errno.yaml
[235/2505] Copying CXX header __algorithm/move.h
[236/2505] Copying CXX header __algorithm/transform.h
[237/2505] Generating header strings.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/strings.yaml
[238/2505] Copying CXX header __algorithm/ranges_any_of.h
[239/2505] Copying CXX header __algorithm/min_element.h
[240/2505] Copying CXX header __bit/has_single_bit.h
[241/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strcoll.dir/strcoll.cpp.obj
[242/2505] Copying CXX header __algorithm/ranges_equal.h
[243/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strspn.dir/strspn.cpp.obj
[244/2505] Copying CXX header __bit/bit_cast.h
[245/2505] Building CXX object libc/src/compiler/generic/CMakeFiles/libc.src.compiler.generic.__stack_chk_fail.dir/__stack_chk_fail.cpp.obj
[246/2505] Generating header complex.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/complex.yaml
[247/2505] Generating header inttypes.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/inttypes.yaml
[248/2505] Building CXX object libc/src/strings/CMakeFiles/libc.src.strings.strcasecmp.dir/strcasecmp.cpp.obj
[249/2505] Generating header locale.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/locale.yaml
[250/2505] Building CXX object libc/src/strings/CMakeFiles/libc.src.strings.strncasecmp.dir/strncasecmp.cpp.obj
[251/2505] Generating header uchar.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/uchar.yaml
[252/2505] Generating header fenv.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/fenv.yaml
[253/2505] Generating header setjmp.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/setjmp.yaml
[254/2505] Generating header wchar.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/wchar.yaml
Step 6 (build) failure: build (failure)
...
[215/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.a64l.dir/a64l.cpp.obj
[216/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.labs.dir/labs.cpp.obj
[217/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strcmp.dir/strcmp.cpp.obj
[218/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strncpy.dir/strncpy.cpp.obj
[219/2505] Building CXX object libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.fini.dir/fini.cpp.obj
[220/2505] Generating header features.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/features.yaml
[221/2505] Building CXX object libc/src/stdio/baremetal/CMakeFiles/libc.src.stdio.baremetal.putchar.dir/putchar.cpp.obj
[222/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.llabs.dir/llabs.cpp.obj
[223/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.srand.dir/srand.cpp.obj
[224/2505] Building CXX object libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj
FAILED: libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj 
/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-9cluh2_w/./bin/clang++ --target=armv8.1m.main-none-eabi -DLIBC_NAMESPACE=__llvm_libc_21_0_0_git -I/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc -isystem /var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-9cluh2_w/include/armv8.1m.main-unknown-none-eabi --target=armv8.1m.main-none-eabi -Wno-atomic-alignment "-Dvfprintf(stream, format, vlist)=vprintf(format, vlist)" "-Dfprintf(stream, format, ...)=printf(format)" -D_LIBCPP_PRINT=1 -mthumb -mfloat-abi=hard -march=armv8.1-m.main+mve.fp+fp.dp -mcpu=cortex-m55 -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -ffunction-sections -fdata-sections -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/build/llvm-build-9cluh2_w/runtimes/runtimes-armv8.1m.main-none-eabi-bins=../../../../llvm-project -ffile-prefix-map=/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/= -no-canonical-prefixes -Os -DNDEBUG --target=armv8.1m.main-none-eabi -DLIBC_QSORT_IMPL=LIBC_QSORT_HEAP_SORT -DLIBC_TYPES_TIME_T_IS_32_BIT -DLIBC_ADD_NULL_CHECKS "-DLIBC_MATH=(LIBC_MATH_SKIP_ACCURATE_PASS | LIBC_MATH_SMALL_TABLES)" -fpie -ffreestanding -DLIBC_FULL_BUILD -nostdlibinc -ffixed-point -fno-builtin -fno-exceptions -fno-lax-vector-conversions -fno-unwind-tables -fno-asynchronous-unwind-tables -fno-rtti -ftrivial-auto-var-init=pattern -fno-omit-frame-pointer -Wall -Wextra -Werror -Wconversion -Wno-sign-conversion -Wdeprecated -Wno-c99-extensions -Wno-gnu-imaginary-constant -Wno-pedantic -Wimplicit-fallthrough -Wwrite-strings -Wextra-semi -Wnewline-eof -Wnonportable-system-include-path -Wstrict-prototypes -Wthread-safety -Wglobal-constructors -DLIBC_COPT_PUBLIC_PACKAGING -MD -MT libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj -MF libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj.d -o libc/src/stdlib/CMakeFiles/libc.src.stdlib.memalignment.dir/memalignment.cpp.obj -c /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdlib/memalignment.cpp
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdlib/memalignment.cpp:20:3: error: unknown type name 'uintptr_t'
   20 |   uintptr_t addr = reinterpret_cast<uintptr_t>(p);
      |   ^
/var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/libc/src/stdlib/memalignment.cpp:20:37: error: unknown type name 'uintptr_t'
   20 |   uintptr_t addr = reinterpret_cast<uintptr_t>(p);
      |                                     ^
2 errors generated.
[225/2505] Building CXX object libc/startup/baremetal/CMakeFiles/libc.startup.baremetal.init.dir/init.cpp.obj
[226/2505] Copying CXX header __atomic/fence.h
[227/2505] Copying CXX header __algorithm/min_max_result.h
[228/2505] Copying CXX header __atomic/support.h
[229/2505] Building CXX object libc/src/math/generic/CMakeFiles/libc.src.math.generic.inv_trigf_utils.dir/inv_trigf_utils.cpp.obj
[230/2505] Building CXX object libc/src/stdio/baremetal/CMakeFiles/libc.src.stdio.baremetal.puts.dir/puts.cpp.obj
[231/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strncmp.dir/strncmp.cpp.obj
[232/2505] Copying CXX header __assert
[233/2505] Copying CXX header __algorithm/move_backward.h
[234/2505] Generating header errno.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/errno.yaml
[235/2505] Copying CXX header __algorithm/move.h
[236/2505] Copying CXX header __algorithm/transform.h
[237/2505] Generating header strings.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/strings.yaml
[238/2505] Copying CXX header __algorithm/ranges_any_of.h
[239/2505] Copying CXX header __algorithm/min_element.h
[240/2505] Copying CXX header __bit/has_single_bit.h
[241/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strcoll.dir/strcoll.cpp.obj
[242/2505] Copying CXX header __algorithm/ranges_equal.h
[243/2505] Building CXX object libc/src/string/CMakeFiles/libc.src.string.strspn.dir/strspn.cpp.obj
[244/2505] Copying CXX header __bit/bit_cast.h
[245/2505] Building CXX object libc/src/compiler/generic/CMakeFiles/libc.src.compiler.generic.__stack_chk_fail.dir/__stack_chk_fail.cpp.obj
[246/2505] Generating header complex.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/complex.yaml
[247/2505] Generating header inttypes.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/inttypes.yaml
[248/2505] Building CXX object libc/src/strings/CMakeFiles/libc.src.strings.strcasecmp.dir/strcasecmp.cpp.obj
[249/2505] Generating header locale.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/locale.yaml
[250/2505] Building CXX object libc/src/strings/CMakeFiles/libc.src.strings.strncasecmp.dir/strncasecmp.cpp.obj
[251/2505] Generating header uchar.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/uchar.yaml
[252/2505] Generating header fenv.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/fenv.yaml
[253/2505] Generating header setjmp.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/setjmp.yaml
[254/2505] Generating header wchar.h from /var/lib/buildbot/fuchsia-x86_64-linux/llvm-project/runtimes/../libc/include/wchar.yaml

fschlimb pushed a commit to fschlimb/llvm-project that referenced this pull request Jun 18, 2025
…ement and fptrunc users (llvm#141758)

Now we only support D16 folding for `image sample` instructions with a
single user: a `fptrunc` to half.
However, we can actually support D16 folding for image.sample
instructions with multiple users,
as long as each user follows the pattern of extractelement followed by
fptrunc to half.
For example:
```
  %sample = call <4 x float> @llvm.amdgcn.image.sample
  %e0 = extractelement <4 x float> %sample, i32 0
  %h0 = fptrunc float %e0 to half
  %e1 = extractelement <4 x float> %sample, i32 1
  %h1 = fptrunc float %e1 to half
  %e2 = extractelement <4 x float> %sample, i32 2
  %h2 = fptrunc float %e2 to half
```
This change enables D16 folding for such cases and avoids generating
`v_cvt_f16_f32_e32` instructions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants