Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Clang] Add __builtin_selectvector and use it for AVX512 intrinsics #91306

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

philnik777
Copy link
Contributor

@philnik777 philnik777 commented May 7, 2024

This is also very useful for generic code. For example this would allow libc++ to vectorize {min,max,minmax}_element without having to use platform-specific intrinsics. I've done some testing and even at -O0 Clang compiles the code to the expected instructions for architectures where the mask vector has the same bit count as the element vector (i.e. every SIMD ISA except AVX512 that I'm aware of) as long as the comparison operation is visible.

@philnik777 philnik777 changed the title [Clang] Add __builtin_selectvector [Clang] Add __builtin_selectvector and use it for AVX512 intrinsics May 7, 2024
Copy link

github-actions bot commented May 7, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@philnik777 philnik777 marked this pull request as ready for review May 16, 2024 08:10
@llvmbot llvmbot added clang Clang issues not falling into any other category backend:X86 clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:headers Headers provided by Clang, e.g. for intrinsics clang:codegen labels May 16, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented May 16, 2024

@llvm/pr-subscribers-clang
@llvm/pr-subscribers-backend-x86

@llvm/pr-subscribers-clang-codegen

Author: Nikolas Klauser (philnik777)

Changes

This is also very useful for generic code. For example this would allow libc++ to vectorize {min,max,minmax}_element without having to use platform-specific intrinsics. I've done some testing and even at -O0 Clang compiles the code to the expected instructions for architectures where the mask vector has the same bit count as the element vector (i.e. every SIMD ISA except AVX512 that I'm aware of) as long as the comparison operation is visible.


Patch is 962.61 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/91306.diff

32 Files Affected:

  • (modified) clang/docs/LanguageExtensions.rst (+20)
  • (modified) clang/include/clang/Basic/Builtins.td (+6)
  • (modified) clang/include/clang/Basic/BuiltinsX86.def (-24)
  • (modified) clang/include/clang/Basic/DiagnosticSemaKinds.td (+3-1)
  • (modified) clang/lib/CodeGen/CGBuiltin.cpp (+6-25)
  • (modified) clang/lib/Headers/avx512bf16intrin.h (+12-12)
  • (modified) clang/lib/Headers/avx512bitalgintrin.h (+6-6)
  • (modified) clang/lib/Headers/avx512bwintrin.h (+386-396)
  • (modified) clang/lib/Headers/avx512cdintrin.h (+24-24)
  • (modified) clang/lib/Headers/avx512dqintrin.h (+134-134)
  • (modified) clang/lib/Headers/avx512fintrin.h (+958-957)
  • (modified) clang/lib/Headers/avx512fp16intrin.h (+90-85)
  • (modified) clang/lib/Headers/avx512ifmaintrin.h (+12-12)
  • (modified) clang/lib/Headers/avx512ifmavlintrin.h (+24-25)
  • (modified) clang/lib/Headers/avx512vbmi2intrin.h (+84-86)
  • (modified) clang/lib/Headers/avx512vbmiintrin.h (+21-22)
  • (modified) clang/lib/Headers/avx512vbmivlintrin.h (+42-43)
  • (modified) clang/lib/Headers/avx512vlbf16intrin.h (+24-24)
  • (modified) clang/lib/Headers/avx512vlbitalgintrin.h (+12-12)
  • (modified) clang/lib/Headers/avx512vlbwintrin.h (+776-788)
  • (modified) clang/lib/Headers/avx512vlcdintrin.h (+48-48)
  • (modified) clang/lib/Headers/avx512vldqintrin.h (+190-190)
  • (modified) clang/lib/Headers/avx512vlfp16intrin.h (+212-223)
  • (modified) clang/lib/Headers/avx512vlintrin.h (+1798-1982)
  • (modified) clang/lib/Headers/avx512vlvbmi2intrin.h (+168-168)
  • (modified) clang/lib/Headers/avx512vlvnniintrin.h (+48-48)
  • (modified) clang/lib/Headers/avx512vnniintrin.h (+24-24)
  • (modified) clang/lib/Headers/avx512vpopcntdqintrin.h (+6-4)
  • (modified) clang/lib/Headers/avx512vpopcntdqvlintrin.h (+12-8)
  • (modified) clang/lib/Headers/gfniintrin.h (+33-34)
  • (modified) clang/lib/Sema/SemaChecking.cpp (+56)
  • (added) clang/test/Sema/builtin-selectvector.c (+18)
diff --git a/clang/docs/LanguageExtensions.rst b/clang/docs/LanguageExtensions.rst
index 96691b45d63a3..6513676438ffb 100644
--- a/clang/docs/LanguageExtensions.rst
+++ b/clang/docs/LanguageExtensions.rst
@@ -3019,6 +3019,26 @@ C-style cast applied to each element of the first argument.
 
 Query for this feature with ``__has_builtin(__builtin_convertvector)``.
 
+``__builtin_selectvector``
+--------------------------
+
+``__builtin_selectvector`` is used to express generic vector element selection.
+
+**Signature**:
+
+.. code-block:: c++
+
+  template <class T, size_t N>
+  simd_vec<T, N> __builtin_selectvector(simd_vec<T, N> lhs, simd_vec<T, N> rhs,
+                                        simd_vec<bool, N> cond)
+
+**Description**:
+
+The returned vector is equivalent to
+``simd_vec<T, N>{cond[0] ? rhs[0] : lhs[0], ..., cond[N - 1] ? rhs[N - 1] : lhs[N - 1]}``.
+
+Query for this feature with ``__has_builtin(__builtin_selectvector)``.
+
 ``__builtin_bitreverse``
 ------------------------
 
diff --git a/clang/include/clang/Basic/Builtins.td b/clang/include/clang/Basic/Builtins.td
index d6ceb450bd106..279330d9b5251 100644
--- a/clang/include/clang/Basic/Builtins.td
+++ b/clang/include/clang/Basic/Builtins.td
@@ -1176,6 +1176,12 @@ def ConvertVector : Builtin {
   let Prototype = "void(...)";
 }
 
+def SelectVector : Builtin {
+  let Spellings = ["__builtin_selectvector"];
+  let Attributes = [NoThrow, Const, CustomTypeChecking];
+  let Prototype = "void(...)";
+}
+
 def AllocaUninitialized : Builtin {
   let Spellings = ["__builtin_alloca_uninitialized"];
   let Attributes = [FunctionWithBuiltinPrefix, NoThrow];
diff --git a/clang/include/clang/Basic/BuiltinsX86.def b/clang/include/clang/Basic/BuiltinsX86.def
index eafcc219c1096..2e099b3ab4f05 100644
--- a/clang/include/clang/Basic/BuiltinsX86.def
+++ b/clang/include/clang/Basic/BuiltinsX86.def
@@ -1973,30 +1973,6 @@ TARGET_BUILTIN(__builtin_ia32_vfcmulcph256_mask,  "V8fV8fV8fV8fUc", "ncV:256:",
 TARGET_BUILTIN(__builtin_ia32_vfcmulcph512_mask,  "V16fV16fV16fV16fUsIi", "ncV:512:", "avx512fp16,evex512")
 
 // generic select intrinsics
-TARGET_BUILTIN(__builtin_ia32_selectb_128, "V16cUsV16cV16c", "ncV:128:", "avx512bw,avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectb_256, "V32cUiV32cV32c", "ncV:256:", "avx512bw,avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectb_512, "V64cUOiV64cV64c", "ncV:512:", "avx512bw,evex512")
-TARGET_BUILTIN(__builtin_ia32_selectw_128, "V8sUcV8sV8s", "ncV:128:", "avx512bw,avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectw_256, "V16sUsV16sV16s", "ncV:256:", "avx512bw,avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectw_512, "V32sUiV32sV32s", "ncV:512:", "avx512bw,evex512")
-TARGET_BUILTIN(__builtin_ia32_selectd_128, "V4iUcV4iV4i", "ncV:128:", "avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectd_256, "V8iUcV8iV8i", "ncV:256:", "avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectd_512, "V16iUsV16iV16i", "ncV:512:", "avx512f,evex512")
-TARGET_BUILTIN(__builtin_ia32_selectph_128, "V8xUcV8xV8x", "ncV:128:", "avx512fp16,avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectph_256, "V16xUsV16xV16x", "ncV:256:", "avx512fp16,avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectph_512, "V32xUiV32xV32x", "ncV:512:", "avx512fp16,evex512")
-TARGET_BUILTIN(__builtin_ia32_selectpbf_128, "V8yUcV8yV8y", "ncV:128:", "avx512bf16,avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectpbf_256, "V16yUsV16yV16y", "ncV:256:", "avx512bf16,avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectpbf_512, "V32yUiV32yV32y", "ncV:512:", "avx512bf16,evex512")
-TARGET_BUILTIN(__builtin_ia32_selectq_128, "V2OiUcV2OiV2Oi", "ncV:128:", "avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectq_256, "V4OiUcV4OiV4Oi", "ncV:256:", "avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectq_512, "V8OiUcV8OiV8Oi", "ncV:512:", "avx512f,evex512")
-TARGET_BUILTIN(__builtin_ia32_selectps_128, "V4fUcV4fV4f", "ncV:128:", "avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectps_256, "V8fUcV8fV8f", "ncV:256:", "avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectps_512, "V16fUsV16fV16f", "ncV:512:", "avx512f,evex512")
-TARGET_BUILTIN(__builtin_ia32_selectpd_128, "V2dUcV2dV2d", "ncV:128:", "avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectpd_256, "V4dUcV4dV4d", "ncV:256:", "avx512vl")
-TARGET_BUILTIN(__builtin_ia32_selectpd_512, "V8dUcV8dV8d", "ncV:512:", "avx512f,evex512")
 TARGET_BUILTIN(__builtin_ia32_selectsh_128, "V8xUcV8xV8x", "ncV:128:", "avx512fp16")
 TARGET_BUILTIN(__builtin_ia32_selectsbf_128, "V8yUcV8yV8y", "ncV:128:", "avx512bf16")
 TARGET_BUILTIN(__builtin_ia32_selectss_128, "V4fUcV4fV4f", "ncV:128:", "avx512f")
diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index 774d2b53a3825..7c2222fe51203 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -12035,7 +12035,9 @@ def err_builtin_invalid_arg_type: Error <
   "a floating point type|"
   "a vector of integers|"
   "an unsigned integer|"
-  "an 'int'}1 (was %2)">;
+  "an 'int'|"
+  "a vector of bools"
+  "}1 (was %2)">;
 
 def err_builtin_matrix_disabled: Error<
   "matrix types extension is disabled. Pass -fenable-matrix to enable it">;
diff --git a/clang/lib/CodeGen/CGBuiltin.cpp b/clang/lib/CodeGen/CGBuiltin.cpp
index c7b219dcfcec5..487f9a2099eb9 100644
--- a/clang/lib/CodeGen/CGBuiltin.cpp
+++ b/clang/lib/CodeGen/CGBuiltin.cpp
@@ -3744,6 +3744,12 @@ RValue CodeGenFunction::EmitBuiltinExpr(const GlobalDecl GD, unsigned BuiltinID,
     return RValue::get(Result);
   }
 
+  case Builtin::BI__builtin_selectvector: {
+    return RValue::get(Builder.CreateSelect(EmitScalarExpr(E->getArg(2)),
+                                            EmitScalarExpr(E->getArg(0)),
+                                            EmitScalarExpr(E->getArg(1))));
+  }
+
   case Builtin::BI__builtin_elementwise_abs: {
     Value *Result;
     QualType QT = E->getArg(0)->getType();
@@ -15513,31 +15519,6 @@ Value *CodeGenFunction::EmitX86BuiltinExpr(unsigned BuiltinID,
   case X86::BI__builtin_ia32_prorvq256:
   case X86::BI__builtin_ia32_prorvq512:
     return EmitX86FunnelShift(*this, Ops[0], Ops[0], Ops[1], true);
-  case X86::BI__builtin_ia32_selectb_128:
-  case X86::BI__builtin_ia32_selectb_256:
-  case X86::BI__builtin_ia32_selectb_512:
-  case X86::BI__builtin_ia32_selectw_128:
-  case X86::BI__builtin_ia32_selectw_256:
-  case X86::BI__builtin_ia32_selectw_512:
-  case X86::BI__builtin_ia32_selectd_128:
-  case X86::BI__builtin_ia32_selectd_256:
-  case X86::BI__builtin_ia32_selectd_512:
-  case X86::BI__builtin_ia32_selectq_128:
-  case X86::BI__builtin_ia32_selectq_256:
-  case X86::BI__builtin_ia32_selectq_512:
-  case X86::BI__builtin_ia32_selectph_128:
-  case X86::BI__builtin_ia32_selectph_256:
-  case X86::BI__builtin_ia32_selectph_512:
-  case X86::BI__builtin_ia32_selectpbf_128:
-  case X86::BI__builtin_ia32_selectpbf_256:
-  case X86::BI__builtin_ia32_selectpbf_512:
-  case X86::BI__builtin_ia32_selectps_128:
-  case X86::BI__builtin_ia32_selectps_256:
-  case X86::BI__builtin_ia32_selectps_512:
-  case X86::BI__builtin_ia32_selectpd_128:
-  case X86::BI__builtin_ia32_selectpd_256:
-  case X86::BI__builtin_ia32_selectpd_512:
-    return EmitX86Select(*this, Ops[0], Ops[1], Ops[2]);
   case X86::BI__builtin_ia32_selectsh_128:
   case X86::BI__builtin_ia32_selectsbf_128:
   case X86::BI__builtin_ia32_selectss_128:
diff --git a/clang/lib/Headers/avx512bf16intrin.h b/clang/lib/Headers/avx512bf16intrin.h
index b28d2e243f2cb..1c32831a8cc57 100644
--- a/clang/lib/Headers/avx512bf16intrin.h
+++ b/clang/lib/Headers/avx512bf16intrin.h
@@ -77,9 +77,9 @@ _mm512_cvtne2ps_pbh(__m512 __A, __m512 __B) {
 ///    conversion of __B, and higher 256 bits come from conversion of __A.
 static __inline__ __m512bh __DEFAULT_FN_ATTRS512
 _mm512_mask_cvtne2ps_pbh(__m512bh __W, __mmask32 __U, __m512 __A, __m512 __B) {
-  return (__m512bh)__builtin_ia32_selectpbf_512((__mmask32)__U,
-                                        (__v32bf)_mm512_cvtne2ps_pbh(__A, __B),
-                                        (__v32bf)__W);
+  return (__m512bh)__builtin_selectvector(
+      (__v32bf)_mm512_cvtne2ps_pbh(__A, __B), (__v32bf)__W,
+      __builtin_bit_cast(__vecmask32, __U));
 }
 
 /// Convert Two Packed Single Data to One Packed BF16 Data.
@@ -99,9 +99,9 @@ _mm512_mask_cvtne2ps_pbh(__m512bh __W, __mmask32 __U, __m512 __A, __m512 __B) {
 ///    conversion of __B, and higher 256 bits come from conversion of __A.
 static __inline__ __m512bh __DEFAULT_FN_ATTRS512
 _mm512_maskz_cvtne2ps_pbh(__mmask32 __U, __m512 __A, __m512 __B) {
-  return (__m512bh)__builtin_ia32_selectpbf_512((__mmask32)__U,
-                                        (__v32bf)_mm512_cvtne2ps_pbh(__A, __B),
-                                        (__v32bf)_mm512_setzero_si512());
+  return (__m512bh)__builtin_selectvector(
+      (__v32bf)_mm512_cvtne2ps_pbh(__A, __B), (__v32bf)_mm512_setzero_si512(),
+      __builtin_bit_cast(__vecmask32, __U));
 }
 
 /// Convert Packed Single Data to Packed BF16 Data.
@@ -200,9 +200,9 @@ _mm512_dpbf16_ps(__m512 __D, __m512bh __A, __m512bh __B) {
 ///  __A, __B and __D
 static __inline__ __m512 __DEFAULT_FN_ATTRS512
 _mm512_mask_dpbf16_ps(__m512 __D, __mmask16 __U, __m512bh __A, __m512bh __B) {
-  return (__m512)__builtin_ia32_selectps_512((__mmask16)__U,
-                                       (__v16sf)_mm512_dpbf16_ps(__D, __A, __B),
-                                       (__v16sf)__D);
+  return (__m512)__builtin_selectvector(
+      (__v16sf)_mm512_dpbf16_ps(__D, __A, __B), (__v16sf)__D,
+      __builtin_bit_cast(__vecmask16, __U));
 }
 
 /// Dot Product of BF16 Pairs Accumulated into Packed Single Precision.
@@ -224,9 +224,9 @@ _mm512_mask_dpbf16_ps(__m512 __D, __mmask16 __U, __m512bh __A, __m512bh __B) {
 ///  __A, __B and __D
 static __inline__ __m512 __DEFAULT_FN_ATTRS512
 _mm512_maskz_dpbf16_ps(__mmask16 __U, __m512 __D, __m512bh __A, __m512bh __B) {
-  return (__m512)__builtin_ia32_selectps_512((__mmask16)__U,
-                                       (__v16sf)_mm512_dpbf16_ps(__D, __A, __B),
-                                       (__v16sf)_mm512_setzero_si512());
+  return (__m512)__builtin_selectvector(
+      (__v16sf)_mm512_dpbf16_ps(__D, __A, __B), (__v16sf)_mm512_setzero_si512(),
+      __builtin_bit_cast(__vecmask16, __U));
 }
 
 /// Convert Packed BF16 Data to Packed float Data.
diff --git a/clang/lib/Headers/avx512bitalgintrin.h b/clang/lib/Headers/avx512bitalgintrin.h
index bad265ceb7db2..f4e31c287af18 100644
--- a/clang/lib/Headers/avx512bitalgintrin.h
+++ b/clang/lib/Headers/avx512bitalgintrin.h
@@ -29,9 +29,9 @@ _mm512_popcnt_epi16(__m512i __A)
 static __inline__ __m512i __DEFAULT_FN_ATTRS
 _mm512_mask_popcnt_epi16(__m512i __A, __mmask32 __U, __m512i __B)
 {
-  return (__m512i) __builtin_ia32_selectw_512((__mmask32) __U,
-              (__v32hi) _mm512_popcnt_epi16(__B),
-              (__v32hi) __A);
+  return (__m512i)__builtin_selectvector((__v32hi)_mm512_popcnt_epi16(__B),
+                                         (__v32hi)__A,
+                                         __builtin_bit_cast(__vecmask32, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS
@@ -51,9 +51,9 @@ _mm512_popcnt_epi8(__m512i __A)
 static __inline__ __m512i __DEFAULT_FN_ATTRS
 _mm512_mask_popcnt_epi8(__m512i __A, __mmask64 __U, __m512i __B)
 {
-  return (__m512i) __builtin_ia32_selectb_512((__mmask64) __U,
-              (__v64qi) _mm512_popcnt_epi8(__B),
-              (__v64qi) __A);
+  return (__m512i)__builtin_selectvector((__v64qi)_mm512_popcnt_epi8(__B),
+                                         (__v64qi)__A,
+                                         __builtin_bit_cast(__vecmask64, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS
diff --git a/clang/lib/Headers/avx512bwintrin.h b/clang/lib/Headers/avx512bwintrin.h
index c854720de6a65..ba77f979da1f8 100644
--- a/clang/lib/Headers/avx512bwintrin.h
+++ b/clang/lib/Headers/avx512bwintrin.h
@@ -369,16 +369,16 @@ _mm512_add_epi8 (__m512i __A, __m512i __B) {
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
 _mm512_mask_add_epi8(__m512i __W, __mmask64 __U, __m512i __A, __m512i __B) {
-  return (__m512i)__builtin_ia32_selectb_512((__mmask64)__U,
-                                             (__v64qi)_mm512_add_epi8(__A, __B),
-                                             (__v64qi)__W);
+  return (__m512i)__builtin_selectvector((__v64qi)_mm512_add_epi8(__A, __B),
+                                         (__v64qi)__W,
+                                         __builtin_bit_cast(__vecmask64, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
 _mm512_maskz_add_epi8(__mmask64 __U, __m512i __A, __m512i __B) {
-  return (__m512i)__builtin_ia32_selectb_512((__mmask64)__U,
-                                             (__v64qi)_mm512_add_epi8(__A, __B),
-                                             (__v64qi)_mm512_setzero_si512());
+  return (__m512i)__builtin_selectvector((__v64qi)_mm512_add_epi8(__A, __B),
+                                         (__v64qi)_mm512_setzero_si512(),
+                                         __builtin_bit_cast(__vecmask64, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
@@ -388,16 +388,16 @@ _mm512_sub_epi8 (__m512i __A, __m512i __B) {
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
 _mm512_mask_sub_epi8(__m512i __W, __mmask64 __U, __m512i __A, __m512i __B) {
-  return (__m512i)__builtin_ia32_selectb_512((__mmask64)__U,
-                                             (__v64qi)_mm512_sub_epi8(__A, __B),
-                                             (__v64qi)__W);
+  return (__m512i)__builtin_selectvector((__v64qi)_mm512_sub_epi8(__A, __B),
+                                         (__v64qi)__W,
+                                         __builtin_bit_cast(__vecmask64, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
 _mm512_maskz_sub_epi8(__mmask64 __U, __m512i __A, __m512i __B) {
-  return (__m512i)__builtin_ia32_selectb_512((__mmask64)__U,
-                                             (__v64qi)_mm512_sub_epi8(__A, __B),
-                                             (__v64qi)_mm512_setzero_si512());
+  return (__m512i)__builtin_selectvector((__v64qi)_mm512_sub_epi8(__A, __B),
+                                         (__v64qi)_mm512_setzero_si512(),
+                                         __builtin_bit_cast(__vecmask64, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
@@ -407,16 +407,16 @@ _mm512_add_epi16 (__m512i __A, __m512i __B) {
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
 _mm512_mask_add_epi16(__m512i __W, __mmask32 __U, __m512i __A, __m512i __B) {
-  return (__m512i)__builtin_ia32_selectw_512((__mmask32)__U,
-                                             (__v32hi)_mm512_add_epi16(__A, __B),
-                                             (__v32hi)__W);
+  return (__m512i)__builtin_selectvector((__v32hi)_mm512_add_epi16(__A, __B),
+                                         (__v32hi)__W,
+                                         __builtin_bit_cast(__vecmask32, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
 _mm512_maskz_add_epi16(__mmask32 __U, __m512i __A, __m512i __B) {
-  return (__m512i)__builtin_ia32_selectw_512((__mmask32)__U,
-                                             (__v32hi)_mm512_add_epi16(__A, __B),
-                                             (__v32hi)_mm512_setzero_si512());
+  return (__m512i)__builtin_selectvector((__v32hi)_mm512_add_epi16(__A, __B),
+                                         (__v32hi)_mm512_setzero_si512(),
+                                         __builtin_bit_cast(__vecmask32, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
@@ -426,16 +426,16 @@ _mm512_sub_epi16 (__m512i __A, __m512i __B) {
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
 _mm512_mask_sub_epi16(__m512i __W, __mmask32 __U, __m512i __A, __m512i __B) {
-  return (__m512i)__builtin_ia32_selectw_512((__mmask32)__U,
-                                             (__v32hi)_mm512_sub_epi16(__A, __B),
-                                             (__v32hi)__W);
+  return (__m512i)__builtin_selectvector((__v32hi)_mm512_sub_epi16(__A, __B),
+                                         (__v32hi)__W,
+                                         __builtin_bit_cast(__vecmask32, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
 _mm512_maskz_sub_epi16(__mmask32 __U, __m512i __A, __m512i __B) {
-  return (__m512i)__builtin_ia32_selectw_512((__mmask32)__U,
-                                             (__v32hi)_mm512_sub_epi16(__A, __B),
-                                             (__v32hi)_mm512_setzero_si512());
+  return (__m512i)__builtin_selectvector((__v32hi)_mm512_sub_epi16(__A, __B),
+                                         (__v32hi)_mm512_setzero_si512(),
+                                         __builtin_bit_cast(__vecmask32, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
@@ -445,32 +445,30 @@ _mm512_mullo_epi16 (__m512i __A, __m512i __B) {
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
 _mm512_mask_mullo_epi16(__m512i __W, __mmask32 __U, __m512i __A, __m512i __B) {
-  return (__m512i)__builtin_ia32_selectw_512((__mmask32)__U,
-                                             (__v32hi)_mm512_mullo_epi16(__A, __B),
-                                             (__v32hi)__W);
+  return (__m512i)__builtin_selectvector((__v32hi)_mm512_mullo_epi16(__A, __B),
+                                         (__v32hi)__W,
+                                         __builtin_bit_cast(__vecmask32, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
 _mm512_maskz_mullo_epi16(__mmask32 __U, __m512i __A, __m512i __B) {
-  return (__m512i)__builtin_ia32_selectw_512((__mmask32)__U,
-                                             (__v32hi)_mm512_mullo_epi16(__A, __B),
-                                             (__v32hi)_mm512_setzero_si512());
+  return (__m512i)__builtin_selectvector((__v32hi)_mm512_mullo_epi16(__A, __B),
+                                         (__v32hi)_mm512_setzero_si512(),
+                                         __builtin_bit_cast(__vecmask32, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
 _mm512_mask_blend_epi8 (__mmask64 __U, __m512i __A, __m512i __W)
 {
-  return (__m512i) __builtin_ia32_selectb_512 ((__mmask64) __U,
-              (__v64qi) __W,
-              (__v64qi) __A);
+  return (__m512i)__builtin_selectvector((__v64qi)__W, (__v64qi)__A,
+                                         __builtin_bit_cast(__vecmask64, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
 _mm512_mask_blend_epi16 (__mmask32 __U, __m512i __A, __m512i __W)
 {
-  return (__m512i) __builtin_ia32_selectw_512 ((__mmask32) __U,
-              (__v32hi) __W,
-              (__v32hi) __A);
+  return (__m512i)__builtin_selectvector((__v32hi)__W, (__v32hi)__A,
+                                         __builtin_bit_cast(__vecmask32, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
@@ -482,17 +480,17 @@ _mm512_abs_epi8 (__m512i __A)
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
 _mm512_mask_abs_epi8 (__m512i __W, __mmask64 __U, __m512i __A)
 {
-  return (__m512i)__builtin_ia32_selectb_512((__mmask64)__U,
-                                             (__v64qi)_mm512_abs_epi8(__A),
-                                             (__v64qi)__W);
+  return (__m512i)__builtin_selectvector((__v64qi)_mm512_abs_epi8(__A),
+                                         (__v64qi)__W,
+                                         __builtin_bit_cast(__vecmask64, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
 _mm512_maskz_abs_epi8 (__mmask64 __U, __m512i __A)
 {
-  return (__m512i)__builtin_ia32_selectb_512((__mmask64)__U,
-                                             (__v64qi)_mm512_abs_epi8(__A),
-                                             (__v64qi)_mm512_setzero_si512());
+  return (__m512i)__builtin_selectvector((__v64qi)_mm512_abs_epi8(__A),
+                                         (__v64qi)_mm512_setzero_si512(),
+                                         __builtin_bit_cast(__vecmask64, __U));
 }
 
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
@@ -504,17 +502,17 @@ _mm512_abs_epi16 (__m512i __A)
 static __inline__ __m512i __DEFAULT_FN_ATTRS512
 _mm512_mask_abs_epi16 (__m512i __W, __mmask32 __U, __m512i __A)
 {
-  return (__m512i)__builtin_ia32_selectw_512((__mmask32)__U,...
[truncated]

@philnik777
Copy link
Contributor Author

I don't understand why the order of emitted instructions changes based on how exactly Clang is compiled, but other than that this should be ready. Hopefully someone spots what the problem could be.

``__builtin_selectvector``
--------------------------

``__builtin_selectvector`` is used to express generic vector element selection.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extend this description to explicitly describe the input/output types and mechanism - don't just rely on the code snippet (although that's a nice accompaniment): The input must all be vectors of the same same number of elements, the 2 first operands must be the same type etc. etc. (basically everything in SemaChecking).

(__v2di)__W);
return (__m128i)__builtin_selectvector((__v2di)_mm_add_epi64(__A, __B),
(__v2di)__W,
__builtin_bit_cast(__vecmask2, __U));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__mmask8 is an unsigned char - how do we safely bitcast to __vecmask2 which is bool ext vector type?

Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constexpr handling?

@@ -3744,6 +3744,12 @@ RValue CodeGenFunction::EmitBuiltinExpr(const GlobalDecl GD, unsigned BuiltinID,
return RValue::get(Result);
}

case Builtin::BI__builtin_selectvector: {
return RValue::get(Builder.CreateSelect(EmitScalarExpr(E->getArg(2)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check all one (or all zero) like we did in EmitX86Select?


template <class T, size_t N>
simd_vec<T, N> __builtin_selectvector(simd_vec<T, N> lhs, simd_vec<T, N> rhs,
simd_vec<bool, N> cond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put cond the first operand to match with select and the old X86 builtins?

Comment on lines +45 to +46
typedef bool __vecmask2 __attribute__((__ext_vector_type__(2)));
typedef bool __vecmask4 __attribute__((__ext_vector_type__(4)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used __mmask8 for v2i1 and v4i1 too. I don't we can bitcast them directly.

(__v32bf)__W);
return (__m512bh)__builtin_selectvector(
(__v32bf)_mm512_cvtne2ps_pbh(__A, __B), (__v32bf)__W,
__builtin_bit_cast(__vecmask32, __U));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use (__vecmask32) dirctly like __v32bf etc?

typedef bool __vecmask32 __attribute__((__ext_vector_type__(32)));
typedef bool __vecmask64 __attribute__((__ext_vector_type__(64)));
#else
typedef _Bool __vecmask2 __attribute__((__ext_vector_type__(2)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerning about the ABI when people abuse them in function call.

@efriedma-quic
Copy link
Collaborator

clang already supports ?: with a vector condition; does this add anything new on top of that?

@@ -1176,6 +1176,12 @@ def ConvertVector : Builtin {
let Prototype = "void(...)";
}

def SelectVector : Builtin {
let Spellings = ["__builtin_selectvector"];
let Attributes = [NoThrow, Const, CustomTypeChecking];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be constexpr?

ExprResult RHS = TheCall->getArg(1);

QualType Result = UsualArithmeticConversions(
LHS, RHS, TheCall->getExprLoc(), ACK_Comparison);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is comparison correct? The documentation makes it seem like this should actually be ACK_Conditional.


if (!LHSVecT) {
Diag(LHS.get()->getBeginLoc(), diag::err_builtin_invalid_arg_type)
<< 1 << 4 << LHST;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add inline comments to the magic numbers (here and below), like /*argument*/1

Copy link
Collaborator

@AaronBallman AaronBallman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please be sure to add a release note about the new functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 clang:codegen clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:headers Headers provided by Clang, e.g. for intrinsics clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants