Using SIMD intrinsics with Intel syntax with D.
Clone or download
Guillaume Piolat
Guillaume Piolat _MM_SHUFFLE2 and _MM_SHUFFLE are now @nogc nothrow
test for _mm_shuffle_epi32
fix bad codegen for _mm_shufflehi_epi16
fix bad codegen for _mm_shufflelo_epi16
fix non-working _mm_shuffle_pd
Latest commit 457f76b Oct 12, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
source/inteli _MM_SHUFFLE2 and _MM_SHUFFLE are now @nogc nothrow Oct 12, 2018
.gitignore Initial commit Jun 7, 2016
.travis.yml Remove support for LDC 1.2, as indexing into vectors doesn't work there Jul 12, 2018
README.md Fix readme Aug 15, 2018
dub.json More advertising. Jul 21, 2018

README.md

intel-intrinsics

Travis Status

The practical D SIMD solution. Use Intel intrinsics in D code.

This is a work in progress, please complain in the bugtracker.

Usage

import inteli.xmmintrin; // allows SSE1 intrinsics
import inteli.emmintrin; // allows SSE2 intrinsics

// distance between two points in 4D
float distance(float[4] a, float[4] b) nothrow @nogc
{
    __m128 va = _mm_loadu_ps(a.ptr);
    __m128 vb = _mm_loadu_ps(b.ptr);
    __m128 diffSquared = _mm_sub_ps(va, vb);
    diffSquared = _mm_mul_ps(diffSquared, diffSquared);
    __m128 sum = _mm_add_ps(diffSquared, _mm_srli_si128!8(diffSquared));
    sum = _mm_add_ps(sum, _mm_srli_si128!4(sum));
    return _mm_cvtss_f32(_mm_sqrt_ss(sum));
}
assert(distance([0, 2, 0, 0], [0, 0, 0, 0]) == 2);

Who is using it?

Why?

Familiar syntax

Why Intel intrinsic syntax? Because it is more familiar to C++ programmers and there is a convenient online guide provided by Intel: https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Without this guide it's much more difficult to write sizeable SIMD code.

Future-proof

LDC SIMD intrinsics are a moving target (https://github.com/ldc-developers/ldc/issues/2019), and you need a layer over it if you want to be safe.

We maintain that layer because we need it for our products.

Because those x86 intrinsics are internally converted to IR, they don't tie to a particular architecture. So you could target ARM one day and still get some speed-up.

Portability

For now only LDC is supported, but in the future the same set of intrinsics will work with DMD too. This is intended to be the most practical SIMD solution for D. Including an emulation layer for DMD 32-bit which doesn't have any SIMD capability right now.

Supported instructions set

  • SSE1
  • SSE2

The lack of AVX intrinsics is explained by the lack of raw speed gain with these instruction sets.

Important difference

When using the LDC compatibility layer (ie. when not using LDC), every implicit conversion of similarly-sized vectors should be done with a cast instead.

__m128i b = _mm_set1_epi32(42);
__m128 a = b;             // NO, only works in LDC
__m128 a = cast(__m128)b; // YES, works in all D compilers

This is because D does not allow implicit conversions, except magically in the compiler for real vector types.