Add ARM NEON support #72

42Bastian · 2016-07-05T08:43:50Z

Inspired from the AVX2 discussion, I suggest following code for ARM targets:
Function:
FORCE_INLINE U32 XXH32_endian_align(const void* input, size_t len, U32 seed, XXH_endianess endian, XXH_alignment align)

`if (len>=16) {
const BYTE* const limit = bEnd - 16;
const uint32_t initial[4] = {
PRIME32_1 + PRIME32_2,
PRIME32_2,
0,
-PRIME32_1
};
U32 v1;
U32 v2;
U32 v3;
U32 v4;
uint32x4_t vseed = vdupq_n_u32 (seed); // v(0,1,2,3) = seed
uint32x4_t prime1 = vdupq_n_u32(PRIME32_1); // prime1(0,1,2,3) = prime1
uint32x4_t prime2 = vdupq_n_u32(PRIME32_2); // prime2(0,1,2,3) = prime2
uint32x4_t v = vld1q_u32 (initial); // read initial into vector
uint32x4_t input;
uint32x4_t tmp;
v += vseed;
do {
input = vld1q_u32((uint32_t )p);
p += 16;
/ round */
v = vmlaq_u32 (v, input, prime2); // seed += input * PRIME32_2;
tmp = vshrq_n_u32 (v, 19); // XXH_rotl32(seed, 13);
v = vsliq_n_u32 (tmp, v, 13);
v = vmulq_u32 (v, prime1); // seed *= PRIME32_1;
} while (p<=limit);

v1 = vgetq_lane_u32(v,0);
v2 = vgetq_lane_u32(v,1);
v3 = vgetq_lane_u32(v,2);
v4 = vgetq_lane_u32(v,3);

h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) + XXH_rotl32(v4, 18);

} else {
h32 = seed + PRIME32_5;
}`

On a ZYNQ (Cortex-A9) it nearly doubles speed.
PS: I am new to github, so please bare with me.

Dev

v0.6.1

add `XXH*_copyState()` to list for `XXH_NAMESPACE`

Cyan4973 · 2016-07-06T17:30:47Z

Thanks for the suggestion Bastian.

It's certainly an interesting addition to the library.

I'll need some free time
and availability of relevant hardware / emulator
to properly test its integration.

So bear with me, I'll keep this topic on the side, and come back to it as soon as I get the ingredients to make it work.

Regards

42Bastian · 2016-07-06T19:09:26Z

Hi Yann

Thanks for the suggestion Bastian.

Thanks for the hash. I compared it on an Cortex-A9/866MHz against
hand-optimized CRC32-slicing-by-four and it was 3 to 4 times faster!

I first thought the multiplications are a show-stopper, but today's CPUs
can do the multiplication/addition mostly in one cycle.

It's certainly an interesting addition to the library.

I'll need some free time
and availability of relevant hardware / emulator
to properly test its integration.

So bear with me, I'll keep this topic on the side, and come back to it as soon as I get the ingredients to make it work.

It is not yet perfect anyway. It was just a prove-of-concept. Esp.
alignment checks are missing.

Cheers

42Bastian

v0.6.2

fixed minor conversion warnings

v0.6.3

Cyan4973 · 2017-12-19T11:14:26Z

I suspect we should rather convert this PR into an issue,
since there is actually no code to merge,
but the comment contains interesting guidelines.

Cyan4973 added 3 commits June 2, 2016 11:51

Merge pull request #68 from Cyan4973/dev

8601182

Dev

Merge pull request #70 from Cyan4973/dev

b806ee4

v0.6.1

Merge pull request #71 from Cyan4973/dev

228d727

add `XXH*_copyState()` to list for `XXH_NAMESPACE`

Cyan4973 and others added 3 commits August 10, 2016 07:38

Merge pull request #80 from Cyan4973/dev

4ab17c7

v0.6.2

Merge pull request #81 from Cyan4973/dev

88c6ee1

fixed minor conversion warnings

Merge pull request #112 from Cyan4973/dev

15c3512

v0.6.3

Cyan4973 mentioned this pull request Dec 22, 2017

Add NEON support #119

Closed

Cyan4973 closed this Dec 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ARM NEON support #72

Add ARM NEON support #72

42Bastian commented Jul 5, 2016

Cyan4973 commented Jul 6, 2016

42Bastian commented Jul 6, 2016

Cyan4973 commented Dec 19, 2017

Add ARM NEON support #72

Add ARM NEON support #72

Conversation

42Bastian commented Jul 5, 2016

Cyan4973 commented Jul 6, 2016

42Bastian commented Jul 6, 2016

Cyan4973 commented Dec 19, 2017