Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ARM NEON support #72

Closed
wants to merge 6 commits into from
Closed

Add ARM NEON support #72

wants to merge 6 commits into from

Conversation

42Bastian
Copy link

Inspired from the AVX2 discussion, I suggest following code for ARM targets:
Function:
FORCE_INLINE U32 XXH32_endian_align(const void* input, size_t len, U32 seed, XXH_endianess endian, XXH_alignment align)

`if (len>=16) {
const BYTE* const limit = bEnd - 16;
const uint32_t initial[4] = {
PRIME32_1 + PRIME32_2,
PRIME32_2,
0,
-PRIME32_1
};
U32 v1;
U32 v2;
U32 v3;
U32 v4;
uint32x4_t vseed = vdupq_n_u32 (seed); // v(0,1,2,3) = seed
uint32x4_t prime1 = vdupq_n_u32(PRIME32_1); // prime1(0,1,2,3) = prime1
uint32x4_t prime2 = vdupq_n_u32(PRIME32_2); // prime2(0,1,2,3) = prime2
uint32x4_t v = vld1q_u32 (initial); // read initial into vector
uint32x4_t input;
uint32x4_t tmp;
v += vseed;
do {
input = vld1q_u32((uint32_t )p);
p += 16;
/
round */
v = vmlaq_u32 (v, input, prime2); // seed += input * PRIME32_2;
tmp = vshrq_n_u32 (v, 19); // XXH_rotl32(seed, 13);
v = vsliq_n_u32 (tmp, v, 13);
v = vmulq_u32 (v, prime1); // seed *= PRIME32_1;
} while (p<=limit);

v1 = vgetq_lane_u32(v,0);
v2 = vgetq_lane_u32(v,1);
v3 = vgetq_lane_u32(v,2);
v4 = vgetq_lane_u32(v,3);

h32 = XXH_rotl32(v1, 1) + XXH_rotl32(v2, 7) + XXH_rotl32(v3, 12) + XXH_rotl32(v4, 18);

} else {
h32 = seed + PRIME32_5;
}`

On a ZYNQ (Cortex-A9) it nearly doubles speed.
PS: I am new to github, so please bare with me.

@Cyan4973
Copy link
Owner

Cyan4973 commented Jul 6, 2016

Thanks for the suggestion Bastian.

It's certainly an interesting addition to the library.

I'll need some free time
and availability of relevant hardware / emulator
to properly test its integration.

So bear with me, I'll keep this topic on the side, and come back to it as soon as I get the ingredients to make it work.

Regards

@42Bastian
Copy link
Author

Hi Yann

Thanks for the suggestion Bastian.

Thanks for the hash. I compared it on an Cortex-A9/866MHz against
hand-optimized CRC32-slicing-by-four and it was 3 to 4 times faster!

I first thought the multiplications are a show-stopper, but today's CPUs
can do the multiplication/addition mostly in one cycle.

It's certainly an interesting addition to the library.

I'll need some free time
and availability of relevant hardware / emulator
to properly test its integration.

So bear with me, I'll keep this topic on the side, and come back to it as soon as I get the ingredients to make it work.

It is not yet perfect anyway. It was just a prove-of-concept. Esp.
alignment checks are missing.

Cheers

42Bastian

@Cyan4973
Copy link
Owner

I suspect we should rather convert this PR into an issue,
since there is actually no code to merge,
but the comment contains interesting guidelines.

@Cyan4973 Cyan4973 mentioned this pull request Dec 22, 2017
@Cyan4973 Cyan4973 closed this Dec 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants