Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve glm_quat_conjugate #342

Open
gottfriedleibniz opened this issue Aug 4, 2023 · 3 comments
Open

Improve glm_quat_conjugate #342

gottfriedleibniz opened this issue Aug 4, 2023 · 3 comments

Comments

@gottfriedleibniz
Copy link

gottfriedleibniz commented Aug 4, 2023

The current implementation of quat_conjugate is quite slow when compiled with SSE. For reference, here is clang's output.

Included in this link are alternate implementations, one of which can be easily extended to WASM and Neon, e.g.,

  float32x4_t mask = glmm_float32x4_init(-1.0f, -1.0f, -1.0f, 1.0);
  glmm_store(dest, vmulq_f32(glmm_load(q), mask));
@recp
Copy link
Owner

recp commented Aug 5, 2023

@gottfriedleibniz nice suggestion thanks,

To avoid mul overhead ( if there is no special optimization for -1 ), it would be nice to do that without mul as your implementations in godbolt:

extern
void glm_quat_conjugate_simd(versor q, versor dest) {
#if 0
  __m128i mask = _mm_set_epi32(0, GLMM_NEGZEROf, GLMM_NEGZEROf, GLMM_NEGZEROf);
  glmm_store(dest, _mm_xor_ps(glmm_load(q), _mm_castsi128_ps(mask)));
#else
  __m128 mask = _mm_set_ps(1.0f, -1.0f, -1.0f, -1.0f);
  glmm_store(dest, _mm_mul_ps(glmm_load(q), mask));
#endif
}

with defining GLMM__SIGNMASKf or glmm_float32x4_SIGNMASK_NNNP in SEE, NEON and WASM ... we could write as:

CGLM_INLINE
void
glm_quat_conjugate(versor q, versor dest) {
#if defined(CGLM_SIMD)
  glmm_store(dest, glmm_xor(glmm_load(q), glmm_float32x4_SIGNMASK_NNNP));
#else
  dest[0] = -q[0];
  dest[1] = -q[1];
  dest[2] = -q[2];
  dest[3] =  q[3];
#endif
}

currently there is no glmm_xor in WASM, it would make thing easier to improve glmm_ api.

@gottfriedleibniz
Copy link
Author

Seems good.

Although, I wouldn't be surprised if the scalar equivalent is faster or as-fast on ARM64 (and maybe ARMv7). Some timing here would be nice, but can be done later.

@recp
Copy link
Owner

recp commented Aug 5, 2023

Thanks,

Although, I wouldn't be surprised if the scalar equivalent is faster or as-fast on ARM64 (and maybe ARMv7). Some timing here would be nice, but can be done later.

Sure, simd can be ignored if there is no benefits on ARM ( or maybe on other platforms too ), as you said benchmark could be done asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants