Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use BMI2 pdep/pext instructions #6

Closed
gnzlbg opened this issue Dec 28, 2015 · 7 comments
Closed

Use BMI2 pdep/pext instructions #6

gnzlbg opened this issue Dec 28, 2015 · 7 comments
Assignees

Comments

@gnzlbg
Copy link

gnzlbg commented Dec 28, 2015

BMI2 provides parallel bit deposit/extract instructions that allow an efficient encoding/decoding of morton codes. Something like the following should do the trick. Basically in 3D one needs 3 calls to pdep/pext intrinsics to encode/decode morton codes.

#include <array>
#include <type_traits>
#if defined(__BMI2__)
#include <immintrin.h>
#endif

// wrappers around the bmi2 pdep/pext intrinsics
namespace bmi2_detail {

constexpr uint32_t pdep(uint32_t source, uint32_t mask) noexcept {
  return _pdep_u32(source, mask);
}
constexpr uint64_t pdep(uint64_t source, uint64_t mask) noexcept {
  return _pdep_u64(source, mask);
}

constexpr uint32_t pext(uint32_t source, uint32_t mask) noexcept {
  return _pext_u32(source, mask);
}
constexpr uint64_t pext(uint64_t source, uint64_t mask) noexcept {
  return _pext_u64(source, mask);
}

}  // namespace bmi2_detail

/// Parallel Bits Deposit implementation w/o BMI2 intrinsics
template <typename Integral>
__attribute__((no_sanitize("integer"))) 
constexpr Integral deposit_bits(Integral x, Integral mask) {
#if !defined(__BMI2__)
  Integral res = 0;
  for (Integral bb = 1; mask != 0; bb += bb) {
    if (x & bb) { res |= mask & (-mask); }
    mask &= (mask - 1);
  }
  return res;
#else
  return bmi2_detail::pdep(x, mask);
#endif
}

/// Parallel Bits Extract implementation w/o BMI2 intrinsics
template <typename Integral>
__attribute__((no_sanitize("integer"))) 
constexpr Integral extract_bits(Integral x, Integral mask) {
#if !defined(__BMI2__)
  Integral res = 0;
  for (Integral bb = 1; mask != 0; bb += bb) {
    if (x & mask & -mask) { res |= bb; }
    mask &= (mask - 1);
  }
  return res;
#else
  return bmi2_detail::pext(x, mask);
#endif
}

// restrict to unsigned integer types
template <typename T>
using enable_if_unsigned_t = std::enable_if_t<std::is_unsigned<T>{}>;

template <typename UInt, typename = enable_if_unsigned_t<UInt>>
constexpr UInt encode(std::array<UInt, 1> xs) noexcept {
  return xs[0];
}
template <typename UInt, typename = enable_if_unsigned_t<UInt>>
constexpr std::array<UInt, 1> decode(UInt code, std::array<UInt, 1>) noexcept {
  // I use std::array for decoding but using anything else should be trivial
  return {{code}};
}

template <typename UInt, typename = enable_if_unsigned_t<UInt>> 
UInt encode(std::array<UInt, 2> xs) noexcept {
  return deposit_bits(xs[1], static_cast<UInt>(0xAAAAAAAAAAAAAAAA))
         | deposit_bits(xs[0], static_cast<UInt>(0x5555555555555555));
}
template <typename UInt,  typename = enable_if_unsigned_t<UInt>>
std::array<UInt, 2> decode(UInt code, std::array<UInt, 2>) noexcept {
  return {{extract_bits(code, static_cast<UInt>(0x555555555555555)),
           extract_bits(code, static_cast<UInt>(0xAAAAAAAAAAAAAAAA))}};
}

template <typename UInt, typename = enable_if_unsigned_t<UInt>>
UInt encode(std::array<UInt, 3> xs) noexcept {
  return deposit_bits(xs[2], static_cast<UInt>(0x4924924924924924))
         | deposit_bits(xs[1], static_cast<UInt>(0x2492492492492492))
         | deposit_bits(xs[0], static_cast<UInt>(0x9249249249249249));
}
template <typename UInt, enable_if_unsigned_t<UInt>>
std::array<UInt, 3> decode(UInt code, std::array<UInt, 3>) noexcept {
  return {{extract_bits(code, static_cast<UInt>(0x9249249249249249)),
           extract_bits(code, static_cast<UInt>(0x2492492492492492)),
           extract_bits(code, static_cast<UInt>(0x4924924924924924))}};
}
@Forceflow
Copy link
Owner

I am aware of the BMI2 instructions set - will try to work this in!

@Forceflow Forceflow self-assigned this Jan 30, 2016
@Forceflow
Copy link
Owner

I'm implementing this in 1b57576, but without the fallback to a for-based implementation if the BMI2 instruction set is unsupported.

Hard to detect this at compile-time on a MSVC compiler, too.

@gnzlbg
Copy link
Author

gnzlbg commented Nov 24, 2016

You can do something like:

#if defined(LIBMORTON_USE_BMI2) || defined(__BMI2__)
  //...
#else
  //...
#endif

to allow users on windows to pass it a flag. I have zero experience with windows, but MSVC must offer a way to detect BMI2 support.

@gnzlbg
Copy link
Author

gnzlbg commented Nov 24, 2016

Anyhow, when BMI2 is not supported, the "for-based fallback implementation" of deposit_bits/extract_bits might be slower than other techniques of computing morton indices (I haven't benchmarked it though).

@Forceflow
Copy link
Owner

Forceflow commented Nov 26, 2016

Yeah, better fall back on LUT-based or magicbits based methods.

At compile time, you'd think MSVC would have a flag like GCC's __BMI2__, but no. You can check for AVX2 compilation, but that doesn't necessarily mean BMI2 instructions are available.

At runtime, you can manually check the CPUID bits, but it obviously kills performance to do that for every libmorton call.

So for now, I'm just going to use the __BMI2__ flag and hope MSVC follows GCC's example soon. Until then, Windows compilation will have to define the flag manually.

@gnzlbg
Copy link
Author

gnzlbg commented Nov 26, 2016

You can check for AVX2 compilation, but that doesn't necessarily mean BMI2 instructions are available.

This is technically correct but AFAIK all existing CPUs from both Intel and AMD that support AVX2 also support BMI2, so checking from AVX2 compilation on MSVC might be "good enough".

@Forceflow
Copy link
Owner

Yep, @gnzlbg, I decided to go for that in baad935.

Thanks for the excellent suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants