Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use vector instructions for union2by2 #288

Open
lemire opened this issue Oct 29, 2020 · 5 comments
Open

Use vector instructions for union2by2 #288

lemire opened this issue Oct 29, 2020 · 5 comments

Comments

@lemire
Copy link
Member

lemire commented Oct 29, 2020

We have ARM assembly for union2by2 as per #287

It should be said that there are vectorized algorithm (hint: ARM NEON) for union routines. It appears in the C code (for SSE only but it is portable).

https://github.com/RoaringBitmap/CRoaring/blob/master/src/array_util.c#L1569-L1647

It is described in section 4.3 of the following paper:

It is actually not difficult to implement. For someone that is either fluent in ARM NEON or wants to learn... it is a great project.

@jacksonrnewhouse
Copy link

https://github.com/DLTcollab/sse2neon seems like a valuable reference

@lemire
Copy link
Member Author

lemire commented Oct 30, 2020

If someone is interested, I am fluent in the various SIMD dialects. Finding the intrinsics and the instructions is not so difficult. There are few of them to find anyhow.

The hard part is to put it all together.

@jacksonrnewhouse
Copy link

Unfortunately golang hasn't ported over UMIN and UMAX. I opened an issue on it, but will probably have to figure out the bitwise representation for now.

@jacksonrnewhouse
Copy link

@lemire, the intersection algorithms use _mm_cmpestrm to compare 2 registers of 8 shorts each, and there isn't an equivalent instruction in arm64, I believe. Have any suggestions? My thoughts are to do something like sse_merge, where you run CMEQ between the registers, then rotate 8 times using EXT, and pop_count the result, shifting right by 4 bits (each equal case sets 16 bits). Any suggestions on other approaches?

@lemire
Copy link
Member Author

lemire commented Nov 9, 2020

Though cmpestrm looks really good, it is an expensive instruction with many cycles of latency so the approach you suggest is more competitive than it appears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants