Add scalar optimizations from CRoaring / arXiv:1709.07821 section 3 #127

saik0 · 2022-01-04T11:57:00Z

Purpose

This PR adds some optimizations from CRoaring as outlined in arXiv:1709.07821 section 3

Overview

All inserts and removes are now branchless (!in arXiv:1709.0782, in CRoaring)
Section 3.1 was already implemented, except for BitmapIter. This is covered in Speed up bitmap iteration #125
Implement Array-Bitset aggregates as outlined in section 3.2
- Also branchless 😎
Tracks bitmap cardinality while performing bitmap-bitmap ops
- This is a deviation from CRoaring, and will need to be benchmarked further before this Draft PR is ready
- Curious to hear what you think about this @lemire
In order to track bitmap cardinality the len field had to moved into Store::Bitmap
- This is unfortunately a cross cutting change
Store was quite large (LoC) and had many responsibilities. The largest change in this draft is decomposing Store such hat it's field variants are two new container types: each responsible for maintaining their invariants and implementing ops
- Bitmap8K keeps track of it's cardinality
- SortedU16Vec maintains its sorting
- Store now only delegates to these containers
- My hope is that this will be useful when implementing run containers. 🤞
- Unfortunately so much code was moved this PR is HUGE

Out of scope

Inline ASM for Array-Bitset aggregates
Section 4 (explicit SIMD). As noted by the paper authors: The compiler does a decent job of autovectorization, though not as good as hand-tuned

Notes

I attempted to emulate the inline ASM Array-Bitset aggregates by using a mix of unsafe ptr arithmetic and x86-64 intrinsics, hoping to compile to the same instructions. I was unable to get it under 13 instructions per iteration (compared to the papers 5). While it was an improvement, I abandoned the effort in favor of waiting for the asm! macro to stabilize. Tracking Issue for inline assembly (asm!) rust-lang/rust#72016

…on 3

saik0 · 2022-01-08T07:50:03Z

Got a benchmark harness with ops over the sample datasets. The results are consistent and very promising!

Kerollmops · 2022-01-09T10:07:21Z

That's quit good indeed 🚀 😄

However, I am not sure about the use of the new asm! macro to do inline asm. I would like to keep this crate simple and able to compile on most of the platforms.

Maybe we can make sure that using inline asm doesn't break the build, I had too much trouble compiling croaring-rs (the wrapper of the C roaring library in Rust) due to a lot of raw SIMD function calls, unavailable on the platforms. Maybe we can use the std::simd soon to be released crate?

saik0 · 2022-01-10T01:11:37Z

That's quit good indeed 🚀 😄

However, I am not sure about the use of the new asm! macro to do inline asm. I would like to keep this crate simple and able to compile on most of the platforms.

Maybe we can make sure that using inline asm doesn't break the build, I had too much trouble compiling croaring-rs (the wrapper of the C roaring library in Rust) due to a lot of raw SIMD function calls, unavailable on the platforms. Maybe we can use the std::simd soon to be released crate?

Let's have this discussion in #60. This PR does not include any inline asm or explicit SIMD. The perf gains here are in making inserts and remove branchless, and by interleaving the cardinality tracking.

saik0 · 2022-01-10T01:15:11Z

Maintaining the len in the bitmap store has the additional micro-optimization of making sizeof<Container>() smaller, so binary searching containers should be slightly cache-friendlier.

Kerollmops

This PR is huge indeed but the work is great! It reduces the time it takes to do most of the operations and also reduces the amount of code in the bitmap/store.rs file!

Thank you very much! I don't see any reason not to mark this PR as a candidate for merge!

src/bitmap/bitmap_8k.rs

src/bitmap/sorted_u16_vec.rs

src/bitmap/bitmap_8k.rs

saik0 · 2022-01-10T17:09:06Z

@Kerollmops I'm checking invariants with debug assertion in: 62e2eed. I think it's sufficient. ~~However: deserialize might panic now for malformed inputs! (but only in debug builds)~~ Fixed

src/bitmap/sorted_u16_vec.rs

Co-authored-by: Clément Renault <renault.cle@gmail.com>

saik0 · 2022-01-11T18:34:42Z

Re: Naming

Bitmap8k and SortedU16Vec are named after what they are. However, if we're being honest with ourselves they are not general purpose types.

We could also name them after what they're for. BitmapStore ArrayStore and possibly rename Store enum to StoreKind or StoreType and move all three into a store module, or include Container and call the mod container.

I don't have strong opinions on this. I'm just trying to think of what would be least surprising to me if I were seeing the codebase for the first time.

saik0 · 2022-01-11T19:20:33Z

The checks are not free, but I think for deserialization correctness matters more than performance. If this change is controversial I can revert it for this PR and we can have a separate discussion.

Kerollmops · 2022-01-12T10:30:31Z

Re: Naming

We could also name them after what they're for. BitmapStore ArrayStore and possibly rename Store enum to StoreKind or StoreType and move all three into a store module, or include Container and call the mod container.

I like the idea, I would just like to keep the Store enum name. This is a store that can be of one of two forms. This is not just the kind or type of a store, that's a store!

Re: Unchecked Deserialization

~~I am not sure to understand, what are both of these methods? Have you introduced a new method to deserialize without checking some bounds or other stuff? I am not able to find it on your branch.~~

Ho! I understand now, you were trying to use the try_from and from_unchecked versions of the BitmapStore and ArrayStore, I got it. I don't know about keeping these checked or unchecked versions, but I like that users can choose to use the unchecked version of a deserialization method when they know that's safe. Don't you think?

Also, the timings seem quite low, 2.5ms is very low, which dataset were you using to graph that?

saik0 · 2022-01-12T11:52:18Z

Re: Unchecked Deserialization

~~I am not sure to understand, what are both of these methods? Have you introduced a new method to deserialize without checking some bounds or other stuff? I am not able to find it on your branch.~~

Ho! I understand now, you were trying to use the try_from and from_unchecked versions of the BitmapStore and ArrayStore, I got it.

Also, the timings seem quite low, 2.5ms is very low, which dataset were you using to graph that?

Each point is one of the datasets. The one with the most elements is weather_85 if my memory serves me correctly.

In any case, I don't want to hold up this PR with this discussion, as it's out of scope. I will revert deserialize to the prior behavior. We can discuss elsewhere.

Kerollmops

I like what you've done! Indeed I would like you to remove the unchecked version of the deserialize method. Maybe you can create an issue for this specific issue?

saik0 · 2022-01-12T12:46:23Z

I like what you've done! Indeed I would like you to remove the unchecked version of the deserialize method. Maybe you can create an issue for this specific issue?

OK, I'll open up a followup PR to add data validation to deserialize

Kerollmops · 2022-01-12T14:19:12Z

Do you think I can merge this PR while waiting for your follow-up PR on data validation?

saik0 · 2022-01-12T14:53:08Z

Do you think I can merge this PR while waiting for your follow-up PR on data validation?

Yes. Follow up PR will depend on the try versions implemented in this one, so it's required to merge

Kerollmops · 2022-01-12T14:55:03Z

bors merge

bors · 2022-01-12T14:59:07Z

Build succeeded:

ci (stable)

127: Add scalar optimizations from CRoaring / arXiv:1709.07821 section 3 r=Kerollmops a=saik0 ### Purpose This PR adds some optimizations from CRoaring as outlined in arXiv:1709.07821 section 3 ### Overview * All inserts and removes are now branchless (!in arXiv:1709.0782, in CRoaring) * Section 3.1 was already implemented, except for `BitmapIter`. This is covered in RoaringBitmap#125 * Implement Array-Bitset aggregates as outlined in section 3.2 * Also branchless 😎 * Tracks bitmap cardinality while performing bitmap-bitmap ops * This is a deviation from CRoaring, and will need to be benchmarked further before this Draft PR is ready * Curious to hear what you think about this `@lemire` * In order to track bitmap cardinality the len field had to moved into `Store::Bitmap` * This is unfortunately a cross cutting change * `Store` was quite large (LoC) and had many responsibilities. The largest change in this draft is decomposing `Store` such hat it's field variants are two new container types: each responsible for maintaining their invariants and implementing `ops` * `Bitmap8K` keeps track of it's cardinality * `SortedU16Vec` maintains its sorting * `Store` now only delegates to these containers * My hope is that this will be useful when implementing run containers. 🤞 * Unfortunately so much code was moved this PR is _HUGE_ ### Out of scope * Inline ASM for Array-Bitset aggregates * Section 4 (explicit SIMD). As noted by the paper authors: The compiler does a decent job of autovectorization, though not as good as hand-tuned ### Notes * I attempted to emulate the inline ASM Array-Bitset aggregates by using a mix of unsafe ptr arithmetic and x86-64 intrinsics, hoping to compile to the same instructions. I was unable to get it under 13 instructions per iteration (compared to the papers 5). While it was an improvement, I abandoned the effort in favor of waiting for the `asm!` macro to stabilize. rust-lang/rust#72016 Co-authored-by: saik0 <github@saik0.net> Co-authored-by: Joel Pedraza <github@saik0.net>

saik0 added 2 commits January 4, 2022 03:30

implement scalar optimizations from CRoaring / arXiv:1709.07821 secti…

ee0152c

…on 3

fix formatting

1678d1c

saik0 mentioned this pull request Jan 10, 2022

len should take constant-time #45

Open

Kerollmops requested changes Jan 10, 2022

View reviewed changes

change op_bitmaps method param from fn to Fn

f29fa21

Kerollmops mentioned this pull request Jan 10, 2022

Make append O(N) #132

Merged

check invariats with debug assertions

62e2eed

saik0 added 3 commits January 10, 2022 09:33

Merge remote-tracking branch 'upstream/master' into optimized-section-3

9450642

add Error tyoes and TryFrom impl for SortedU16Vec, use in deserialize

c694f54

add Error tyoes and try_from for Bitmap8k, use in deserialize

c130a9b

Kerollmops requested changes Jan 11, 2022

View reviewed changes

src/bitmap/sorted_u16_vec.rs Outdated Show resolved Hide resolved

src/bitmap/sorted_u16_vec.rs Outdated Show resolved Hide resolved

saik0 and others added 2 commits January 11, 2022 09:24

Update src/bitmap/sorted_u16_vec.rs

50f84f3

Co-authored-by: Clément Renault <renault.cle@gmail.com>

Update src/bitmap/sorted_u16_vec.rs

56c6eb8

Co-authored-by: Clément Renault <renault.cle@gmail.com>

create store module

dbc246e

saik0 marked this pull request as ready for review January 12, 2022 12:35

Kerollmops requested changes Jan 12, 2022

View reviewed changes

bors bot merged commit 72e1ca7 into RoaringBitmap:master Jan 12, 2022

Kerollmops mentioned this pull request Jan 13, 2022

Introduce methods that check the validity of stores when deserializing #138

Merged

This was referenced Jan 29, 2022

Tracking Issue: Implementation of an Optimized Software Library #164

Closed

Hand tune array-bitset aggregates #166

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scalar optimizations from CRoaring / arXiv:1709.07821 section 3 #127

Add scalar optimizations from CRoaring / arXiv:1709.07821 section 3 #127

saik0 commented Jan 4, 2022 •

edited

saik0 commented Jan 8, 2022

Kerollmops commented Jan 9, 2022 •

edited

saik0 commented Jan 10, 2022

saik0 commented Jan 10, 2022

Kerollmops left a comment

saik0 commented Jan 10, 2022 •

edited

saik0 commented Jan 11, 2022 •

edited

saik0 commented Jan 11, 2022

Kerollmops commented Jan 12, 2022 •

edited

saik0 commented Jan 12, 2022

Kerollmops left a comment

saik0 commented Jan 12, 2022

Kerollmops commented Jan 12, 2022

saik0 commented Jan 12, 2022

Kerollmops commented Jan 12, 2022

bors bot commented Jan 12, 2022

Add scalar optimizations from CRoaring / arXiv:1709.07821 section 3 #127

Add scalar optimizations from CRoaring / arXiv:1709.07821 section 3 #127

Conversation

saik0 commented Jan 4, 2022 • edited

Purpose

Overview

Out of scope

Notes

saik0 commented Jan 8, 2022

Kerollmops commented Jan 9, 2022 • edited

saik0 commented Jan 10, 2022

saik0 commented Jan 10, 2022

Kerollmops left a comment

Choose a reason for hiding this comment

saik0 commented Jan 10, 2022 • edited

saik0 commented Jan 11, 2022 • edited

saik0 commented Jan 11, 2022

Kerollmops commented Jan 12, 2022 • edited

saik0 commented Jan 12, 2022

Kerollmops left a comment

Choose a reason for hiding this comment

saik0 commented Jan 12, 2022

Kerollmops commented Jan 12, 2022

saik0 commented Jan 12, 2022

Kerollmops commented Jan 12, 2022

bors bot commented Jan 12, 2022

saik0 commented Jan 4, 2022 •

edited

Kerollmops commented Jan 9, 2022 •

edited

saik0 commented Jan 10, 2022 •

edited

saik0 commented Jan 11, 2022 •

edited

Kerollmops commented Jan 12, 2022 •

edited