-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement batch iterators #243
Conversation
The difference is huge:
The numbers above are with 1024 element buffer, there's investigation to be done on what the best buffer size is. |
Yes, we should do this for some kind of typical OLAP use case like computing the sum of a column of numeric values, masked by a bitmap. Even those this iteration style is faster than the status quo, it will lead to unpredictable access patterns into such a column of numbers, which will be difficult for JIT compilers to optimise. When iterating over runs, it should be possible to get good vectorised code for the sum. |
public int next(int key, int[] buffer) { | ||
int consumed = 0; | ||
short[] data = array.content; | ||
while (consumed < buffer.length && index < array.getCardinality()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be tempted to send array.getCardinality()
and buffer.length
to final local variables. Just to be safe performance-wise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be surprised if this made a difference. Since cardinality is not volatile, it is permissible to store the variable in a register. The method call will very likely be inlined. Nevertheless, I will try it, and I want to give this code the best chance against the alternative approach I have in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be surprised if this made a difference.
So would I.
while (consumed < buffer.length) { | ||
while (word == 0) { | ||
++wordIndex; | ||
if (wordIndex == 1024) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind: I don't understand this if (wordIndex == 1024)
clause. Maybe a comment is warranted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Now I see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could indeed be cleaned up.
} else { | ||
cursor += chunk; | ||
} | ||
} while (consumed < buffer.length && run != runs.numberOfRuns()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted to send buffer.length
and runs.numberOfRuns()
to final local variables, to ensure best performance.
public int next(int key, int[] buffer) { | ||
int consumed = 0; | ||
ShortBuffer data = array.content; | ||
while (consumed < buffer.length && index < array.getCardinality()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment, I'd be tempted to send buffer.length
and array.getCardinality()
to final local variables.
int[] buffer = benchmarkState.buffer; | ||
int batch = 0; | ||
while (it.hasNext()) { | ||
batch = it.nextBatch(buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I would iterate over the buffer otherwise I think that the test is not fair against standard iterators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt this can actually be measured unless each value is folded into an aggregate, I would expect an optimising compiler to realise that only the last element is used. I will delete these new benchmarks and start a new set of comparative benchmarks.
int[] buffer = benchmarkState.buffer; | ||
int batch = 0; | ||
while (it.hasNext()) { | ||
batch = it.nextBatch(buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment: I would iterate over the buffer otherwise I think that the test is not fair against standard iterators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
@@ -237,6 +270,9 @@ public int testReverseFlyweight_c(BenchmarkState benchmarkState) { | |||
@State(Scope.Benchmark) | |||
public static class BenchmarkState { | |||
|
|||
|
|||
int[] buffer = new int[256]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's 1kB (256*4) which is going to find in L1. Good. You want buffer to find in L1 cache. So 1kB is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this does a lot better than 1024. I think the main benefit of this approach in general is decoupling instruction chains allowing more pipelining. Maybe even 64 elements is enough to achieve this.
@richardstartin Have a look at my comments. |
The new benchmark will take a couple of hours to run but should explore some of the dimensions of the problem. |
This is up to 10x faster than the standard iterator.
|
@richardstartin Too many numbers!!! Can you select an interesting sample of results so that people can review this quickly and understand why they might want to give it a try? |
Sure. The batch iterator is always at least twice as fast as the standard iterator, and that is true across a range of sizes of bitmaps, proportion of run containers, array containers and bitmaps containers, and a range of size of buffer. 128-256 element buffers seem to do the best job. Depending on the contents and size of the bitmap, the batch iterator can be 10x faster. For example, a best case and worst case:
|
This all looks very good. Let us merge. |
* implement batch iterators * add benchmark
Sorry, I had to erase the last branch because of some of the commits. This is the same code, complete with the buffer implementation. I'm still keen to get some help benchmarking it.