New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partialRadixSort optimisation #472
Comments
Will do later today. Just out on a walk with my SO. |
@Ignition @richardstartin That is not the numbers I am getting. Just one minute. |
A claim that the new code without the copy is slower seems extraordinary, and I don't think it holds up. Here are my numbers.... Java 8 on M1:
(the M1 has a fantastically clean performance profile, look at the errors!) Java 12 on Skylake
(noisier than I'd like, see error margin) Source code: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/extra/radix/java Of course, numbers will vary depending on your system but if one thinks that removing a copy could make the code slower, one has to explain how it could happen. It is important to include error measures in such discussions. For example, my numbers show that BM_likeNew and BM_unroll have undistinguishable performance. One might think, for example, that the skylake number show that likeNew is inferior to unroll, but once you factor in the error margin, it is no longer viable. The jmh framework reports an 18% error margin. This means that I cannot measure differences smaller than that (with JMH configured as it is). I do invite a pull request because it seems like a nice gain is possible (30%) but my best guess is that it is on top of a 35% gain, not on top of a regression. If one really thinks that there was a regression, I'd like to see more evidence. It would be a nice touch to contribute this benchmarking code to our jmh tests. |
I agree. It's also suspicious when there are no error bars, meaning there aren't enough iterations to know if the measurement hit steady state. |
I get better results with this simpler code.... final int radix = 8;
int shift = 16;
int[] copy = new int[data.length];
int[] histogram = new int[(1 << radix) + 1];
// We want to avoid copying the data, see
// https://github.com/RoaringBitmap/RoaringBitmap/issues/470
int[] primary = data;
int[] secondary = copy;
while (shift < 32) {
for (int i = 0; i < data.length; ++i) {
++histogram[((primary[i] >>> shift) & 0xFF) + 1];
}
for (int i = 0; i < 1 << radix; ++i) {
histogram[i + 1] += histogram[i];
}
for (int i = 0; i < primary.length; ++i) {
secondary[histogram[(primary[i] >>> shift) & 0xFF]++] = primary[i];
}
// swap
int[] tmp = primary;
primary = secondary;
secondary = tmp;
//
shift += radix;
if(shift < 32) { Arrays.fill(histogram, 0); }
} It matches and maybe beats the suggested unrolled versions.
|
Fixing the mask, that is a very nice trick. |
@Ignition I'd love to see even faster code!!! My intuition here is that what your unrolling does is allow the compiler to figure out that it does not need to do bound checking whereas the single loop makes Java shy about bound checking elision. I could be wrong, but my mental model seems to match what I am seeing in terms of numbers. |
The elephant in the room is the copy of the 100M element array (381MB) which is going to catch up with you pretty quickly here |
@richardstartin If you are ingesting data in blocks of 381 MB, you can probably help matters greatly with some re-engineering where you can break your chunks into smaller chunks. |
@lemire unroll still improves performance on my machine even with the hardcoded mask. This time I did enough runs to get error range.
(OpenJDK 11 on MacOS with Intel i9 (Coffee Lake)) I have no idea if this perf difference is runtime or processor architecture. As @richardstartin says, we are hitting the limit because at a minimum you have to copy all data between buffers and back again. Not clear what is the best way to avoid that. public static void newnewUnrollPartialRadixSort(int[] primary) {
final int radix = 8;
int[] secondary = new int[primary.length];
int[] histogram = new int[(1 << radix) + 1];
{
final int shift = 16;
for (int i = 0; i < primary.length; ++i) {
++histogram[((primary[i] >>> shift) & 0xFF) + 1];
}
for (int i = 0; i < 1 << radix; ++i) {
histogram[i + 1] += histogram[i];
}
for (int i = 0; i < primary.length; ++i) {
secondary[histogram[(primary[i] >>> shift) & 0xFF]++] = primary[i];
}
}
Arrays.fill(histogram, 0);
{
final int shift = 24;
for (int i = 0; i < secondary.length; ++i) {
++histogram[((secondary[i] >>> shift) & 0xFF) + 1];
}
for (int i = 0; i < 1 << radix; ++i) {
histogram[i + 1] += histogram[i];
}
for (int i = 0; i < secondary.length; ++i) {
primary[histogram[(secondary[i] >>> shift) & 0xFF]++] = secondary[i];
}
}
} |
I have added your new code to my benchmark and duplicated some functions (added the 2 suffix) so we can see the effect of noise... Apple M1 / Java 8
Skylake / Java 12
The clear conclusion you can draw is that "_new" was better than the original and then that everything else we tried is better than "_new". It could be interesting to further investigate the issue using a matrix of JDKs, and matrix of different systems, and varied inputs... but you are hitting diminishing returns. See code at
I recommend not to add 100M elements in bulk... add smaller chunks. Once you are dealing with 100M elements, you are outside the CPU cache, there is no helping that. |
The reason I'm raising the cost of the copy is I think it should be avoidable to have already materialised 100M unsorted integers before adding them to the bitmap. The spirit of the If you have large, unsorted, data and want to chunk it up, the first chunk of inserts will all be appends, but the second and subsequent batches will revert to container location, so the cost of the inserts increases. This was the motivation for doing the partial sort; cheaper than a full sort, but allows containers to be appended rather than located and mutated. If you have data large enough for this copy to be a problem, then the use case is kind of degenerate and not what the |
If you have huge arrays (say 100M integers), then you almost surely want to do something entirely different. The most obvious solution would be to parallelize. So fragment your large array into N subarrays where N depends in part on the number of CPUs you have, then sort each of them (each on its own core) then merge back. There are Radix-Sort versions of this approach. We could provide a really fast parallelized sort function if someone wants to write it, but I don't think it should be tightly integrated with Roaring. |
I get the point about Roaring can be materialised in many different ways, my refactoring of code that was never designed for Roaring should not be a reason to implement new approaches of sorting in Roaring. @lemire @richardstartin Thank you for looking at |
@Ignition if we give up on stability (which we don't gain from) we can get rid of the copy, in which case this code could eventually be used in production safely. Anyway, good luck with your migration and please come back with more fun problems soon! |
I took a look at change you did in #471 and removing the memory move wasn't enough, don't ask me why but you need the loop unroll which gives the performance gain.
The text was updated successfully, but these errors were encountered: