Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partialRadixSort optimisation #472

Closed
Ignition opened this issue Apr 8, 2021 · 17 comments
Closed

partialRadixSort optimisation #472

Ignition opened this issue Apr 8, 2021 · 17 comments

Comments

@Ignition
Copy link

Ignition commented Apr 8, 2021

I took a look at change you did in #471 and removing the memory move wasn't enough, don't ask me why but you need the loop unroll which gives the performance gain.

Benchmark                              (listSize)  Mode  Cnt   Score   Error  Units
PartialRadixSortBenchmark.BM_fullSort     1000000  avgt    2  65.492          ms/op
PartialRadixSortBenchmark.BM_original     1000000  avgt    2   5.586          ms/op
PartialRadixSortBenchmark.BM_new          1000000  avgt    2   5.648          ms/op
PartialRadixSortBenchmark.BM_likeNew      1000000  avgt    2   3.310          ms/op
PartialRadixSortBenchmark.BM_unroll       1000000  avgt    2   3.164          ms/op
package benchmark;

import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Level;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.infra.Blackhole;

import java.util.Arrays;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 2)
@Measurement(batchSize = 1, iterations = 2)
@Fork(1)
public class PartialRadixSortBenchmark {

	public static void originalPartialRadixSort(int[] data) {
		final int radix = 8;

		int shift = 16;
		int mask = 0xFF0000;
		int[] copy = new int[data.length];
		int[] histogram = new int[(1 << radix) + 1];
		while (shift < 32) {
			for (int i = 0; i < data.length; ++i) {
				++histogram[((data[i] & mask) >>> shift) + 1];
			}
			for (int i = 0; i < 1 << radix; ++i) {
				histogram[i + 1] += histogram[i];
			}
			for (int i = 0; i < data.length; ++i) {
				copy[histogram[(data[i] & mask) >>> shift]++] = data[i];
			}
			System.arraycopy(copy, 0, data, 0, data.length);
			shift += radix;
			mask <<= radix;
			Arrays.fill(histogram, 0);
		}
	}

	public static void newPartialRadixSort(int[] data) {
		final int radix = 8;
		int shift = 16;
		int mask = 0xFF0000;
		int[] copy = new int[data.length];
		int[] histogram = new int[(1 << radix) + 1];
		// We want to avoid copying the data, see
		// https://github.com/RoaringBitmap/RoaringBitmap/issues/470
		int[] primary = data;
		int[] secondary = copy;
		while (shift < 32) {
			for (int i = 0; i < data.length; ++i) {
				++histogram[((primary[i] & mask) >>> shift) + 1];
			}
			for (int i = 0; i < 1 << radix; ++i) {
				histogram[i + 1] += histogram[i];
			}
			for (int i = 0; i < primary.length; ++i) {
				secondary[histogram[(primary[i] & mask) >>> shift]++] = primary[i];
			}
			// swap
			int[] tmp = primary;
			primary = secondary;
			secondary = tmp;
			//
			shift += radix;
			mask <<= radix;
			Arrays.fill(histogram, 0);
		}
	}

	public static void unrollPartialRadixSort(int[] primary) {
		final int radix = 8;

		final int[] secondary = new int[primary.length];
		final int[] histogram = new int[(1 << radix) + 1];
		{
			final int shift = 16;
			final int mask = 0x00FF0000;
			for (int i = 0; i < primary.length; ++i) {
				++histogram[((primary[i] & mask) >>> shift) + 1];
			}
			for (int i = 0; i < 1 << radix; ++i) {
				histogram[i + 1] += histogram[i];
			}
			for (int i = 0; i < primary.length; ++i) {
				secondary[histogram[(primary[i] & mask) >>> shift]++] = primary[i];
			}
		}
		Arrays.fill(histogram, 0);
		{
			final int shift = 24;
			final int mask = 0xFF000000;
			for (int i = 0; i < secondary.length; ++i) {
				++histogram[((secondary[i] & mask) >>> shift) + 1];
			}
			for (int i = 0; i < 1 << radix; ++i) {
				histogram[i + 1] += histogram[i];
			}
			for (int i = 0; i < secondary.length; ++i) {
				primary[histogram[(secondary[i] & mask) >>> shift]++] = secondary[i];
			}
		}
	}

	public static void likeNewPartialRadixSort(int[] data) {
		final int radix = 8;
		int shift = 16;
		int mask = 0xFF0000;
		int[] copy = new int[data.length];
		int[] histogram = new int[(1 << radix) + 1];
		// We want to avoid copying the data, see
		// https://github.com/RoaringBitmap/RoaringBitmap/issues/470
		int[] primary = data;
		int[] secondary = copy;
		for (int i = 0; i < data.length; ++i) {
			++histogram[((primary[i] & mask) >>> shift) + 1];
		}
		for (int i = 0; i < 1 << radix; ++i) {
			histogram[i + 1] += histogram[i];
		}
		for (int i = 0; i < primary.length; ++i) {
			secondary[histogram[(primary[i] & mask) >>> shift]++] = primary[i];
		}
		// swap
		int[] tmp = primary;
		primary = secondary;
		secondary = tmp;
		//
		shift += radix;
		mask <<= radix;
		Arrays.fill(histogram, 0);
		for (int i = 0; i < data.length; ++i) {
			++histogram[((primary[i] & mask) >>> shift) + 1];
		}
		for (int i = 0; i < 1 << radix; ++i) {
			histogram[i + 1] += histogram[i];
		}
		for (int i = 0; i < primary.length; ++i) {
			secondary[histogram[(primary[i] & mask) >>> shift]++] = primary[i];
		}
		// swap
		tmp = primary;
		primary = secondary;
		secondary = tmp;
		//
		shift += radix;
		mask <<= radix;
		Arrays.fill(histogram, 0);
	}

	@State(Scope.Benchmark)
	public static class BenchmarkState {
		@Param({/*"100", "10000",*/ "1000000"})
		public int listSize;

		private int[] origArr;
		public int[] testArr;

		@Setup(Level.Trial)
		public void setUp() {
			origArr = ThreadLocalRandom.current()
					.ints()
					.limit(listSize)
					.toArray();
			testArr = new int[origArr.length];
		}

		@Setup(Level.Invocation)
		public void perInvocation(){
			System.arraycopy(origArr, 0, testArr, 0, origArr.length);
		}
	}

	@Benchmark
	public void BM_original(Blackhole blackhole, BenchmarkState state) {
		originalPartialRadixSort(state.testArr);
		blackhole.consume(state.testArr);
	}
	@Benchmark
	public void BM_new(Blackhole blackhole, BenchmarkState state) {
		newPartialRadixSort(state.testArr);
		blackhole.consume(state.testArr);
	}

	@Benchmark
	public void BM_unroll(Blackhole blackhole, BenchmarkState state) {
		unrollPartialRadixSort(state.testArr);
		blackhole.consume(state.testArr);
	}

	@Benchmark
	public void BM_likeNew(Blackhole blackhole, BenchmarkState state) {
		likeNewPartialRadixSort(state.testArr);
		blackhole.consume(state.testArr);
	}

	@Benchmark
	public void BM_fullSort(Blackhole blackhole, BenchmarkState state) {
		Arrays.sort(state.testArr);
		blackhole.consume(state.testArr);
	}

}
@richardstartin
Copy link
Member

@lemire it seems you regressed my mission critical partial radix sort by about 50us. Disappointed!!!1

@Ignition it looks like a nice speedup - do you feel like submitting a PR?

@Ignition
Copy link
Author

Ignition commented Apr 8, 2021

Will do later today. Just out on a walk with my SO.

@lemire
Copy link
Member

lemire commented Apr 8, 2021

@Ignition @richardstartin That is not the numbers I am getting. Just one minute.

@lemire
Copy link
Member

lemire commented Apr 8, 2021

@Ignition @richardstartin

A claim that the new code without the copy is slower seems extraordinary, and I don't think it holds up. Here are my numbers....

Java 8 on M1:

# Run complete. Total time: 00:01:19

Benchmark                     (listSize)  Mode  Cnt      Score    Error  Units
PartialRadixSort.BM_fullSort     1000000  avgt   10  60502.821 ± 95.842  us/op
PartialRadixSort.BM_likeNew      1000000  avgt   10   2871.702 ± 23.353  us/op
PartialRadixSort.BM_new          1000000  avgt   10   3755.863 ± 24.021  us/op
PartialRadixSort.BM_original     1000000  avgt   10   5062.976 ± 68.652  us/op
PartialRadixSort.BM_unroll       1000000  avgt   10   2871.098 ± 37.959  us/op

(the M1 has a fantastically clean performance profile, look at the errors!)

Java 12 on Skylake

PartialRadixSort.BM_fullSort     1000000  avgt   10  72124.115 ±  74.282  us/op
PartialRadixSort.BM_likeNew      1000000  avgt   10   4752.748 ± 839.942  us/op
PartialRadixSort.BM_new          1000000  avgt   10   6194.519 ± 821.995  us/op
PartialRadixSort.BM_original     1000000  avgt   10   7549.826 ± 717.538  us/op
PartialRadixSort.BM_unroll       1000000  avgt   10   4703.043 ± 818.181  us/op

(noisier than I'd like, see error margin)

Source code: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/extra/radix/java

Of course, numbers will vary depending on your system but if one thinks that removing a copy could make the code slower, one has to explain how it could happen.

It is important to include error measures in such discussions. For example, my numbers show that BM_likeNew and BM_unroll have undistinguishable performance. One might think, for example, that the skylake number show that likeNew is inferior to unroll, but once you factor in the error margin, it is no longer viable. The jmh framework reports an 18% error margin. This means that I cannot measure differences smaller than that (with JMH configured as it is).

I do invite a pull request because it seems like a nice gain is possible (30%) but my best guess is that it is on top of a 35% gain, not on top of a regression. If one really thinks that there was a regression, I'd like to see more evidence.

It would be a nice touch to contribute this benchmarking code to our jmh tests.

@richardstartin
Copy link
Member

Of course, numbers will vary depending on your system but if one thinks that removing a copy could make the code slower, one has to explain how it could happen.

I agree. It's also suspicious when there are no error bars, meaning there aren't enough iterations to know if the measurement hit steady state.

@lemire
Copy link
Member

lemire commented Apr 8, 2021

I get better results with this simpler code....

		final int radix = 8;
		int shift = 16;
		int[] copy = new int[data.length];
		int[] histogram = new int[(1 << radix) + 1];
		// We want to avoid copying the data, see
		// https://github.com/RoaringBitmap/RoaringBitmap/issues/470
		int[] primary = data;
		int[] secondary = copy;
		while (shift < 32) {
			for (int i = 0; i < data.length; ++i) {
				++histogram[((primary[i] >>> shift) & 0xFF) + 1];
			}
			for (int i = 0; i < 1 << radix; ++i) {
				histogram[i + 1] += histogram[i];
			}
			for (int i = 0; i < primary.length; ++i) {
				secondary[histogram[(primary[i] >>> shift) & 0xFF]++] = primary[i];
			}
			// swap
			int[] tmp = primary;
			primary = secondary;
			secondary = tmp;
			//
			shift += radix;
			if(shift < 32) { Arrays.fill(histogram, 0); }
		}	

It matches and maybe beats the suggested unrolled versions.

Benchmark                     (listSize)  Mode  Cnt      Score     Error  Units
PartialRadixSort.BM_fullSort     1000000  avgt   10  60647.667 ± 160.573  us/op
PartialRadixSort.BM_likeNew      1000000  avgt   10   3148.308 ±  57.290  us/op
PartialRadixSort.BM_new          1000000  avgt   10   4070.347 ±  73.890  us/op
PartialRadixSort.BM_newnew       1000000  avgt   10   2895.875 ±  28.825  us/op
PartialRadixSort.BM_original     1000000  avgt   10   5355.076 ±  78.458  us/op
PartialRadixSort.BM_unroll       1000000  avgt   10   3148.585 ±  60.783  us/op

@Ignition
Copy link
Author

Ignition commented Apr 8, 2021

Fixing the mask, that is a very nice trick.

@lemire
Copy link
Member

lemire commented Apr 8, 2021

@Ignition I'd love to see even faster code!!!

#473

My intuition here is that what your unrolling does is allow the compiler to figure out that it does not need to do bound checking whereas the single loop makes Java shy about bound checking elision. I could be wrong, but my mental model seems to match what I am seeing in terms of numbers.

@richardstartin
Copy link
Member

The elephant in the room is the copy of the 100M element array (381MB) which is going to catch up with you pretty quickly here

@lemire
Copy link
Member

lemire commented Apr 8, 2021

@richardstartin If you are ingesting data in blocks of 381 MB, you can probably help matters greatly with some re-engineering where you can break your chunks into smaller chunks.

@Ignition
Copy link
Author

Ignition commented Apr 8, 2021

@lemire unroll still improves performance on my machine even with the hardcoded mask. This time I did enough runs to get error range.

PartialRadixSortBenchmark.BM_newNew           1000000  avgt   10  3.346 ± 0.042  ms/op
PartialRadixSortBenchmark.BM_newNewUnroll     1000000  avgt   10  3.082 ± 0.050  ms/op
PartialRadixSortBenchmark.BM_unroll           1000000  avgt   10  3.052 ± 0.044  ms/op

(OpenJDK 11 on MacOS with Intel i9 (Coffee Lake))

I have no idea if this perf difference is runtime or processor architecture.

As @richardstartin says, we are hitting the limit because at a minimum you have to copy all data between buffers and back again. Not clear what is the best way to avoid that.

public static void newnewUnrollPartialRadixSort(int[] primary) {
	final int radix = 8;
	int[] secondary = new int[primary.length];
	int[] histogram = new int[(1 << radix) + 1];
	{
		final int shift = 16;
		for (int i = 0; i < primary.length; ++i) {
			++histogram[((primary[i] >>> shift) & 0xFF) + 1];
		}
		for (int i = 0; i < 1 << radix; ++i) {
			histogram[i + 1] += histogram[i];
		}
		for (int i = 0; i < primary.length; ++i) {
			secondary[histogram[(primary[i] >>> shift) & 0xFF]++] = primary[i];
		}
	}
	Arrays.fill(histogram, 0);
	{
		final int shift = 24;
		for (int i = 0; i < secondary.length; ++i) {
			++histogram[((secondary[i] >>> shift) & 0xFF) + 1];
		}
		for (int i = 0; i < 1 << radix; ++i) {
			histogram[i + 1] += histogram[i];
		}
		for (int i = 0; i < secondary.length; ++i) {
			primary[histogram[(secondary[i] >>> shift) & 0xFF]++] = secondary[i];
		}
	}
}

@lemire
Copy link
Member

lemire commented Apr 8, 2021

I have added your new code to my benchmark and duplicated some functions (added the 2 suffix) so we can see the effect of noise...

Apple M1 / Java 8

PartialRadixSort.BM_fullSort          1000000  avgt   10  60200.995 ±  44.614  us/op
PartialRadixSort.BM_likeNew           1000000  avgt   10   2867.208 ±  56.601  us/op
PartialRadixSort.BM_new               1000000  avgt   10   4167.666 ±  58.935  us/op
PartialRadixSort.BM_newnew            1000000  avgt   10   2667.998 ±  11.424  us/op
PartialRadixSort.BM_newnew2           1000000  avgt   10   2650.875 ±  18.746  us/op
PartialRadixSort.BM_newnewUnroll      1000000  avgt   10   2994.352 ± 177.318  us/op
PartialRadixSort.BM_newnewUnroll2     1000000  avgt   10   3313.587 ± 439.159  us/op
PartialRadixSort.BM_original          1000000  avgt   10   5099.736 ±  71.129  us/op
PartialRadixSort.BM_unroll            1000000  avgt   10   3015.358 ± 499.114  us/op
PartialRadixSort.BM_unroll2           1000000  avgt   10   3084.749 ± 284.166  us/op

Skylake / Java 12

Benchmark                          (listSize)  Mode  Cnt      Score     Error  Units
PartialRadixSort.BM_fullSort          1000000  avgt   10  79575.113 ±  10.545  us/op
PartialRadixSort.BM_likeNew           1000000  avgt   10   4712.443 ± 839.045  us/op
PartialRadixSort.BM_new               1000000  avgt   10   6202.734 ± 816.924  us/op
PartialRadixSort.BM_newnew            1000000  avgt   10   4799.698 ± 820.412  us/op
PartialRadixSort.BM_newnew2           1000000  avgt   10   4798.734 ± 833.117  us/op
PartialRadixSort.BM_newnewUnroll      1000000  avgt   10   4744.756 ± 841.817  us/op
PartialRadixSort.BM_newnewUnroll2     1000000  avgt   10   4722.411 ± 827.029  us/op
PartialRadixSort.BM_original          1000000  avgt   10   7893.155 ±  33.460  us/op
PartialRadixSort.BM_unroll            1000000  avgt   10   4710.394 ± 815.227  us/op
PartialRadixSort.BM_unroll2           1000000  avgt   10   4710.788 ± 822.601  us/op

The clear conclusion you can draw is that "_new" was better than the original and then that everything else we tried is better than "_new".

It could be interesting to further investigate the issue using a matrix of JDKs, and matrix of different systems, and varied inputs... but you are hitting diminishing returns.

See code at
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/extra/radix/java

Not clear what is the best way to avoid that.

I recommend not to add 100M elements in bulk... add smaller chunks.

Once you are dealing with 100M elements, you are outside the CPU cache, there is no helping that.

@richardstartin
Copy link
Member

The reason I'm raising the cost of the copy is I think it should be avoidable to have already materialised 100M unsorted integers before adding them to the bitmap. The spirit of the Writer API, the way it's used in the system I wrote it for, and in Druid and Pinot, is that row IDs are monotonic, and appended to the bitmap almost as soon as they are generated. So you just repeatedly call writer.add and when a multiple of 2^16 is crossed a container is appended to the bitmap. This is good because you don't have to locate the container for each added value.

If you have large, unsorted, data and want to chunk it up, the first chunk of inserts will all be appends, but the second and subsequent batches will revert to container location, so the cost of the inserts increases. This was the motivation for doing the partial sort; cheaper than a full sort, but allows containers to be appended rather than located and mutated.

If you have data large enough for this copy to be a problem, then the use case is kind of degenerate and not what the Writer API was designed for. So what I'm getting at is do we want to optimise much for huge pre-materialised arrays of integers?

@lemire
Copy link
Member

lemire commented Apr 8, 2021

If you have data large enough for this copy to be a problem, then the use case is kind of degenerate and not what the Writer API was designed for. So what I'm getting at is do we want to optimise much for huge pre-materialised arrays of integers?

If you have huge arrays (say 100M integers), then you almost surely want to do something entirely different.

The most obvious solution would be to parallelize. So fragment your large array into N subarrays where N depends in part on the number of CPUs you have, then sort each of them (each on its own core) then merge back. There are Radix-Sort versions of this approach.

We could provide a really fast parallelized sort function if someone wants to write it, but I don't think it should be tightly integrated with Roaring.

@richardstartin
Copy link
Member

@Ignition please try the implementation here: #474

@Ignition
Copy link
Author

Ignition commented Apr 9, 2021

I get the point about partialRadixSort shouldn't be used in production. Particularly with my 100M array example. In my case the code I'm refactoring is a localised change, converting from one representation into Roaring, so ATM can't guarantee its ordered and so the construction cost dominates. Eventually I hope move Roaring to other areas of the code and avoid converting representation (and hopefully eliminate cases which are ordering dependant).

Roaring can be materialised in many different ways, my refactoring of code that was never designed for Roaring should not be a reason to implement new approaches of sorting in Roaring.

@lemire @richardstartin Thank you for looking at partialRadixSort this week, even if it is not a priority I hope it was at least fun and interesting.

@richardstartin
Copy link
Member

@Ignition if we give up on stability (which we don't gain from) we can get rid of the copy, in which case this code could eventually be used in production safely.

Anyway, good luck with your migration and please come back with more fun problems soon!

@Ignition Ignition closed this as completed Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants