Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partial radix sort with early exit #474

Merged

Conversation

richardstartin
Copy link
Member

@richardstartin richardstartin commented Apr 8, 2021

Since we're beating up on my 3 year old radix sort, here's another variant which has two benefits

  1. Reduce the number of passes over the data to build the histograms by populating both at once
  2. If the maximum element in the data has an empty high byte, skip one level of the sort

I added a benchmark which varies the number of bits in the input data. On my branch (skylake 2.6GHz) I get

Benchmark              (bits)  (seed)     (size)  Mode  Cnt  Score   Error  Units
RadixSort.partialSort      23       0  100000000  avgt    5  0.507 ▒ 0.011   s/op
RadixSort.partialSort      25       0  100000000  avgt    5  0.600 ▒ 0.016   s/op

On master I get

Benchmark              (bits)  (seed)     (size)  Mode  Cnt  Score   Error  Units
RadixSort.partialSort      23       0  100000000  avgt    5  0.921 ▒ 0.165   s/op
RadixSort.partialSort      25       0  100000000  avgt    5  0.749 ▒ 0.003   s/op

I haven't dug in to why the existing 2 pass algorithm is sensitive to there being 9 leading zeros but I suspect there is a dependency on the histogram's singularly populated bucket on the last pass.

NB I still worry that this code isn't production worthy because of the linear space requirement!

@richardstartin
Copy link
Member Author

richardstartin commented Apr 9, 2021

For 1M elements:

branch

Benchmark              (bits)  (seed)   (size)  Mode  Cnt     Score     Error  Units
RadixSort.partialSort      23       0  1000000  avgt    5  4507.471 ▒  80.487  us/op
RadixSort.partialSort      25       0  1000000  avgt    5  5615.708 ▒ 111.422  us/op

master

Benchmark              (bits)  (seed)   (size)  Mode  Cnt     Score    Error  Units
RadixSort.partialSort      23       0  1000000  avgt    5  8580.655 ▒ 91.773  us/op
RadixSort.partialSort      25       0  1000000  avgt    5  7077.933 ▒ 61.236  us/op

@Ignition
Copy link

Ignition commented Apr 9, 2021

LGTM. Ran it locally, I see same results. For the cost an extra small histogram buffer this is nice gain.

@lemire
Copy link
Member

lemire commented Apr 9, 2021

Saving a whole pass over the data (in all cases) is a really nice optimization. Of course, it increases slightly the size of the buffer memory but spending another kilobyte is probably worth it for large inputs.

@richardstartin
Copy link
Member Author

richardstartin commented Apr 9, 2021

I added some more test cases and in doing so noticed a couple more cases we can optimise for just by looking at the histograms after the first pass:

  • all of the values have the same bits in positions 16-24, so we don't need to do the first sort
  • all of the values have the same bits in positions 16-32, in which case we don't allocate the copy or do any sorting at all

When we can skip the copy and the sort, we get out a lot quicker:

Benchmark                                           (bits)  (seed)   (size)  Mode  Cnt     Score       Error   Units
RadixSort.partialSort                                   16       0  1000000  avgt    5  2197.522 ▒   107.813   us/op
RadixSort.partialSort:▒gc.alloc.rate                    16       0  1000000  avgt    5     0.494 ▒     0.052  MB/sec
RadixSort.partialSort:▒gc.alloc.rate.norm               16       0  1000000  avgt    5  2097.281 ▒     1.588    B/op
RadixSort.partialSort:▒gc.churn.G1_Eden_Space           16       0  1000000  avgt    5     1.830 ▒    15.757  MB/sec
RadixSort.partialSort:▒gc.churn.G1_Eden_Space.norm      16       0  1000000  avgt    5  8000.035 ▒ 68882.713    B/op
RadixSort.partialSort:▒gc.count                         16       0  1000000  avgt    5     1.000              counts
RadixSort.partialSort:▒gc.time                          16       0  1000000  avgt    5     2.000                  ms

Compared to e.g.

Benchmark                                           (bits)  (seed)   (size)  Mode  Cnt        Score         Error   Units
RadixSort.partialSort                                   24       0  1000000  avgt    5     4883.339 ▒     117.287   us/op
RadixSort.partialSort:▒gc.alloc.rate                    24       0  1000000  avgt    5      455.707 ▒       6.660  MB/sec
RadixSort.partialSort:▒gc.alloc.rate.norm               24       0  1000000  avgt    5  4002114.282 ▒       0.293    B/op
RadixSort.partialSort:▒gc.churn.G1_Eden_Space           24       0  1000000  avgt    5        4.996 ▒       2.864  MB/sec
RadixSort.partialSort:▒gc.churn.G1_Eden_Space.norm      24       0  1000000  avgt    5    43882.183 ▒   25380.760    B/op
RadixSort.partialSort:▒gc.churn.G1_Old_Gen              24       0  1000000  avgt    5      468.996 ▒     137.111  MB/sec
RadixSort.partialSort:▒gc.churn.G1_Old_Gen.norm         24       0  1000000  avgt    5  4118724.911 ▒ 1197359.334    B/op
RadixSort.partialSort:▒gc.count                         24       0  1000000  avgt    5       29.000                counts
RadixSort.partialSort:▒gc.time                          24       0  1000000  avgt    5       16.000                    ms

I also applied the suggestion to mask out rather than bound the random numbers in the benchmark for better comparability.

@richardstartin richardstartin merged commit 13b1376 into RoaringBitmap:master Apr 9, 2021
@richardstartin richardstartin deleted the early-exit-radix-sort branch April 9, 2021 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants