Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimisation: Add zero-garbage deserialiser for ByteBuffer to RoaringBitmap #650

Merged
merged 7 commits into from Sep 2, 2023

Conversation

shikharid
Copy link
Contributor

@shikharid shikharid commented Aug 7, 2023

SUMMARY

  • There is no direct way to convert a byte array or a ByteBuffer (representing an uncompressed bitmap) to a RoaringBitmap
  • The best way is to create a BitSet first and then use: BitSetUtil.bitmapOf(bitset)
  • This creates unnecessary heap garbage as, essentially, a full copy of the byte array is created for BitSet
  • This PR introduces a method in BitSetUtil that can be used to directly convert a ByteBuffer to RoaringBitmap
  • The implementation tries to do it with constant and minimum possible memory allocation
  • This is very useful for performance sensitive code doing this very frequently. Now a RoaringBitmap can be created directly from bytes read on wire with almost no unnecessary memory allocs
  • For testing, BitSetUtil tests are replicated and benchmarks are added

RESULTS

  • Benchmark Results are posted here
  • TL;DR
  • Average time is upto 10-20% faster when we go from small to larger bitsets (compared to existing way)
  • GC pressure is about 4-5x lower for all size types

Automated Checks

  • I have run ./gradlew test and made sure that my PR does not break any unit test.
  • I have run ./gradlew checkstyleMain or the equivalent and corrected the formatting warnings reported.

- existing most performant way was to convert it to a BitSet and then use BitSetUtil
- this adds a helper which you can use to get a RoaringBitmap directly from the byte array you read on the wire
@shikharid
Copy link
Contributor Author

shikharid commented Aug 7, 2023

Benchmark Setup

  • JMH version: 1.23
  • VM version: JDK 1.8.0_342, OpenJDK 64-Bit Server VM, 25.342-b07
  • VM invoker: ***/Java/JavaVirtualMachines/liberica-1.8.0_342/jre/bin/java
  • OS: MacOS Ventura 13.2.1
  • Arch: Apple M1 Max (32 gb, aarch64)
  • JMH Threads: 1
  • Warmup: 30 sec (6 rounds of 5 sec)
  • Measure: 60 sec (6 rounds of 10 sec)
  • Forks: 1
Show Results

Small bitsets, wordSize = 64 represents 4096 bits (512 bytes)

Benchmark                                                                        Mode  Cnt           Score           Error   Units
BitSetUtilBenchmark.ByteArrayToBitsetToRoaring                                   avgt    6      926566.872 ?      1418.564   us/op
BitSetUtilBenchmark.ByteArrayToRoaring                                           avgt    6      817612.465 ?      3338.594   us/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.alloc.rate                    avgt    6        2911.060 ?         5.388  MB/sec
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.alloc.rate                            avgt    6         913.190 ?         3.108  MB/sec

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.alloc.rate.norm               avgt    6  2969120079.273 ?         0.001    B/op
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.alloc.rate.norm                       avgt    6   820106009.231 ?         0.001    B/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Eden_Space           avgt    6        2909.076 ?       252.130  MB/sec
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Eden_Space                   avgt    6         894.322 ?       213.054  MB/sec

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Eden_Space.norm      avgt    6  2967120554.667 ? 259843058.204    B/op
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Eden_Space.norm              avgt    6   803141999.590 ? 190543693.672    B/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Survivor_Space       avgt    6           0.027 ?         0.027  MB/sec
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Survivor_Space               avgt    6           0.010 ?         0.011  MB/sec

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Survivor_Space.norm  avgt    6       27306.667 ?     27663.142    B/op
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Survivor_Space.norm          avgt    6        9242.256 ?      9657.104    B/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.count                         avgt    6          93.000                  counts
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.count                                 avgt    6          39.000                  counts

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.time                          avgt    6         310.000                      ms
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.time                                  avgt    6         142.000                      ms

Medium sized bitsets, wordSize = 512 represents 32768 bits (~4kb)

Benchmark                                                                        Mode  Cnt           Score           Error   Units
BitSetUtilBenchmark.ByteArrayToBitsetToRoaring                                   avgt    6     1014957.615 ?      2676.383   us/op
BitSetUtilBenchmark.ByteArrayToRoaring                                           avgt    6      874951.650 ?     28685.135   us/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.alloc.rate                    avgt    6        2535.867 ?         7.028  MB/sec
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.alloc.rate                            avgt    6         688.141 ?        21.026  MB/sec

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.alloc.rate.norm               avgt    6  2833754699.733 ?         3.663    B/op
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.alloc.rate.norm                       avgt    6   661747620.000 ?         0.001    B/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Eden_Space           avgt    6        2536.466 ?       274.322  MB/sec
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Eden_Space                   avgt    6         680.573 ?        41.338  MB/sec

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Eden_Space.norm      avgt    6  2834388309.333 ? 303144939.109    B/op
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Eden_Space.norm              avgt    6   654435214.222 ?  27481301.088    B/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Survivor_Space       avgt    6           0.022 ?         0.019  MB/sec
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Survivor_Space               avgt    6           0.009 ?         0.012  MB/sec

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Survivor_Space.norm  avgt    6       24576.000 ?     20751.140    B/op
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Survivor_Space.norm          avgt    6        8647.111 ?     11271.239    B/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.count                         avgt    6          81.000                  counts
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.count                                 avgt    6          30.000                  counts

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.time                          avgt    6         292.000                      ms
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.time                                  avgt    6         137.000                      ms

Large bitsets, wordSize = 8192 represents 524288 bits (~64kb)

Benchmark                                                                        Mode  Cnt           Score           Error   Units
BitSetUtilBenchmark.ByteArrayToBitsetToRoaring                                   avgt    6      979256.137 ?      6679.003   us/op
BitSetUtilBenchmark.ByteArrayToRoaring                                           avgt    6      833433.943 ?       884.490   us/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.alloc.rate                    avgt    6        2847.125 ?        19.011  MB/sec
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.alloc.rate                            avgt    6         997.572 ?         5.987  MB/sec

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.alloc.rate.norm               avgt    6  3061785079.758 ?         3.330    B/op
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.alloc.rate.norm                       avgt    6   913781810.154 ?         4.010    B/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Eden_Space           avgt    6        2829.040 ?       250.767  MB/sec
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Eden_Space                   avgt    6        1005.457 ?       286.823  MB/sec

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Eden_Space.norm      avgt    6  3042467095.273 ? 282401987.966    B/op
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Eden_Space.norm              avgt    6   920915157.821 ? 259584190.833    B/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Survivor_Space       avgt    6           0.034 ?         0.019  MB/sec
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Survivor_Space               avgt    6           0.014 ?         0.017  MB/sec

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Survivor_Space.norm  avgt    6       36254.061 ?     20748.672    B/op
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Survivor_Space.norm          avgt    6       12498.051 ?     15258.993    B/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.count                         avgt    6          86.000                  counts
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.count                                 avgt    6          27.000                  counts

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.time                          avgt    6         359.000                      ms
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.time                                  avgt    6         100.000                      ms

Very large bitsets, wordSize = 131072 represents 8388608 bits (~8.3 million, ~1mb)

Benchmark                                                                        Mode  Cnt           Score           Error   Units
BitSetUtilBenchmark.ByteArrayToBitsetToRoaring                                   avgt    6     1043406.306 ?     32719.910   us/op
BitSetUtilBenchmark.ByteArrayToRoaring                                           avgt    6      917331.653 ?      4053.123   us/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.alloc.rate                    avgt    6        2544.777 ?        58.797  MB/sec
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.alloc.rate                            avgt    6         757.633 ?         3.118  MB/sec

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.alloc.rate.norm               avgt    6  2919403257.600 ?  21151458.690    B/op
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.alloc.rate.norm                       avgt    6   765114215.273 ?         0.001    B/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Eden_Space           avgt    6        2537.170 ?       291.747  MB/sec
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Eden_Space                   avgt    6         764.355 ?       334.761  MB/sec

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Eden_Space.norm      avgt    6  2911008515.067 ? 353108067.114    B/op
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Eden_Space.norm              avgt    6   771910811.152 ? 338239315.720    B/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Survivor_Space       avgt    6           0.373 ?         0.205  MB/sec
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Survivor_Space               avgt    6           0.074 ?         0.169  MB/sec

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.churn.PS_Survivor_Space.norm  avgt    6      427634.400 ?    240371.386    B/op
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.churn.PS_Survivor_Space.norm          avgt    6       74472.727 ?    171357.077    B/op

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.count                         avgt    6          77.000                  counts
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.count                                 avgt    6          20.000                  counts

BitSetUtilBenchmark.ByteArrayToBitsetToRoaring:?gc.time                          avgt    6         186.000                      ms
BitSetUtilBenchmark.ByteArrayToRoaring:?gc.time                                  avgt    6          49.000                      ms

@shikharid shikharid changed the title optimisation: add deserialiser for bits byte array to RoaringBitmap Optimisation: Add zero-garbage deserialiser for ByteBuffer to RoaringBitmap Aug 7, 2023
@shikharid
Copy link
Contributor Author

shikharid commented Aug 7, 2023

Also, anyone with more experience with the codebase please confirm this.

This is a bug: https://github.com/RoaringBitmap/RoaringBitmap/pull/650/files#diff-608ac1c40d6f95be23548cf97937dfcee083b21634337f6bb57617565c467f05R163

The copy should be (from, to) and not (from, from + BLOCK_LENGTH)

Once we fix this, I won't need to zero-out the thread local block buffer after every use.
And we will also probably save some unnecessary copying as to < from + BLOCK_LENGTH

The fixed impl will look like:

  • create a new long[BLOCK_LENGTH]
  • copy (from, to) of words[]

EDIT: Went ahead and made the change. As it doesn't change any existing behaviour and benchmarks showed a consistent gain of ~5% when not zeroing out the thread local word buffer after each use.

- this removes the need to zero-out the threadlocal buffer everytime
@lemire
Copy link
Member

lemire commented Aug 7, 2023

Would you consider porting your code over to ‎RoaringBitmap/src/main/java/org/roaringbitmap/buffer/BufferBitSetUtil.java

We try to keep them in sync. It should be easy work.

@@ -71,6 +72,71 @@ public static RoaringBitmap bitmapOf(final long[] words) {
return ans;
}

// To avoid memory allocation, reuse ThreadLocal buffers
private static final ThreadLocal<long[]> WORD_BLOCK = ThreadLocal.withInitial(() ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is 8 kB that may get allocated for each thread for one function (below) and it never gets released.

I don't think we want that.

What would be accessible is to allow the user to (optionally) pass a buffer to the function below. If the user passes a buffer, then you use it, otherwise, you allocate it.

In this manner, you give the use full control over the performance, and you don't make people pay 8 kB that they don't want to lock down for one function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will change

ans.highLowContainer.insertNewKeyValueAt(containerIndex++, Util.highbits(offset),
BitSetUtil.containerOf(0, blockLength, blockCardinality, words));
}
offset += (BLOCK_LENGTH * Long.SIZE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though this can be reasonably dismissed, there is the possibility that offset overflows. Make sure that the offset variable cannot overflow (hopefully it cannot due to the the max size of a Java Bitset, but please be specific, maybe with a comment).

Copy link
Contributor Author

@shikharid shikharid Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't do much here, will add a comment.
As you said, this won't overflow unless the BitSet size is more than Integer.MAX_VALUE - 64.
And even in that case, right at the end where it's not needed anymore.

@lemire
Copy link
Member

lemire commented Aug 7, 2023

This is a very good PR. Please consider my comments.

@shikharid
Copy link
Contributor Author

@lemire made the changes you recommended and also ported to BufferBitSetUtil

@lemire
Copy link
Member

lemire commented Aug 7, 2023

@shikharid

Regarding...

 private static Container containerOf(final int from, final int to, final int blockCardinality,
      final long[] words) {
    // find the best container available
    if (blockCardinality <= ArrayContainer.DEFAULT_MAX_SIZE) {
      // containers with DEFAULT_MAX_SIZE or less integers should be
      // ArrayContainers
      return arrayContainerOf(from, to, blockCardinality, words);
    } else {
      // otherwise use bitmap container
      return new BitmapContainer(Arrays.copyOfRange(words, from, from + BLOCK_LENGTH),
          blockCardinality);
    }
  }

A possible alternative would be...

      if(to - from < BLOCK_LENGTH) {
        long [] newbuffer = new long[BLOCK_LENGTH];
        System.arraycopy(words, from, newbuffer, 0, to - from);
                            Object dest_arr, int destPos, int len)
        return new BitmapContainer(newbuffer, blockCardinality);
      } else {
        return new BitmapContainer(Arrays.copyOfRange(words, from, from + BLOCK_LENGTH),
            blockCardinality);
     }

(This code is untested... it is conceptually correct but could be technically wrong.)

If you'd like to make this change (it will need to be done in both version of containerOf, one in the buffer package and one in the main package), or a related change, that would be fine.

Please advise.

@shikharid
Copy link
Contributor Author

Can't really see if that helps with anything.
OpenJDK 17 implementation of copyOfRange looks like:

public static long[] copyOfRange(long[] original, int from, int to) {
        int newLength = to - from;
        if (newLength < 0)
            throw new IllegalArgumentException(from + " > " + to);
        long[] copy = new long[newLength];
        System.arraycopy(original, from, copy, 0,
                         Math.min(original.length - from, newLength));
        return copy;
    }

Doesn't seem to have changed since 8.

So the changes I made are essentially the same, without the extra if checks:

  private static Container containerOf(final int from, final int to, final int blockCardinality,
      final long[] words) {
    // find the best container available
    if (blockCardinality <= ArrayContainer.DEFAULT_MAX_SIZE) {
      // containers with DEFAULT_MAX_SIZE or less integers should be
      // ArrayContainers
      return arrayContainerOf(from, to, blockCardinality, words);
    } else {
      // otherwise use bitmap container
      long[] container = new long[BLOCK_LENGTH];
      System.arraycopy(words, from, container, 0, to - from);
      return new BitmapContainer(container, blockCardinality);
    }
  }

@@ -15,7 +16,7 @@ public class BitSetUtil {

// a block consists has a maximum of 1024 words, each representing 64 bits,
// thus representing at maximum 65536 bits
static final private int BLOCK_LENGTH = BitmapContainer.MAX_CAPACITY / Long.SIZE; //
public static final int BLOCK_LENGTH = BitmapContainer.MAX_CAPACITY / Long.SIZE; //
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to make BLOCK_LENGTH public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see any other neat way to expose the information.
Since the user can provide the buffer, they need to know atleast what size it needs to be.
Can just mention it in Javadoc, the bounds check will anyways raise error if bad sized buffer is provided.

had hidden locally, forgot to uncomment
@lemire
Copy link
Member

lemire commented Aug 8, 2023

Ok. The PR looks good. Running tests.

I will give some time for folks to come in and comment.

This will be part of the next release.

@shikharid
Copy link
Contributor Author

shikharid commented Sep 2, 2023

Hey @lemire, any plans of merging/releasing this soon?
Added it as I wanted this for my usecase, so would be nice if this was in release and I don't have to use a fork.

@lemire
Copy link
Member

lemire commented Sep 2, 2023

Merging. I will issue a release.

@lemire lemire merged commit 07ec0dd into RoaringBitmap:master Sep 2, 2023
9 checks passed
srowen pushed a commit to apache/spark that referenced this pull request Sep 27, 2023
### What changes were proposed in this pull request?
- The pr aims to upgrade RoaringBitmap from 0.9.45 to 1.0.0.
- From version 1.0.0, the `ArraysShim` class has been moved from `shims-x.x.x.jar` jar to `RoaringBitmap-x.x.x.jar` jar, so we no longer need to rely on it.

### Why are the changes needed?
- The newest brings some improvments, eg:
Add zero-garbage deserialiser for ByteBuffer to RoaringBitmap by shikharid in RoaringBitmap/RoaringBitmap#650
More specialized method for value decrementation by xtonik in RoaringBitmap/RoaringBitmap#640
Duplicated small array sort routine by xtonik in RoaringBitmap/RoaringBitmap#638
Avoid intermediate byte array creation by xtonik in RoaringBitmap/RoaringBitmap#635
Useless back and forth BD bytes conversion by xtonik in RoaringBitmap/RoaringBitmap#636

- The full release notes:
https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/1.0.0
https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.49
https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.48
https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.47
https://github.com/RoaringBitmap/RoaringBitmap/releases/tag/0.9.46

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #42143 from panbingkun/SPARK-44539.

Authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants