Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster Bulk Next() and Add() #123

Closed
cakey opened this issue Nov 3, 2017 · 16 comments
Closed

Faster Bulk Next() and Add() #123

cakey opened this issue Nov 3, 2017 · 16 comments

Comments

@cakey
Copy link

cakey commented Nov 3, 2017

Hey!
Great library! Big fan of the research you put out :)

We've been starting to use roaring for much faster union/intersection operations, but we've found that a big bottleneck for us has been creating those recordsets, and then iterating through them after the operations.
Internally we've had a lot of success implementing bulk creation and iteration functions - to avoid unnecessary per row work. We see speedups for some of our usecases of up to 10-20x.
We'd love to contribute them back upstream!

Before cleaning up our code and submitting a PR, it would be great to get your feedback on what you'd like the API to look like.
Our signatures currently look something like:

// fills up as much of the buffer as it can, and returns the number of rows produced
func (ii *intIterator) NextRows(buf []uint32) int

// AddManyNew add all of the values in buf
// Values in buf need to be in ascending order
// Assumes that all values in Bitmap are lower than the new rows in buf
// (Used when constructing a roar from scratch)
func (rb *Bitmap) AddManySeqNew(bufHigh, bufLow []uint16)

// rows in bufHigh/Low must be ascending, but it is fine to overlap with existing rows in the bitmap
// (Used when unioning a large bitmap, with a small number of new rows)
func (rb *Bitmap) AddManySeq(bufHigh, bufLow []uint16)

Each container type then has corresponding functions that can use the assumptions to avoid unnecessary work on each row.

For example for rle: nextrows:

func (ri *runIterator16) nextRows(hs uint32, buf []uint32) int {
	n := 0
	if !ri.hasNext() {
		return n
	}

	if ri.curIndex == -1 {
		// first time is special
		ri.curIndex = 0
	}

	// start and end are inclusive
	for n < len(buf) {
		// add as many as you can from this seq
		// careful for uint16  overflows ... (valsLeft can be uint16Max + 1)
		valsLeft := int(ri.rc.iv[ri.curIndex].last) - (int(ri.rc.iv[ri.curIndex].start) + int(ri.curPosInIndex)) + 1
		moreVals := min(valsLeft, len(buf)-n)

		base := uint32(ri.rc.iv[ri.curIndex].start+ri.curPosInIndex) | hs

		// allows BCE
		buf2 := buf[n : n+moreVals]
		for i := range buf2 {
			buf2[i] = base + uint32(i)
		}

		// update values
		ri.curPosInIndex += uint16(moreVals) //moreVals always fits in uint16
		ri.curSeq += int64(moreVals)
		n += moreVals
		if moreVals == valsLeft {
			ri.curPosInIndex = 0
			ri.curIndex++
			if ri.curIndex == int64(len(ri.rc.iv)) {
				break
			}
		}
	}
	return n

}

Gives us 20x performance over naively calling next() repeatedly.

Excited to contribute :) What are the next steps here?

@lemire
Copy link
Member

lemire commented Nov 4, 2017

Evidently, we are always happy to receive contributions. Anything that you use in an actual application with good results should make it into our code. At a minimum, we should make sure that you can get the features you need without any need of maintaining a private fork (something that is not good for anyone).

Let us chat.

I'd break your contributions into two distinct ones.

First, it seems that you propose a way to iterate over values faster by doing it in bulk. The nextRows function has clearly the potential of being faster, if only because you have fewer function calls. I'd still want to see a benchmark, something that does something less trivial than just run through the data (i.e., fill the buffers), because you are filling in your buffer, but then you need to read it again in your application, so you take the value, store it in your buffer, then come back to it later, and reread it. It may end up being faster, but it is not obviously much faster in practice, and it is certainly less convenient for most users. The name is not ideal because we have no notion of rows. I don't want to bikeshed the naming convention, but can you propose a name that first the current API as per our godoc. As a hint, we have a closely related method called ToArray.

In C, we sidestep the problem of iterating over the values with an iterator, and use a kind of for-each construct. The function takes a function pointer as input. The function pointer just calls your function with each value (possibly stopping if your function returns false). It is unknown to me how well it works in Go, but in C, it is ridiculously fast. So it is possible that there are alternative designs that you have not considered that might serve your performance need even better. Or not.

Then you want to add values in bulk. We have a function called AddMany that is meant to quickly add many values... I think you are proposing another, faster way of doing the same. Your function signatures look a tad complicated. I'd be interested in a benchmark between your functions and the existing AddMany. Do you have such a thing? Obviously, if your code is anything like 10x faster, it may warrant a more complicated function signature but if... say... it is only 2% faster... it may not be worth it. Right? Another possibility is that your functions have functionality not covered by AddMany. If so, can you make this clearer?

@lemire
Copy link
Member

lemire commented Nov 4, 2017

c.c. @maciej @glycerine

@cakey
Copy link
Author

cakey commented Nov 4, 2017

@lemire Thanks for the thorough response. Absolutely not tied to any of the names we're currently using. Here are some benchmarks for addMany I quickly whipped up, forgive the randomness here - not ideal, but the general trend of the numbers holds up across runs.

func randomRecordSetUint32(count, max int) []uint32 {
	h := make([]bool, max)
	for i := 0; i < count; {
		r := rand.Intn(max)
		if !h[r] {
			h[r] = true
			i++
		}
	}
	rows := make([]uint32, count)
	i := 0
	for r, ok := range h {
		if ok {
			rows[i] = uint32(r + 1)
			i++
		}
	}
	return rows
}

func randomRecordSetUint16s(count, max int) ([]uint16, []uint16) {
	h := make([]bool, max)
	for i := 0; i < count; {
		r := rand.Intn(max)
		if !h[r] {
			h[r] = true
			i++
		}
	}
	highs := make([]uint16, count)
	lows := make([]uint16, count)
	i := 0
	for r, ok := range h {
		if ok {
			highs[i] = uint16((r + 1) >> 16)
			lows[i] = uint16((r + 1) & 0xFFFF)
			i++
		}
	}
	return highs, lows
}

func (s *RecordSetSuite) BenchmarkCreateAddMany(c *C) {
	c.StopTimer()
	// 1% density
	count := 1000000
	rrs := randomRecordSetUint32(count, 100000000)
	c.StartTimer()
	var bm *roaring.Bitmap
	for n := 0; n < c.N; n++ {
		bm = roaring.NewBitmap()
		bm.AddMany(rrs)
	}

	if bm.GetCardinality() == 0 {
		// avoid optimising away
		c.FailNow()
	}
}

func (s *RecordSetSuite) BenchmarkCreateSeqNew(c *C) {
	c.StopTimer()
	// 1% density
	count := 1000000
	rrs := randomRecordSetUint32(count, 100000000)
	rowsHigh := make([]uint16, count)
	rowsLow := make([]uint16, count)
	c.StartTimer()
	var bm *roaring.Bitmap
	for n := 0; n < c.N; n++ {
		bm = roaring.NewBitmap()
		for i, v := range rrs {
			rowsHigh[i] = uint16(v >> 16)
			rowsLow[i] = uint16(v & 0xFFFF)
		}

		bm.AddManySeqNew(rowsHigh, rowsLow)
	}

	if bm.GetCardinality() == 0 {
		// avoid optimising away
		c.FailNow()
	}
}
func (s *RecordSetSuite) BenchmarkCreateSeqNewBestCase(c *C) {
	c.StopTimer()
	// 1% density
	count := 1000000
	highs, lows := randomRecordSetUint16s(count, 100000000)
	c.StartTimer()
	var bm *roaring.Bitmap
	for n := 0; n < c.N; n++ {
		bm = roaring.NewBitmap()
		bm.AddManySeqNew(highs, lows)
	}

	if bm.GetCardinality() == 0 {
		// avoid optimising away
		c.FailNow()
	}
}

1% density

BenchmarkCreateAddMany_______	     500	   7198367 ns/op	 6383084 B/op	   15295 allocs/op
BenchmarkCreateSeqNew_________	    1000	   2027800 ns/op	 2243602 B/op	    3087 allocs/op
BenchmarkCreateSeqNewBestCase	    1000	   1181379 ns/op	 2245140 B/op	    3087 allocs/op

So in an apples to apples comparison (same overall inputs) we see a 3-4x speedup.
In a more realistic scenario (it's easy for us to change our other code to provide the []uint16), more like 6x.

Let me know if you spot any errors here, these were very quickly constructed.

For comparison:
10% density:

BenchmarkCreateAddMany________	     500	   7114560 ns/op	 3582671 B/op	    2318 allocs/op
BenchmarkCreateSeqNew_________	    1000	   2926960 ns/op	 1272978 B/op	     482 allocs/op
BenchmarkCreateSeqNewBestCase	    1000	   2194713 ns/op	 1272978 B/op	     482 allocs/op

0.1% density:

BenchmarkCreateAddMany	     100	  11338850 ns/op	 7654162 B/op	   99878 allocs/op
BenchmarkCreateSeqNew	     500	   4229738 ns/op	 3860664 B/op	   30577 allocs/op
BenchmarkCreateSeqNewBestCase	     500	   3648316 ns/op	 3860196 B/op	   30577 allocs/op

I'll try to put together something similar for nextRows soon, including trying out your function pointer suggestion

@lemire
Copy link
Member

lemire commented Nov 4, 2017

Thanks! I love hard numbers!

@cakey
Copy link
Author

cakey commented Nov 4, 2017

Next rows

Again, forgive the rough benchmarks here, and point out any mistakes I've made.

func lookupValue(i uint32) int64 {
	return int64(i % 50)
}

func (s *RecordSetSuite) benchmarkRoarNext(c *C, bm *roaring.Bitmap) {
	var sum int64
	for n := 0; n < c.N; n++ {
		sum = int64(0)
		i := bm.Iterator()
		for i.HasNext() {
			row := i.Next()
			sum += lookupValue(row)
		}
	}

	if sum == 0 {
		// avoid optimising away
		c.FailNow()
	}
}

func (s *RecordSetSuite) benchmarkRoarNextRows(c *C, bm *roaring.Bitmap) {
	var sum int64
	for n := 0; n < c.N; n++ {
		sum = int64(0)
		buf := make([]uint32, 4096)
		it := bm.Iterator()
		for n := it.NextRows(buf); n != 0; n = it.NextRows(buf) {
			for i := 0; i < n; i++ {
				sum += lookupValue(buf[i])
			}
		}
	}
	if sum == 0 {
		// avoid optimising away
		c.FailNow()
	}
}

func (s *RecordSetSuite) BenchmarkRoarNextRows_dens1(c *C) {
	c.StopTimer()
	count := 1000000
	dat := randomRecordSetUint32(count, count*100)
	bm := roaring.BitmapOf(dat...)
	c.StartTimer()
	s.benchmarkRoarNextRows(c, bm)
}
func (s *RecordSetSuite) BenchmarkRoarNextRows_dens20(c *C) {
	c.StopTimer()
	count := 1000000
	dat := randomRecordSetUint32(count, count*5)
	bm := roaring.BitmapOf(dat...)
	c.StartTimer()
	s.benchmarkRoarNextRows(c, bm)
}
func (s *RecordSetSuite) BenchmarkRoarNext_____dens1(c *C) {
	c.StopTimer()
	count := 1000000
	dat := randomRecordSetUint32(count, count*100)
	bm := roaring.BitmapOf(dat...)
	c.StartTimer()
	s.benchmarkRoarNext(c, bm)
}
func (s *RecordSetSuite) BenchmarkRoarNext_____dens20(c *C) {
	c.StopTimer()
	count := 1000000
	dat := randomRecordSetUint32(count, count*5)
	bm := roaring.BitmapOf(dat...)
	c.StartTimer()
	s.benchmarkRoarNext(c, bm)
}
func (s *RecordSetSuite) BenchmarkRoarNextRows___rle(c *C) {
	c.StopTimer()
	count := uint64(1000000)
	bm := roaring.NewBitmap()
	bm.AddRange(0, count)
	c.StartTimer()
	s.benchmarkRoarNextRows(c, bm)
}
func (s *RecordSetSuite) BenchmarkRoarNext_______rle(c *C) {
	c.StopTimer()
	count := uint64(1000000)
	bm := roaring.NewBitmap()
	bm.AddRange(0, count)
	c.StartTimer()
	s.benchmarkRoarNext(c, bm)
}

BenchmarkRoarNextRows___rle	    1000	   1434033 ns/op	   16946 B/op	      18 allocs/op
BenchmarkRoarNext_______rle	     100	  23526095 ns/op	     571 B/op	      17 allocs/op

BenchmarkRoarNextRows_dens1	     500	   3584035 ns/op	   65268 B/op	    1528 allocs/op
BenchmarkRoarNext_____dens1	     100	  12932847 ns/op	   48900 B/op	    1527 allocs/op

BenchmarkRoarNextRows_dens20	     500	   7310742 ns/op	 4185580 B/op	     387 allocs/op
BenchmarkRoarNext_____dens20	     100	  18294503 ns/op	    3764 B/op	      78 allocs/op

So ~16x on rle, ~3.5x on array, ~2.5x on bitmap

In practise these numbers are worst case for us, as dealing with the rows in blocks helps speed up our code also.

(the bitmap nextrows is very naively implemented, hence the extra memory pressure, I'd fix this when upstreaming, and likely achieve numbers that were quite a bit better)

@glycerine
Copy link
Member

So ~16x on rle, ~3.5x on array, ~2.5x on bitmap

In practise these numbers are worst case for us, as dealing with the rows in blocks helps speed up our code also.

(the bitmap nextrows is very naively implemented, hence the extra memory pressure, I'd fix this when upstreaming, and likely achieve numbers that were quite a bit better)

Awesome. Wow, that is significant. This is great work. I'd just like to encourage you to polish it and submit a PR it so roaring can be improved for all users. When I wrote the RLE part of roaring in Go, we established correctness with extensive tests, but there was/is still performance tuning that can and should be done. So thanks for doing this!

@cakey
Copy link
Author

cakey commented Nov 6, 2017

Implemented a smarter nextRows for bitmaps:

BenchmarkRoarNextRows_dens20	     500	   4710256 ns/op	   21364 B/op	      79 allocs/op

Memory is much better, and its up to ~3.5x

I'll put together a PR in the near future.

I'm not entirely decided on the interface.

AddManySeqNew is only useful when initially creating a roaring.Bitmap - so perhaps hiding it behind some kind of builder is better:

builder := bitmap.NewBulkBuilder()
builder.AddManySeq(buf)
bitmap := builder.Finish()

It's kinda ugly though

As for nextRows(buf) - nextMany(buf)?

Also, it's a bit more work to ensure things stay in sync if the same iterator object supports calls to both NextMany and Next. Also, the interface is different anyway. Next() requires a hasNext, while in our nextMany implementation, we use a return of 0 to indicate no more rows.
We could easily hide this behind a .BulkIterator()/.ManyIterator() perhaps?

@lemire
Copy link
Member

lemire commented Nov 7, 2017

I agree with @glycerine.

I'll put together a PR in the near future.

I would support this.

Again, though I will not get into bikeshedding about exactly how things should be named, we have no concept of rows in our API, so it is probably best not to use the term, I think nextMany is nice and consistent.

It's kinda ugly though

Yes.

Some naive thoughts:

  • AddManySeqNew should really create a new bitmap, it looks to me like a version of BitmapOf. So why not call it BitmapOfSequential or something?
  • AddManySeq could maybe be renamed AddManySequential.

Maybe spelling out Seq to Sequential is not best but my view is that it makes things nicer because it is not intuitive what is meant by the suffix Seq.

Also, it's a bit more work to ensure things stay in sync if the same iterator object supports calls to both NextMany and Next.

This could be openly discouraged/disallowed. The scenario where one would try to use both sounds a bit adversarial. If we simply document (don't do this), it is probably good enough.

We could easily hide this behind a (...)

The Go way is to make things simple. Adding extra wrappers does not sound Go-like to me, but that's not something I feel strongly about.

@maciej
Copy link
Member

maciej commented Nov 7, 2017

I completely agree with what was said here.

NextMany would be great to have!

I don't understand why AddManySequential needs bufhigh and buflow split? Can't it just accept a []uint32 like AddMany (assuming ascending order)? Would that impact it's performance?

@lemire
Copy link
Member

lemire commented Nov 7, 2017

@maciej That's a very good question. I am somewhat puzzled by the split between low and high.

@cakey
Copy link
Author

cakey commented Nov 7, 2017

Cool, I'll start with a PR for NextMany as the interface is less up in the air. :)

why not call it BitmapOfSequential or something?

We want the ability to construct the bitmap in multiple chunks, I'm not sure how to do that without the function being a method on bitmap itself, or as part of a builder.
There's always something like BitmapOfSequential(func ([]uint32)) that calls your function until there are no more values.

I don't understand why AddManySequential needs bufhigh and buflow split?

We saw better performance - Our implementation iterates through the buffer, looking for when the high 16bits change (ie the container changes), and then it passes all of the low 16bits down to the container at once. Best case with something like an arraycontainer, with the high and low bits already split, it's a trivial memcpy of the entireuint16 buffer. If they were uint32s, we would have to iterate through, converting them to uint16s.
I'll whip up some benchmarks, obviously if it turns out to be just a few percent, then it's not worth the interface complexity, and I'll go with []uint32.
(On our end we already have to iterate through to create this buffers, so the split is very cheap)

@lemire
Copy link
Member

lemire commented Nov 7, 2017

@cakey

We want the ability to construct the bitmap in multiple chunks, I'm not sure how to do that without the function being a method on bitmap itself, or as part of a builder.

What is wrong with the following...

    r := BitmapOfSequential(buf1)
    r.addManySequential(buf2)
    r.addManySequential(buf3)

This is not rhetorical, I'd like to better understand your point of view.

@maciej
Copy link
Member

maciej commented Nov 7, 2017

@cakey I was pretty curious about how would a BitmapOfSequential([]uint32) perform and I wrote a small piece of code that does what I image would be more or less what your code is doing. It's available here. It's a bit rough and might not cover all edge cases (e.g. 0 element input slice). The code generating the benchmark input data is, of course, copied from the you to get a (somehow) proper benchmark comparison.

The benchmarks are not somehow worse than what you showed (1% density):

# go test -bench BenchmarkBitmapOf -benchmem -run -

goos: darlin
goarch: amd64
pkg: github.com/RoaringBitmap/roaring
BenchmarkBitmapOf/baseline-8         	     200	   7824027 ns/op	 6383080 B/op	   15295 allocs/op
BenchmarkBitmapOf/dense-8            	     100	  13826256 ns/op	 4349873 B/op	    4615 allocs/op
BenchmarkBitmapOf/sparse-8           	    1000	   1713547 ns/op	 2421936 B/op	    4612 allocs/op
PASS
ok  	github.com/RoaringBitmap/roaring	5.851s

which makes me even more curious to see your pull request!

@cakey
Copy link
Author

cakey commented Nov 8, 2017

@lemire Nothing wrong with that if r := BitmapOfSequential(buf1) is essentially just sugar for r := roaring.NewBitmap(); r.addManySequential() One of the main performance wins is being able to assume that there are no values in the roar larger than any in buf, so we know we can always append in the array case, or avoid recomputing the cardinality in the bitmap case.

@maciej
Ah, I didn't realise you could imbed .Run like that, much cleaner and saner!

I compared your functions to mine:
Your functions are quite dependent on choosing the right sparse/dense function, but the idea is very interesting - I tried a variant of your code (bitmapofmixes) where I look ahead arrayDefaultMaxSize values in the buffer, and then select the appropriate container ahead of time depending on the high bit of that element, but I couldn't get the performance I wanted, I think in practice it will be quite hard to choose the right function.

I then tried a variant of my own AddManySeqNew - but taking inspiration from your much simpler functions which remove a ton of container overhead, and again trying to choose the right container. I'm quite happy with the results. (BenchmarkBitmapOf/split) (See code below)

func BitmapOfSequentialSplit(highs, lows []uint16) *Bitmap {
	rb := New()

	prevN := 0
	prevHigh := highs[0]
	for n, i := range highs {

		if prevHigh != i {
			card := n - prevN
			if card < arrayDefaultMaxSize {
				cc := make([]uint16, card)
				copy(cc, lows[prevN:n])
				var nc container = &arrayContainer{cc}
				rb.highlowcontainer.appendContainer(prevHigh, nc, false)
			} else {
				var c = newBitmapContainer()
				c.cardinality = card
				bm := c.bitmap
				for _, lb := range lows[prevN:n] {
					bm[lb/64] |= uint64(1) << (lb % 64)
				}
				rb.highlowcontainer.appendContainer(prevHigh, c, false)
			}
			prevN = n
			prevHigh = i
		}
	}

	cc := make([]uint16, len(highs)-prevN)
	copy(cc, lows[prevN:])
	var lc container = &arrayContainer{cc}
	if len(highs)-prevN > arrayDefaultMaxSize {
		lc = lc.(*arrayContainer).toBitmapContainer()
	}
	rb.highlowcontainer.appendContainer(prevHigh, lc, false)

	return rb
}

and without the split:

func BitmapOfSequential(dat ...uint32) *Bitmap {
	rb := New()

	prevN := 0
	prevHigh := highbits(dat[0])
	for n, i := range dat {

		if prevHigh != highbits(i) {
			card := n - prevN
			if card < arrayDefaultMaxSize {
				cc := make([]uint16, card)
				for j, v := range dat[prevN:n] {
					cc[j] = lowbits(v)
				}
				var nc container = &arrayContainer{cc}
				rb.highlowcontainer.appendContainer(prevHigh, nc, false)
			} else {
				var c = newBitmapContainer()
				c.cardinality = card
				bm := c.bitmap
				for _, v := range dat[prevN:n] {
					lb := lowbits(v)
					bm[lb/64] |= uint64(1) << (lb % 64)
				}
				rb.highlowcontainer.appendContainer(prevHigh, c, false)
			}
			prevN = n
			prevHigh = highbits(i)
		}
	}

	cc := make([]uint16, len(dat)-prevN)
	for j, v := range dat[prevN:] {
		cc[j] = lowbits(v)
	}
	var lc container = &arrayContainer{cc}
	if len(dat)-prevN > arrayDefaultMaxSize {
		lc = lc.(*arrayContainer).toBitmapContainer()
	}
	rb.highlowcontainer.appendContainer(prevHigh, lc, false)

	return rb
}


Results:

0.1% density:

BenchmarkBitmapOf/baseline-8         	     100	  11998353 ns/op	 7650806 B/op	   99865 allocs/op
BenchmarkBitmapOf/dense-8            	      50	  27068608 ns/op	 5974192 B/op	   45838 allocs/op
BenchmarkBitmapOf/sparse-8           	     500	   3042467 ns/op	 3990976 B/op	   30577 allocs/op
BenchmarkBitmapOf/addmanyseq-8       	     500	   3365152 ns/op	 3860000 B/op	   30577 allocs/op
BenchmarkBitmapOf/bitmapofmixes-8    	     300	   5045504 ns/op	 3999264 B/op	   30579 allocs/op
BenchmarkBitmapOf/notSplit-8         	     500	   2781168 ns/op	 3860000 B/op	   30577 allocs/op
BenchmarkBitmapOf/split-8            	     500	   2530099 ns/op	 3860000 B/op	   30577 allocs/op

1% density:

BenchmarkBitmapOf/baseline-8         	     200	  10176469 ns/op	 6383076 B/op	   15295 allocs/op
BenchmarkBitmapOf/dense-8            	     100	  13417180 ns/op	 4353456 B/op	    4615 allocs/op
BenchmarkBitmapOf/sparse-8           	    1000	   1910863 ns/op	 2374928 B/op	    3087 allocs/op
BenchmarkBitmapOf/addmanyseq-8       	    1000	   1272524 ns/op	 2245008 B/op	    3087 allocs/op
BenchmarkBitmapOf/bitmapofmixes-8    	     500	   3126546 ns/op	 2384272 B/op	    3089 allocs/op
BenchmarkBitmapOf/notSplit-8         	    1000	   1478694 ns/op	 2245008 B/op	    3087 allocs/op
BenchmarkBitmapOf/split-8            	    1000	   1123379 ns/op	 2245008 B/op	    3087 allocs/op

10% density

BenchmarkBitmapOf/baseline-8         	     200	   7841014 ns/op	 5767851 B/op	    2622 allocs/op
BenchmarkBitmapOf/dense-8            	    1000	   2020032 ns/op	 1284501 B/op	     332 allocs/op
BenchmarkBitmapOf/sparse-8           	     500	   3662102 ns/op	 3458160 B/op	     633 allocs/op
BenchmarkBitmapOf/addmanyseq-8       	    1000	   2881489 ns/op	 1272947 B/op	     481 allocs/op
BenchmarkBitmapOf/bitmapofmixes-8    	     500	   3053739 ns/op	 1407344 B/op	     331 allocs/op
BenchmarkBitmapOf/notSplit-8         	    1000	   2357370 ns/op	 1268080 B/op	     329 allocs/op
BenchmarkBitmapOf/split-8            	    1000	   2042487 ns/op	 1268080 B/op	     329 allocs/op

So it's about 10-30% gain by splitting into highs and lows.

Might be a bit of work to keep the performance of the above approach with chunking though. The other reason we want chunking, is it gives us the opportunity to use addRange for rle, which is common in the format we're converting from, and is key for performance in a subset of our use cases. (to avoid materialising all the uint32s)

@maciej
Copy link
Member

maciej commented Nov 8, 2017

@cakey ok, now looking at your implementation the performance gains seem obvious. If you can copy subslice from the input lows slice – it's always going to be faster. My function has to split every uint32; there is no way around it.
Does your system produce the input as already split highs and lows? For me it seems a somehow unlikely case to have the input data already available in such form. If you have to split the data then, well, you're still doing the work, but somewhere else, outside of roaring.

What about the idea of having a function that accepts a "generator" (an anonymous functions, I don't know how you'd call such a pattern in Go): BitmapFromGenerator(func () (next uint32, ok bool) ). You proposed it in one of the previous comments. Have you benchmarked it?

You said that the high-low split gives you an opportunity to use addRange for RLE. Would it be more more costly if you used uint32s?

@lemire
Copy link
Member

lemire commented May 29, 2018

I think that this issue has been resolved entirely. Reopen if you disagree.

@lemire lemire closed this as completed May 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants