Skip to content

Performance improvement array_container_contains()#800

Merged
lemire merged 1 commit into
RoaringBitmap:masterfrom
andreigudkov:array_contains
Apr 24, 2026
Merged

Performance improvement array_container_contains()#800
lemire merged 1 commit into
RoaringBitmap:masterfrom
andreigudkov:array_contains

Conversation

@andreigudkov
Copy link
Copy Markdown
Member

  1. Limit initial binsearch boundary by x value
  2. Reorder conditional logic to follow the pattern "first find smallest item >= x", next perform return decision
  3. Consume 16 remaining elements in blocks of four

Tested on KabyLake. Overall this gives ~33% improvement in array_container_benchmark:contains_test and ~8% improvement in real_bitmaps_contains_benchmark.

Before:

contains_test(B):  37.48 cycles per operation
contains_test(B):  37.48 cycles per operation
contains_test(B):  37.45 cycles per operation
contains_test(B):  37.47 cycles per operation
contains_test(B):  37.47 cycles per operation
Quartile queries on 200 bitmaps took 15299 cycles
Quartile queries on 200 bitmaps took 14222 cycles
Quartile queries on 200 bitmaps took 14945 cycles
Quartile queries on 200 bitmaps took 15525 cycles
Quartile queries on 200 bitmaps took 15311 cycles

After:

contains_test(B):  24.64 cycles per operation
contains_test(B):  24.55 cycles per operation
contains_test(B):  24.61 cycles per operation
contains_test(B):  24.62 cycles per operation
contains_test(B):  24.56 cycles per operation
Quartile queries on 200 bitmaps took 13898 cycles
Quartile queries on 200 bitmaps took 14159 cycles
Quartile queries on 200 bitmaps took 13608 cycles
Quartile queries on 200 bitmaps took 13995 cycles
Quartile queries on 200 bitmaps took 13706 cycles

1) Limit initial binsearch boundary by x value
2) Reorder conditional logic to follow the pattern
   "first find smallest item >= x", next perform return decision
3) Consume 16 remaining elements in blocks of four

Tested on KabyLake. Overall this gives ~33% improvement in
array_container_benchmark:contains_test and ~8% improvement in
real_bitmaps_contains_benchmark.
@lemire
Copy link
Copy Markdown
Member

lemire commented Apr 14, 2026

Very interesting. Let us investigate!

@lemire
Copy link
Copy Markdown
Member

lemire commented Apr 21, 2026

I am not forgetting this.

@andreigudkov
Copy link
Copy Markdown
Member Author

Meanwhile, I realized that both array_container_benchmark and real_bitmaps_contains_benchmark are (very) far from perfect.

For array_container_benchmark:

  1. remove_test uses BEST_TIME macro, which makes 500 repeats on the same instance of bitmap (and reports the minimum time among all the iterations). This means that all iterations starting from the second one try to remove elements from the empty bitmap. Hence such ridiculous values as 0.01 cycles per element.

  2. Similar issue is with add_benchmark. Actual addition happens only during the very first iteration.

  3. Usage of size argument of the BEST_TIME macro is inconsistent. For example, it is passed as 2048 (=TESTSIZE) for add_test, but internally this test adds 2^16/3 elements.

  4. Unlike BEST_TIME, BEST_TIME_PRE_ARRAY reports average (not minimum) time between the iterations. Therefore its name is misleading.

  5. contains_tests with prefetch/flush use values from testvalues array, which is never populated at all.

real_bitmaps_contains_benchmark queries three elements from each of the preloaded bitmaps. These values are selected as 1/4, 1/2, 3/4 between zero and maxvalue. However, this maxvalue is precomputed as the maximum value among all bitmaps. I didn't investigated further, but I have a feeling that this leads to many queried elements falling outside the range of the target bitmap.

@lemire
Copy link
Copy Markdown
Member

lemire commented Apr 22, 2026

@andreigudkov That's correct. The benchmarks are not great.

@lemire
Copy link
Copy Markdown
Member

lemire commented Apr 22, 2026

The BEST_TIME macro is old crap.

@lemire
Copy link
Copy Markdown
Member

lemire commented Apr 22, 2026

Note that I did not forget this PR.

@lemire
Copy link
Copy Markdown
Member

lemire commented Apr 24, 2026

Before

> sudo ./build/benchmarks/benchmarkref --filter contains_quartiles
| benchmark                                              |      ns/op |     cyc/op |    GHz |     ins/op |  ins/cyc |     brm/op |    miss/op |
|--------------------------------------------------------|-----------:|-----------:|-------:|-----------:|---------:|-----------:|-----------:|
| real_bitmaps/contains_quartiles/census-income          |       5.70 |      25.76 |   4.52 |     128.03 |     4.97 |     0.0000 |     4.6365 |
| real_bitmaps/contains_quartiles/census-income_srt      |       6.01 |      27.15 |   4.52 |     114.94 |     4.23 |     0.0017 |     3.0653 |
| real_bitmaps/contains_quartiles/census1881             |       2.04 |       9.21 |   4.53 |      51.85 |     5.63 |     0.0033 |     0.0289 |
| real_bitmaps/contains_quartiles/census1881_srt         |       2.47 |      11.18 |   4.52 |      54.30 |     4.86 |     0.0001 |     0.0477 |
| real_bitmaps/contains_quartiles/uscensus2000           |       1.99 |       9.00 |   4.53 |      45.17 |     5.02 |     0.0083 |     0.0004 |
| real_bitmaps/contains_quartiles/weather_sept_85        |       6.17 |      27.87 |   4.52 |     126.25 |     4.53 |     0.0001 |     4.9091 |
| real_bitmaps/contains_quartiles/weather_sept_85_srt    |       5.06 |      22.89 |   4.52 |     106.66 |     4.66 |     0.0000 |     2.8324 |
| real_bitmaps/contains_quartiles/wikileaks-noquotes     |       3.41 |      15.43 |   4.52 |      68.64 |     4.45 |     0.0001 |     0.4939 |
| real_bitmaps/contains_quartiles/wikileaks-noquotes_srt |       2.81 |      12.73 |   4.53 |      56.73 |     4.46 |     0.0000 |     0.0457 |

After

> sudo ./build/benchmarks/benchmark --filter contains_quartiles

| benchmark                                              |      ns/op |     cyc/op |    GHz |     ins/op |  ins/cyc |     brm/op |    miss/op |
|--------------------------------------------------------|-----------:|-----------:|-------:|-----------:|---------:|-----------:|-----------:|
| real_bitmaps/contains_quartiles/census-income          |       4.65 |      21.02 |   4.52 |     108.42 |     5.16 |     0.0000 |     3.7335 |
| real_bitmaps/contains_quartiles/census-income_srt      |       5.35 |      24.18 |   4.52 |     107.21 |     4.43 |     0.0017 |     2.6187 |
| real_bitmaps/contains_quartiles/census1881             |       1.48 |       6.71 |   4.53 |      44.97 |     6.71 |     0.0000 |     0.0013 |
| real_bitmaps/contains_quartiles/census1881_srt         |       2.09 |       9.44 |   4.52 |      50.24 |     5.32 |     0.0017 |     0.0027 |
| real_bitmaps/contains_quartiles/uscensus2000           |       1.30 |       5.89 |   4.54 |      40.15 |     6.82 |     0.0001 |     0.0021 |
| real_bitmaps/contains_quartiles/weather_sept_85        |       4.86 |      21.97 |   4.52 |     104.38 |     4.75 |     0.0017 |     5.0421 |
| real_bitmaps/contains_quartiles/weather_sept_85_srt    |       4.57 |      20.64 |   4.52 |      98.39 |     4.77 |     0.0017 |     2.8745 |
| real_bitmaps/contains_quartiles/wikileaks-noquotes     |       3.16 |      14.26 |   4.52 |      65.69 |     4.61 |     0.0050 |     0.4766 |
| real_bitmaps/contains_quartiles/wikileaks-noquotes_srt |       2.36 |      10.65 |   4.52 |      53.50 |     5.02 |     0.0000 |     0.0204 |

Before:

> sudo ./build/benchmarks/benchmarkref --filter array_container/
| benchmark                                    |      ns/op |     cyc/op |    GHz |     ins/op |  ins/cyc |     brm/op |    miss/op |
|----------------------------------------------|-----------:|-----------:|-------:|-----------:|---------:|-----------:|-----------:|
| array_container/add                          |       0.45 |       2.03 |   4.52 |      21.05 |    10.38 |     0.0001 |     0.0040 |
| array_container/contains_all_u16             |      15.40 |      69.20 |   4.49 |     175.00 |     2.53 |     1.5036 |     0.0001 |
| array_container/remove                       |     182.08 |     820.47 |   4.51 |    4242.94 |     5.17 |     0.9525 |     0.0582 |
| array_container/to_uint32_array              |       0.23 |       1.06 |   4.59 |       8.01 |     7.53 |     0.0000 |     0.0002 |
| array_container/contains_random_prefetch     |     216.55 |     980.25 |   4.53 |    3605.13 |     3.68 |     7.4082 |     0.0684 |
| array_container/contains_random_flush        |      13.52 |      61.15 |   4.52 |     174.78 |     2.86 |     0.7004 |     0.0045 |
| array_container/union_stride3_stride5        |       0.29 |       1.26 |   4.32 |      10.38 |     8.27 |     0.0000 |     0.0004 |
| array_container/intersection_stride3_stride5 |       0.25 |       1.13 |   4.49 |       7.50 |     6.65 |     0.0000 |     0.0002 |
| array_container/union_stride16_pow2          |       0.18 |       0.81 |   4.52 |       6.22 |     7.65 |     0.0015 |     0.0000 |
| array_container/intersection_stride16_pow2   |       0.02 |       0.09 |   4.53 |       0.34 |     3.60 |     0.0000 |     0.0000 |

After

> sudo ./build/benchmarks/benchmark --filter array_container/
| benchmark                                    |      ns/op |     cyc/op |    GHz |     ins/op |  ins/cyc |     brm/op |    miss/op |
|----------------------------------------------|-----------:|-----------:|-------:|-----------:|---------:|-----------:|-----------:|
| array_container/add                          |       0.45 |       2.02 |   4.52 |      21.04 |    10.42 |     0.0001 |     0.0018 |
| array_container/contains_all_u16             |      10.67 |      48.00 |   4.50 |     144.38 |     3.01 |     0.7853 |     0.0008 |
| array_container/remove                       |     182.76 |     822.47 |   4.50 |    4243.37 |     5.16 |     0.9366 |     0.1345 |
| array_container/to_uint32_array              |       0.23 |       1.05 |   4.53 |       8.01 |     7.64 |     0.0000 |     0.0003 |
| array_container/contains_random_prefetch     |     204.55 |     926.21 |   4.53 |    3576.83 |     3.86 |     5.6064 |     1.9395 |
| array_container/contains_random_flush        |      10.43 |      47.22 |   4.53 |     144.64 |     3.06 |     0.4304 |     0.0106 |
| array_container/union_stride3_stride5        |       0.28 |       1.23 |   4.43 |      10.38 |     8.40 |     0.0000 |     0.0001 |
| array_container/intersection_stride3_stride5 |       0.26 |       1.14 |   4.41 |       7.50 |     6.60 |     0.0000 |     0.0002 |
| array_container/union_stride16_pow2          |       0.18 |       0.81 |   4.51 |       6.22 |     7.67 |     0.0015 |     0.0000 |
| array_container/intersection_stride16_pow2   |       0.02 |       0.09 |   4.53 |       0.34 |     3.61 |     0.0000 |     0.0000 |

@lemire
Copy link
Copy Markdown
Member

lemire commented Apr 24, 2026

It is a clear win with the updated benchmarks.

So merging.

@lemire lemire merged commit 4234f13 into RoaringBitmap:master Apr 24, 2026
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants