New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parallel OR and XOR #211

Merged
merged 2 commits into from Mar 7, 2018

Conversation

Projects
None yet
6 participants
@richardstartin
Contributor

richardstartin commented Feb 3, 2018

This PR provides a way to execute arbitrary boolean reductions on bitmaps in parallel.

Parallel AND and ANDNOT were implemented but found to be unprofitable. They can be implemented as circuits if necessary.

@coveralls

This comment has been minimized.

coveralls commented Feb 3, 2018

Coverage Status

Coverage increased (+0.3%) to 91.059% when pulling 3a3510f on richardstartin:parallel into 6aca167 on RoaringBitmap:master.

@richardstartin richardstartin changed the title from Implement parallel OR, XOR and boolean circuits - don't merge to Implement parallel OR, XOR and boolean circuits Feb 5, 2018

@lemire

This comment has been minimized.

Member

lemire commented Feb 6, 2018

We need at least one person to review this and I am too busy.

Pinging @owenkaser @okrische @maciej

@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Feb 6, 2018

I'm not likely to push very hard to get this merged - I can leave this here in case someone wants to use the code or improve it.

@lemire

This comment has been minimized.

Member

lemire commented Feb 6, 2018

It is interesting. I think @owenkaser would love this stuff... but I don't know if he has time.

@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Feb 6, 2018

BTW I started a repo to exactly this and only this, I'm using the containers from RoaringBitmap inside a different index.

@maciej

This comment has been minimized.

Member

maciej commented Feb 9, 2018

Over the week I came up and implemented a draft version of another approach to parallel roaring aggregations, described in brief here: RoaringBitmap/roaring#140 (code is here: RoaringBitmap/roaring@master...maciej:par2)

It does less work upfront (being parallel aggregation starts) at the cost of possible work unit size skew.
I think this approach might be better suited for Java 8 parallel streams. Unfortunately, I don't think I'll have the time to implemented anytime soon in Java.

@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Feb 9, 2018

If you implement it and benchmark it against the code here (perhaps considering cases like evaluation of arbitrary boolean circuits too) it would be a useful point of comparison. Do you have measurements of the throughput multiplier (parallel throughput / single threaded throughput) for each of the real data sets in Go?

@lemire

This comment has been minimized.

Member

lemire commented Feb 10, 2018

I might end up reviewing this code, if nobody else has time, but it won't be next week nor probably the week after that.

@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Feb 27, 2018

I took another look at this and tuned the initial grouping method (groupByKey), I think this is in a fit state to review. The upfront work is cheap: its contibution to the overall cost of the aggregation is very small. Doing this work results in balanced tasks, as can be seen by the multiplicative improvements (parallel* vs fast*).

I removed the circuits stuff.

These methods execute in the default fork join pool, rather than manage a thread pool on the behalf of the user. To execute inside a controlled thread pool, the user can execute an expression like RoaringBitmap result = myExecutor.submit(() -> ParallelAggregation.or(bitmaps)).get();

Summary:

Operation (dataset) Fast Score Parallel Score Ratio
or census-income 2186.408 590.787 3.700839727
or census1881 3039.89 590.113 5.151369314
or dimension_008 7815.298 1948.751 4.01041385
or dimension_003 27330.666 1742.342 15.68616609
or dimension_033 3816.513 592.466 6.441741805
or uscensus2000 415.614 191.871 2.166111606
or weather_sept_85 6695.696 1325.299 5.052215387
or wikileaks-noquotes 2832.173 400.455 7.072387659
or census-income_srt 1577.818 514.695 3.065539786
or census1881_srt 2498.874 621.687 4.019504992
or weather_sept_85_srt 4389.666 602.811 7.281993859
or wikileaks-noquotes_srt 1854.733 297.483 6.23475291
xor census-income 4953.7 2282.418 2.1703737
xor census1881 12470.463 1911.917 6.522491824
xor dimension_008 14941.415 3078.301 4.853786228
xor dimension_003 67537.161 2839.221 23.78721523
xor dimension_033 7543.789 885.083 8.523256011
xor uscensus2000 769.519 185.91 4.139201764
xor weather_sept_85 15564.005 2199.873 7.074956145
xor wikileaks-noquotes 5692.782 948.319 6.003024299
xor census-income_srt 3635.227 1121.972 3.240033619
xor census1881_srt 3140.666 1403.689 2.23743721
xor weather_sept_85_srt 8875.121 1068.565 8.305644486
xor wikileaks-noquotes_srt 3153.011 827.203 3.811653246

Raw:

Benchmark                                             (dataset)  Mode  Cnt      Score       Error  Units
ParallelAggregatorBenchmark.fastOr                census-income  avgt    5   2186.408 ▒   396.156  us/op
ParallelAggregatorBenchmark.fastOr                   census1881  avgt    5   3039.890 ▒   583.813  us/op
ParallelAggregatorBenchmark.fastOr                dimension_008  avgt    5   7815.298 ▒  1650.698  us/op
ParallelAggregatorBenchmark.fastOr                dimension_003  avgt    5  27330.666 ▒  7684.766  us/op
ParallelAggregatorBenchmark.fastOr                dimension_033  avgt    5   3816.513 ▒   929.668  us/op
ParallelAggregatorBenchmark.fastOr                 uscensus2000  avgt    5    415.614 ▒    29.042  us/op
ParallelAggregatorBenchmark.fastOr              weather_sept_85  avgt    5   6695.696 ▒  1511.225  us/op
ParallelAggregatorBenchmark.fastOr           wikileaks-noquotes  avgt    5   2832.173 ▒   564.162  us/op
ParallelAggregatorBenchmark.fastOr            census-income_srt  avgt    5   1577.818 ▒   487.531  us/op
ParallelAggregatorBenchmark.fastOr               census1881_srt  avgt    5   2498.874 ▒   361.164  us/op
ParallelAggregatorBenchmark.fastOr          weather_sept_85_srt  avgt    5   4389.666 ▒   543.126  us/op
ParallelAggregatorBenchmark.fastOr       wikileaks-noquotes_srt  avgt    5   1854.733 ▒   173.413  us/op
ParallelAggregatorBenchmark.fastXor               census-income  avgt    5   4953.700 ▒  1261.132  us/op
ParallelAggregatorBenchmark.fastXor                  census1881  avgt    5  12470.463 ▒  4129.793  us/op
ParallelAggregatorBenchmark.fastXor               dimension_008  avgt    5  14941.415 ▒  7312.867  us/op
ParallelAggregatorBenchmark.fastXor               dimension_003  avgt    5  67537.161 ▒ 29808.591  us/op
ParallelAggregatorBenchmark.fastXor               dimension_033  avgt    5   7543.789 ▒   189.345  us/op
ParallelAggregatorBenchmark.fastXor                uscensus2000  avgt    5    769.519 ▒    10.626  us/op
ParallelAggregatorBenchmark.fastXor             weather_sept_85  avgt    5  15564.005 ▒   207.713  us/op
ParallelAggregatorBenchmark.fastXor          wikileaks-noquotes  avgt    5   5692.782 ▒    63.245  us/op
ParallelAggregatorBenchmark.fastXor           census-income_srt  avgt    5   3635.227 ▒    50.380  us/op
ParallelAggregatorBenchmark.fastXor              census1881_srt  avgt    5   3140.666 ▒   347.140  us/op
ParallelAggregatorBenchmark.fastXor         weather_sept_85_srt  avgt    5   8875.121 ▒  1981.947  us/op
ParallelAggregatorBenchmark.fastXor      wikileaks-noquotes_srt  avgt    5   3153.011 ▒   327.130  us/op
ParallelAggregatorBenchmark.groupByKey            census-income  avgt    5     15.063 ▒     2.408  us/op
ParallelAggregatorBenchmark.groupByKey               census1881  avgt    5     33.225 ▒     0.310  us/op
ParallelAggregatorBenchmark.groupByKey            dimension_008  avgt    5    202.036 ▒     9.271  us/op
ParallelAggregatorBenchmark.groupByKey            dimension_003  avgt    5    471.520 ▒   233.010  us/op
ParallelAggregatorBenchmark.groupByKey            dimension_033  avgt    5     40.107 ▒    10.954  us/op
ParallelAggregatorBenchmark.groupByKey             uscensus2000  avgt    5     94.619 ▒     0.726  us/op
ParallelAggregatorBenchmark.groupByKey          weather_sept_85  avgt    5     50.692 ▒     0.629  us/op
ParallelAggregatorBenchmark.groupByKey       wikileaks-noquotes  avgt    5     37.157 ▒     0.394  us/op
ParallelAggregatorBenchmark.groupByKey        census-income_srt  avgt    5     13.912 ▒     1.946  us/op
ParallelAggregatorBenchmark.groupByKey           census1881_srt  avgt    5     48.620 ▒     2.146  us/op
ParallelAggregatorBenchmark.groupByKey      weather_sept_85_srt  avgt    5     45.402 ▒    15.315  us/op
ParallelAggregatorBenchmark.groupByKey   wikileaks-noquotes_srt  avgt    5     32.599 ▒     9.463  us/op
ParallelAggregatorBenchmark.parallelOr            census-income  avgt    5    590.787 ▒   410.733  us/op
ParallelAggregatorBenchmark.parallelOr               census1881  avgt    5    590.113 ▒    81.046  us/op
ParallelAggregatorBenchmark.parallelOr            dimension_008  avgt    5   1948.751 ▒    29.713  us/op
ParallelAggregatorBenchmark.parallelOr            dimension_003  avgt    5   1742.342 ▒    24.026  us/op
ParallelAggregatorBenchmark.parallelOr            dimension_033  avgt    5    592.466 ▒    11.182  us/op
ParallelAggregatorBenchmark.parallelOr             uscensus2000  avgt    5    191.871 ▒    16.775  us/op
ParallelAggregatorBenchmark.parallelOr          weather_sept_85  avgt    5   1325.299 ▒    89.872  us/op
ParallelAggregatorBenchmark.parallelOr       wikileaks-noquotes  avgt    5    400.455 ▒     4.644  us/op
ParallelAggregatorBenchmark.parallelOr        census-income_srt  avgt    5    514.695 ▒    34.790  us/op
ParallelAggregatorBenchmark.parallelOr           census1881_srt  avgt    5    621.687 ▒    15.332  us/op
ParallelAggregatorBenchmark.parallelOr      weather_sept_85_srt  avgt    5    602.811 ▒    56.558  us/op
ParallelAggregatorBenchmark.parallelOr   wikileaks-noquotes_srt  avgt    5    297.483 ▒     4.345  us/op
ParallelAggregatorBenchmark.parallelXor           census-income  avgt    5   2282.418 ▒   416.933  us/op
ParallelAggregatorBenchmark.parallelXor              census1881  avgt    5   1911.917 ▒    51.065  us/op
ParallelAggregatorBenchmark.parallelXor           dimension_008  avgt    5   3078.301 ▒    21.038  us/op
ParallelAggregatorBenchmark.parallelXor           dimension_003  avgt    5   2839.221 ▒    27.686  us/op
ParallelAggregatorBenchmark.parallelXor           dimension_033  avgt    5    885.083 ▒    48.304  us/op
ParallelAggregatorBenchmark.parallelXor            uscensus2000  avgt    5    185.910 ▒     6.636  us/op
ParallelAggregatorBenchmark.parallelXor         weather_sept_85  avgt    5   2199.873 ▒    77.680  us/op
ParallelAggregatorBenchmark.parallelXor      wikileaks-noquotes  avgt    5    948.319 ▒    16.758  us/op
ParallelAggregatorBenchmark.parallelXor       census-income_srt  avgt    5   1121.972 ▒    51.130  us/op
ParallelAggregatorBenchmark.parallelXor          census1881_srt  avgt    5   1403.689 ▒    33.235  us/op
ParallelAggregatorBenchmark.parallelXor     weather_sept_85_srt  avgt    5   1068.565 ▒   307.331  us/op
ParallelAggregatorBenchmark.parallelXor  wikileaks-noquotes_srt  avgt    5    827.203 ▒    17.198  us/op

@richardstartin richardstartin changed the title from Implement parallel OR, XOR and boolean circuits to Implement parallel OR and XOR Feb 28, 2018

@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Feb 28, 2018

Ported to the buffer package - similar story:

Operation (dataset) Fast Score Parallel Score Ratio
or census-income 1689.988 578.129 2.923202261
or census1881 3212.083 603.321 5.324003308
or dimension_008 7389.3 2039.142 3.623729981
or dimension_003 27540.562 1802.796 15.2765826
or dimension_033 3604.492 598.17 6.025865557
or uscensus2000 469.418 188.369 2.492013017
or weather_sept_85 6393.732 1355.158 4.718071251
or wikileaks-noquotes 3010.916 426.046 7.067114819
or census-income_srt 1530.562 420.987 3.635651457
or census1881_srt 2353.514 654.602 3.595335792
or weather_sept_85_srt 3995.665 624.395 6.399258482
or wikileaks-noquotes_srt 1760.478 312.241 5.638202542
xor census-income 4916.101 2140.669 2.296525525
xor census1881 13000.856 2159.061 6.021532509
xor dimension_008 15181.078 3151.216 4.817530122
xor dimension_003 65379.065 2914.92 22.42911126
xor dimension_033 7409.519 929.755 7.969324177
xor uscensus2000 1004.825 191.123 5.257478169
xor weather_sept_85 15803.065 2221.875 7.112490577
xor wikileaks-noquotes 5780.093 1001.636 5.770652213
xor census-income_srt 3737.726 1122.103 3.331000808
xor census1881_srt 3125.041 1424.627 2.193585409
xor weather_sept_85_srt 8415.521 1232.288 6.8291836
xor wikileaks-noquotes_srt 3007.755 1041.546 2.88777932

Raw:

Benchmark                                                   (dataset)  Mode  Cnt      Score       Error  Units
ParallelAggregatorBenchmark.bufferFastOr                census-income  avgt    5   1689.988 ▒    28.747  us/op
ParallelAggregatorBenchmark.bufferFastOr                   census1881  avgt    5   3212.083 ▒   200.296  us/op
ParallelAggregatorBenchmark.bufferFastOr                dimension_008  avgt    5   7389.300 ▒   130.547  us/op
ParallelAggregatorBenchmark.bufferFastOr                dimension_003  avgt    5  27540.562 ▒ 10296.860  us/op
ParallelAggregatorBenchmark.bufferFastOr                dimension_033  avgt    5   3604.492 ▒    31.472  us/op
ParallelAggregatorBenchmark.bufferFastOr                 uscensus2000  avgt    5    469.418 ▒     7.070  us/op
ParallelAggregatorBenchmark.bufferFastOr              weather_sept_85  avgt    5   6393.732 ▒   161.357  us/op
ParallelAggregatorBenchmark.bufferFastOr           wikileaks-noquotes  avgt    5   3010.916 ▒    40.259  us/op
ParallelAggregatorBenchmark.bufferFastOr            census-income_srt  avgt    5   1530.562 ▒    17.559  us/op
ParallelAggregatorBenchmark.bufferFastOr               census1881_srt  avgt    5   2353.514 ▒    41.470  us/op
ParallelAggregatorBenchmark.bufferFastOr          weather_sept_85_srt  avgt    5   3995.665 ▒    39.720  us/op
ParallelAggregatorBenchmark.bufferFastOr       wikileaks-noquotes_srt  avgt    5   1760.478 ▒    21.939  us/op
ParallelAggregatorBenchmark.bufferFastXor               census-income  avgt    5   4916.101 ▒  1075.463  us/op
ParallelAggregatorBenchmark.bufferFastXor                  census1881  avgt    5  13000.856 ▒   111.705  us/op
ParallelAggregatorBenchmark.bufferFastXor               dimension_008  avgt    5  15181.078 ▒  1259.014  us/op
ParallelAggregatorBenchmark.bufferFastXor               dimension_003  avgt    5  65379.065 ▒ 15856.760  us/op
ParallelAggregatorBenchmark.bufferFastXor               dimension_033  avgt    5   7409.519 ▒   308.058  us/op
ParallelAggregatorBenchmark.bufferFastXor                uscensus2000  avgt    5   1004.825 ▒    13.447  us/op
ParallelAggregatorBenchmark.bufferFastXor             weather_sept_85  avgt    5  15803.065 ▒   146.089  us/op
ParallelAggregatorBenchmark.bufferFastXor          wikileaks-noquotes  avgt    5   5780.093 ▒    65.436  us/op
ParallelAggregatorBenchmark.bufferFastXor           census-income_srt  avgt    5   3737.726 ▒    33.094  us/op
ParallelAggregatorBenchmark.bufferFastXor              census1881_srt  avgt    5   3125.041 ▒    37.007  us/op
ParallelAggregatorBenchmark.bufferFastXor         weather_sept_85_srt  avgt    5   8415.521 ▒   106.932  us/op
ParallelAggregatorBenchmark.bufferFastXor      wikileaks-noquotes_srt  avgt    5   3007.755 ▒    22.935  us/op
ParallelAggregatorBenchmark.bufferGroupByKey            census-income  avgt    5     15.443 ▒     2.994  us/op
ParallelAggregatorBenchmark.bufferGroupByKey               census1881  avgt    5     33.224 ▒     0.325  us/op
ParallelAggregatorBenchmark.bufferGroupByKey            dimension_008  avgt    5    199.027 ▒     9.035  us/op
ParallelAggregatorBenchmark.bufferGroupByKey            dimension_003  avgt    5    386.616 ▒    74.411  us/op
ParallelAggregatorBenchmark.bufferGroupByKey            dimension_033  avgt    5     39.495 ▒     5.876  us/op
ParallelAggregatorBenchmark.bufferGroupByKey             uscensus2000  avgt    5     91.571 ▒     1.407  us/op
ParallelAggregatorBenchmark.bufferGroupByKey          weather_sept_85  avgt    5     52.130 ▒     0.581  us/op
ParallelAggregatorBenchmark.bufferGroupByKey       wikileaks-noquotes  avgt    5     39.188 ▒     0.255  us/op
ParallelAggregatorBenchmark.bufferGroupByKey        census-income_srt  avgt    5     12.358 ▒     0.124  us/op
ParallelAggregatorBenchmark.bufferGroupByKey           census1881_srt  avgt    5     48.445 ▒     0.343  us/op
ParallelAggregatorBenchmark.bufferGroupByKey      weather_sept_85_srt  avgt    5     37.379 ▒     0.287  us/op
ParallelAggregatorBenchmark.bufferGroupByKey   wikileaks-noquotes_srt  avgt    5     30.380 ▒     0.201  us/op
ParallelAggregatorBenchmark.bufferParallelOr            census-income  avgt    5    578.129 ▒    35.053  us/op
ParallelAggregatorBenchmark.bufferParallelOr               census1881  avgt    5    603.321 ▒     5.164  us/op
ParallelAggregatorBenchmark.bufferParallelOr            dimension_008  avgt    5   2039.142 ▒    14.237  us/op
ParallelAggregatorBenchmark.bufferParallelOr            dimension_003  avgt    5   1802.796 ▒   128.352  us/op
ParallelAggregatorBenchmark.bufferParallelOr            dimension_033  avgt    5    598.170 ▒     1.580  us/op
ParallelAggregatorBenchmark.bufferParallelOr             uscensus2000  avgt    5    188.369 ▒     4.252  us/op
ParallelAggregatorBenchmark.bufferParallelOr          weather_sept_85  avgt    5   1355.158 ▒    82.169  us/op
ParallelAggregatorBenchmark.bufferParallelOr       wikileaks-noquotes  avgt    5    426.046 ▒    10.051  us/op
ParallelAggregatorBenchmark.bufferParallelOr        census-income_srt  avgt    5    420.987 ▒    60.501  us/op
ParallelAggregatorBenchmark.bufferParallelOr           census1881_srt  avgt    5    654.602 ▒    10.360  us/op
ParallelAggregatorBenchmark.bufferParallelOr      weather_sept_85_srt  avgt    5    624.395 ▒   227.496  us/op
ParallelAggregatorBenchmark.bufferParallelOr   wikileaks-noquotes_srt  avgt    5    312.241 ▒     2.179  us/op
ParallelAggregatorBenchmark.bufferParallelXor           census-income  avgt    5   2140.669 ▒   246.233  us/op
ParallelAggregatorBenchmark.bufferParallelXor              census1881  avgt    5   2159.061 ▒    31.341  us/op
ParallelAggregatorBenchmark.bufferParallelXor           dimension_008  avgt    5   3151.216 ▒    25.534  us/op
ParallelAggregatorBenchmark.bufferParallelXor           dimension_003  avgt    5   2914.920 ▒   553.993  us/op
ParallelAggregatorBenchmark.bufferParallelXor           dimension_033  avgt    5    929.755 ▒    48.649  us/op
ParallelAggregatorBenchmark.bufferParallelXor            uscensus2000  avgt    5    191.123 ▒     1.956  us/op
ParallelAggregatorBenchmark.bufferParallelXor         weather_sept_85  avgt    5   2221.875 ▒   152.606  us/op
ParallelAggregatorBenchmark.bufferParallelXor      wikileaks-noquotes  avgt    5   1001.636 ▒   258.027  us/op
ParallelAggregatorBenchmark.bufferParallelXor       census-income_srt  avgt    5   1122.103 ▒    15.700  us/op
ParallelAggregatorBenchmark.bufferParallelXor          census1881_srt  avgt    5   1424.627 ▒    13.187  us/op
ParallelAggregatorBenchmark.bufferParallelXor     weather_sept_85_srt  avgt    5   1232.288 ▒    72.835  us/op
ParallelAggregatorBenchmark.bufferParallelXor  wikileaks-noquotes_srt  avgt    5   1041.546 ▒    27.056  us/op

@lemire

This comment has been minimized.

Member

lemire commented Feb 28, 2018

This version looks superb to me.

@richardstartin What is your core count?

@maciej Do you want to have a look before we merge?

@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Feb 28, 2018

4, 8 with hyperthreading. So some of the results are a bit funny. Perhaps there are cases where lining the containers up first is beneficial.

@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Mar 3, 2018

I just investigated why some of the improvements are so high: it's possible to aggregate bitmaps a lot faster on a single thread. ParallelAggregation.groupByKey is quite cheap, and once you have its result, you know how big the RoaringArray should be. This code (quickly written, not tested etc.) is much faster than FastAggregation.priorityqueue_or:

  /**
   * Groups the containers before aggregation.
   *
   * @param bitmaps input bitmaps
   * @return aggregated bitmap
   */
  public static RoaringBitmap groupedOr(RoaringBitmap... bitmaps) {
    SortedMap<Short, List<Container>> grouped = ParallelAggregation.groupByKey(bitmaps);
    short[] keys = new short[grouped.size()];
    Container[] containers = new Container[grouped.size()];
    int position = 0;
    for (Map.Entry<Short, List<Container>> slice : grouped.entrySet()) {
      List<Container> cs = slice.getValue();
      Container reduced = cs.get(0).clone();
      for (int i = 1; i < cs.size(); ++i) {
        reduced = reduced.lazyIOR(cs.get(i));
      }
      keys[position] = slice.getKey();
      containers[position] = reduced.repairAfterLazy();
      ++position;
    }
    RoaringArray hlc = new RoaringArray(keys, containers);
    return new RoaringBitmap(hlc);
  }
Benchmark                                          (dataset)  Mode  Cnt      Score      Error  Units
ParallelAggregatorBenchmark.fastOr             census-income  avgt    5   1762.007 ▒   17.951  us/op
ParallelAggregatorBenchmark.fastOr                census1881  avgt    5   2836.202 ▒   51.495  us/op
ParallelAggregatorBenchmark.fastOr             dimension_008  avgt    5   6975.359 ▒  103.086  us/op
ParallelAggregatorBenchmark.fastOr             dimension_003  avgt    5  26886.985 ▒ 8719.185  us/op
ParallelAggregatorBenchmark.fastOr             dimension_033  avgt    5   3611.131 ▒  108.956  us/op
ParallelAggregatorBenchmark.fastOr              uscensus2000  avgt    5    408.797 ▒    5.133  us/op
ParallelAggregatorBenchmark.fastOr           weather_sept_85  avgt    5   6368.642 ▒  110.388  us/op
ParallelAggregatorBenchmark.fastOr        wikileaks-noquotes  avgt    5   2720.545 ▒   28.647  us/op
ParallelAggregatorBenchmark.fastOr         census-income_srt  avgt    5   1494.022 ▒    7.070  us/op
ParallelAggregatorBenchmark.fastOr            census1881_srt  avgt    5   2364.890 ▒   22.560  us/op
ParallelAggregatorBenchmark.fastOr       weather_sept_85_srt  avgt    5   4031.584 ▒   67.850  us/op
ParallelAggregatorBenchmark.fastOr    wikileaks-noquotes_srt  avgt    5   1756.808 ▒   10.443  us/op
ParallelAggregatorBenchmark.fasterOr           census-income  avgt    5    881.500 ▒   29.317  us/op
ParallelAggregatorBenchmark.fasterOr              census1881  avgt    5   1680.120 ▒   18.386  us/op
ParallelAggregatorBenchmark.fasterOr           dimension_008  avgt    5   2775.348 ▒  641.789  us/op
ParallelAggregatorBenchmark.fasterOr           dimension_003  avgt    5   5607.536 ▒  744.890  us/op
ParallelAggregatorBenchmark.fasterOr           dimension_033  avgt    5   2319.855 ▒  361.051  us/op
ParallelAggregatorBenchmark.fasterOr            uscensus2000  avgt    5    282.189 ▒    3.661  us/op
ParallelAggregatorBenchmark.fasterOr         weather_sept_85  avgt    5   4383.067 ▒   65.590  us/op
ParallelAggregatorBenchmark.fasterOr      wikileaks-noquotes  avgt    5    790.273 ▒   14.556  us/op
ParallelAggregatorBenchmark.fasterOr       census-income_srt  avgt    5    822.005 ▒  153.132  us/op
ParallelAggregatorBenchmark.fasterOr          census1881_srt  avgt    5   1801.195 ▒   14.127  us/op
ParallelAggregatorBenchmark.fasterOr     weather_sept_85_srt  avgt    5   2059.228 ▒  147.410  us/op
ParallelAggregatorBenchmark.fasterOr  wikileaks-noquotes_srt  avgt    5    846.731 ▒   61.631  us/op

In fact, benchmarking the method above against the proposed parallel implementation reveals that the majority of the improvement is not necessarily because of parallelism.

Dataset Score Parallel Score Ratio
census-income 891.872 482.788 1.847337
census1881 1708.736 666.409 2.564095
dimension_008 2892.662 2046.431 1.413516
dimension_003 5997.619 1744.191 3.438625
dimension_033 2396.151 622.017 3.852228
uscensus2000 301.926 218.289 1.383148
weather_sept_85 4783.59 1358.899 3.520195
wikileaks-noquotes 936.034 433.436 2.159567
census-income_srt 894.362 404.108 2.213176
census1881_srt 1900.356 694.883 2.734786
weather_sept_85_srt 2489.202 860.713 2.892023
wikileaks-noquotes_srt 847.99 340.343 2.491575
Benchmark                                            (dataset)  Mode  Cnt     Score      Error  Units
ParallelAggregatorBenchmark.fasterOr             census-income  avgt    5   891.872 ▒   38.896  us/op
ParallelAggregatorBenchmark.fasterOr                census1881  avgt    5  1708.736 ▒   21.811  us/op
ParallelAggregatorBenchmark.fasterOr             dimension_008  avgt    5  2892.662 ▒  457.517  us/op
ParallelAggregatorBenchmark.fasterOr             dimension_003  avgt    5  5997.619 ▒ 1937.485  us/op
ParallelAggregatorBenchmark.fasterOr             dimension_033  avgt    5  2396.151 ▒  366.040  us/op
ParallelAggregatorBenchmark.fasterOr              uscensus2000  avgt    5   301.926 ▒   40.905  us/op
ParallelAggregatorBenchmark.fasterOr           weather_sept_85  avgt    5  4783.590 ▒   44.695  us/op
ParallelAggregatorBenchmark.fasterOr        wikileaks-noquotes  avgt    5   936.034 ▒  347.029  us/op
ParallelAggregatorBenchmark.fasterOr         census-income_srt  avgt    5   894.362 ▒  145.087  us/op
ParallelAggregatorBenchmark.fasterOr            census1881_srt  avgt    5  1900.356 ▒  167.163  us/op
ParallelAggregatorBenchmark.fasterOr       weather_sept_85_srt  avgt    5  2489.202 ▒  656.373  us/op
ParallelAggregatorBenchmark.fasterOr    wikileaks-noquotes_srt  avgt    5   847.990 ▒   45.905  us/op
ParallelAggregatorBenchmark.parallelOr           census-income  avgt    5   482.788 ▒  100.495  us/op
ParallelAggregatorBenchmark.parallelOr              census1881  avgt    5   666.409 ▒  529.329  us/op
ParallelAggregatorBenchmark.parallelOr           dimension_008  avgt    5  2046.431 ▒  469.036  us/op
ParallelAggregatorBenchmark.parallelOr           dimension_003  avgt    5  1744.191 ▒   53.774  us/op
ParallelAggregatorBenchmark.parallelOr           dimension_033  avgt    5   622.017 ▒   56.898  us/op
ParallelAggregatorBenchmark.parallelOr            uscensus2000  avgt    5   218.289 ▒   13.643  us/op
ParallelAggregatorBenchmark.parallelOr         weather_sept_85  avgt    5  1358.899 ▒   95.831  us/op
ParallelAggregatorBenchmark.parallelOr      wikileaks-noquotes  avgt    5   433.436 ▒    4.123  us/op
ParallelAggregatorBenchmark.parallelOr       census-income_srt  avgt    5   404.108 ▒    8.688  us/op
ParallelAggregatorBenchmark.parallelOr          census1881_srt  avgt    5   694.883 ▒   11.276  us/op
ParallelAggregatorBenchmark.parallelOr     weather_sept_85_srt  avgt    5   860.713 ▒  315.753  us/op

Since FastAggregation itself could be a lot faster, though this pull request does offer a big improvement, with a little more effort the throughput could be improved dramatically.

@lemire

This comment has been minimized.

Member

lemire commented Mar 3, 2018

This is intriguing.

The PriorityQueue approach is generally not best, so we’d like to compare against FastAggregation.or. Have tried it?

You might be onto something. We never considered a hash-based algorithm such as the one you are using. What you are doing is obvious in retrospect but I think none of us tried it.

It has different complexity. So maybe you found a strategy worth pushing forward.

Can you report you results against both FastAggregation.or and FastAggregation.priorityqueue_or?

We should investigate this further.

@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Mar 4, 2018

That's a good catch. I should have been using FastAggregation.or as a baseline.

With a different baseline, the picture isn't quite so rosy but is quite interesting. The grouped single threaded implementation is usually similar to FastAggregation.or but in one dataset (wikileaks-noquotes_srt) is half as fast, and in another (uscensus2000) is 4x faster. The parallel implementation ranges from slightly better to almost 6x better than FastAggregation.or. The far right column below is the multiple of each implementation's throughput relative to FastAggregation.or.

Benchmark (dataset) Mode Cnt Score Error Units Ratio Difference Multiplier
ParallelAggregatorBenchmark.fastGroupedOr census-income avgt 5 898.8 149 us/op 1.11 91.295 0.898426
ParallelAggregatorBenchmark.fastGroupedOr census1881 avgt 5 1678.97 20.2 us/op 0.94 -101.428 1.060411
ParallelAggregatorBenchmark.fastGroupedOr dimension_008 avgt 5 2843.53 562 us/op 1.43 848.548 0.701586
ParallelAggregatorBenchmark.fastGroupedOr dimension_003 avgt 5 5438.267 589 us/op 0.97 -159.089 1.029254
ParallelAggregatorBenchmark.fastGroupedOr dimension_033 avgt 5 2292.313 381 us/op 0.91 -221.023 1.096419
ParallelAggregatorBenchmark.fastGroupedOr uscensus2000 avgt 5 276.933 2.07 us/op 0.25 -816.211 3.947323
ParallelAggregatorBenchmark.fastGroupedOr weather_sept_85 avgt 5 4377.849 111 us/op 1.04 187.202 0.957239
ParallelAggregatorBenchmark.fastGroupedOr wikileaks-noquotes avgt 5 788.334 13.1 us/op 1.41 229.983 0.708267
ParallelAggregatorBenchmark.fastGroupedOr census-income_srt avgt 5 785.497 13.9 us/op 0.97 -22.348 1.028451
ParallelAggregatorBenchmark.fastGroupedOr census1881_srt avgt 5 1774.21 7.6 us/op 1.55 627.833 0.646134
ParallelAggregatorBenchmark.fastGroupedOr weather_sept_85_srt avgt 5 2029.066 169 us/op 1.01 21.172 0.989566
ParallelAggregatorBenchmark.fastGroupedOr wikileaks-noquotes_srt avgt 5 793.957 9.34 us/op 2.24 440.015 0.445795
ParallelAggregatorBenchmark.fastOr census-income avgt 5 807.505 28.7 us/op 1 0 1
ParallelAggregatorBenchmark.fastOr census1881 avgt 5 1780.398 138 us/op 1 0 1
ParallelAggregatorBenchmark.fastOr dimension_008 avgt 5 1994.982 147 us/op 1 0 1
ParallelAggregatorBenchmark.fastOr dimension_003 avgt 5 5597.356 450 us/op 1 0 1
ParallelAggregatorBenchmark.fastOr dimension_033 avgt 5 2513.336 157 us/op 1 0 1
ParallelAggregatorBenchmark.fastOr uscensus2000 avgt 5 1093.144 176 us/op 1 0 1
ParallelAggregatorBenchmark.fastOr weather_sept_85 avgt 5 4190.647 805 us/op 1 0 1
ParallelAggregatorBenchmark.fastOr wikileaks-noquotes avgt 5 558.351 8.18 us/op 1 0 1
ParallelAggregatorBenchmark.fastOr census-income_srt avgt 5 807.845 29.3 us/op 1 0 1
ParallelAggregatorBenchmark.fastOr census1881_srt avgt 5 1146.377 15.3 us/op 1 0 1
ParallelAggregatorBenchmark.fastOr weather_sept_85_srt avgt 5 2007.894 63.5 us/op 1 0 1
ParallelAggregatorBenchmark.fastOr wikileaks-noquotes_srt avgt 5 353.942 2.34 us/op 1 0 1
ParallelAggregatorBenchmark.parallelOr census-income avgt 5 561.877 35.4 us/op 0.7 -245.628 1.437156
ParallelAggregatorBenchmark.parallelOr census1881 avgt 5 549.337 13.2 us/op 0.31 -1231.06 3.240994
ParallelAggregatorBenchmark.parallelOr dimension_008 avgt 5 1817.589 20.3 us/op 0.91 -177.393 1.097598
ParallelAggregatorBenchmark.parallelOr dimension_003 avgt 5 1773.036 129 us/op 0.32 -3824.32 3.156933
ParallelAggregatorBenchmark.parallelOr dimension_033 avgt 5 578.908 7.25 us/op 0.23 -1934.43 4.341512
ParallelAggregatorBenchmark.parallelOr uscensus2000 avgt 5 185.418 0.56 us/op 0.17 -907.726 5.895566
ParallelAggregatorBenchmark.parallelOr weather_sept_85 avgt 5 1294.675 59 us/op 0.31 -2895.97 3.236833
ParallelAggregatorBenchmark.parallelOr wikileaks-noquotes avgt 5 399.439 4.1 us/op 0.72 -158.912 1.397838
ParallelAggregatorBenchmark.parallelOr census-income_srt avgt 5 424.368 26.3 us/op 0.53 -383.477 1.903643
ParallelAggregatorBenchmark.parallelOr census1881_srt avgt 5 688.326 136 us/op 0.6 -458.051 1.665456
ParallelAggregatorBenchmark.parallelOr weather_sept_85_srt avgt 5 584.801 45.3 us/op 0.29 -1423.09 3.433465
ParallelAggregatorBenchmark.parallelOr wikileaks-noquotes_srt avgt 5 301.668 1.95 us/op 0.85 -52.274 1.173283

It's interesting to look at the number of bitmaps with a container for each key in each dataset. This dictates how balanced parallel tasks will be. Some of these datasets create skewed aggregations, others are well balanced. Some have many keys, some very few.

census-income: [196, 196, 195, 181]
census1881: [23, 27, 23, 28, 20, 23, 20, 20, 19, 26, 26, 23, 21, 21, 14, 17, 20, 20, 19, 21, 21, 18, 23, 24, 18, 24, 19, 20, 17, 23, 26, 27, 19, 24, 25, 22, 22, 23, 24, 19, 21, 19, 21, 19, 26, 24, 22, 20, 17, 20, 29, 22, 26, 24, 23, 24, 22, 29, 23, 23, 22, 25, 24, 23, 25, 22]
dimension_008: [4444, 799, 995, 15, 100, 69, 18, 58, 197, 199, 4, 17, 58, 66, 4, 7, 7, 11, 53, 23, 42, 38, 45, 106, 24, 88, 88, 3, 2, 3, 1, 5, 5, 6, 5, 11, 58, 55, 46, 54, 648, 316, 17, 10, 6, 5, 1, 3, 8, 12, 19, 8, 1, 95, 111, 5]
dimension_003: [1996, 1560, 1928, 128, 138, 322, 160, 206, 515, 525, 121, 89, 190, 253, 92, 117, 117, 150, 159, 97, 110, 76, 77, 152, 54, 242, 276, 3, 14, 4, 81, 76, 39, 10, 32, 10, 23, 15, 20, 29, 392, 893, 298, 313, 1721, 838, 669, 91, 35, 60, 9, 28, 30, 62, 64, 71, 3, 664, 249, 9]
dimension_033: [62, 42, 64, 43, 28, 52, 31, 30, 46, 65, 42, 24, 38, 57, 44, 41, 45, 47, 32, 29, 31, 22, 19, 33, 23, 61, 66, 3, 5, 3, 28, 31, 24, 8, 12, 8, 14, 8, 13, 15, 45, 30, 23, 18, 29, 43, 59, 26, 11, 21, 4, 12, 12, 18, 14, 17, 3, 55, 32, 7]
uscensus2000: [6, 6, 6, 6, 4, 10, 13, 10, 2, 3, 5, 4, 3, 9, 10, 5, 4, 3, 5, 5, 5, 9, 5, 1, 1, 4, 4, 4, 6, 3, 5, 2, 4, 5, 3, 9, 5, 7, 1, 2, 4, 5, 6, 7, 14, 17, 11, 11, 14, 10, 20, 25, 2, 3, 2, 5, 2, 5, 6, 6, 4, 3, 4, 4, 2, 7, 8, 1, 1, 1, 3, 3, 2, 3, 4, 4, 4, 4, 3, 4, 7, 6, 1, 1, 1, 3, 1, 3, 4, 3, 3, 7, 6, 4, 8, 11, 4, 1, 1, 3, 1, 2, 3, 5, 1, 3, 3, 2, 7, 8, 4, 1, 2, 2, 1, 3, 3, 1, 1, 1, 4, 1, 3, 4, 3, 1, 2, 2, 1, 5, 4, 2, 6, 2, 7, 3, 4, 9, 5, 5, 4, 5, 3, 5, 7, 6, 1, 4, 3, 5, 3, 4, 6, 3, 2, 2, 3, 2, 6, 1, 1, 1, 3, 4, 2, 4, 6, 3, 2, 2, 5, 2, 4, 6, 4, 3, 2, 3, 4, 3, 7, 6, 1, 2, 2, 4, 4, 3, 6, 5, 1, 3, 3, 4, 4, 8, 6, 2, 2, 3, 5, 3, 5, 7, 3, 1, 3, 4, 3, 5, 8, 8, 3, 3, 5, 6, 4, 6, 5, 2, 4, 3, 4, 3, 4, 5, 3, 1, 3, 4, 3, 1, 6, 5, 5, 3, 5, 8, 3, 9, 5, 6, 2, 5, 3, 2, 4, 4, 3, 2, 5, 6, 4, 4, 10, 5, 1, 3, 2, 2, 5, 2, 1, 1, 1, 3, 1, 4, 2, 2, 1, 1, 2, 2, 4, 4, 2, 2, 3, 3, 3, 3, 6, 5, 3, 2, 3, 5, 3, 4, 6, 5, 2, 4, 6, 4, 3, 6, 7, 1, 1, 3, 2, 3, 3, 3, 1, 4, 3, 3, 6, 3, 1, 1, 3, 1, 2, 3, 1, 2, 2, 1, 3, 3, 2, 1, 2, 2, 1, 3, 2, 2, 2, 2, 5, 1, 4, 5, 2, 1, 2, 3, 3, 2, 4, 3, 1, 1, 2, 3, 1, 3, 4, 1, 1, 2, 3, 2, 5, 2, 2, 2, 5, 2, 3, 5, 5, 5, 6, 11, 10, 13, 9, 8, 2, 2, 3, 1, 4, 3, 1, 1, 1, 4, 1, 3, 4, 2, 1, 2, 2, 1, 4, 5, 6, 3, 4, 8, 6, 7, 9, 9, 4, 4, 6, 6, 9, 11, 9, 3, 3, 5, 9, 6, 11, 10, 4, 2, 3, 3, 4, 5, 5, 4, 1, 3, 1, 3, 2, 6, 3, 1, 1, 1, 3, 1, 2, 3, 7, 4, 4, 7, 7, 6, 20, 12, 3, 5, 3, 7, 6, 9, 8, 5, 2, 2, 3, 5, 4, 7, 6, 2, 2, 2, 5, 3, 4, 5, 3, 2, 2, 4, 3, 5, 5, 4, 2, 3, 2, 6, 3, 6, 6, 13, 11, 14, 15, 11, 21, 16, 15, 1, 3, 3, 2, 7, 2, 3, 2, 2, 6, 3, 4, 5, 4, 1, 1, 4, 2, 2, 6, 2, 1, 1, 3, 1, 2, 3, 2, 1, 3, 3, 3, 3, 4, 5, 1, 4, 3, 4, 3, 5, 4, 3, 1, 1, 2, 1, 3, 4, 2, 1, 1, 2, 2, 1, 5, 2]
weather_sept_85: [187, 182, 177, 176, 177, 175, 177, 174, 175, 176, 176, 176, 178, 184, 183, 183]
wikileaks-noquotes: [94, 98, 95, 97, 96, 95, 92, 94, 91, 95, 91, 90, 90, 85, 88, 91, 84, 79, 81, 73, 93]
census-income_srt: [193, 194, 188, 116]
census1881_srt: [36, 33, 33, 32, 33, 33, 39, 30, 40, 37, 31, 39, 31, 35, 35, 33, 37, 34, 36, 31, 28, 29, 34, 34, 37, 42, 30, 40, 44, 52, 47, 44, 30, 35, 41, 46, 40, 51, 45, 50, 44, 46, 29, 46, 46, 40, 44, 41, 45, 28, 39, 45, 47, 43, 42, 32, 46, 43, 42, 31, 46, 49, 34, 48, 29, 26]
weather_sept_85_srt: [54, 133, 164, 149, 108, 90, 131, 141, 151, 166, 161, 159, 138, 161, 139, 123]
wikileaks-noquotes_srt: [81, 93, 67, 76, 67, 70, 72, 68, 89, 65, 72, 76, 76, 80, 71, 85, 73, 73, 76, 84, 61]

As things stand, it's likely that it will always be beneficial to use the parallel implementation, but won't always justify the cost of using up capacity in a threadpool for the task.

Various heuristics can be based on the grouped keys to check if it is worth doing the work in parallel. I will implement a spliterator for the map which splits based on the number of containers rather than the number of keys (dimension_008 should probably put the work for the first key on a dedicated thread). The default spliterator on TreeMap is good but doesn't know the cost of each key.

Also, I'm implementing this "for fun" rather than having a particular concrete need this time. It might be better to close this until the approach shows more clear cut advantage.

@maciej

This comment has been minimized.

Member

maciej commented Mar 4, 2018

@lemire @richardstartin I think I have little to add at this point. The code is very readable, it's well documented and for a large number of test cases it's faster.

@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Mar 4, 2018

I just thought up a quick win for large skewed tasks, but I don't think it's worthwhile making many more changes, and it's important not to overtrain for my laptop or these datasets.

Benchmark (dataset) Mode Cnt Score Error Units  
ParallelAggregatorBenchmark.parallelOr census-income avgt 5 555.713 34.241 us/op
ParallelAggregatorBenchmark.parallelOr census1881 avgt 5 570.718 3.39 us/op
ParallelAggregatorBenchmark.parallelOr dimension_008 avgt 5 757.614 9.179 us/op
ParallelAggregatorBenchmark.parallelOr dimension_003 avgt 5 1743.939 27.555 us/op
ParallelAggregatorBenchmark.parallelOr dimension_033 avgt 5 591.09 29.429 us/op
ParallelAggregatorBenchmark.parallelOr uscensus2000 avgt 5 188.677 1.916 us/op
ParallelAggregatorBenchmark.parallelOr weather_sept_85 avgt 5 1333.605 99.963 us/op
ParallelAggregatorBenchmark.parallelOr wikileaks-noquotes avgt 5 219.896 2.404 us/op
ParallelAggregatorBenchmark.parallelOr census-income_srt avgt 5 460.229 180.861 us/op
ParallelAggregatorBenchmark.parallelOr census1881_srt avgt 5 446.696 76.561 us/op
ParallelAggregatorBenchmark.parallelOr weather_sept_85_srt avgt 5 688.588 331.204 us/op
ParallelAggregatorBenchmark.parallelOr wikileaks-noquotes_srt avgt 5 149.8 5.171 us/op
@okrische

This comment has been minimized.

Member

okrische commented Mar 4, 2018

With the price of having another array, one could also do the reduction for each key in parallel. I have not tested it, but it could look like this:

  public static RoaringBitmap groupedOr(RoaringBitmap... bitmaps) {
    SortedMap<Short, List<Container>> grouped = ParallelAggregation.groupByKey(bitmaps);
    
    short[] keys = new short[grouped.size()];
    Container[] containers = new Container[grouped.size()];
    List<List<Container>> values = new ArrayList(grouped.size);
    int position = 0;
    for (Map.Entry<Short, List<Container>> slice : grouped.entrySet()) {
      keys[position] = slice.getKey();
      values.add(slice.getValue());
      ++position;
    }

    IntStream.range(0, position).parallel.forEach(pos -> {
      // here it also could be worth it to stream/collect in parallel, if list is really long
      // (with a supplier, adder, accumulator and the repairAfterLazy as finisher)
      Container reduced = values.get(pos).get(0).clone();
      for (int i = 1; i < cs.size(); ++i) {
        reduced = reduced.lazyIOR(cs.get(i));
      }
      containers[pos] = reduced.repairAfterLazy();
    })
      
    RoaringArray hlc = new RoaringArray(keys, containers);
    return new RoaringBitmap(hlc);
  }

Maybe one can already collect the containers into a single one, while grouping by key. Something like stream().collect(Collectors.groupingBy(Collectors.mapping...Collectors.toRoaringArray(...))). So many options!

@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Mar 5, 2018

@okrische that's a good idea and similar to one I tried. A potential problem is it can't be generalised to XOR because if you have two identical containers at a key, the key should not be present in the output.

I tried it.

  public static RoaringBitmap or2(RoaringBitmap... bitmaps) {
    SortedMap<Short, List<Container>> grouped = groupByKey(bitmaps);
    short[] keys = new short[grouped.size()];
    Container[] values = new Container[grouped.size()];
    List<List<Container>> slices = new ArrayList<>(grouped.size());
    int i = 0;
    for (Map.Entry<Short, List<Container>> slice : grouped.entrySet()) {
      keys[i++] = slice.getKey();
      slices.add(slice.getValue());
    }
    IntStream.range(0, i).parallel()
            .forEach(position -> values[position] = or(slices.get(position)));
    return new RoaringBitmap(new RoaringArray(keys, values, i));
  }

For smaller aggregations it's much faster because it's more memory efficient, but usually the difference is unnoticeable.

Benchmark (dataset) Mode Cnt Score Error Units
ParallelAggregatorBenchmark.parallelOr census-income avgt 5 499.008 47.1 us/op
ParallelAggregatorBenchmark.parallelOr census1881 avgt 5 634.849 199 us/op
ParallelAggregatorBenchmark.parallelOr dimension_008 avgt 5 822.22 455 us/op
ParallelAggregatorBenchmark.parallelOr dimension_003 avgt 5 1724.995 31 us/op
ParallelAggregatorBenchmark.parallelOr dimension_033 avgt 5 579.19 26 us/op
ParallelAggregatorBenchmark.parallelOr uscensus2000 avgt 5 214.843 158 us/op
ParallelAggregatorBenchmark.parallelOr weather_sept_85 avgt 5 1313.411 103 us/op
ParallelAggregatorBenchmark.parallelOr wikileaks-noquotes avgt 5 217.131 2.2 us/op
ParallelAggregatorBenchmark.parallelOr census-income_srt avgt 5 410.109 36.6 us/op
ParallelAggregatorBenchmark.parallelOr census1881_srt avgt 5 388.093 9.38 us/op
ParallelAggregatorBenchmark.parallelOr weather_sept_85_srt avgt 5 593.551 49.8 us/op
ParallelAggregatorBenchmark.parallelOr wikileaks-noquotes_srt avgt 5 147.971 4.62 us/op
ParallelAggregatorBenchmark.parallelOr2 census-income avgt 5 328.729 10.9 us/op
ParallelAggregatorBenchmark.parallelOr2 census1881 avgt 5 488.69 89.2 us/op
ParallelAggregatorBenchmark.parallelOr2 dimension_008 avgt 5 767.88 56 us/op
ParallelAggregatorBenchmark.parallelOr2 dimension_003 avgt 5 1800.419 15.9 us/op
ParallelAggregatorBenchmark.parallelOr2 dimension_033 avgt 5 556.236 92.3 us/op
ParallelAggregatorBenchmark.parallelOr2 uscensus2000 avgt 5 172.038 1.88 us/op
ParallelAggregatorBenchmark.parallelOr2 weather_sept_85 avgt 5 1284.25 551 us/op
ParallelAggregatorBenchmark.parallelOr2 wikileaks-noquotes avgt 5 193.273 2.65 us/op
ParallelAggregatorBenchmark.parallelOr2 census-income_srt avgt 5 378.977 162 us/op
ParallelAggregatorBenchmark.parallelOr2 census1881_srt avgt 5 386.318 81.9 us/op
ParallelAggregatorBenchmark.parallelOr2 weather_sept_85_srt avgt 5 549.375 44.7 us/op
ParallelAggregatorBenchmark.parallelOr2 wikileaks-noquotes_srt avgt 5 125.469 1.69 us/op
@lemire

This comment has been minimized.

Member

lemire commented Mar 6, 2018

@richardstartin It seems worthwhile to adopt @okrische's approach. No?

@okrische

This comment has been minimized.

Member

okrische commented Mar 6, 2018

I would not use the style as i suggested. Its not good style, it has side effects.

I also did not like the extra iteration, but we can get the position from headMap with some extra cost:

sortedMap.entrySet.parallelStream).forEach(e -> {
    int pos = sortedMap.headMap(e.getKey()).size();
    keys[pos] = e.getKey();
    values[pos] = or(e.getValue());
})

But is setting short[] values in parallel thread-safe? Probably not?

Maybe i overthink it. Does the SortedMap's stream not announce "SORTED" as characteristic? Maybe its just enough to do this:

// will need an extra conversion to short[], there is no toShortArray
int[] keys = sortedMap.entrySet().parallelStream().mapToInt(e -> e.getKey()).toArray()
Container[] values = sortedMap.entrySet().parallelStream().map(e ->
   or(e.getValue())).toArray(Container[]::new);

And if not, we can still sort:

int[] keys = sortedMap.entrySet().parallelStream().mapToInt(e -> e.getKey()).sorted().toArray()
Container[] values = sortedMap.entrySet().parallelStream().sorted(Comparators.comparing(Entry::getKey)).map(e ->
   or(e.getValue())).toArray(Container[]::new);
@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Mar 6, 2018

It’s thread safe to assign to different indices of an array, so long as no two threads write to the same index. I have this code already written if needed. My only misgiving is that it uses an approach inapplicable to XOR (which means more code) and it’s only faster beyond measurement error when the task is already very fast.

@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Mar 6, 2018

I tried your latest suggestion, modified on the assumption that the first parallel stream invocation couldn't possibly justify parallel execution, let alone a subsequent conversion to a short[] (try it).

  public static RoaringBitmap or3(RoaringBitmap... bitmaps) {
    SortedMap<Short, List<Container>> grouped = groupByKey(bitmaps);
    short[] keys = new short[grouped.size()];
    int i = 0;
    for (Map.Entry<Short, List<Container>> slice : grouped.entrySet()) {
      keys[i++] = slice.getKey();
    }
    Container[] values = grouped.entrySet()
                                .parallelStream()
                                .map(e -> or(e.getValue()))
                                .toArray(Container[]::new);
    return new RoaringBitmap(new RoaringArray(keys, values, i));
  }

It's usually worse.

Benchmark (dataset) Mode Cnt Score Error Units
ParallelAggregatorBenchmark.parallelOr census-income avgt 5 555.241 37.3 us/op
ParallelAggregatorBenchmark.parallelOr census1881 avgt 5 573.727 23.3 us/op
ParallelAggregatorBenchmark.parallelOr dimension_008 avgt 5 759.337 80.4 us/op
ParallelAggregatorBenchmark.parallelOr dimension_003 avgt 5 1724.561 18 us/op
ParallelAggregatorBenchmark.parallelOr dimension_033 avgt 5 589.229 25.6 us/op
ParallelAggregatorBenchmark.parallelOr uscensus2000 avgt 5 193.171 1.26 us/op
ParallelAggregatorBenchmark.parallelOr weather_sept_85 avgt 5 1327.188 95.3 us/op
ParallelAggregatorBenchmark.parallelOr wikileaks-noquotes avgt 5 216.523 3.67 us/op
ParallelAggregatorBenchmark.parallelOr census-income_srt avgt 5 478.56 96.9 us/op
ParallelAggregatorBenchmark.parallelOr census1881_srt avgt 5 390.966 9.33 us/op
ParallelAggregatorBenchmark.parallelOr weather_sept_85_srt avgt 5 602.929 47.7 us/op
ParallelAggregatorBenchmark.parallelOr wikileaks-noquotes_srt avgt 5 149.752 5.41 us/op
ParallelAggregatorBenchmark.parallelOr2 census-income avgt 5 326.995 98.4 us/op
ParallelAggregatorBenchmark.parallelOr2 census1881 avgt 5 476.763 4.36 us/op
ParallelAggregatorBenchmark.parallelOr2 dimension_008 avgt 5 744.415 14.2 us/op
ParallelAggregatorBenchmark.parallelOr2 dimension_003 avgt 5 1708.778 12.1 us/op
ParallelAggregatorBenchmark.parallelOr2 dimension_033 avgt 5 621.399 183 us/op
ParallelAggregatorBenchmark.parallelOr2 uscensus2000 avgt 5 173.205 3.14 us/op
ParallelAggregatorBenchmark.parallelOr2 weather_sept_85 avgt 5 1045.635 133 us/op
ParallelAggregatorBenchmark.parallelOr2 wikileaks-noquotes avgt 5 197.271 19.3 us/op
ParallelAggregatorBenchmark.parallelOr2 census-income_srt avgt 5 268.951 32 us/op
ParallelAggregatorBenchmark.parallelOr2 census1881_srt avgt 5 330.806 7.79 us/op
ParallelAggregatorBenchmark.parallelOr2 weather_sept_85_srt avgt 5 541.925 41 us/op
ParallelAggregatorBenchmark.parallelOr2 wikileaks-noquotes_srt avgt 5 124.192 0.96 us/op
ParallelAggregatorBenchmark.parallelOr3 census-income avgt 5 541.171 22.2 us/op
ParallelAggregatorBenchmark.parallelOr3 census1881 avgt 5 581.749 3.93 us/op
ParallelAggregatorBenchmark.parallelOr3 dimension_008 avgt 5 748.952 17.4 us/op
ParallelAggregatorBenchmark.parallelOr3 dimension_003 avgt 5 1867.997 31 us/op
ParallelAggregatorBenchmark.parallelOr3 dimension_033 avgt 5 607 65.9 us/op
ParallelAggregatorBenchmark.parallelOr3 uscensus2000 avgt 5 205.238 3.71 us/op
ParallelAggregatorBenchmark.parallelOr3 weather_sept_85 avgt 5 1334.836 89.4 us/op
ParallelAggregatorBenchmark.parallelOr3 wikileaks-noquotes avgt 5 229.825 1.88 us/op
ParallelAggregatorBenchmark.parallelOr3 census-income_srt avgt 5 487.834 15.7 us/op
ParallelAggregatorBenchmark.parallelOr3 census1881_srt avgt 5 401.117 15.9 us/op
ParallelAggregatorBenchmark.parallelOr3 weather_sept_85_srt avgt 5 616.065 53.8 us/op
ParallelAggregatorBenchmark.parallelOr3 wikileaks-noquotes_srt avgt 5 163.409 4.74 us/op

The approach is also not applicable to XOR. I'm happy to brainstorm but would like to shift the focus to the code that has been written and tested. Java is a big language and there are multiple ways to solve most problems. As mentioned before, this code is usually a little bit faster, but can't be generalised to XOR. I find it attractive to treat OR and XOR as instances of the same problem, but if the performance gains are considered important by others, I'm happy to test and submit this implementation.

  /**
   * Computes the bitwise union of the input bitmaps
   * @param bitmaps the input bitmaps
   * @return the union of the bitmaps
   */
  public static RoaringBitmap or2(RoaringBitmap... bitmaps) {
    SortedMap<Short, List<Container>> grouped = groupByKey(bitmaps);
    short[] keys = new short[grouped.size()];
    Container[] values = new Container[grouped.size()];
    List<List<Container>> slices = new ArrayList<>(grouped.size());
    int index = 0;
    for (Map.Entry<Short, List<Container>> slice : grouped.entrySet()) {
      keys[index++] = slice.getKey();
      slices.add(slice.getValue());
    }
    IntStream.range(0, index).parallel()
            .forEach(position -> values[position] = or(slices.get(position)));
    return new RoaringBitmap(new RoaringArray(keys, values, i));
  }
@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Mar 6, 2018

I don't think the calling thread is guaranteed to see the latest value in the slightly faster implementation (or2), though the writing threads are guaranteed not to conflict by the JLS. The original implementation doesn't have any potential for stale reads.

The JLS ensures that the array writes are non interfering, and forEach calls ForkJoinTask.invoke which guarantees the array updates are visible from the calling thread.

@okrische

This comment has been minimized.

Member

okrische commented Mar 6, 2018

@richardstartin I am glad you took the time to test the performance. Thank you.

@okrische

This comment has been minimized.

Member

okrische commented Mar 6, 2018

One idea, though. We can avoid copying the values into the slices array, by looking up the slices via the short key. But its probably slower.

IntStream.range(0, index).parallel()
            .forEach(position -> values[position] = or(grouped.get(keys[position]));
@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Mar 6, 2018

@okrische there are indeed a few ways you can cut this. I have tried and benchmarked more implementations than have been presented here. Including:

  • Balancing the work by container size
  • Your suggestion with a hash position to index lookup
  • Your suggestion inside a collector, doing a binary search to find the position to index lookup, then writing sequentially after the initial binary seach.

There are pros and cons to all of them, and it's easier to suggest them than to implement, test and benchmark them. Can we review the code that's been written?

@lemire

This comment has been minimized.

Member

lemire commented Mar 6, 2018

@richardstartin I think we can merge this. The code has been reviewed. Just give your ok.

@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Mar 6, 2018

@lemire I'm OK with merging this. FTR, I'm concerned about the quantity of temporary memory: one auxiliary data structure turned into two and is now three. If someone does a really big aggregation, it could end up being at least a megabyte of temporary memory.

@lemire

This comment has been minimized.

Member

lemire commented Mar 6, 2018

@richardstartin Do you want to add a remark in the JavaDoc regarding memory usage?

@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Mar 6, 2018

It looks like there's a warning already in the class javadoc (it's been a while, forgot it was there.)

@lemire

This comment has been minimized.

Member

lemire commented Mar 6, 2018

Ok. Ok.

I'll issue a release in the near future.

@ppiotrow

This comment has been minimized.

Member

ppiotrow commented Mar 7, 2018

I've compared Maciej's and this implementation in our production application for two very heavy queries. This is only about buffered bitmaps.

Maciej’s (parr.OR + parr.AND)
[info] Benchmark                                Mode  Cnt     Score     Error  Units
[info] ProdApp.benchA                   	avgt   20  2873.975 ± 574.882  ms/op
[info] ProdApp.benchB                    	avgt   20   728.568 ±  70.563  ms/op

Maciej's (parr.OR)
[info] Benchmark                                Mode  Cnt     Score     Error  Units
[info] ProdApp.benchA                   	avgt   20  5652.284 ± 765.898  ms/op
[info] ProdApp.benchB                    	avgt   20  1550.732 ± 654.949  ms/op

Official
[info] Benchmark                                Mode  Cnt     Score     Error  Units
[info] ProdApp.benchA                   	avgt   20  6351.598 ± 508.392  ms/op
[info] ProdApp.benchB                    	avgt   20  2915.789 ± 571.573  ms/op

Two conclusions, the parallel AND makes difference and in general @maciej approach was faster even with only OR enabled.
Sorry, but I cannot provide benchmark data or queries to you, but those are in general alternatives and conjunctions of about 3-100 bitmaps which has values around from 0 to 500 000 000.

@richardstartin

This comment has been minimized.

Contributor

richardstartin commented Mar 7, 2018

For a small number of bitmaps I'd expect that. As has been alluded to throughout this PR, there are a few dimensions to this problem (number of bitmaps, size of bitmaps, overlap of keys). It would be interesting if you ran your benchmark on thousands of bitmaps. An alternative implementation was previously welcomed to be benchmarked against this implementation, on the real datasets, and that is obviously still open. I don't think it can be merged until that is taken up.

Another point, there's a huge optimisation in the unbuffered version that hasn't been ported to the buffered implementation (just saw the emboldened text above). Let's port that over, then you can benchmark OR again if you want.

@lemire lemire merged commit 2884f89 into RoaringBitmap:master Mar 7, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.3%) to 91.059%
Details
@lemire

This comment has been minimized.

Member

lemire commented Mar 7, 2018

To move things forward, I merged @richardstartin's code and will issue a release.

I opened a separate issue where I invite better benchmarking...

#223

@richardstartin richardstartin referenced this pull request Apr 10, 2018

Merged

Faster ParOr #178

@richardstartin richardstartin deleted the richardstartin:parallel branch Apr 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment