implement the fastGetObjectMap to get better read performance #845

pyb1993 · 2021-07-23T04:29:52Z

1. unroll the locate to get method

Unroll version have better performance: I test the version of roll and unroll (about 2%)

    SimpleMapBenchmark.read          0.5f    unroll           2048       1000  thrpt    9  360077.662 ± 7646.572  ops/s
    SimpleMapBenchmark.read          0.5f    unroll           2048       3000  thrpt    9  352933.493 ± 5690.624  ops/s
    SimpleMapBenchmark.read          0.5f    roll           2048          1000  thrpt    9  368682.803 ± 7567.719  ops/s
    SimpleMapBenchmark.read          0.5f    roll           2048          3000  thrpt    9  358348.669 ± 7085.066  ops/s

There are two possible reasons:
JIT doesn't inline the method locateKey, there is little overhead
get method need to compare the result of locateKey and then decide to return the real value

2. Remove the magic number in place

this algorithm will make the low probability of hash collision and take some cost
with low loadFactor, we don't need to consider the hash collision

3. Automatically adjust load factor

according to previous benchmark, different size have different best loadFactor
I guess the reason is that small map can utilize the cache locality if we get high loadFactor.

Other little optimization (maybe optimized automatically , no obvious improvement )
-   remove the key not null check(cuckooObjectMap also not check it, check it in put is enough)
-   switch the order of check other is null and check other is equal to key(the second is more likely to be true, and key 
        would not be null so we can use key.equals(other))

Benchmark Result

num = 100

Benchmark (mapType) (maxCapacity) (numClasses) Mode Cnt Score Error Units
SimpleMapBenchmark.read fastGet 2048 100 thrpt 9 1767666.150 ± 7451.419 ops/s
SimpleMapBenchmark.read cuckoo 2048 100 thrpt 9 1458135.074 ± 373999.983 ops/s

num=1000

Benchmark (mapType) (maxCapacity) (numClasses) Mode Cnt Score Error Units
SimpleMapBenchmark.read fastGet 2048 1000 thrpt 9 370069.484 ± 20877.873 ops/s
SimpleMapBenchmark.read cuckoo 2048 1000 thrpt 9 363330.568 ± 18756.168 ops/s

num=3000

Benchmark (mapType) (maxCapacity) (numClasses) Mode Cnt Score Error Units
SimpleMapBenchmark.read fastGet 2048 3000 thrpt 9 355674.722 ± 18533.691 ops/s
SimpleMapBenchmark.read cuckoo 2048 3000 thrpt 9 333397.797 ± 78791.696 ops/s

num=10000

Benchmark (mapType) (maxCapacity) (numClasses) Mode Cnt Score Error Units
SimpleMapBenchmark.read fastGet 2048 10000 thrpt 9 319202.343 ± 11023.639 ops/s
SimpleMapBenchmark.read cuckoo 2048 10000 thrpt 9 316253.575 ± 68273.397 ops/s

…readability

theigl

Thanks for your PR @pyb1993! I left some initial comments. Please also run mvn formatter:format to apply project code styles.

I will try to get some feedback by the original authors of ObjectMap as well.

benchmarks/src/main/java/com/esotericsoftware/kryo/benchmarks/io/FastGetObjectMapBenchmark.java

src/com/esotericsoftware/kryo/util/ObjectMap.java

src/com/esotericsoftware/kryo/util/FastGetObjectMap.java

…rameter of constructor

theigl · 2021-07-23T11:52:19Z

@pyb1993: I ran the benchmark on your current branch on one of my linux boxes with JDK 11:

With a single thread:

mvn -f benchmarks/pom.xml compile exec:java -Dexec.args="-f 3 -wi 10 -i 3 -t 1 -w 2s -r 2s FastGetObjectMapBenchmark.read"

Benchmark                       (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt       Score       Error  Units
FastGetObjectMapBenchmark.read    fastGet           2048           100  thrpt    9  855709.085 ±  5643.041  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          1000  thrpt    9  189912.648 ±  3902.263  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          3000  thrpt    9  184169.213 ±  8908.928  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048         10000  thrpt    9  181783.603 ±  3773.340  ops/s

FastGetObjectMapBenchmark.read     cuckoo           2048           100  thrpt    9  846559.307 ± 31966.850  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          1000  thrpt    9  194285.589 ± 22357.330  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          3000  thrpt    9  184905.701 ± 14447.756  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048         10000  thrpt    9  187945.091 ±  4137.380  ops/s

And with two threads:

mvn -f benchmarks/pom.xml compile exec:java -Dexec.args="-f 3 -wi 10 -i 3 -t 2 -w 2s -r 2s FastGetObjectMapBenchmark.read"

Benchmark                       (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt        Score       Error  Units
FastGetObjectMapBenchmark.read    fastGet           2048           100  thrpt    9  1775902.890 ± 44836.590  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          1000  thrpt    9   362333.710 ±  7255.862  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          3000  thrpt    9   360358.157 ±  4465.184  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048         10000  thrpt    9   340645.007 ±  6892.110  ops/s

FastGetObjectMapBenchmark.read     cuckoo           2048           100  thrpt    9  1703987.634 ± 59457.182  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          1000  thrpt    9   385086.809 ±  6782.309  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          3000  thrpt    9   378497.012 ±  6633.835  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048         10000  thrpt    9   371224.718 ±  4818.188  ops/s

For me, fastGet is still a little slower than cuckoo, but the overall read performance is much better than the default objectMap.

pyb1993 · 2021-07-24T03:56:53Z

@pyb1993: I ran the benchmark on your current branch on one of my linux boxes with JDK 11:

With a single thread:

mvn -f benchmarks/pom.xml compile exec:java -Dexec.args="-f 3 -wi 10 -i 3 -t 1 -w 2s -r 2s FastGetObjectMapBenchmark.read"

Benchmark                       (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt       Score       Error  Units
FastGetObjectMapBenchmark.read    fastGet           2048           100  thrpt    9  855709.085 ±  5643.041  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          1000  thrpt    9  189912.648 ±  3902.263  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          3000  thrpt    9  184169.213 ±  8908.928  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048         10000  thrpt    9  181783.603 ±  3773.340  ops/s

FastGetObjectMapBenchmark.read     cuckoo           2048           100  thrpt    9  846559.307 ± 31966.850  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          1000  thrpt    9  194285.589 ± 22357.330  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          3000  thrpt    9  184905.701 ± 14447.756  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048         10000  thrpt    9  187945.091 ±  4137.380  ops/s

And with two threads:

mvn -f benchmarks/pom.xml compile exec:java -Dexec.args="-f 3 -wi 10 -i 3 -t 2 -w 2s -r 2s FastGetObjectMapBenchmark.read"

Benchmark                       (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt        Score       Error  Units
FastGetObjectMapBenchmark.read    fastGet           2048           100  thrpt    9  1775902.890 ± 44836.590  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          1000  thrpt    9   362333.710 ±  7255.862  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048          3000  thrpt    9   360358.157 ±  4465.184  ops/s
FastGetObjectMapBenchmark.read    fastGet           2048         10000  thrpt    9   340645.007 ±  6892.110  ops/s

FastGetObjectMapBenchmark.read     cuckoo           2048           100  thrpt    9  1703987.634 ± 59457.182  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          1000  thrpt    9   385086.809 ±  6782.309  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048          3000  thrpt    9   378497.012 ±  6633.835  ops/s
FastGetObjectMapBenchmark.read     cuckoo           2048         10000  thrpt    9   371224.718 ±  4818.188  ops/s

For me, fastGet is still a little slower than cuckoo, but the overall read performance is much better than the default objectMap.

Yes, you are correct, I rerun the benchmark many times, and find the result is not very stable:
In most times, cuckoo will be a little quicker than fastGet. but sometimes the fastGet is also quicker than cuckoo.
Only in small map(num=100) then fastGet will outperforms the cuckoo.

In that case, I'm not sure it is worth to replace cuckoo, maybe keep it is a better option

tommyettinger · 2021-07-24T10:15:52Z

I looked at the code changes for a minute, and noticed that the FastGetObjectMap code is mostly the same... except that it always uses a lower load factor than ObjectMap or CuckooObjectMap. If you make ObjectMap and CuckooObjectMap also use 0.5 load factor, like this should be using at the size of 2048 that the benchmark runs at, you get a ~~very~~ different outcome:

Benchmark                            (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt       Score        Error  Units
FastGetObjectMapBenchmark.read         fastGet           2048           100  thrpt    8  984496.868 ± 138226.490  ops/s
FastGetObjectMapBenchmark.read          cuckoo           2048           100  thrpt    8  964066.798 ±  31559.059  ops/s
FastGetObjectMapBenchmark.read             obj           2048           100  thrpt    8  868244.026 ±  24440.959  ops/s
FastGetObjectMapBenchmark.read            hash           2048           100  thrpt    8  966474.123 ±  27272.680  ops/s
FastGetObjectMapBenchmark.write        fastGet           2048           100  thrpt    8  737184.677 ± 120212.539  ops/s
FastGetObjectMapBenchmark.write         cuckoo           2048           100  thrpt    8  713588.461 ± 109947.526  ops/s
FastGetObjectMapBenchmark.write            obj           2048           100  thrpt    8  706915.647 ± 169966.607  ops/s
FastGetObjectMapBenchmark.write           hash           2048           100  thrpt    8  672497.488 ± 135969.047  ops/s
FastGetObjectMapBenchmark.writeRead    fastGet           2048           100  thrpt    8  223227.096 ±   3652.897  ops/s
FastGetObjectMapBenchmark.writeRead     cuckoo           2048           100  thrpt    8  210817.956 ±  11824.875  ops/s
FastGetObjectMapBenchmark.writeRead        obj           2048           100  thrpt    8  219017.636 ±   6081.790  ops/s
FastGetObjectMapBenchmark.writeRead       hash           2048           100  thrpt    8  194159.176 ±  10909.609  ops/s

This could easily have benchmark quality issues because not very many iterations were used, but if it holds up, ObjectMap (obj in this listing) is the fastest for read time (by over 10%), second-fastest for write time behind HashMap, and for writeRead time all of the results are very close, though HashMap is the leader. Whoops, I'm used to seeing these measured in time taken, not throughput. ObjectMap is slowest by over 10% for read, second-slowest for write, and second-fastest in a close bunch for writeRead.

~~What this seems to suggest is that just lowering the load factor on the existing ObjectMap is enough to make it significantly faster, even relative to CuckooObjectMap. 0.5 is probably a good amount.~~ I'll run more benchmarks with varying load factors to see where the changes start to matter.

GitHub doesn't let me attach a bare .java file, but here are my changes to the benchmark, zipped:
FastGetObjectMapBenchmark.zip
Note that I changed a usage of Random to a ThreadLocalRandom, because the seed and statistical quality don't matter, and ThreadLocalRandom should take up much less of the time in the benchmark -- synchronization in Random can really slow it down.

tommyettinger · 2021-07-24T22:46:54Z

So, I ran some more benchmarks... These are all using FastGetObjectMapBenchmark.
fastGet is the customized subclass of ObjectMap being evaluated.
cuckoo is CuckooObjectMap.
obj is ObjectMap with the get() and place() optimizations from FastGetObjectMap applied, because they all make sense here (but not its loadFactor changes, because I want something to be different so I can compare).
hash is java.util.HashMap.

This tests with 500 Classes inserted, instead of 100, and the results are less favorable to
CuckooObjectMap with the higher item count. This tests loadFactors of 0.4, 0.5, 0.6, 0.7, and
0.8, but fastGet doesn't take loadFactor into account, so it's a little confusing that it gets different
results at all for different loadFactors. There is also a lot of noise in the benchmark; I ran this on Windows 10, and I think there may have been some OS junk competing for CPU time.

Benchmark  (loadFactor)  (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt       Score       Error  Units
read                0.4    fastGet           2048           500  thrpt   10  203396.861 ±  8688.657  ops/s
read                0.4     cuckoo           2048           500  thrpt   10  174474.845 ± 11104.342  ops/s
read                0.4        obj           2048           500  thrpt   10  182592.193 ±  4077.766  ops/s
read                0.4       hash           2048           500  thrpt   10  159097.986 ±  9607.009  ops/s
read                0.5    fastGet           2048           500  thrpt   10  193455.437 ± 14741.645  ops/s
read                0.5     cuckoo           2048           500  thrpt   10  161923.298 ± 22251.526  ops/s
read                0.5        obj           2048           500  thrpt   10  177151.514 ±  7222.974  ops/s
read                0.5       hash           2048           500  thrpt   10  163061.511 ±  3982.901  ops/s
read                0.6    fastGet           2048           500  thrpt   10  200107.413 ±  2153.898  ops/s
read                0.6     cuckoo           2048           500  thrpt   10  166540.735 ±  9245.270  ops/s
read                0.6        obj           2048           500  thrpt   10  180055.150 ±  3254.601  ops/s
read                0.6       hash           2048           500  thrpt   10  162189.791 ±  7832.268  ops/s
read                0.7    fastGet           2048           500  thrpt   10  174583.487 ±  4305.354  ops/s
read                0.7     cuckoo           2048           500  thrpt   10  173032.679 ±  3766.070  ops/s
read                0.7        obj           2048           500  thrpt   10  182344.666 ±  6568.991  ops/s
read                0.7       hash           2048           500  thrpt   10  161397.007 ±  2309.251  ops/s
read                0.8    fastGet           2048           500  thrpt   10  198841.597 ± 12824.240  ops/s
read                0.8     cuckoo           2048           500  thrpt   10  169526.261 ± 13254.839  ops/s
read                0.8        obj           2048           500  thrpt   10  152559.914 ± 52911.405  ops/s
read                0.8       hash           2048           500  thrpt   10  155779.667 ± 20023.779  ops/s
write               0.4    fastGet           2048           500  thrpt   10  173894.412 ±  3791.294  ops/s
write               0.4     cuckoo           2048           500  thrpt   10  168937.060 ±  4303.122  ops/s
write               0.4        obj           2048           500  thrpt   10  177942.996 ±  4306.470  ops/s
write               0.4       hash           2048           500  thrpt   10  132062.077 ±  3824.929  ops/s
write               0.5    fastGet           2048           500  thrpt   10  176622.602 ±  4953.666  ops/s
write               0.5     cuckoo           2048           500  thrpt   10  163975.929 ±  3902.864  ops/s
write               0.5        obj           2048           500  thrpt   10  174190.288 ±  5795.178  ops/s
write               0.5       hash           2048           500  thrpt   10  130471.241 ± 15769.696  ops/s
write               0.6    fastGet           2048           500  thrpt   10  176066.884 ±  8087.540  ops/s
write               0.6     cuckoo           2048           500  thrpt   10  162391.330 ±  4616.634  ops/s
write               0.6        obj           2048           500  thrpt   10  177106.562 ±  3092.769  ops/s
write               0.6       hash           2048           500  thrpt   10  133279.551 ±  6136.359  ops/s
write               0.7    fastGet           2048           500  thrpt   10  177620.663 ±  5643.269  ops/s
write               0.7     cuckoo           2048           500  thrpt   10  163657.554 ±  3696.383  ops/s
write               0.7        obj           2048           500  thrpt   10  155664.683 ± 30198.158  ops/s
write               0.7       hash           2048           500  thrpt   10  138005.766 ±  5484.546  ops/s
write               0.8    fastGet           2048           500  thrpt   10  174708.845 ±  3925.234  ops/s
write               0.8     cuckoo           2048           500  thrpt   10  163353.853 ±  2017.528  ops/s
write               0.8        obj           2048           500  thrpt   10  165161.512 ± 16316.204  ops/s
write               0.8       hash           2048           500  thrpt   10  136692.675 ±  3913.704  ops/s
writeRead           0.4    fastGet           2048           500  thrpt   10   43591.521 ±   704.075  ops/s
writeRead           0.4     cuckoo           2048           500  thrpt   10   46173.253 ±   824.982  ops/s
writeRead           0.4        obj           2048           500  thrpt   10   48410.097 ±   687.127  ops/s
writeRead           0.4       hash           2048           500  thrpt   10   45380.194 ±  1189.894  ops/s
writeRead           0.5    fastGet           2048           500  thrpt   10   43689.811 ±   712.184  ops/s
writeRead           0.5     cuckoo           2048           500  thrpt   10   40964.239 ±   693.984  ops/s
writeRead           0.5        obj           2048           500  thrpt   10   42961.852 ±  1806.682  ops/s
writeRead           0.5       hash           2048           500  thrpt   10   43414.371 ±  1383.136  ops/s
writeRead           0.6    fastGet           2048           500  thrpt   10   43878.141 ±   455.079  ops/s
writeRead           0.6     cuckoo           2048           500  thrpt   10   40819.561 ±   441.262  ops/s
writeRead           0.6        obj           2048           500  thrpt   10   42724.069 ±   485.276  ops/s
writeRead           0.6       hash           2048           500  thrpt   10   43415.844 ±   831.169  ops/s
writeRead           0.7    fastGet           2048           500  thrpt   10   43819.019 ±   616.127  ops/s
writeRead           0.7     cuckoo           2048           500  thrpt   10   41281.160 ±   457.417  ops/s
writeRead           0.7        obj           2048           500  thrpt   10   43204.736 ±   595.106  ops/s
writeRead           0.7       hash           2048           500  thrpt   10   43279.243 ±   626.552  ops/s
writeRead           0.8    fastGet           2048           500  thrpt   10   41917.364 ±  4236.458  ops/s
writeRead           0.8     cuckoo           2048           500  thrpt   10   40736.723 ±  1126.091  ops/s
writeRead           0.8        obj           2048           500  thrpt   10   42525.028 ±   765.568  ops/s
writeRead           0.8       hash           2048           500  thrpt   10   42937.855 ±   891.393  ops/s

To sum this up, FastGetObjectMap seems like a good set of changes, but I would be wary of taking away the
option to configure loadFactor. That's one of the few tools hash tables have to adjust for higher- or
lower-quality hashCode() implementations. Here we're only testing identity hash codes, and those are
generally quite well-distributed compared to, say, Vector2 hash codes from libGDX, or maliciously crafted
Strings. I think there's some issue with the benchmark code or the way I'm running it, also; running just
obj without running fastGet first makes obj much faster, and that suggests there could be some bimorphic callsite
slowdown. It's hard to tell, though if FastGetObjectMap replaced ObjectMap, then that issue couldn't occur.
There's also a lot of variance between forked VMs here; obj in particular could slow down by 20% on some
forked runs, and then be back at full speed on another fork (other types had this happen as well, but it is
more noticeable with obj). So, yeah, in this case, I agree with all the changes in FastGetObjectMap except the fixed loadFactor -- I think that should still be up to the user, but could default to 0.7f or 0.5f in the constructors that don't specify a loadFactor.

tommyettinger · 2021-07-25T03:23:12Z

I figured out the flaw with the benchmark... on my end. I was mvn package-ing kryo but not mvn install-ing it, so the benchmarks were stuck with an old version. Implementing the unrolled get() and masking place(), but not the load factor change, now makes ObjectMap faster than CuckooObjectMap pretty reliably -- I can run more benchmarks later, but for now:

Benchmark                       (loadFactor)  (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt       Score       Error  Units
FastGetObjectMapBenchmark.read           0.4        obj           2048           500  thrpt   10  275569.483 ±  3758.463  ops/s
FastGetObjectMapBenchmark.read           0.4     cuckoo           2048           500  thrpt   10  272176.565 ±  3442.241  ops/s
FastGetObjectMapBenchmark.read           0.5        obj           2048           500  thrpt   10  267311.200 ± 10068.484  ops/s
FastGetObjectMapBenchmark.read           0.5     cuckoo           2048           500  thrpt   10  265858.911 ±  3855.796  ops/s
FastGetObjectMapBenchmark.read           0.6        obj           2048           500  thrpt   10  275953.601 ±  3599.506  ops/s
FastGetObjectMapBenchmark.read           0.6     cuckoo           2048           500  thrpt   10  251233.291 ± 36603.872  ops/s
FastGetObjectMapBenchmark.read           0.7        obj           2048           500  thrpt   10  272036.503 ±  5120.783  ops/s
FastGetObjectMapBenchmark.read           0.7     cuckoo           2048           500  thrpt   10  262150.137 ±  3546.619  ops/s
FastGetObjectMapBenchmark.read           0.8        obj           2048           500  thrpt   10  269115.094 ± 20723.802  ops/s
FastGetObjectMapBenchmark.read           0.8     cuckoo           2048           500  thrpt   10  263997.198 ±  4647.231  ops/s

This shows obj is faster than cuckoo at every load factor tested, using pyb1993's optimizations. Without those optimizations, it's probably considerably slower, since that should match the above comment's benchmarks.

If you're making changes to ObjectMap, you should also make sure IdentityMap has similar changes. I found in my code that the magic-multiply-then-shift is unnecessary for identityHashCode() results, and it is faster with a mask. It seems like this (for the current benchmarks) shows that the magic-multiply-then-shift is usually slower, though I prefer it when the hashCode() could have a strong bias towards high bits instead of the low bits the mask uses.

@pyb1993 , is it OK if I credit you in a CONTRIBUTORS file in the repo where I've been working on these types of data structure? These are very useful changes, and I'll be incorporating something like them in my code. @NathanSweet is also going in there, because a lot of the changes after my initial wave of code were his work.

pyb1993 · 2021-07-25T08:54:14Z

I figured out the flaw with the benchmark... on my end. I was mvn package-ing kryo but not mvn install-ing it, so the benchmarks were stuck with an old version. Implementing the unrolled get() and masking place(), but not the load factor change, now makes ObjectMap faster than CuckooObjectMap pretty reliably -- I can run more benchmarks later, but for now:
Benchmark                       (loadFactor)  (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt       Score       Error  Units
FastGetObjectMapBenchmark.read           0.4        obj           2048           500  thrpt   10  275569.483 ±  3758.463  ops/s
FastGetObjectMapBenchmark.read           0.4     cuckoo           2048           500  thrpt   10  272176.565 ±  3442.241  ops/s
FastGetObjectMapBenchmark.read           0.5        obj           2048           500  thrpt   10  267311.200 ± 10068.484  ops/s
FastGetObjectMapBenchmark.read           0.5     cuckoo           2048           500  thrpt   10  265858.911 ±  3855.796  ops/s
FastGetObjectMapBenchmark.read           0.6        obj           2048           500  thrpt   10  275953.601 ±  3599.506  ops/s
FastGetObjectMapBenchmark.read           0.6     cuckoo           2048           500  thrpt   10  251233.291 ± 36603.872  ops/s
FastGetObjectMapBenchmark.read           0.7        obj           2048           500  thrpt   10  272036.503 ±  5120.783  ops/s
FastGetObjectMapBenchmark.read           0.7     cuckoo           2048           500  thrpt   10  262150.137 ±  3546.619  ops/s
FastGetObjectMapBenchmark.read           0.8        obj           2048           500  thrpt   10  269115.094 ± 20723.802  ops/s
FastGetObjectMapBenchmark.read           0.8     cuckoo           2048           500  thrpt   10  263997.198 ±  4647.231  ops/s
This shows obj is faster than cuckoo at every load factor tested, using pyb1993's optimizations. Without those optimizations, it's probably considerably slower, since that should match the above comment's benchmarks.

If you're making changes to ObjectMap, you should also make sure IdentityMap has similar changes. I found in my code that the magic-multiply-then-shift is unnecessary for identityHashCode() results, and it is faster with a mask. It seems like this (for the current benchmarks) shows that the magic-multiply-then-shift is usually slower, though I prefer it when the hashCode() could have a strong bias towards high bits instead of the low bits the mask uses.

@pyb1993 , is it OK if I credit you in a CONTRIBUTORS file in the repo where I've been working on these types of data structure? These are very useful changes, and I'll be incorporating something like them in my code. @NathanSweet is also going in there, because a lot of the changes after my initial wave of code were his work.

for the contributors in your repo, I am very welcome to be added^_^
For the bechmark result, I rerun many times in MacOS, and sometimes the cuckoo is better, sometimes the fastGet is better. it is not stable, but the cuckoo is more likely better when the item count is higher(1000,3000)

So I'm not sure about the benchmark result.
If you and @theigl can confirm the result, I am free to undo the dynamic loadFactor change
(but this fastGet is mainly used for classRegistration, so maybe the hashcode quality is not a problem)

tommyettinger · 2021-07-26T04:34:18Z

It makes sense that the results aren't stable, I think, because this is using Object.hashCode() or just the identity hash code of each object. The identity hash codes should be different each time, so if the references aren't identical between the cuckoo and fastGet benchmarks, the actual number of collisions will vary between runs. It might make more sense to generate Strings, which have a more representative hashCode() implementation for most Java classes, or random Integers, generated by the same seed for each group of objects. If this is only being used with Class keys, then we're stuck with identity hash codes; they tend to be relatively-high-quality random numbers, but have some issues with locking that I don't fully understand. ByteBuddy might be implementing a normal hashCode() for the generated classes, but I don't know really how that library works. I'll benchmark this some more later.

theigl · 2021-07-26T08:10:10Z

@tommyettinger: Thanks a lot for your effort and input on this topic!

Let me summarize briefly what we found out so far:

FastGetObjectMap outperforms ObjectMap by 10-20% when reading

This is mainly due to simplified hashing and a lower load-factor. According to my benchmarks, a load-factor of 0.5 or 0.6 seems ideal, at least for the use-case we are benchmarking. The simplified hashing only makes sense for high quality hashes like identity hashcodes.

CuckooObjectMap outperforms FastGetObjectMap by 5-10% when reading for map sizes > 500 elements.

This result is consistent in all my benchmarks. Once we reach a certain number of elements (somewhere between 500 and 1000), the Cuckoo map starts to perform significantly better. Although I'd very much like to delete the Cuckoo map from Kryo, I don't think it is a good idea until we find a replacement that comes closer in performance.

However, there are other parts of Kryo that currently use ObjectMap that might benefit from switching to the new, faster implementation. The IdentityObjectIntMap in MapReferenceResolver for instance. But this has to be benchmarked with a real-world use case before I would consider changing the implementation. For scenarios where reads and writes are balanced, my benchmarks show that FastGetObjectMap is still faster, but only by about 2%.

For reference, these are the benchmark results I got from running @tommyettinger's benchmark on my linux box with a large number of forks and warmups. The error rate for the runs with > 100 elements is quite low, so I'm pretty sure these results are valid:

Benchmark                             (mapType)  (maxCapacity)  (numClasses)   Mode  Cnt       Score       Error  Units
FastGetObjectMapBenchmark3.read          cuckoo           2048           100  thrpt    9  901751.050 ± 69378.088  ops/s
FastGetObjectMapBenchmark3.read         fastGet           2048           100  thrpt    9  874029.724 ± 11040.301  ops/s
FastGetObjectMapBenchmark3.read             obj           2048           100  thrpt    9  803881.628 ± 17089.752  ops/s
FastGetObjectMapBenchmark3.read            hash           2048           100  thrpt    9  873987.979 ± 68187.282  ops/s

FastGetObjectMapBenchmark3.read          cuckoo           2048          1000  thrpt    9  200249.675 ±  3628.701  ops/s
FastGetObjectMapBenchmark3.read         fastGet           2048          1000  thrpt    9  178400.099 ± 13617.125  ops/s
FastGetObjectMapBenchmark3.read            hash           2048          1000  thrpt    9  169823.262 ±  3492.215  ops/s
FastGetObjectMapBenchmark3.read             obj           2048          1000  thrpt    9  152193.884 ± 27983.011  ops/s

FastGetObjectMapBenchmark3.read          cuckoo           2048          3000  thrpt    9  190520.528 ±  1260.122  ops/s
FastGetObjectMapBenchmark3.read         fastGet           2048          3000  thrpt    9  180000.873 ± 11179.701  ops/s
FastGetObjectMapBenchmark3.read             obj           2048          3000  thrpt    9  165642.223 ±  2900.557  ops/s
FastGetObjectMapBenchmark3.read            hash           2048          3000  thrpt    9  165305.519 ±  3006.401  ops/s

FastGetObjectMapBenchmark3.read          cuckoo           2048         10000  thrpt    9  181784.360 ±  4693.853  ops/s
FastGetObjectMapBenchmark3.read         fastGet           2048         10000  thrpt    9  176059.147 ±  6336.086  ops/s
FastGetObjectMapBenchmark3.read             obj           2048         10000  thrpt    9  158154.273 ±  2574.914  ops/s
FastGetObjectMapBenchmark3.read            hash           2048         10000  thrpt    9  156412.495 ±  3297.872  ops/s

FastGetObjectMapBenchmark3.write        fastGet           2048           100  thrpt    9  711911.286 ±  7407.523  ops/s
FastGetObjectMapBenchmark3.write         cuckoo           2048           100  thrpt    9  688634.806 ± 17693.374  ops/s
FastGetObjectMapBenchmark3.write            obj           2048           100  thrpt    9  640426.948 ± 13311.081  ops/s
FastGetObjectMapBenchmark3.write           hash           2048           100  thrpt    9  630192.809 ±  4457.366  ops/s

FastGetObjectMapBenchmark3.write         cuckoo           2048          1000  thrpt    9   75099.094 ±  2870.669  ops/s
FastGetObjectMapBenchmark3.write        fastGet           2048          1000  thrpt    9   74290.204 ±  1002.739  ops/s
FastGetObjectMapBenchmark3.write            obj           2048          1000  thrpt    9   72653.609 ±  1966.490  ops/s
FastGetObjectMapBenchmark3.write           hash           2048          1000  thrpt    9   62677.821 ±  5744.457  ops/s

FastGetObjectMapBenchmark3.write        fastGet           2048          3000  thrpt    9   23054.563 ±  1383.498  ops/s
FastGetObjectMapBenchmark3.write            obj           2048          3000  thrpt    9   22640.087 ±  1596.283  ops/s
FastGetObjectMapBenchmark3.write         cuckoo           2048          3000  thrpt    9   21663.940 ±  2328.371  ops/s
FastGetObjectMapBenchmark3.write           hash           2048          3000  thrpt    9   17995.772 ±   796.708  ops/s

FastGetObjectMapBenchmark3.write         cuckoo           2048         10000  thrpt    9    5979.943 ±   406.254  ops/s
FastGetObjectMapBenchmark3.write        fastGet           2048         10000  thrpt    9    5718.929 ±   385.197  ops/s
FastGetObjectMapBenchmark3.write            obj           2048         10000  thrpt    9    5701.094 ±   309.234  ops/s
FastGetObjectMapBenchmark3.write           hash           2048         10000  thrpt    9    4860.473 ±    78.571  ops/s

FastGetObjectMapBenchmark3.writeRead       hash           2048           100  thrpt    9  264731.769 ± 14282.130  ops/s
FastGetObjectMapBenchmark3.writeRead    fastGet           2048           100  thrpt    9  218024.219 ± 18505.642  ops/s
FastGetObjectMapBenchmark3.writeRead     cuckoo           2048           100  thrpt    9  213267.021 ±  6174.069  ops/s
FastGetObjectMapBenchmark3.writeRead        obj           2048           100  thrpt    9  208221.008 ±  5742.714  ops/s

FastGetObjectMapBenchmark3.writeRead    fastGet           2048          1000  thrpt    9   25579.959 ±   402.784  ops/s
FastGetObjectMapBenchmark3.writeRead     cuckoo           2048          1000  thrpt    9   24133.977 ±   620.268  ops/s
FastGetObjectMapBenchmark3.writeRead       hash           2048          1000  thrpt    9   24115.726 ±   151.297  ops/s
FastGetObjectMapBenchmark3.writeRead        obj           2048          1000  thrpt    9   23627.770 ±   614.566  ops/s

FastGetObjectMapBenchmark3.writeRead       hash           2048          3000  thrpt    9    6453.622 ±   191.857  ops/s
FastGetObjectMapBenchmark3.writeRead    fastGet           2048          3000  thrpt    9    6032.564 ±   131.448  ops/s
FastGetObjectMapBenchmark3.writeRead        obj           2048          3000  thrpt    9    5887.065 ±   346.304  ops/s
FastGetObjectMapBenchmark3.writeRead     cuckoo           2048          3000  thrpt    9    5140.471 ±    88.034  ops/s

FastGetObjectMapBenchmark3.writeRead       hash           2048         10000  thrpt    9    1243.688 ±    27.968  ops/s
FastGetObjectMapBenchmark3.writeRead    fastGet           2048         10000  thrpt    9    1213.300 ±    29.741  ops/s
FastGetObjectMapBenchmark3.writeRead        obj           2048         10000  thrpt    9    1197.335 ±    16.430  ops/s
FastGetObjectMapBenchmark3.writeRead     cuckoo           2048         10000  thrpt    9    1176.832 ±    16.654  ops/s

…ssRegistration

pyb1993 · 2021-07-26T11:54:12Z

Hi, according to your suggestion, I choose the nameIdToClass to do some benchmark, and try to replace the intMap.
I use the continuous id as the key, and class as the value.

see my latest commit to get details.
If this works, I can do more benchmark on other part of kryo(classToNameId, nameToClass, idToRegistration) and create a new PR.

quicker than (3% to 10%)
If there are more same class (most happened in collections), read will be more important than write.

FastGetObjectMapBenchmark.read         intMap           2048           500  thrpt   10  217789.340 ± 52158.272  ops/s
FastGetObjectMapBenchmark.read         intMap           2048          1000  thrpt   10  185616.635 ± 20610.151  ops/s
FastGetObjectMapBenchmark.read         intMap           2048          3000  thrpt   10  189498.133 ±  3460.429  ops/s
FastGetObjectMapBenchmark.read         intMap           2048         10000  thrpt   10  178081.649 ±  8869.369  ops/s
FastGetObjectMapBenchmark.read  fastGetIntMap           2048           500  thrpt   10  224623.596 ± 22657.120  ops/s
FastGetObjectMapBenchmark.read  fastGetIntMap           2048          1000  thrpt   10  203991.136 ±  6355.197  ops/s
FastGetObjectMapBenchmark.read  fastGetIntMap           2048          3000  thrpt   10  198668.113 ± 11263.387  ops/s
FastGetObjectMapBenchmark.read  fastGetIntMap           2048         10000  thrpt   10  195530.189 ± 58851.877  ops/s



quicker than (1%, 4%, 5%, 6%)
FastGetObjectMapBenchmark.writeRead         intMap           2048           500  thrpt   10  87734.833 ± 5324.936  ops/s
FastGetObjectMapBenchmark.writeRead         intMap           2048          1000  thrpt   10  43089.341 ± 3225.335  ops/s
FastGetObjectMapBenchmark.writeRead         intMap           2048          3000  thrpt   10  14206.493 ± 1199.021  ops/s
FastGetObjectMapBenchmark.writeRead         intMap           2048         10000  thrpt   10   3626.960 ±   40.521  ops/s
FastGetObjectMapBenchmark.writeRead  fastGetIntMap           2048           500  thrpt   10  88981.727 ± 9759.120  ops/s
FastGetObjectMapBenchmark.writeRead  fastGetIntMap           2048          1000  thrpt   10  45685.725 ±  521.659  ops/s
FastGetObjectMapBenchmark.writeRead  fastGetIntMap           2048          3000  thrpt   10  14964.966 ±  490.173  ops/s
FastGetObjectMapBenchmark.writeRead  fastGetIntMap           2048         10000  thrpt   10   3787.810 ±  209.264  ops/s

On the other hand, if we use the dynamic loadFactor and add an parameter as EnableDynamicLoadFactor in constructor, The overall performance of writeRead/read can be more stable at (quicker than intMap 5% - 10%), Do you think it is necessary if we only use this trick in
nameIdToClass ? (which hashCode is very stable..)

tommyettinger · 2021-07-26T20:02:31Z

So first: I think two parts of the PR are purely good optimizations, and make no changes to the behavior; these are the places where locateKey() is inlined/unrolled. I suspect that just these optimizations to get() provide a major improvement over the ObjectMap without this PR. In all the benchmarks I have run, these optimizations are present; there's no reason not to have them. Well, unless you rely on users being able to extend ObjectMap and override its locateKey() method for equality; even then, it's just another place to change the equality comparison.

A more debatable change is in place(), making it just mask the hashCode() instead of doing the Fibonacci rehashing (the magic multiply and shift); this should help performance for good hashCode() implementations, but cause more collisions with low-quality hashCode() implementations. Unlike CuckooObjectMap, ObjectMap (and FastGetObjectMap) can handle having more collisions without, at some point, an OutOfMemoryError killing the program. CuckooObjectMap is likely to never encounter enough completely-colliding hash codes to have that crash if it sticks to using Class keys, but it cannot be safely used for third-party-submitted data as keys. If the goal is to replace CuckooObjectMap, you can afford the same assumption that the hash codes will be at least moderately-high-quality, and use the mask optimization to place(). I do use the mask optimization in my benchmarks of both FastGetObjectMap and ObjectMap; I can roll back this change to ObjectMap and benchmark again, but I think it's a reasonable change to make. The Fibonacci rehash can also sometimes increase hash collisions, particularly if all hash codes are Fibonacci numbers or low multiples thereof, so it isn't always the best option.

The load factor changes could help, but they also reduce the ability of future code to easily adapt to changing needs for keys. I think changing the default load factor in the constructor from 0.8f to something lower (0.7 is reasonable, maybe 0.5 would be too) could be good here; it uses more memory by default, but if the capacity isn't filled-up, then it makes minimal difference on performance. I would need to run more benchmarks, with more varied key counts, before I could tell where the load factor starts to matter, and what might be a good balance.

I mentioned I would benchmark random Integers; these are (as expected) much faster than Classes to hash, because getting an identity hash code is a pretty involved process. Either fastGet or obj (with most of the optimizations in fastGet) are at the top of pretty much all benches here, though I only tried 500 for numIntegers (like numClasses before). I use a seed for this, which helps ensure more reliable benchmark results (I think I have it right).

Benchmark  (loadFactor)  (mapType)  (maxCapacity)  (numIntegers)   Mode  Cnt       Score       Error  Units
read                0.4    fastGet           2048            500  thrpt   24  311247.642 ±  8117.935  ops/s
read                0.4     cuckoo           2048            500  thrpt   24  288889.556 ±  2796.552  ops/s
read                0.4        obj           2048            500  thrpt   24  281159.447 ± 20055.693  ops/s
read                0.4       hash           2048            500  thrpt   24  187050.992 ± 11477.826  ops/s
read                0.5    fastGet           2048            500  thrpt   24  305052.970 ±  3072.985  ops/s
read                0.5     cuckoo           2048            500  thrpt   24  295044.878 ±  3335.154  ops/s
read                0.5        obj           2048            500  thrpt   24  310254.856 ±  9237.816  ops/s
read                0.5       hash           2048            500  thrpt   24  208629.023 ±  1622.164  ops/s
read                0.6    fastGet           2048            500  thrpt   24  324609.057 ±  2998.449  ops/s
read                0.6     cuckoo           2048            500  thrpt   24  305444.803 ±  3484.108  ops/s
read                0.6        obj           2048            500  thrpt   24  324293.284 ±  2554.168  ops/s
read                0.6       hash           2048            500  thrpt   24  207541.244 ±  2609.115  ops/s
read                0.7    fastGet           2048            500  thrpt   24  325357.080 ±  1724.044  ops/s
read                0.7     cuckoo           2048            500  thrpt   24  309505.610 ±  2738.307  ops/s
read                0.7        obj           2048            500  thrpt   24  325688.085 ±  2939.122  ops/s
read                0.7       hash           2048            500  thrpt   24  197686.334 ± 10791.997  ops/s
read                0.8    fastGet           2048            500  thrpt   24  309515.955 ± 19954.716  ops/s
read                0.8     cuckoo           2048            500  thrpt   24  305579.325 ±  9399.305  ops/s
read                0.8        obj           2048            500  thrpt   24  320494.343 ±  4875.453  ops/s
read                0.8       hash           2048            500  thrpt   24  208023.992 ±  2880.216  ops/s
write               0.4    fastGet           2048            500  thrpt   24  251423.107 ±  1779.975  ops/s
write               0.4     cuckoo           2048            500  thrpt   24  209027.134 ±  1657.113  ops/s
write               0.4        obj           2048            500  thrpt   24  267438.405 ±  2373.473  ops/s
write               0.4       hash           2048            500  thrpt   24  153771.147 ±  4671.341  ops/s
write               0.5    fastGet           2048            500  thrpt   24  252150.900 ±  1862.121  ops/s
write               0.5     cuckoo           2048            500  thrpt   24  202480.173 ±  2087.405  ops/s
write               0.5        obj           2048            500  thrpt   24  246146.243 ±  5024.671  ops/s
write               0.5       hash           2048            500  thrpt   24  145083.640 ±  1014.747  ops/s
write               0.6    fastGet           2048            500  thrpt   24  251033.614 ±  2065.863  ops/s
write               0.6     cuckoo           2048            500  thrpt   24  201611.388 ±  2114.741  ops/s
write               0.6        obj           2048            500  thrpt   24  250234.362 ±  2371.303  ops/s
write               0.6       hash           2048            500  thrpt   24  145220.120 ±   964.643  ops/s
write               0.7    fastGet           2048            500  thrpt   24  249378.654 ±  2752.061  ops/s
write               0.7     cuckoo           2048            500  thrpt   24  198853.919 ±  3785.977  ops/s
write               0.7        obj           2048            500  thrpt   24  250717.465 ±  2158.502  ops/s
write               0.7       hash           2048            500  thrpt   24  145666.077 ±   943.396  ops/s
write               0.8    fastGet           2048            500  thrpt   24  250936.847 ±  1302.827  ops/s
write               0.8     cuckoo           2048            500  thrpt   24  198121.092 ±  4365.235  ops/s
write               0.8        obj           2048            500  thrpt   24  258688.034 ±  2100.709  ops/s
write               0.8       hash           2048            500  thrpt   24  144406.426 ±  1876.688  ops/s
writeRead           0.4    fastGet           2048            500  thrpt   24   78596.949 ±   541.856  ops/s
writeRead           0.4     cuckoo           2048            500  thrpt   24   70088.231 ±   457.812  ops/s
writeRead           0.4        obj           2048            500  thrpt   24   88078.772 ±   654.802  ops/s
writeRead           0.4       hash           2048            500  thrpt   24   59556.445 ±  2882.402  ops/s
writeRead           0.5    fastGet           2048            500  thrpt   24   78138.961 ±   648.683  ops/s
writeRead           0.5     cuckoo           2048            500  thrpt   24   62374.084 ±   714.386  ops/s
writeRead           0.5        obj           2048            500  thrpt   24   78035.299 ±   528.528  ops/s
writeRead           0.5       hash           2048            500  thrpt   24   60077.107 ±   433.450  ops/s
writeRead           0.6    fastGet           2048            500  thrpt   24   78473.978 ±   373.149  ops/s
writeRead           0.6     cuckoo           2048            500  thrpt   24   61797.286 ±   994.279  ops/s
writeRead           0.6        obj           2048            500  thrpt   24   77929.805 ±   558.747  ops/s
writeRead           0.6       hash           2048            500  thrpt   24   60125.884 ±   538.126  ops/s
writeRead           0.7    fastGet           2048            500  thrpt   24   78678.622 ±   511.142  ops/s
writeRead           0.7     cuckoo           2048            500  thrpt   24   62300.585 ±   905.950  ops/s
writeRead           0.7        obj           2048            500  thrpt   24   78540.848 ±   607.452  ops/s
writeRead           0.7       hash           2048            500  thrpt   24   60073.270 ±   578.024  ops/s
writeRead           0.8    fastGet           2048            500  thrpt   24   78729.001 ±   426.471  ops/s
writeRead           0.8     cuckoo           2048            500  thrpt   24   61330.391 ±  1444.338  ops/s
writeRead           0.8        obj           2048            500  thrpt   24   77886.240 ±   825.636  ops/s
writeRead           0.8       hash           2048            500  thrpt   24   60000.079 ±   460.316  ops/s

It's surprising how slow HashMap is with Integer keys, and CuckooObjectMap often isn't much better on writeRead benches.

Here's my code for the benchmark with Integers:
FastGetObjectMapBenchmark2.zip
And my changes to ObjectMap:
ObjectMap.zip

theigl · 2021-07-27T08:42:30Z

@tommyettinger: Thanks again, and I fully agree with your analysis!

Do you plan on integrating any of the changes into libGDX? If so, we should coordinate the changes, since Nate used to copy the map implementations from libGDX to Kryo in the past, and I'd like to keep them as compatible as possible.

tommyettinger · 2021-07-27T09:45:19Z

@theigl I'd like to test with some of libGDX's more problematic keys, since the change in place() seems very significant for speed on the keys we've tested so far, but is unknown regarding low-quality hashCodes. GridPoint2 is not uncommon for a key type in sparse area maps, and has hit crashes with the cuckoo-hashed object map in some larger maps. Vector2 can often have all of its useful bits in the upper 16 of the 32-bit hashCode, leaving a huge space for hashes to collide. Although place() can change the rehashing technique, user code can't subclass locateKey() in libGDX (it's package-private), so code outside ObjectMap's package can't change how equality is handled. Being able to do this would be kind of nice if users wanted to use arrays as keys (treating them as immutable and comparing/hashing them by value); this might be reasonable as a substitute for the missing Vector4 class. The working changes I'm making in jdkgdxds use a protected boolean equate(Object left, Object right) method that defaults to just calling left.equals(right), and can be overridden for things like IdentityMap, since it compares with ==, while the array key case could use Arrays.equals(). I suspect there's a tiny speed hit for the default equate() that gets bigger for more complex checks, but they are checks that would need to be done in multiple places and have changes coordinated between them -- a prime case for one central method. The default is so small that it should be inlined like a getter.

Also, since locateKey() is package-private now, no code could be relying on it outside libGDX (or Kryo) -- this makes the unrolled/inlined locateKey() in get() no problem at all to incorporate. It shouldn't change any behavior for users other than being faster, and the only code changes are in subclasses of ObjectMap or ObjectSet, in the same package in libGDX (or Kryo).

I'm leaving the decision for any changes up to @NathanSweet , since he has the final word in both projects, as far as I know. The get()/containsKey() unroll optimization is 100% safe in libGDX, the place() optimization needs more testing to see how it affects keys from libGDX, and the equate() change is currently only used in jdkgdxds, and isn't necessary for the code to be used as it is now. As for jdkgdxds, it isn't possible to merge with libGDX any time soon, since it needs Java 8. I'm not in favor of changing load factor forcibly without user control, but I am totally on-board with changing the defaults to a lower load factor (maybe not by much).

pyb1993 · 2021-07-27T10:46:20Z

@theigl @tommyettinger
Thanks for your analysis , but I'm a little confusing now and have several questions:

what I have done is to apply some optimization to the subclass of ObjectMap, and try to use it replace some part of kryo.
Now the benchmark from theigl shows that the fastGet is a little slower than cuckoo, so it is not reasonable to replace it
According to theigl's suggestion, I have implemented another fastGetIntMap in the same way, and try to replace the intMap in the nameIdToClass while shows my benchmark. @theigl Could you take a look at that part ?
According to @tommyettinger suggestion, he says "there's no reason not to have them", so you mean you want to apply this optimization to ObjectMap rather than the fastGetObjectMap(In that case this is not necessary)? But the hashCode quality is a concern, so you will do more test on that (correct me if I misunderstand what you mean)

To sum up, I'm curious what I should do and what actions you will take for this PR?

theigl · 2021-07-27T12:11:22Z

@pyb1993:

Now the benchmark from theigl shows that the fastGet is a little slower than cuckoo, so it is not reasonable to replace it

Correct. We will not replace it for now.

According to theigl's suggestion, I have implemented another fastGetIntMap in the same way, and try to replace the intMap in the nameIdToClass while shows my benchmark. @theigl Could you take a look at that part ?

I will do some benchmarks in the coming days and get back to you. The IntMap would be a good start to test your changes because it is independent of ObjectMap, always produces good hashcodes, and doesn't have sub-classes.

According to @tommyettinger suggestion, he says "there's no reason not to have them", so you mean you want to apply this optimization to ObjectMap rather than the fastGetObjectMap(In that case this is not necessary)? But the hashCode quality is a concern, so you will do more test on that (correct me if I misunderstand what you mean)

I think what tommyettinger means is that the inlining of the locateKey method would make sense for all ObjectMaps.
That being said, it is currently not possible to simply inline the locateKey method directly in ObjectMap.get because it is overridden by IdentityMap to adjust the definition of equality. The same is true for ObjectIntMap and the subclass IdentityObjectIntMap.

The changes to the hashcode have to be evaluated separately for every sub-class and every usage of the map.

NathanSweet · 2021-07-31T18:49:07Z

I'm leaving the decision for any changes up to @NathanSweet

It doesn't look like there are controversial changes, so I'm good with whatever you guys come up with for making the maps faster/better! I do wonder if it's worth optimizing using inlining/unrolling in Java-land. Since this is so sensitive to JVM optimizations, maybe we should test with at least Java 16.

tommyettinger · 2021-08-01T09:27:34Z

Since this is so sensitive to JVM optimizations, maybe we should test with at least Java 16.

I'm pretty sure my tests were with Java 16, unless Maven was somehow forcing another version. That might explain some discrepancies, though everything is quite close. My JAVA_HOME is set to an AdoptOpenJDK 16 JDK.

pyb1993 · 2021-08-12T09:30:18Z

@theigl
Hi, there is any updates about this PR?
You say you will do more benchmarks~

theigl · 2021-08-22T17:19:46Z

@pyb1993: Sorry that this takes so long. I'm very busy at my job at the moment and can only do only minimal work on Kryo. I'll get back to you as soon as I have more time.

pyb1993 · 2021-10-11T11:15:18Z

@pyb1993: Sorry that this takes so long. I'm very busy at my job at the moment and can only do only minimal work on Kryo. I'll get back to you as soon as I have more time.

Are you still working in your job?

tommyettinger · 2021-10-12T07:37:16Z

So, I should probably follow up here. For a while, I was using several of the optimizations from FastGetObjectMap in various data structures in jdkgdxds (I'm on my phone because home network is down, so getting a link is a challenge). I had to roll back some changes to int and long-keyed maps, because in one case where most of the changed bits were higher-order, an IntMap slowed down to a crawl (though it didn't crash). That case used the x coordinate in the upper 16 bits and y in the lower 16; for a 512x256 grid, the upper bits were the only part that changed, they weren't seen by 16-bit-or-lower hashes, and so each collided 512 times. I went back to Fibonacci hashing, which doesn't have this issue, and performance returned to normal. Object-keyed maps still use the plain mask of the hashCode(), because most of those can be hashed differently by subclassing the ObjectMap in the few cases that need it. Identity hash codes also don't benefit from Fibonacci hashing. I use an overrideable equate() method in my Object-keyed maps and sets so users can compare keys for equality how they want, and can make case-insensitive data structures, among other changes. This separates the equality logic from locateKey(), and allows internals to be better-encapsulated. It could be higher or lower performance; equate() is usually very small and so gets optimized well by HotSpot, but there is an extra method call if inlining hasn't happened. The equate() can also be used in equals(). Otherwise, the inlining has been good for me.

theigl · 2022-01-18T18:03:56Z

@tommyettinger: Thanks a lot for your feedback! I finally found the time to revisit this topic and the info you shared gives me more confidence in applying these changes.

I will leave hashing in the maps with numeric keys as it is now and replace Fibonacci hashing in the other maps.
I will apply inlining to all maps. I did some more benchmarks and it really makes sense in all implementations even with JDK 17.

@pyb1993: I will create another PR with your changes to hashing and inlining and test the resulting code in our production environment.

…thods

…Map`

…thods

tommyettinger · 2022-01-19T03:37:03Z

@theigl This is all up to you and depends on your usage scenarios, but I should emphasize that the Fibonacci hashing acts as a form of defense against heavy pileup of collisions when the lower bits of the keys' hashCode() are extremely similar or identical. If you know that a map will not deal with third-party-submitted, potentially-malicious String keys, or keys of some other type that can be engineered to collide completely, then CuckooObjectMap remains a good choice and so does an ObjectMap with masking (without Fibonacci). But, even for int keys, if the lower bits of those keys are more-often-than-not duplicated in other keys, Fibonacci hashing provides a very significant speed improvement by using the upper bits to help reduce collisions. My point is, if you control the keys at every stage and know their hashCode() implementations will be well-distributed, masking is a valid option, and will likely be slightly faster than using Fibonacci hashing. That case is true for all identity-hashed maps, since the identity hash code is very thoroughly mixed. If you don't control the keys or have to accept keys with poorly distributed hashCode() implementations, like Float or even unusual sets of int keys, then Fibonacci hashing can help considerably, as I described in my previous comment.

I also should mention that I ran some JMH benchmarks on CuckooObjectMap and some variants that are hardened somewhat against mass collision scenarios; all variants had comparable speed to the original, with very slow insertion relative to any other maps I tried, and an insignificant improvement to lookup times (within the margin of error) relative to ObjectMap. I was rather discouraged by this, because I had been hoping to maybe switch some implementations to a variant of CuckooObjectMap, but it didn't substantially outperform the existing code in any metric. I just hardened one variant some more, and it can now successfully insert 256 keys with the same hashCode() (0 for all of them); I expect it performs a little worse than the existing CuckooObjectMap, and could perform substantially worse, but won't OutOfMemoryError on small input sets: https://github.com/tommyettinger/jdkgdxds/blob/master/src/test/java/com/github/tommyettinger/ds/test/ObjectObjectCuckooMap.java#L408

…efault loadFactor to 0.7

This reverts commit 66bf106.

theigl · 2022-01-24T16:55:01Z

@tommyettinger:

My point is, if you control the keys at every stage and know their hashCode() implementations will be well-distributed, masking is a valid option, and will likely be slightly faster than using Fibonacci hashing. That case is true for all identity-hashed maps, since the identity hash code is very thoroughly mixed. If you don't control the keys or have to accept keys with poorly distributed hashCode() implementations, like Float or even unusual sets of int keys, then Fibonacci hashing can help considerably, as I described in my previous comment.

I decided against dropping Fibonacci hashing from all maps and instead only removed it from the two identity maps. Thank you very much for your advice!

I also should mention that I ran some JMH benchmarks on CuckooObjectMap and some variants that are hardened somewhat against mass collision scenarios; all variants had comparable speed to the original, with very slow insertion relative to any other maps I tried, and an insignificant improvement to lookup times (within the margin of error) relative to ObjectMap. I was rather discouraged by this, because I had been hoping to maybe switch some implementations to a variant of CuckooObjectMap, but it didn't substantially outperform the existing code in any metric. I just hardened one variant some more, and it can now successfully insert 256 keys with the same hashCode() (0 for all of them); I expect it performs a little worse than the existing CuckooObjectMap, and could perform substantially worse, but won't OutOfMemoryError on small input sets:

I ran several days worth of benchmarks against my PR #876 and was finally able to get rid of the CuckooObjectMap in Kryo. The only place where we used the CuckooMap was resolving registrations for classes. I replaced it with the improved version of IdentityMap (inlined get and masking instead of Fibonacci hashing). Performance is the same or slightly faster for all my real world benchmarks.

I pushed a custom Kryo version built from the PR into our production environment and will monitor it for a day or two. So far it looks like a 3-5% performance boost over the master.

tommyettinger · 2022-01-25T11:27:15Z

3-5% performance boost is great considering no features were lost as payment. I like the approach where identity maps can rely on identityHashCode() producing good output, since the cases where it doesn't are virtually impossible, and being able to omit the long multiply and shift should help there (especially since getting an identityHashCode() can be slower than some hashCode() implementations). I also like not having to worry about collision buildup in other maps/sets, because those do still use Fibonacci hashing to mix the upper and lower bits.

I'm glad you were able to figure out a solution that avoids using CuckooObjectMap. The insertion speed in the Cuckoo maps was really surprisingly poor, to me, and the lookup speed seems very close, as you said, so the main advantage is lessened when compared to the maps in this PR. I think part of the problem is that Cuckoo Hashing really does seem to depend on an implementation of double hashing (where a key has two independent hashCode() values); C++ maps can provide this, but Java really can't. The stash behavior also only guarantees constant-time operations if it is fixed-size, and the larger the stash, the larger the constant factor. If the stash runs out of room and is fixed-size, it's a major problem. Being able to use an old standard like linear probing is certainly more comforting.

* #845 Replace Fibonacci hashing with plain masking for identity maps and inline `get` methods * #845 Adjust JavaDoc * #845 Add updated `MapBenchmark`

theigl · 2022-01-25T12:17:44Z

@pyb1993 @tommyettinger: Most changes suggested in this PR have been applied in #876 and #877. Many thanks to both of you!

implement the fastGetObjectMap to get better read performance & more …

06a5cf1

…readability

theigl mentioned this pull request Jul 23, 2021

Use CuckooObjectMap from Kryo 4 for class registrations #755

Merged

theigl added the enhancement label Jul 23, 2021

theigl requested changes Jul 23, 2021

View reviewed changes

set the filed of ObjectMap package private & formate code & rename pa…

36de886

…rameter of constructor

implement the fastGetIntMap & do benchmark & remove the change to cla…

b98943d

…ssRegistration

theigl added a commit that referenced this pull request Jan 18, 2022

#845 Replace Fibonacci hashing with plain masking and inline get me…

d2764c6

…thods

theigl added a commit that referenced this pull request Jan 18, 2022

#845 Replace CuckooObjectMap in DefaultClassResolver with `Object…

6365edc

…Map`

theigl mentioned this pull request Jan 18, 2022

Optimize object maps #876

Merged

3 tasks

theigl added a commit that referenced this pull request Jan 18, 2022

#845 Replace Fibonacci hashing with plain masking and inline get me…

39c42fe

…thods

theigl added a commit that referenced this pull request Jan 21, 2022

#845 Adjust JavaDoc

58986f5

theigl added a commit that referenced this pull request Jan 21, 2022

#845 Only apply optimizations to maps using identity hashes and set d…

1065947

…efault loadFactor to 0.7

theigl added a commit that referenced this pull request Jan 21, 2022

#845 Add updated MapBenchmark

e3a34a9

theigl added a commit that referenced this pull request Jan 21, 2022

#845 Use IdentityMap for class registrations

f5c6710

theigl added a commit that referenced this pull request Jan 21, 2022

#845 Add updated MapBenchmark

8a8bc1a

theigl added a commit that referenced this pull request Jan 21, 2022

#845 Add updated MapBenchmark

81023a3

theigl added a commit that referenced this pull request Jan 21, 2022

#845 Add updated MapBenchmark

49e8297

theigl added a commit that referenced this pull request Jan 22, 2022

#845 Add updated MapBenchmark

0787a3a

theigl added a commit that referenced this pull request Jan 22, 2022

#845 Increase default loadFactor to 0.75

d8a3d80

theigl added a commit that referenced this pull request Jan 22, 2022

#845 Experiment with default loadFactor of 0.72

0f212f5

theigl added a commit that referenced this pull request Jan 23, 2022

#845 Experiment with default loadFactor of 0.75

66bf106

theigl added a commit that referenced this pull request Jan 23, 2022

Revert "#845 Experiment with default loadFactor of 0.75"

6698d22

This reverts commit 66bf106.

theigl added a commit that referenced this pull request Jan 23, 2022

#845 Cleanup

df9296a

theigl added a commit that referenced this pull request Jan 23, 2022

#845 Cleanup

32cabe8

theigl added a commit that referenced this pull request Jan 25, 2022

Performance improvements for maps (#876)

b5d5443

* #845 Replace Fibonacci hashing with plain masking for identity maps and inline `get` methods * #845 Adjust JavaDoc * #845 Add updated `MapBenchmark`

theigl closed this Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement the fastGetObjectMap to get better read performance #845

implement the fastGetObjectMap to get better read performance #845

pyb1993 commented Jul 23, 2021

theigl left a comment •

edited

Loading

theigl commented Jul 23, 2021

pyb1993 commented Jul 24, 2021

tommyettinger commented Jul 24, 2021 •

edited

Loading

tommyettinger commented Jul 24, 2021

tommyettinger commented Jul 25, 2021

pyb1993 commented Jul 25, 2021

tommyettinger commented Jul 26, 2021

theigl commented Jul 26, 2021 •

edited

Loading

pyb1993 commented Jul 26, 2021

tommyettinger commented Jul 26, 2021

theigl commented Jul 27, 2021

tommyettinger commented Jul 27, 2021

pyb1993 commented Jul 27, 2021 •

edited

Loading

theigl commented Jul 27, 2021

NathanSweet commented Jul 31, 2021

tommyettinger commented Aug 1, 2021

pyb1993 commented Aug 12, 2021

theigl commented Aug 22, 2021

pyb1993 commented Oct 11, 2021

tommyettinger commented Oct 12, 2021

theigl commented Jan 18, 2022

tommyettinger commented Jan 19, 2022

theigl commented Jan 24, 2022

tommyettinger commented Jan 25, 2022

theigl commented Jan 25, 2022

implement the fastGetObjectMap to get better read performance #845

implement the fastGetObjectMap to get better read performance #845

Conversation

pyb1993 commented Jul 23, 2021

1. unroll the locate to get method

2. Remove the magic number in place

3. Automatically adjust load factor

Benchmark Result

theigl left a comment • edited Loading

Choose a reason for hiding this comment

theigl commented Jul 23, 2021

pyb1993 commented Jul 24, 2021

tommyettinger commented Jul 24, 2021 • edited Loading

tommyettinger commented Jul 24, 2021

tommyettinger commented Jul 25, 2021

pyb1993 commented Jul 25, 2021

tommyettinger commented Jul 26, 2021

theigl commented Jul 26, 2021 • edited Loading

pyb1993 commented Jul 26, 2021

tommyettinger commented Jul 26, 2021

theigl commented Jul 27, 2021

tommyettinger commented Jul 27, 2021

pyb1993 commented Jul 27, 2021 • edited Loading

theigl commented Jul 27, 2021

NathanSweet commented Jul 31, 2021

tommyettinger commented Aug 1, 2021

pyb1993 commented Aug 12, 2021

theigl commented Aug 22, 2021

pyb1993 commented Oct 11, 2021

tommyettinger commented Oct 12, 2021

theigl commented Jan 18, 2022

tommyettinger commented Jan 19, 2022

theigl commented Jan 24, 2022

tommyettinger commented Jan 25, 2022

theigl commented Jan 25, 2022

theigl left a comment •

edited

Loading

tommyettinger commented Jul 24, 2021 •

edited

Loading

theigl commented Jul 26, 2021 •

edited

Loading

pyb1993 commented Jul 27, 2021 •

edited

Loading