Skip to content
This repository has been archived by the owner on Oct 21, 2024. It is now read-only.

Improve 128-bit hash performance #6

Merged
merged 1 commit into from
Aug 5, 2020

Conversation

kevinconaway
Copy link

@kevinconaway kevinconaway commented May 26, 2020

This commit makes some improvements to the 128-bit hash performance, mainly to the Hash128Writer but also to the standalone Hash128 function

Prior to this, the writer would copy all incoming data in to a 16-byte buffer and process it block by block, which incurred a lot of memmove overhead. This has been changed to unsafe read the incoming data in 16-byte blocks and only copying the remainder to a tail buffer. This ends up being much more efficient for larger blocks of input as well as inputs that are sized in to 16-byte blocks.

The other change here is to use bits.RotateLeft64 where appropriate instead of hand-rolling the code for this. Using bits.RotateLeft64 leverages assembly intrinsics (ROLQ) where available which speeds this up

The changes to HashWriter128 are backwards compatible but new methods have been added (AddString/AddBytes) that allow callers not to have to bother with an error result. The Sum128() method was added and returns a Hash128Value, similar to the standalone hash function

Some tests & benchmarks from the twmb/murmur3 project were borrowed to test edge cases around the handling of tail bytes in the writer.

Below are benchmarks for the standalone Hash128 function:

name                  old time/op    new time/op    delta
Hash128Branches/0-8     1.87ns ±17%    1.88ns ± 7%     ~     (p=0.469 n=10+10)
Hash128Branches/1-8     6.21ns ± 3%    6.26ns ± 4%     ~     (p=0.448 n=9+9)
Hash128Branches/2-8     7.02ns ± 0%    6.53ns ± 2%   -7.02%  (p=0.000 n=8+10)
Hash128Branches/3-8     8.21ns ± 4%    7.75ns ± 4%   -5.63%  (p=0.000 n=10+10)
Hash128Branches/4-8     8.61ns ± 4%    7.35ns ± 0%  -14.60%  (p=0.000 n=10+6)
Hash128Branches/5-8     8.77ns ± 4%    7.96ns ± 2%   -9.29%  (p=0.000 n=10+9)
Hash128Branches/6-8     9.90ns ± 4%    8.40ns ± 7%  -15.16%  (p=0.000 n=10+10)
Hash128Branches/7-8     10.5ns ± 2%     9.1ns ± 5%  -13.76%  (p=0.000 n=9+10)
Hash128Branches/8-8     10.1ns ± 0%     9.8ns ± 4%   -3.25%  (p=0.003 n=7+10)
Hash128Branches/9-8     11.8ns ± 1%    10.9ns ± 3%   -7.41%  (p=0.000 n=10+9)
Hash128Branches/10-8    12.0ns ± 1%    10.9ns ± 3%   -9.09%  (p=0.000 n=9+10)
Hash128Branches/11-8    13.5ns ± 7%    11.9ns ± 7%  -12.13%  (p=0.000 n=10+10)
Hash128Branches/12-8    13.5ns ± 1%    12.8ns ± 7%   -5.51%  (p=0.002 n=9+9)
Hash128Branches/13-8    14.0ns ± 1%    12.5ns ± 3%  -10.74%  (p=0.000 n=10+10)
Hash128Branches/14-8    15.1ns ± 1%    12.9ns ± 5%  -14.62%  (p=0.000 n=9+10)
Hash128Branches/15-8    15.4ns ± 1%    13.8ns ± 1%  -10.81%  (p=0.000 n=10+8)
Hash128Branches/16-8    8.34ns ± 2%    7.54ns ± 4%   -9.64%  (p=0.000 n=10+10)
Hash128Sizes/32-8       11.0ns ± 1%     9.6ns ± 3%  -12.81%  (p=0.000 n=10+10)
Hash128Sizes/64-8       16.2ns ± 1%    14.7ns ± 7%   -8.91%  (p=0.000 n=10+10)
Hash128Sizes/128-8      26.4ns ± 4%    22.5ns ± 6%  -14.73%  (p=0.000 n=10+10)
Hash128Sizes/256-8      45.5ns ± 1%    40.7ns ± 4%  -10.53%  (p=0.000 n=10+10)
Hash128Sizes/512-8      90.0ns ± 1%    79.3ns ± 4%  -11.98%  (p=0.000 n=10+10)
Hash128Sizes/1024-8      163ns ± 1%     146ns ± 7%  -10.84%  (p=0.000 n=10+10)
Hash128Sizes/2048-8      327ns ± 2%     298ns ± 4%   -9.07%  (p=0.000 n=10+10)
Hash128Sizes/4096-8      637ns ± 1%     558ns ± 3%  -12.51%  (p=0.000 n=10+9)
Hash128Sizes/8192-8     1.27µs ± 1%    1.10µs ± 1%  -13.33%  (p=0.000 n=9+9)

name                  old speed      new speed      delta
Hash128Branches/1-8    161MB/s ± 3%   160MB/s ± 4%     ~     (p=0.340 n=9+9)
Hash128Branches/2-8    285MB/s ± 0%   307MB/s ± 2%   +7.62%  (p=0.000 n=8+10)
Hash128Branches/3-8    365MB/s ± 4%   387MB/s ± 4%   +5.98%  (p=0.000 n=10+10)
Hash128Branches/4-8    465MB/s ± 4%   544MB/s ± 0%  +17.04%  (p=0.000 n=10+6)
Hash128Branches/5-8    570MB/s ± 4%   629MB/s ± 2%  +10.22%  (p=0.000 n=10+9)
Hash128Branches/6-8    606MB/s ± 4%   715MB/s ± 7%  +17.89%  (p=0.000 n=10+10)
Hash128Branches/7-8    662MB/s ± 6%   773MB/s ± 5%  +16.81%  (p=0.000 n=10+10)
Hash128Branches/8-8    792MB/s ± 1%   819MB/s ± 4%   +3.39%  (p=0.005 n=10+10)
Hash128Branches/9-8    765MB/s ± 1%   820MB/s ± 8%   +7.09%  (p=0.001 n=10+10)
Hash128Branches/10-8   832MB/s ± 1%   915MB/s ± 3%  +10.07%  (p=0.000 n=9+10)
Hash128Branches/11-8   814MB/s ± 6%   927MB/s ± 7%  +13.82%  (p=0.000 n=10+10)
Hash128Branches/12-8   888MB/s ± 1%   943MB/s ± 7%   +6.14%  (p=0.004 n=9+9)
Hash128Branches/13-8   932MB/s ± 1%  1044MB/s ± 3%  +12.04%  (p=0.000 n=10+10)
Hash128Branches/14-8   923MB/s ± 1%  1084MB/s ± 5%  +17.37%  (p=0.000 n=10+10)
Hash128Branches/15-8   973MB/s ± 1%  1089MB/s ± 2%  +11.97%  (p=0.000 n=10+10)
Hash128Branches/16-8  1.92GB/s ± 2%  2.12GB/s ± 4%  +10.69%  (p=0.000 n=10+10)
Hash128Sizes/32-8     2.89GB/s ± 1%  3.33GB/s ± 3%  +14.91%  (p=0.000 n=10+10)
Hash128Sizes/64-8     3.96GB/s ± 1%  4.35GB/s ± 7%   +9.83%  (p=0.000 n=10+10)
Hash128Sizes/128-8    4.85GB/s ± 4%  5.69GB/s ± 5%  +17.29%  (p=0.000 n=10+10)
Hash128Sizes/256-8    5.63GB/s ± 1%  6.29GB/s ± 4%  +11.80%  (p=0.000 n=10+10)
Hash128Sizes/512-8    5.69GB/s ± 1%  6.46GB/s ± 4%  +13.62%  (p=0.000 n=10+10)
Hash128Sizes/1024-8   6.26GB/s ± 1%  7.04GB/s ± 7%  +12.50%  (p=0.000 n=9+10)
Hash128Sizes/2048-8   6.26GB/s ± 2%  6.88GB/s ± 4%  +10.02%  (p=0.000 n=10+10)
Hash128Sizes/4096-8   6.43GB/s ± 1%  7.35GB/s ± 3%  +14.34%  (p=0.000 n=10+9)
Hash128Sizes/8192-8   6.47GB/s ± 1%  7.46GB/s ± 1%  +15.38%  (p=0.000 n=9+9)

And benchmarks for HashWriter128:

name              old time/op    new time/op    delta
128Branches/0-8     8.52ns ± 2%   10.22ns ± 5%   +19.89%  (p=0.000 n=9+10)
128Branches/1-8     13.7ns ± 3%    12.8ns ± 4%    -6.93%  (p=0.000 n=10+8)
128Branches/2-8     14.6ns ± 4%    13.6ns ± 1%    -6.46%  (p=0.000 n=10+8)
128Branches/3-8     14.7ns ± 0%    14.0ns ± 5%    -4.97%  (p=0.000 n=7+10)
128Branches/4-8     14.8ns ± 1%    13.6ns ± 4%    -7.50%  (p=0.000 n=10+9)
128Branches/5-8     14.3ns ± 2%    14.4ns ± 5%      ~     (p=0.925 n=10+10)
128Branches/6-8     14.4ns ± 1%    13.9ns ± 3%    -2.83%  (p=0.000 n=9+10)
128Branches/7-8     14.5ns ± 2%    15.1ns ± 3%    +4.21%  (p=0.000 n=10+10)
128Branches/8-8     15.3ns ± 1%    14.4ns ± 1%    -5.80%  (p=0.000 n=10+8)
128Branches/9-8     15.9ns ± 1%    15.5ns ± 2%    -2.14%  (p=0.000 n=10+10)
128Branches/10-8    16.2ns ± 2%    15.6ns ± 4%    -3.89%  (p=0.000 n=10+10)
128Branches/11-8    16.7ns ± 4%    15.8ns ± 2%    -5.42%  (p=0.000 n=10+9)
128Branches/12-8    17.0ns ± 2%    16.4ns ± 3%    -3.54%  (p=0.000 n=10+9)
128Branches/13-8    17.5ns ± 3%    16.9ns ± 3%    -3.20%  (p=0.000 n=10+10)
128Branches/14-8    18.0ns ± 6%    17.6ns ± 1%    -2.20%  (p=0.047 n=10+9)
128Branches/15-8    18.2ns ± 1%    18.5ns ± 3%    +1.65%  (p=0.022 n=9+9)
128Branches/16-8    18.7ns ±30%    12.7ns ± 3%   -32.21%  (p=0.000 n=10+10)
128Sizes/32-8       26.0ns ± 1%    14.4ns ± 1%   -44.42%  (p=0.000 n=9+8)
128Sizes/64-8       42.7ns ± 1%    20.1ns ± 1%   -52.88%  (p=0.000 n=9+9)
128Sizes/128-8      78.3ns ± 1%    33.8ns ± 2%   -56.78%  (p=0.000 n=8+10)
128Sizes/256-8       159ns ± 3%      60ns ± 3%   -61.88%  (p=0.000 n=10+10)
128Sizes/512-8       301ns ± 3%     111ns ± 1%   -62.94%  (p=0.000 n=10+10)
128Sizes/1024-8      587ns ± 3%     222ns ± 2%   -62.18%  (p=0.000 n=10+10)
128Sizes/2048-8     1.15µs ± 6%    0.45µs ± 2%   -61.10%  (p=0.000 n=10+10)
128Sizes/4096-8     2.36µs ± 4%    0.87µs ± 1%   -63.21%  (p=0.000 n=10+10)
128Sizes/8192-8     4.37µs ± 4%    1.76µs ± 2%   -59.65%  (p=0.000 n=10+10)

name              old speed      new speed      delta
128Branches/1-8   73.1MB/s ± 3%  78.4MB/s ± 4%    +7.25%  (p=0.000 n=10+8)
128Branches/2-8    137MB/s ± 4%   147MB/s ± 1%    +6.78%  (p=0.000 n=10+8)
128Branches/3-8    204MB/s ± 1%   215MB/s ± 5%    +5.18%  (p=0.000 n=9+10)
128Branches/4-8    271MB/s ± 1%   293MB/s ± 4%    +8.11%  (p=0.000 n=10+9)
128Branches/5-8    349MB/s ± 2%   348MB/s ± 5%      ~     (p=0.853 n=10+10)
128Branches/6-8    418MB/s ± 1%   430MB/s ± 2%    +2.92%  (p=0.000 n=9+10)
128Branches/7-8    483MB/s ± 1%   464MB/s ± 3%    -3.95%  (p=0.000 n=10+10)
128Branches/8-8    524MB/s ± 1%   552MB/s ± 3%    +5.40%  (p=0.000 n=10+10)
128Branches/9-8    567MB/s ± 1%   580MB/s ± 2%    +2.16%  (p=0.000 n=10+10)
128Branches/10-8   618MB/s ± 2%   643MB/s ± 4%    +4.11%  (p=0.001 n=10+10)
128Branches/11-8   658MB/s ± 4%   698MB/s ± 1%    +6.08%  (p=0.000 n=10+8)
128Branches/12-8   706MB/s ± 3%   735MB/s ± 1%    +4.07%  (p=0.000 n=10+8)
128Branches/13-8   743MB/s ± 3%   767MB/s ± 2%    +3.31%  (p=0.000 n=10+10)
128Branches/14-8   780MB/s ± 6%   798MB/s ± 1%    +2.27%  (p=0.028 n=10+9)
128Branches/15-8   824MB/s ± 1%   806MB/s ± 6%    -2.21%  (p=0.010 n=9+10)
128Branches/16-8   876MB/s ±25%  1263MB/s ± 3%   +44.22%  (p=0.000 n=10+10)
128Sizes/32-8     1.23GB/s ± 2%  2.22GB/s ± 1%   +79.91%  (p=0.000 n=10+7)
128Sizes/64-8     1.50GB/s ± 1%  3.18GB/s ± 1%  +112.04%  (p=0.000 n=9+10)
128Sizes/128-8    1.63GB/s ± 1%  3.78GB/s ± 2%  +131.40%  (p=0.000 n=8+10)
128Sizes/256-8    1.61GB/s ± 3%  4.23GB/s ± 3%  +162.30%  (p=0.000 n=10+10)
128Sizes/512-8    1.70GB/s ± 3%  4.59GB/s ± 1%  +169.50%  (p=0.000 n=10+10)
128Sizes/1024-8   1.74GB/s ± 3%  4.61GB/s ± 3%  +164.50%  (p=0.000 n=10+10)
128Sizes/2048-8   1.78GB/s ± 6%  4.58GB/s ± 2%  +156.91%  (p=0.000 n=10+10)
128Sizes/4096-8   1.74GB/s ± 4%  4.73GB/s ± 1%  +171.70%  (p=0.000 n=10+10)
128Sizes/8192-8   1.87GB/s ± 4%  4.65GB/s ± 2%  +147.77%  (p=0.000 n=10+10)

This commit makes some improvements to the 128-bit hash performance, mainly to the Hash128Writer but also to the standalone Hash128 function

Prior to this, the writer would copy all incoming data in to a 16-byte buffer and process it block by block, which incurred a lot of memmmove overhead.  This has been changed to unsafe read the incoming data in 16-byte blocks and only copying the remainder to a tail buffer.  This ends up being much more efficient for larger blocks of input as well as inputs that are sized in to 16-byte blocks.

The other change here is to use `bits.RotateLeft64` where appropriate instead of hand-rolling the code for this.  Using `bits.RotateLeft64` leverages assembly intrinsics (ROLQ) where available which speeds this up

The changes to HashWriter128 are backwards compatible but new methods have been added (AddString/AddBytes) that allow callers not to have to bother with an error result.  The Sum128() method was added and returns a Hash128Value, similar to the standalone hash function

Some tests & benchmarks from the twmb/murmur3 project were borrowed to test edge cases around the handling of tail bytes in the writer.
mmh3.go Show resolved Hide resolved
@@ -199,160 +186,190 @@ func Hash128x64(key []byte) []byte {
return Hash128(key).Bytes()
}

// WriteHash128x64 is a
// WriteHash128x64 creates a hash of key and writes it to ret
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅

Copy link

@jmoiron jmoiron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR.

@kevinconaway kevinconaway merged commit 30884ca into master Aug 5, 2020
@kevinconaway kevinconaway deleted the kevinconaway/mmh-128-perf branch August 5, 2020 15:16
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants