Skip to content

Hash Testing

markfasheh edited this page Nov 25, 2014 · 4 revisions

Here are some numbers I got running against the same dataset with sha256 (gcrypt), xxhash and murmur3 from the hashes-update branch. Each run came up with the same total number of unique hashes, which is encouraging to see.

Each run was with 16 threads, writing to a hash file on a different partition. I rebooted between runs. SHA256 is our baseline (and the default in duperemove). Both the xxhash amd murmur3 patches aren't quite in mergeable state yet.

SHA256 via libgcrypt:

time duperemove.git/duperemove -rh --hash-threads=16 --write-hashes=hash-logs/hashes-gcrypt.dup /btrfs/
real    26m25.251s
user    78m38.612s
sys     4m52.848s

murmur3:

time duperemove.git/duperemove -rh --hash-threads=16 --write-hashes=hashes-murmur3.dup /btrfs/
real    26m19.224s
user    15m56.144s
sys     5m16.848s

xxhash, with changes (fills the 32 byte digest by splitting the input buffer into chunks for hashing):

time duperemove.git/duperemove -rh --hash-threads=16 --write-hashes=hash-logs/hashes-xxhash.dup /btrfs/
real    26m11.466s
user    1m11.408s
sys     9m57.720s

Some observations:

  • On my system, the overall time to complete the scan + checksum was dominated by I/O (and not cpu), as a result total time to actually run the test didn't change. For a system with slow cpu this might be different (I haven't tested though).

  • The difference in cpu usage is still worth having alternatives to the default hash. sha256 is super heavy for what we're doing. Duperemove can tolerate some small number of false positives - the kernel is always checking data for us anyway.

Here are some numbers for a direct xxhash encoding.

time duperemove.git/duperemove -rh --hash-threads=16 --write-hashes=hash-logs-2014-11-24/hashes-xxhash-direct.dup /btrfs/
real    26m17.420s
user    10m1.050s
sys     5m4.412s

The direct encoding version is so that I can compare our implementation with the reference, and also so that the digest returned from a running checksum would be the same as a one-shot call.

I have to investigate why CPU usage went up. That said, the extent results look ok - we need more testing though.