Skip to content

MD5-related optimizations#6

Closed
Chainfire wants to merge 3 commits into
RsyncProject:masterfrom
Chainfire:CSUM2-AVX2
Closed

MD5-related optimizations#6
Chainfire wants to merge 3 commits into
RsyncProject:masterfrom
Chainfire:CSUM2-AVX2

Conversation

@Chainfire
Copy link
Copy Markdown
Contributor

@Chainfire Chainfire commented May 31, 2020

  • This is a 3-parter. The first commit moves the OpenSSL related defines from from checksum.c to mdigest.h, because I found myself copy/pasting them more than once otherwise.

  • The second commit enables parallel computation of MD5 hashes in the block matching get_checksum2() phase. As each blocks' hash is independent, we can process up to 4 blocks simultaneously with SSE2, and 8 blocks with AVX2, leading to a real-world 2x to 6x performance gain / CPU usage reduction (even over OpenSSL-optimized MD5).

However, to make this happen without significant changes to the rest of rsync's codebase, a block prefetcher had to be created. (This whole commit requires --enable-simd as my previous contributions). Full compatibility is maintained with non-SIMD counterparts.

The same mechanism could be used for multithreading checksums as well, but that is beyond the scope of this patch.

  • The third commit provides the MD5P8 whole-file checksum. This is an MD5-based checksum but cuts the input stream into 8 independent streams (64-byte interleave), of which the final states are brought together to create the final MD5 checksum. This has the same strengths (and weaknesses) of the normal MD5 checksum, but allows parallel processing, with again a 2x to 6x performance gain.

Both optimized parallel processing is available (--enable-simd) as reference C. MD5P8 is slightly slower on <10kB files due to the additional overhead, but similar to MD5 on larger files without SIMD, and much faster on larger files with SIMD.

Note that get_checksum2() keeps using normal MD5 even if whole-file checksum is MD5P8, because that is parallelized with SIMD anyway if available, and using MD5P8 would just add overhead and quite probably be slower.

Further note that the CSUM_MD5 and CSUM_MD5P8 defines now appear in both checksum.c and simd-checksum-x86_64.cpp, they need to be kept in sync, perhaps moved to a header?


Motivation: though xxhash is now available for rsync, it is not included into the code itself but an external dependency, and by my last evaluation, many distros do not yet come with xxhash included, and thus the distro-included rsync package will be built without xxhash support. That being said, I can imagine that this PR may not be merged due to it not being part of the direction rsync is moving in, I myself need to be using it due to having an uncommon build target, and the code might as well be available to everyone.

The parallel computation of MD5 hashes in get_checksum2() will benefit connections to both recent builds without xxhash as well as older builds of rsync if the block-matching phase applies. If it doesn't lead to a reduction in transfer time due to connection or disk speed limitations, then it will at least massively reduce CPU usage on the supporting client.

The use-case for MD5P8 is more limited, as its usefulness requires both ends to be running an supporting rsync build, but one end not supporting xxhash. If both ends do support xxhash, that should always be the preferred checksum (while MD5P8 can reach gigabytes per second, xxhash is still twice as fast). I only created it as it was a small effort now that parallel MD5 computation was available anyway, and it doesn't have any external dependencies.


I've done some benchmarks for transferring 1GB files between a fast and a slow CPU on 1GbE LAN, compared to normal MD5 usage (all tests already including my previous block size patches and get_checksum1() optimizations):

get_checksum2() MD5 parallelization with MD5 whole-file checksum, both files existing on both ends:

  • 33% transfer time reduction
  • 52% CPU usage reduction

get_checksum2() MD5 parallelization and MD5P8 whole-file checksum, both files existing on both ends:

  • 54% transfer time reduction
  • 84% CPU usage reduction

xxhash for both get_checksum2() and whole-file checksum, both files existing on both ends:

  • 54% transfer time reduction
  • 90% CPU usage reduction

MD5P8, new file:

  • 33% transfer time reduction
  • 86% CPU usage reduction

xxhash, new file:

  • 33% transfer time reduction
  • 92% CPU usage reduction

MD5P8, local checksum:

  • 83% CPU usage reduction

xxhash, local checksum:

  • 94% CPU usage reduction

Obviously these are highly specific to my setup and YMMV. However, my daily syncing of TB's of data is now twice as fast, with average CPU usage down to less than a quarter. xxhash doesn't run ahead much in this case because CPU power while checksumming is no longer the bottleneck after these patches. With even faster network and disks (10GbE + NVMe) xxhash might be twice as fast.

Chainfire added 3 commits May 31, 2020 13:31
Works just as well, prevents having to repeat them across files
MD5 hashes computed during rsync's block matching phase are independent
and thus possible to process in parallel. This code processes 4 blocks
in parallel if SSE2 is available, or 8 if AVX2 is available. An increase
of performance (or decrease of CPU usage) of up to 6x has been measured.

A prefetching algorithm is used to predict and load upcoming blocks, as
this prevents the need for extensive modifications to other parts of
the rsync sources to get this working.
Splits the input up into 8 independent streams (64-byte interleave), and
produces a final checksum based on the end state of those 8 streams. If
parallelization of MD5 hashing is available, the performance gain is 2x
to 6x.

xxHash is still preferred (and faster), but this provides a reasonably
fast fallback for the case where xxHash libraries are not available at
build time.
@WayneD
Copy link
Copy Markdown
Member

WayneD commented Jun 2, 2020

Thanks! I've put the changes into a file named "md5p8.diff" in the rsync-patches repo for now. I incorporated some of the changes that put more info into lib/mdigest.h, and I tweaked a few things for style and to fix a compiler warning. Here's the resulting patch:

https://git.samba.org/?p=rsync-patches.git;a=blob_plain;f=md5p8.diff;hb=ac98f867ff5e7e53a0157b967c7b216c86b0b0a6

@WayneD
Copy link
Copy Markdown
Member

WayneD commented Jun 8, 2020

I'm going to leave it as a maintained patch for now and consider merging it later.

@WayneD WayneD closed this Jun 8, 2020
@WayneD WayneD self-assigned this Jun 19, 2020
@Chainfire
Copy link
Copy Markdown
Contributor Author

I'll update this with the new build tests and applying to latest master

Trogious added a commit to Trogious/rsync that referenced this pull request May 14, 2026
rsync.exe -av <local> user@host:/dst/ now transfers files over SSH with
byte-exact verification. Idempotent re-push transfers 0 bytes.

Four fixes to clear the runtime path after build came up:

* win32/win_select.c: select() shim. winsock's select() only handles
  SOCKETs; rsync's io.c calls select() on the pipe fds from
  piped_child. Classify each fd via GetFileType+GetNamedPipeInfo,
  defer sockets to real winsock select, poll pipes via PeekNamedPipe.
  10 ms cadence. ~170 LOC.

* win_spawn.c: bump CreatePipe buffer hint to 1 MB so the file-list
  phase doesn't deadlock on a full 4 KB anonymous pipe.

* util1.c::change_dir: treat 'X:\…', 'X:/…', and '\…' as absolute on
  Windows. Normalize curr_dir to forward slashes after getcwd so path
  joins don't mix separators.

* syscall.c::do_open_nofollow: force O_BINARY (MSVC defaults to text
  mode); skip the lstat→open→fstat dev/ino symlink-race check on
  Windows because MSVC's stat/fstat don't return stable values for
  those fields.

Pull and local-copy still hit the RtlCloneUserProcess fork hang —
tracked as task RsyncProject#6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tridge added a commit to tridge/rsync that referenced this pull request May 20, 2026
…ncProject#6 (chmod mode arg)

These two small syscall.c fixes were made at the start of the round-4
work but got dropped on the floor when I split the commit -- only the
docs (RsyncProject#1) and the bigger RsyncProject#2/RsyncProject#4/RsyncProject#5 deferred-immutable-dir series ended
up landed.  The tree was left dirty.

RsyncProject#3: do_rename (the non-_at variant) was missing the hardlink-aware
restore I added to do_rename_at last round.  Same shape -- when
renameat replaces a destination inode that had st_nlink > 1, the
remaining hardlinks survive carrying the cleared flags.  Restore via
new_fd before close (the fd still refers to the surviving inode).

RsyncProject#6: do_chmod and do_chmod_at force_change recovery were calling
make_mutable_fd(fd, mode, ...) where mode was the caller-supplied
chmod-target mode -- some callers (notably xattrs.c's set_xattr
recovery path) pass perm bits only, no S_IFREG / S_IFDIR, so on Linux
rsync_fchflags rejects the call as neither regular file nor directory
and recovery silently fails.  Use st.st_mode from the freshly-fstatted
target instead, which always has the right S_IFx bits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants