Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional Perf. Optimizations #6

Merged
merged 35 commits into from
Nov 23, 2021
Merged

Conversation

Sewer56
Copy link
Contributor

@Sewer56 Sewer56 commented Oct 1, 2021

Hi, thanks for the performance optimizations.

I'm working on a delta patching algorithm for a mod loader; and was interested in a managed implementation of VCDiff. In the process, I ended up making some miscellaneous extra improvements to the implementation.

Summary

  • Removed duplicate code in VCEncoder.
  • Removed unused code from internal APIs. e.g. Adler32.Combine
  • Added Span returns for IByteBuffer to reduce total number of heap allocations. New API used where possible.
  • Support using ByteBuffer as source in VCEncoder (to avoid unnecessary memory copy).
  • Faster copying of Source in VCEncoder (allocated array isn't zero initialized [default .NET Behaviour] as it will be entirely overwritten anyway).
  • Added [SkipLocalsInit] attribute for .NET 5 target to methods with 1+ Million invocations (on 100MB+ files).
  • Changed: Capped max bufferSize in VCEncoder to target length to avoid unnecessary excess memory allocation.
  • Added: Reads from target in EncodeAsync are now asynchonous. Only writing to output now still isn't.
  • Added: Usage of GC.AllocateUninitializedArray for .NET 5 in BlockHash to avoid implicit array zero fill.
  • Added: Simple BenchmarkDotNet Program (VCDiff.Benchmark).
  • Added: Vector Fill BlockHash tables.
  • Optimized: Bounds Checks on Vector Operations.
  • Devirtualized VcDecoder (removed virtual method calls via eliminating interface)
  • Added: Caching for small memory reads and less allocations in ByteStreamReader.
  • Added: Read directly into Span<T> for IByteBuffer.
  • Added: Use Array Rentals for less heap allocations on repeat decodes.
  • Reduced: Unnecessary heap allocations; particularly using structs where appropriate and not re-creating e.g. InstructionMap unnecessarily.
  • Removed: Unnecessary behind the hood MemoryStream allocations by using array pooling in Microsoft.IO.RecyclableMemoryStream.

Note

Although unused code was cleaned up a little, no public APIs were removed.
Code passes all tests as expected; should be slightly faster than before but I didn't strictly benchmark it.

Benchmarks

Encode

Encoding a16 MB file 1 time.

Before:

Method Bytes BlockSize Mean Error StdDev
EncodeSlightlyModified 16777216 32 173.7 ms 7.74 ms 4.05 ms
EncodeHeavilyModified 16777216 32 954.9 ms 18.77 ms 9.82 ms

Current (15f0e67):

Method Bytes BlockSize Mean Error StdDev
EncodeSlightlyModified 16777216 32 166.9 ms 4.40 ms 2.30 ms
EncodeHeavilyModified 16777216 32 887.8 ms 3.07 ms 1.61 ms

Decode

Decoding 16 MB file 128 times. (~2GB per run.)

Before:

Method Bytes BlockSize Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
DecodeSlightlyModified 16777216 32 2.820 s 0.0890 s 0.0466 s 106000.0000 44000.0000 42000.0000 5 GB
DecodeHeavilyModified 16777216 32 1.095 s 0.0136 s 0.0071 s 19000.0000 16000.0000 16000.0000 2 GB

Current:

Method Bytes BlockSize Mean Error StdDev Gen 0 Allocated
DecodeSlightlyModified 16777216 32 1,647.2 ms 4.97 ms 1.77 ms 3000.0000 31 MB
DecodeHeavilyModified 16777216 32 466.5 ms 6.25 ms 2.77 ms 3000.0000 31 MB

For decode I've now come to a point where MemoryStream itself is the bottleneck; with a speed anywhere between 740MB/s and 1870MB/s depending on file.

Real World Tests

On an i7 4790k @ 4.4GHz & 1866MHz CL9 DDR3 RAM.
Input is a 1.4GB GameCube ROM and a modified version of said ROM with custom code and ~590MiB of music.
(Not actually what I plan on using it on but was a good test case)

xdelta3.0.11 x64 (cmd: xdelta.exe -e -S -A -s

  • RAM (Private Working Bytes): 150 MB
  • Time Taken: ~6.1 minutes.
  • Patch Size: 980MiB

vcdiff (15f0e67)

  • RAM (Private Working Bytes): 6.1GB (No Change)
  • Time Taken: ~10.8 minutes.
  • Patch Size: 619MiB

Not the only ROM I've tested. Another game mod with equal amount of custom files yielded an even greater patch size improvement from 1320MiB to 620MiB.

There's a chance that xdelta is performing some sort of tricks to optimise speed over compression ratio. RAM usage suggests a small table size. It may be simply splitting files into smaller chunks as it encodes. I haven't investigated the xdelta source personally to verify 100%.

With small files, performance difference is mostly negligible as per the blog.

I'm bringing those numbers up because your blog post focused a lot on comparing performance with xdelta.
Don't forget about compression efficiency; xdelta seems to drop off more the bigger the file size is, though I haven't sampled enough data to verify how strong that correlation is.

Other Notes

  • The blog mentions Span<T> as a bottleneck.

That kind of surprises me as creating a span in release mode is basically free.

See JIT Generated Code on Sharplab:
https://sharplab.io/#v2:C4LghgzgtgPgAgJgIwFgBQcAMACOSB0ASgK4B2wAllAKb4DCA9lAA4UA21ATgMpcBuFAMbUIAbnTo4AZlwJsdbAG902VdhVrp2MhDAAzatm7MwpADwAjAJ7BqAPmwBZABTXbAKmzMGFclwA02L7A2BykAObAABYAlOpoakoaiapwAOzYpNQA7kYm5m72zt7BAaHUEdEx4glqAL7odUA=

The only artifact of creating a Span is a bounds check (length < 0); otherwise it's basically 2 instructions.

L000c: mov [edx], ecx
L000e: mov [edx+4], eax

I wonder if the profiler may have something to do with that. Or maybe it was running in Debug; I don't know.

That said, although it's been a while that I've experimented with SIMD, I do remember there being a noticeable performance difference in using the the more generic Vector<T> over a specific instruction set implementation i.e. Avx2. I would prefer to believe the bigger performance difference came from here.

What is actually not free, is generating a Span from Memory<T>:
https://sharplab.io/#v2:C4LghgzgtgPgAgJgIwFgBQcAMACOSB0ASgK4B2wAllAKb4DCA9lAA4UA21ATgMpcBuFAMbUIAbnTo4AZlwJsdbAG902VdhVrp2MhDAAzatm7MwpADwAjAJ7BqAPmwBZABSPqUBpytmAQjfvYACZgwGAAlOpoakoa0apwAOxBIWD4xqbiUWoAvujZQA==

You're not the only one :) https://stackoverflow.com/questions/56285370/c-sharp-memory-span-unexpectedly-slow
Due to some special handling resulting by the fact that Memory<T> can also hold managed types, not only unmanaged ones.

Edit: Here's an extra

Untested. Just to show the idea. Relies on runtime implementation, would not recommend.

@Sewer56 Sewer56 changed the title Miscellaneous Improvements [WIP] Miscellaneous Improvements Oct 2, 2021
@Sewer56 Sewer56 changed the title [WIP] Miscellaneous Improvements Additional Perf. Optimizations Oct 20, 2021
@Sewer56
Copy link
Contributor Author

Sewer56 commented Oct 20, 2021

Update: This pull request has been tried and tested sufficiently to my belief and should be ready for merging.

@chyyran
Copy link
Member

chyyran commented Nov 23, 2021

Sorry for the late reply, I've been a busy lately and my inbox buried this.

@chyyran
Copy link
Member

chyyran commented Nov 23, 2021

LGTM. Will push a release soon so you can use mainline on NuGet. Apologies again for the late reply.

@chyyran chyyran merged commit 118efed into SnowflakePowered:master Nov 23, 2021
@chyyran
Copy link
Member

chyyran commented Nov 23, 2021

Published as https://www.nuget.org/packages/VCDiff/4.0.0

Capping bufferSize is technically a breaking change + wouldn't want to accidentally break other consumers, so I bumped major version just in case.

Thanks again @Sewer56 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants