Additional Perf. Optimizations #6

Sewer56 · 2021-10-01T20:27:45Z

Hi, thanks for the performance optimizations.

I'm working on a delta patching algorithm for a mod loader; and was interested in a managed implementation of VCDiff. In the process, I ended up making some miscellaneous extra improvements to the implementation.

Summary

Removed duplicate code in VCEncoder.
Removed unused code from internal APIs. e.g. Adler32.Combine
Added Span returns for IByteBuffer to reduce total number of heap allocations. New API used where possible.
Support using ByteBuffer as source in VCEncoder (to avoid unnecessary memory copy).
Faster copying of Source in VCEncoder (allocated array isn't zero initialized [default .NET Behaviour] as it will be entirely overwritten anyway).
Added [SkipLocalsInit] attribute for .NET 5 target to methods with 1+ Million invocations (on 100MB+ files).
Changed: Capped max bufferSize in VCEncoder to target length to avoid unnecessary excess memory allocation.
Added: Reads from target in EncodeAsync are now asynchonous. Only writing to output now still isn't.
Added: Usage of GC.AllocateUninitializedArray for .NET 5 in BlockHash to avoid implicit array zero fill.
Added: Simple BenchmarkDotNet Program (VCDiff.Benchmark).
Added: Vector Fill BlockHash tables.
Optimized: Bounds Checks on Vector Operations.
Devirtualized VcDecoder (removed virtual method calls via eliminating interface)
Added: Caching for small memory reads and less allocations in ByteStreamReader.
Added: Read directly into Span<T> for IByteBuffer.
Added: Use Array Rentals for less heap allocations on repeat decodes.
Reduced: Unnecessary heap allocations; particularly using structs where appropriate and not re-creating e.g. InstructionMap unnecessarily.
Removed: Unnecessary behind the hood MemoryStream allocations by using array pooling in Microsoft.IO.RecyclableMemoryStream.

Note

Although unused code was cleaned up a little, no public APIs were removed.
Code passes all tests as expected; should be slightly faster than before but I didn't strictly benchmark it.

Benchmarks

Encode

Encoding a16 MB file 1 time.

Before:

Method	Bytes	BlockSize	Mean	Error	StdDev
EncodeSlightlyModified	16777216	32	173.7 ms	7.74 ms	4.05 ms
EncodeHeavilyModified	16777216	32	954.9 ms	18.77 ms	9.82 ms

Current (15f0e67):

Method	Bytes	BlockSize	Mean	Error	StdDev
EncodeSlightlyModified	16777216	32	166.9 ms	4.40 ms	2.30 ms
EncodeHeavilyModified	16777216	32	887.8 ms	3.07 ms	1.61 ms

Decode

Decoding 16 MB file 128 times. (~2GB per run.)

Before:

Method	Bytes	BlockSize	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
DecodeSlightlyModified	16777216	32	2.820 s	0.0890 s	0.0466 s	106000.0000	44000.0000	42000.0000	5 GB
DecodeHeavilyModified	16777216	32	1.095 s	0.0136 s	0.0071 s	19000.0000	16000.0000	16000.0000	2 GB

Current:

Method	Bytes	BlockSize	Mean	Error	StdDev	Gen 0	Allocated
DecodeSlightlyModified	16777216	32	1,647.2 ms	4.97 ms	1.77 ms	3000.0000	31 MB
DecodeHeavilyModified	16777216	32	466.5 ms	6.25 ms	2.77 ms	3000.0000	31 MB

For decode I've now come to a point where MemoryStream itself is the bottleneck; with a speed anywhere between 740MB/s and 1870MB/s depending on file.

Real World Tests

On an i7 4790k @ 4.4GHz & 1866MHz CL9 DDR3 RAM.
Input is a 1.4GB GameCube ROM and a modified version of said ROM with custom code and ~590MiB of music.
(Not actually what I plan on using it on but was a good test case)

xdelta3.0.11 x64 (cmd: xdelta.exe -e -S -A -s

RAM (Private Working Bytes): 150 MB
Time Taken: ~6.1 minutes.
Patch Size: 980MiB

vcdiff (15f0e67)

RAM (Private Working Bytes): 6.1GB (No Change)
Time Taken: ~10.8 minutes.
Patch Size: 619MiB

Not the only ROM I've tested. Another game mod with equal amount of custom files yielded an even greater patch size improvement from 1320MiB to 620MiB.

There's a chance that xdelta is performing some sort of tricks to optimise speed over compression ratio. RAM usage suggests a small table size. It may be simply splitting files into smaller chunks as it encodes. I haven't investigated the xdelta source personally to verify 100%.

With small files, performance difference is mostly negligible as per the blog.

I'm bringing those numbers up because your blog post focused a lot on comparing performance with xdelta.
Don't forget about compression efficiency; xdelta seems to drop off more the bigger the file size is, though I haven't sampled enough data to verify how strong that correlation is.

Other Notes

The blog mentions Span<T> as a bottleneck.

That kind of surprises me as creating a span in release mode is basically free.

See JIT Generated Code on Sharplab:
https://sharplab.io/#v2:C4LghgzgtgPgAgJgIwFgBQcAMACOSB0ASgK4B2wAllAKb4DCA9lAA4UA21ATgMpcBuFAMbUIAbnTo4AZlwJsdbAG902VdhVrp2MhDAAzatm7MwpADwAjAJ7BqAPmwBZABTXbAKmzMGFclwA02L7A2BykAObAABYAlOpoakoaiapwAOzYpNQA7kYm5m72zt7BAaHUEdEx4glqAL7odUA=

The only artifact of creating a Span is a bounds check (length < 0); otherwise it's basically 2 instructions.

L000c: mov [edx], ecx
L000e: mov [edx+4], eax

I wonder if the profiler may have something to do with that. Or maybe it was running in Debug; I don't know.

That said, although it's been a while that I've experimented with SIMD, I do remember there being a noticeable performance difference in using the the more generic Vector<T> over a specific instruction set implementation i.e. Avx2. I would prefer to believe the bigger performance difference came from here.

What is actually not free, is generating a Span from Memory<T>:
https://sharplab.io/#v2:C4LghgzgtgPgAgJgIwFgBQcAMACOSB0ASgK4B2wAllAKb4DCA9lAA4UA21ATgMpcBuFAMbUIAbnTo4AZlwJsdbAG902VdhVrp2MhDAAzatm7MwpADwAjAJ7BqAPmwBZABSPqUBpytmAQjfvYACZgwGAAlOpoakoa0apwAOxBIWD4xqbiUWoAvujZQA==

You're not the only one :) https://stackoverflow.com/questions/56285370/c-sharp-memory-span-unexpectedly-slow
Due to some special handling resulting by the fact that Memory<T> can also hold managed types, not only unmanaged ones.

Edit: Here's an extra

Untested. Just to show the idea. Relies on runtime implementation, would not recommend.

a

Sewer56 · 2021-10-20T04:58:48Z

Update: This pull request has been tried and tested sufficiently to my belief and should be ready for merging.

chyyran · 2021-11-23T00:09:53Z

Sorry for the late reply, I've been a busy lately and my inbox buried this.

chyyran · 2021-11-23T00:15:47Z

LGTM. Will push a release soon so you can use mainline on NuGet. Apologies again for the late reply.

chyyran · 2021-11-23T00:48:08Z

Published as https://www.nuget.org/packages/VCDiff/4.0.0

Capping bufferSize is technically a breaking change + wouldn't want to accidentally break other consumers, so I bumped major version just in case.

Thanks again @Sewer56 !

Sewer56 changed the title ~~Miscellaneous Improvements~~ [WIP] Miscellaneous Improvements Oct 2, 2021

Sewer56 force-pushed the master branch from 243d7ed to 15f0e67 Compare October 2, 2021 00:47

Sewer56 added 28 commits October 2, 2021 11:18

Added: Basic (Unrealistic) Benchmark for Encoding/Decoding

4f0d92f

Clean up: Some top level code & unused code.

81d034b

Added: Span Returns for IByteBuffer, Copy whole Source Ahead of Time

77f4e2a

Added: Ignores for Nullable Checks

6bca687

Added: Support for ByteBuffer source; use int length.

b08a2c5

Added: SkipLocalsInit for .NET 5 Hot Path Methods

a137708

Added: Cap Encoder Buffer size at Target

8359201

Misc Space removal in VcEncoder

5ab9ff4

Added: GC.AllocateUninitializedArray for non-zeroing array alloc.

242a099

Added: Optimized Vector Fill & Optimized Bounds Checks

6dd4775

Improve ByteStreamReader.ReadBytes

511a0bf

Cleanup: VCDecoder APIs

3688aeb

a

Optimized: Code Tables

cbde3ae

Optimize: ByteBuffer AsSpan()

eb1674e

Optimize: ByteBuffer PeekBytes()

e704673

Optimize: ByteStreamReader ReadBytesAsSpan

a46c6c2

Revert: Share VcDecoder with Async Path

faeae16

Removed: Redundant ByteStreamReader Check on Pos Set

7b54323

Added: Devirtualization to VarIntBE

18ef7d0

Change AddressCache.FirstNear to Constant

8bbaf04

ByteBuffer: Added Dispose on Finalizer if Needed

c6414b0

Added: AsSpan where AsMemory was used by accident

5c4c59d

Changed: Ensure Chunker is Disposed

9340956

Added: Unsafe Tables in BlockHash so other frameworks can benefit perf

9df4547

Changed: Devirtualize VcDecoder

38d041f

Add: Decoder Caching Support & Optimized Calls

e0188b4

Changed: Dispose VcDecoderEx only if owns Components

49d48d0

Fixed: Don't dispose objects not created by parent.

4363aac

Sewer56 force-pushed the master branch from 15f0e67 to 899e3c4 Compare October 2, 2021 16:18

Sewer56 added 4 commits October 3, 2021 14:41

Feature: Use Array Rental for Faster Repeat Decode

845a1e4

For the sake of my sanity with git. Generate the docs in proper place.

7dd621d

Fixed: Overflow when numValues * sizeof(long) > max int32

e2d86d2

Fixed: 4GB+ Allocations

bc9ec6a

Sewer56 force-pushed the master branch from 779a8e6 to bc9ec6a Compare October 3, 2021 13:41

Optimise: Decoder Memory & MemoryStream Allocations

24d6ef2

Sewer56 force-pushed the master branch from a44a553 to 24d6ef2 Compare October 3, 2021 14:54

Sewer56 added 2 commits October 3, 2021 15:54

Remove Redundant Reset call for AddressCache

fc0cb90

InstructionMap isn't mutable and should be cached.

c7acc65

Sewer56 changed the title ~~[WIP] Miscellaneous Improvements~~ Additional Perf. Optimizations Oct 20, 2021

chyyran merged commit 118efed into SnowflakePowered:master Nov 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional Perf. Optimizations #6

Additional Perf. Optimizations #6

Sewer56 commented Oct 1, 2021 •

edited

Loading

Sewer56 commented Oct 20, 2021

chyyran commented Nov 23, 2021

chyyran commented Nov 23, 2021

chyyran commented Nov 23, 2021 •

edited

Loading

Additional Perf. Optimizations #6

Additional Perf. Optimizations #6

Conversation

Sewer56 commented Oct 1, 2021 • edited Loading

Summary

Note

Benchmarks

Encode

Decode

Real World Tests

Other Notes

Sewer56 commented Oct 20, 2021

chyyran commented Nov 23, 2021

chyyran commented Nov 23, 2021

chyyran commented Nov 23, 2021 • edited Loading

Sewer56 commented Oct 1, 2021 •

edited

Loading

chyyran commented Nov 23, 2021 •

edited

Loading