Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory layout optimizations for JpegDecoderCore #25

Merged
merged 21 commits into from
Dec 5, 2016

Conversation

antonfirsov
Copy link
Member

@antonfirsov antonfirsov commented Nov 11, 2016

  • Block, Component, Huffman classes made to be struct
  • Using ArrayPools in JpegDecoderCore & related classes when possible
  • Replaced Block with SIMD-based Block8x8F in JpegDecoderCore
  • Various micro-optimizations & ugly tricks to keep the data on the stack

@JimBobSquarePants
Copy link
Member

Hi @antonfirsov thanks for this.,

There's a lot of additional files included in this PR, could you please remove them so we can continue.

Cheers

James

Merge branch 'without46' of https://github.com/antonfirsov/ImageSharp into without46

Conflicts:
	src/ImageSharp/Formats/Jpg/JpegEncoderCore.cs
	src/ImageSharp/Formats/Png/Filters/AverageFilter.cs
	src/ImageSharp/Formats/Png/Filters/NoneFilter.cs
	src/ImageSharp/Formats/Png/Filters/PaethFilter.cs
	src/ImageSharp/Formats/Png/Filters/SubFilter.cs
	src/ImageSharp/Formats/Png/Filters/UpFilter.cs
	src/ImageSharp/Formats/Png/PngDecoderCore.cs
	src/ImageSharp/Formats/Png/PngEncoderCore.cs
@antonfirsov
Copy link
Member Author

Hi @JimBobSquarePants,
just removed all my testing junk except JpegTests.cs + rebased my stuff on your current master.

@antonfirsov
Copy link
Member Author

antonfirsov commented Nov 12, 2016

Results on my PC:

Before:

Method Median StdDev Gen 0 Gen 1 Gen 2 Bytes Allocated/Op
ImageSharp Jpeg 48.7835 ms 0.4524 ms 724.00 - 144.00 6,063,638.46

After:

Method Median StdDev Gen 0 Gen 1 Gen 2 Bytes Allocated/Op
ImageSharp Jpeg 43.6793 ms 0.2811 ms 1.00 - 165.00 2,505,161.84

@antonfirsov
Copy link
Member Author

FYI: My progress with the vectorized IDCT is promising:
If I can manage to refactor Block objects in a certain way (GC pinning data for calculation), maybe I can reach significant speedup. The word maybe is important here, it's a path-finding game inside the secret labyrinth of the RyuJIT.

@JimBobSquarePants
Copy link
Member

@antonfirsov This is some seriously impressive refactoring work! The memory reduction is fantastic and the perf boost is a good boost so far.

The vector stuff is way beyond me I'm afraid so you are on your own there.

@antonfirsov
Copy link
Member Author

@JimBobSquarePants I think we see modifications in unrelated files because I failed to merge your changes correctly to my branch. I always got fooled by git.
I will clean this up + add my newest SIMD work, but I need some time to manage it.

@JimBobSquarePants
Copy link
Member

@antonfirsov No worries, 😄

I use GitHub Desktop now to avoid issues like this. I can update from the origin master to my fork in 1 click.

@antonfirsov
Copy link
Member Author

antonfirsov commented Nov 15, 2016

Is it OK now?
Removed all unrelated changes except lacking using directives in PngTests.cs & BitmapTests.cs which were breaking the build in my VS.
Nevertheless - with the new SIMD stuff included - there are 19 changed/additional files. It was a huge refactor/reimplementation.

@JimBobSquarePants
Copy link
Member

It depends... If you still have your csproj you can do it fairly automatically using Stylecop Analysers. Otherwise the Resharper StyleCop extension will give you some automatic fixes and suggestions.

We actually have Analysers in the repo but in xproj it only works on build and you have to follow the output messages to navigate and manually fix the formatting

Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @antonfirsov I've had a look at the code and given a wee bit of feedback so that we can merge the PR.

I'm absolutely dying to merge it in but I need a few more minor things added first if that's ok?

p.s. How big exactly is that brain of yours? You've got me feeling quite dim.

/// 2. Test
/// 3. Refactor BlockF -> Block8x8F
/// </summary>
internal struct BlockF : IDisposable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this class still in use?

Copy link
Member Author

@antonfirsov antonfirsov Dec 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking: not. But: JpegEncoderCore still has to be refactored to use Block8x8F, and it's more safe to do it in 2 steps, as described in the comment. I think it would be better to remove it (together with the old Blockclass), when JpegEncoderCore refactor is finished.
Or we can keep it in a different repository/branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave that up to you then depending on how you want to handle the encoder refactor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need to fix a merge conflict though in JpegEncoderCore.cs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could not figure out where's the conflict now, can you help me?

@@ -0,0 +1,93 @@
<#@ template debug="false" hostspecific="false" language="C#" #>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still required or can it go?

Copy link
Member Author

@antonfirsov antonfirsov Dec 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definetly. It IS source code. It's common to keep T4 templates in the repository.
see: (System/Numerics/Vector.tt)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


namespace ImageSharp.Formats
{
internal partial struct Block8x8F
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✈️ VROOOOOOOOOM......

That's the sound of all this going over my head. Could you please add some comments here so I can figure out what is going on? It's proper into CS territory here.

Copy link
Member Author

@antonfirsov antonfirsov Dec 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the purpose of these functions is simple. (e.g. transposing the block) I will try to add some comments here.
A few words about the implementation:
The methods are unrolled loops that work on a constant-sized (8x8) block. In C# it's typically faster to access fields of a struct, than array indexing or unsafe+pointers. (See: Matrix classes in XNA or WPF)
First I started to implement these methods by hand, but quicky realized that I will go faster with T4, and the code will be more reliable. It's also easier to maintain & understand the T4 metaprogram, than the bloated output code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, just comment in the T4 template then.


public int I { get; set; }
public int I;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I knew what I and J stood for. I can't find anything in my big book o' jpeg and the original source had no comments. Any ideas what they represent?

Copy link
Member Author

@antonfirsov antonfirsov Dec 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I only understand about 10% of the whole Jpeg decoding stuff. I don't even know what's happening inside the DCT implementation methods. I'm just trying to figure out what are the building blocks of the algorithm, and refactoring the code to make the blocks run faster.
Maybe in the future I will will dive into the theoretical details in order to port libjpeg-turbo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the values represent the last two used index positions within the block but I'd have to have another read through.

No worries, we'll figure it out eventually.

/// <param name="maxNCodes">The maximum (inclusive) number of codes in a Huffman tree.</param>
/// <param name="maxCodeLength">The maximum (inclusive) number of bits in a Huffman code.</param>
public Huffman(int lutSize, int maxNCodes, int maxCodeLength)
private static ArrayPool<ushort> UshortBuffer =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do with a little cleanup here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah definately :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

{
/// <summary>
/// Like corefxlab Span, but with an AddOffset() method for efficiency.
/// TODO: When Span will be official, consider replacing this class!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By my understanding we would be looking at some perf improvements switching to span when it's out. Is that correct?

Copy link
Member Author

@antonfirsov antonfirsov Dec 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully. However, the main purpose would be to have a code more conform to the style of the new fancy super-fast .NET API-s.
I think Span<T> would be very useful for the whole ImageSharp project. (Simplify existing code, integrate into the forecoming System.IO.Pipelines API.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking forward to being able to use it. I think it's get me out of some tricky situations I have with pixel rows in my Png codec where I want the byte array for unmanaged use but not the first item, plus I don't want to copy it.


Block b = Block.Create();
Block cb = Block.Create();
Block cr = Block.Create();
// ReSharper disable once InconsistentNaming
int prevDCY = 0, prevDCCb = 0, prevDCCr = 0;

for (int y = 0; y < pixels.Height; y += 8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you reckon we could use Vector<T> here on the loop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need Block8x8F here. JpegEncoderCore should be refactored to use it instead of the old Block class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Do you want to do that now or after this PR is merged?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After :(
I don't have the time for that in the next 2-3 weeks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I might even be able to do it based on your prior work.

public class Block8x8FTests : UtilityTestClassBase
{
#if BENCHMARKING
public const int Times = 1000000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should move this into the benchmark project?

Copy link
Member Author

@antonfirsov antonfirsov Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. There are many arguments for and against.
I could not figure out how to run dotTrace against BenchmarkDotNet benchmarks easily, while there is a straightforward way to profile unit tests. If performance is a long term goal, it's important to make profiling as easy as unit testing.
Is there any R# support for BenchmarkDotNet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Let's keep it then.

As far as I'm aware there isn't. BenchmarkDotNet now includes accurate memory analysis but doesn't give us the detail we require. This might actually end up being a pattern we could adopt elsewhere in the tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can develop an xUnit extension that uses BenchmarkDotNet internally, or at least helps eliminating the boilerplate code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye maybe one day...


// ReSharper disable InconsistentNaming

namespace ImageSharp.Tests.Formats.Jpg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a few comments here describing what you are testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try my best :)

@antonfirsov
Copy link
Member Author

Hi!
I'm planning to finish the PR & answer your questions this weekend. /* It's still friday in Budapest :) */

@JimBobSquarePants
Copy link
Member

Excellent. Thanks!

@antonfirsov
Copy link
Member Author

Any suggestion how to rename these?

 private static readonly Vector4 _1_501321 = new Vector4(1.501321f);

--->

 warning SA1309: Field '_1_501321' must not begin with an underscore

@JimBobSquarePants
Copy link
Member

Hmmm... I'd go with v0 to v etc... like you have with the r0 values above. Couldn't find the equivalent in the master branch so not too sure what they represent.

@antonfirsov
Copy link
Member Author

antonfirsov commented Dec 5, 2016

That's all for today, have to sleep now :( I think I managed to fit to the main points of the style guide. I tried my best but there are still lots of StyleCop warnings.

I could not locate the conflicting lines, the appveyor error message is not too verbose

Decided to keep the numbers in the magic constants. If someone looks into the original c++ implementation, it would be easier to compare the two.

@JimBobSquarePants
Copy link
Member

Thanks for the updates... 💯

I'm going to merge this in now in it's current form as I want to do some refactoring of the way we use formats. The merge conflicts in JpegEncoderCore.cs were odd, don't know why git had an issue as they were simple enough once I had the IDE opened.

Looking forward to what else you can bring once you are able to do so again. This is really great!

@JimBobSquarePants JimBobSquarePants merged commit 4d27da9 into SixLabors:master Dec 5, 2016
@JimBobSquarePants
Copy link
Member

Just been running some benchmarks and it seems memory usage has skyrocketed again..

Method Mean StdDev Scaled Scaled-StdDev Gen 0 Gen 1 Gen 2 Allocated
'System.Drawing Jpeg' 7.9317 ms 0.1200 ms 1.00 0.00 - - - 260.61 kB
'ImageSharp Jpeg' 34.0607 ms 0.3736 ms 4.30 0.08 387.5000 387.5000 387.5000 6.82 MB

@JimBobSquarePants
Copy link
Member

Compare that to the encoder...

Method Mean StdDev Scaled Scaled-StdDev Gen 0 Allocated
'System.Drawing Jpeg' 4.3141 ms 0.1021 ms 1.00 0.00 156.2500 894.67 kB
'ImageSharp Jpeg' 12.9259 ms 0.2706 ms 3.00 0.09 - 274.11 kB

@antonfirsov
Copy link
Member Author

nooo...
I will check this tonight.

@antonfirsov
Copy link
Member Author

antonfirsov commented Dec 5, 2016

BenchmarkDotNet 0.10.1 behaviour is quite strange. Having same result for all Gens is definetately not true, it must be a bug.

Here are my results with different BenchmarkDotNet versions (with your current master):

0.9.9:

              Method |     Median |    StdDev | Scaled | Scaled-SD | Gen 0 | Gen 1 |  Gen 2 | Bytes Allocated/Op |
-------------------- |----------- |---------- |------- |---------- |------ |------ |------- |------------------- |
 System.Drawing Jpeg |  6.3680 ms | 0.1492 ms |   1.00 |      0.00 |  9.75 |     - |      - |          98,840.33 |
     ImageSharp Jpeg | 28.0535 ms | 0.5364 ms |   4.37 |      0.12 |     - |     - | 137.00 |       2,020,788.21 |

0.10:

                Method |       Mean |    StdDev |     Median | Scaled | Scaled-StdDev | Gen 0 | Bytes Allocated/Op |
---------------------- |----------- |---------- |----------- |------- |-------------- |------ |------------------- |
 'System.Drawing Jpeg' |  6.4355 ms | 0.1326 ms |  6.4049 ms |   1.00 |          0.00 | 27.63 |         692,415.50 |
     'ImageSharp Jpeg' | 28.0320 ms | 0.1629 ms | 28.0288 ms |   4.36 |          0.09 |     - |      12,006,051.23 |

0.10.1:

                Method |       Mean |    StdDev | Scaled | Scaled-StdDev |    Gen 0 |    Gen 1 |    Gen 2 | Allocated |
---------------------- |----------- |---------- |------- |-------------- |--------- |--------- |--------- |---------- |
 'System.Drawing Jpeg' |  6.3580 ms | 0.0804 ms |   1.00 |          0.00 |        - |        - |        - | 260.61 kB |
     'ImageSharp Jpeg' | 29.0222 ms | 1.1586 ms |   4.57 |          0.18 | 337.5000 | 337.5000 | 337.5000 |   6.79 MB |

Fortunately, there is one thing that remained unchanged! The 'System.Drawing Jpeg' : 'ImageSharp Jpeg' total allocation proportion is the same across all benchmarks.

I think if you revert my PR on your branch & run the benchmark with 0.10.1, the value for 'Allocatioed' will be around 16MB.

@JimBobSquarePants
Copy link
Member

Yeah @antonfirsov... Something's not correct here

@adamsitnik, Sorry for the name drop but would you be able to explain these numbers for us a little bit more please? The results are a bit confusing and unexpected.

@adamsitnik
Copy link
Contributor

Sorry for the name drop

No problem, it was a good idea to ask me!

To tell the long story short: previous versions of BenchmarkDotNet (older than 0.10.1) did not have very accurate memory diagnostics. We were using ETW events to track the allocations, which have the accuracy of 100kB. We also did not scale the Gen 0/1/2 collections counts, as we do now. What you can see is how many GCs took place for 1k of operations. We display info about that below the results table.

@antonfirsov My suggestion: if you want to compare some code please do that with 0.10.1. It's very accurate, I have spent few weeks to get it 100% accurate and trustworthy.

For the results that I can see above my guess is that you are either forcing a GC.Collect on your own (previous version would not include that) or you allocate a lot of object that are > 85kB, which are stored on LOH, which are born as Gen 2 objects and if they need to be reclaimed a Full GC takes place (Gen 0=>1=>2 & LOH). It's probably the LOH thing. Some good article about LOH. Pooling the memory buffers or using the native memory allocated with Marshal api could help with that. For pooling you can consider usage of System.Buffers package.

@antonfirsov
Copy link
Member Author

antonfirsov commented Dec 5, 2016

@JimBobSquarePants it means, there is still place for more ArrayPools in the decoder (good news)

@adamsitnik any explanation for this part?

Gen 0 Gen 1 Gen 2
- - -
337.5000 337.5000 337.5000

@JimBobSquarePants
Copy link
Member

Thanks @adamsitnik That's really good of you to pitch in. It's a lot more clear now.

@antonfirsov That's the number of GC collects per 1K operations I believe.

@adamsitnik
Copy link
Contributor

@antonfirsov That's the number of GC collects per 1K operations I believe.

Exactly, which means that every third operation invokes full garbage collection.

The - means no collections. I suspect that the other library is using native memory internally. What I recommend is to try some Memory Profiler. The problem might be a limited compatibility with .NET Core and xprojs.

To workaround this you could create a csrpoj, add a reference to your dll and write some simple code that does the same thing that benchmark does. Then you could run some Allocations Profiler. The one that is build in VS 2015 should do the job.

@antonfirsov
Copy link
Member Author

@adamsitnik @JimBobSquarePants I mean .. why do we see the same number for all gens?

@antonfirsov
Copy link
Member Author

@adamsitnik compare to this 0.9.9 benchmark

@adamsitnik
Copy link
Contributor

compare to this 0.9.9 benchmark

Don't compare them. The previous results were not scaled, they were representing a total sum of all collections. And please use Memory Profiler to verify the results if you are not sure. It should give you a full picture.

@antonfirsov
Copy link
Member Author

@adamsitnik I understand now, thanks a lot!

@antonfirsov
Copy link
Member Author

Just made two benchmarks with 10.1, comparing different revisions.

Revision before merging my PR:

                Method |       Mean |    StdDev | Scaled | Scaled-StdDev |     Gen 0 |    Gen 1 |    Gen 2 | Allocated |
---------------------- |----------- |---------- |------- |-------------- |---------- |--------- |--------- |---------- |
 'System.Drawing Jpeg' |  6.4107 ms | 0.0168 ms |   1.00 |          0.00 |         - |        - |        - | 260.61 kB |
     'ImageSharp Jpeg' | 47.9732 ms | 0.1406 ms |   7.48 |          0.03 | 4745.8333 | 370.8333 | 370.8333 |  20.63 MB |

Current ImageSharp master:

                Method |       Mean |    StdDev | Scaled | Scaled-StdDev |    Gen 0 |    Gen 1 |    Gen 2 | Allocated |
---------------------- |----------- |---------- |------- |-------------- |--------- |--------- |--------- |---------- |
 'System.Drawing Jpeg' |  6.5726 ms | 0.2155 ms |   1.00 |          0.00 |        - |        - |        - | 260.61 kB |
     'ImageSharp Jpeg' | 27.8481 ms | 0.1314 ms |   4.24 |          0.13 | 370.8333 | 370.8333 | 370.8333 |   6.79 MB |

I think we had some progress :)
I'm getting addicted to this game, so maybe I will play with a memory profiler to track down the remaining unecassary LOH allocations.

@JimBobSquarePants
Copy link
Member

That ⬆️ is AWESOME!

@antonfirsov antonfirsov deleted the without46 branch January 15, 2017 01:00
antonfirsov pushed a commit that referenced this pull request Jan 18, 2020
Throw better exceptions from MemoryAllocators
JimBobSquarePants added a commit that referenced this pull request Feb 17, 2021
Throw better exceptions from MemoryAllocators
JimBobSquarePants added a commit that referenced this pull request Feb 17, 2021
Throw better exceptions from MemoryAllocators
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants