-
-
Notifications
You must be signed in to change notification settings - Fork 844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory layout optimizations for JpegDecoderCore #25
Conversation
antonfirsov
commented
Nov 11, 2016
•
edited
Loading
edited
- Block, Component, Huffman classes made to be struct
- Using ArrayPools in JpegDecoderCore & related classes when possible
- Replaced Block with SIMD-based Block8x8F in JpegDecoderCore
- Various micro-optimizations & ugly tricks to keep the data on the stack
Hi @antonfirsov thanks for this., There's a lot of additional files included in this PR, could you please remove them so we can continue. Cheers James |
Merge branch 'without46' of https://github.com/antonfirsov/ImageSharp into without46 Conflicts: src/ImageSharp/Formats/Jpg/JpegEncoderCore.cs src/ImageSharp/Formats/Png/Filters/AverageFilter.cs src/ImageSharp/Formats/Png/Filters/NoneFilter.cs src/ImageSharp/Formats/Png/Filters/PaethFilter.cs src/ImageSharp/Formats/Png/Filters/SubFilter.cs src/ImageSharp/Formats/Png/Filters/UpFilter.cs src/ImageSharp/Formats/Png/PngDecoderCore.cs src/ImageSharp/Formats/Png/PngEncoderCore.cs
Hi @JimBobSquarePants, |
Results on my PC: Before:
After:
|
FYI: My progress with the vectorized IDCT is promising: |
@antonfirsov This is some seriously impressive refactoring work! The memory reduction is fantastic and the perf boost is a good boost so far. The vector stuff is way beyond me I'm afraid so you are on your own there. |
@JimBobSquarePants I think we see modifications in unrelated files because I failed to merge your changes correctly to my branch. I always got fooled by git. |
@antonfirsov No worries, 😄 I use GitHub Desktop now to avoid issues like this. I can update from the origin master to my fork in 1 click. |
Is it OK now? |
It depends... If you still have your csproj you can do it fairly automatically using Stylecop Analysers. Otherwise the Resharper StyleCop extension will give you some automatic fixes and suggestions. We actually have Analysers in the repo but in xproj it only works on build and you have to follow the output messages to navigate and manually fix the formatting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @antonfirsov I've had a look at the code and given a wee bit of feedback so that we can merge the PR.
I'm absolutely dying to merge it in but I need a few more minor things added first if that's ok?
p.s. How big exactly is that brain of yours? You've got me feeling quite dim.
/// 2. Test | ||
/// 3. Refactor BlockF -> Block8x8F | ||
/// </summary> | ||
internal struct BlockF : IDisposable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this class still in use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking: not. But: JpegEncoderCore
still has to be refactored to use Block8x8F
, and it's more safe to do it in 2 steps, as described in the comment. I think it would be better to remove it (together with the old Block
class), when JpegEncoderCore
refactor is finished.
Or we can keep it in a different repository/branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave that up to you then depending on how you want to handle the encoder refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do need to fix a merge conflict though in JpegEncoderCore.cs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could not figure out where's the conflict now, can you help me?
@@ -0,0 +1,93 @@ | |||
<#@ template debug="false" hostspecific="false" language="C#" #> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still required or can it go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definetly. It IS source code. It's common to keep T4 templates in the repository.
see: (System/Numerics/Vector.tt)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
namespace ImageSharp.Formats | ||
{ | ||
internal partial struct Block8x8F |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the sound of all this going over my head. Could you please add some comments here so I can figure out what is going on? It's proper into CS territory here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the purpose of these functions is simple. (e.g. transposing the block) I will try to add some comments here.
A few words about the implementation:
The methods are unrolled loops that work on a constant-sized (8x8) block. In C# it's typically faster to access fields of a struct, than array indexing or unsafe+pointers. (See: Matrix classes in XNA or WPF)
First I started to implement these methods by hand, but quicky realized that I will go faster with T4, and the code will be more reliable. It's also easier to maintain & understand the T4 metaprogram, than the bloated output code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, just comment in the T4 template then.
|
||
public int I { get; set; } | ||
public int I; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish I knew what I and J stood for. I can't find anything in my big book o' jpeg and the original source had no comments. Any ideas what they represent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I only understand about 10% of the whole Jpeg decoding stuff. I don't even know what's happening inside the DCT implementation methods. I'm just trying to figure out what are the building blocks of the algorithm, and refactoring the code to make the blocks run faster.
Maybe in the future I will will dive into the theoretical details in order to port libjpeg-turbo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the values represent the last two used index positions within the block but I'd have to have another read through.
No worries, we'll figure it out eventually.
/// <param name="maxNCodes">The maximum (inclusive) number of codes in a Huffman tree.</param> | ||
/// <param name="maxCodeLength">The maximum (inclusive) number of bits in a Huffman code.</param> | ||
public Huffman(int lutSize, int maxNCodes, int maxCodeLength) | ||
private static ArrayPool<ushort> UshortBuffer = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do with a little cleanup here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah definately :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
{ | ||
/// <summary> | ||
/// Like corefxlab Span, but with an AddOffset() method for efficiency. | ||
/// TODO: When Span will be official, consider replacing this class! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By my understanding we would be looking at some perf improvements switching to span when it's out. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully. However, the main purpose would be to have a code more conform to the style of the new fancy super-fast .NET API-s.
I think Span<T>
would be very useful for the whole ImageSharp project. (Simplify existing code, integrate into the forecoming System.IO.Pipelines API.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking forward to being able to use it. I think it's get me out of some tricky situations I have with pixel rows in my Png codec where I want the byte array for unmanaged use but not the first item, plus I don't want to copy it.
|
||
Block b = Block.Create(); | ||
Block cb = Block.Create(); | ||
Block cr = Block.Create(); | ||
// ReSharper disable once InconsistentNaming | ||
int prevDCY = 0, prevDCCb = 0, prevDCCr = 0; | ||
|
||
for (int y = 0; y < pixels.Height; y += 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you reckon we could use Vector<T>
here on the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need Block8x8F
here. JpegEncoderCore
should be refactored to use it instead of the old Block
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, Do you want to do that now or after this PR is merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After :(
I don't have the time for that in the next 2-3 weeks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I might even be able to do it based on your prior work.
public class Block8x8FTests : UtilityTestClassBase | ||
{ | ||
#if BENCHMARKING | ||
public const int Times = 1000000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should move this into the benchmark project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. There are many arguments for and against.
I could not figure out how to run dotTrace against BenchmarkDotNet benchmarks easily, while there is a straightforward way to profile unit tests. If performance is a long term goal, it's important to make profiling as easy as unit testing.
Is there any R# support for BenchmarkDotNet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, Let's keep it then.
As far as I'm aware there isn't. BenchmarkDotNet now includes accurate memory analysis but doesn't give us the detail we require. This might actually end up being a pattern we could adopt elsewhere in the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can develop an xUnit extension that uses BenchmarkDotNet internally, or at least helps eliminating the boilerplate code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aye maybe one day...
|
||
// ReSharper disable InconsistentNaming | ||
|
||
namespace ImageSharp.Tests.Formats.Jpg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a few comments here describing what you are testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try my best :)
Hi! |
Excellent. Thanks! |
Any suggestion how to rename these?
--->
|
Hmmm... I'd go with v0 to v etc... like you have with the r0 values above. Couldn't find the equivalent in the master branch so not too sure what they represent. |
That's all for today, have to sleep now :( I think I managed to fit to the main points of the style guide. I tried my best but there are still lots of StyleCop warnings. I could not locate the conflicting lines, the appveyor error message is not too verbose Decided to keep the numbers in the magic constants. If someone looks into the original c++ implementation, it would be easier to compare the two. |
Thanks for the updates... 💯 I'm going to merge this in now in it's current form as I want to do some refactoring of the way we use formats. The merge conflicts in JpegEncoderCore.cs were odd, don't know why git had an issue as they were simple enough once I had the IDE opened. Looking forward to what else you can bring once you are able to do so again. This is really great! |
Just been running some benchmarks and it seems memory usage has skyrocketed again..
|
Compare that to the encoder...
|
nooo... |
BenchmarkDotNet 0.10.1 behaviour is quite strange. Having same result for all Gens is definetately not true, it must be a bug. Here are my results with different BenchmarkDotNet versions (with your current master): 0.9.9:
0.10:
0.10.1:
Fortunately, there is one thing that remained unchanged! The 'System.Drawing Jpeg' : 'ImageSharp Jpeg' total allocation proportion is the same across all benchmarks. I think if you revert my PR on your branch & run the benchmark with 0.10.1, the value for 'Allocatioed' will be around 16MB. |
Yeah @antonfirsov... Something's not correct here @adamsitnik, Sorry for the name drop but would you be able to explain these numbers for us a little bit more please? The results are a bit confusing and unexpected. |
No problem, it was a good idea to ask me! To tell the long story short: previous versions of BenchmarkDotNet (older than 0.10.1) did not have very accurate memory diagnostics. We were using ETW events to track the allocations, which have the accuracy of 100kB. We also did not scale the Gen 0/1/2 collections counts, as we do now. What you can see is how many GCs took place for 1k of operations. We display info about that below the results table. @antonfirsov My suggestion: if you want to compare some code please do that with For the results that I can see above my guess is that you are either forcing a GC.Collect on your own (previous version would not include that) or you allocate a lot of object that are > 85kB, which are stored on LOH, which are born as Gen 2 objects and if they need to be reclaimed a Full GC takes place (Gen 0=>1=>2 & LOH). It's probably the LOH thing. Some good article about LOH. Pooling the memory buffers or using the native memory allocated with Marshal api could help with that. For pooling you can consider usage of |
@JimBobSquarePants it means, there is still place for more ArrayPools in the decoder (good news) @adamsitnik any explanation for this part?
|
Thanks @adamsitnik That's really good of you to pitch in. It's a lot more clear now. @antonfirsov That's the number of GC collects per 1K operations I believe. |
Exactly, which means that every third operation invokes full garbage collection. The To workaround this you could create a csrpoj, add a reference to your dll and write some simple code that does the same thing that benchmark does. Then you could run some Allocations Profiler. The one that is build in VS 2015 should do the job. |
@adamsitnik @JimBobSquarePants I mean .. why do we see the same number for all gens? |
@adamsitnik compare to this 0.9.9 benchmark |
Don't compare them. The previous results were not scaled, they were representing a total sum of all collections. And please use Memory Profiler to verify the results if you are not sure. It should give you a full picture. |
@adamsitnik I understand now, thanks a lot! |
Just made two benchmarks with 10.1, comparing different revisions. Revision before merging my PR:
Current ImageSharp master:
I think we had some progress :) |
That ⬆️ is AWESOME! |
Throw better exceptions from MemoryAllocators
Throw better exceptions from MemoryAllocators
Throw better exceptions from MemoryAllocators