-
-
Notifications
You must be signed in to change notification settings - Fork 887
Faster Linear Transforms #1591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster Linear Transforms #1591
Conversation
| internal static int GetSamplingRadius<TResampler>(in TResampler sampler, int sourceSize, int destinationSize) | ||
| where TResampler : struct, IResampler | ||
| { | ||
| double scale = sourceSize / destinationSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spot the rounding error 😄
|
@antonfirsov I'm gonna need some help here. MacOS .NET 3.1 was showing a 4% error on projective transforms but I couldn't pull down the test output because it was incomplete. (DebugSave disabled on CI). I enabled saving and now I'm getting a bunch of MemoyAllocator errors. Any idea? |
|
ImageSharp/tests/ImageSharp.Tests/Formats/Png/PngDecoderTests.cs Lines 414 to 417 in 5ab768c
I can't think of any better solution than changing the following line to use
|
|
@antonfirsov Thanks for you help! I've just created a clone of the encoder and use that when registering for tests. |
|
OK... tried everything I can think of here. Rewrote all the sampling to use (since reverted) to use double precision and it still fails for NET Core 3.1 on Mac OS. |
antonfirsov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new transform code looks good and easy to follow, great job!
Can you post some Beyond Compare screenshots for the most problematic, and the less problematic failures?
For troubleshooting I would log and compare as many internals as possible. Since it may take a lot of time, we can just disable the Mac tests with [ActiveIssue] instead of trying to figure out the problem as part of the PR.
| int top = LinearTransformUtility.GetRangeStart(yRadius, pY, maxY); | ||
| int bottom = LinearTransformUtility.GetRangeEnd(yRadius, pY, maxY); | ||
| int left = LinearTransformUtility.GetRangeStart(xRadius, pX, maxX); | ||
| int right = LinearTransformUtility.GetRangeEnd(xRadius, pX, maxX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea 1:
log these 4 values during the whole operation, and look for differences between Mac and Windows.
| float xWeight = sampler.GetValue(xK - pX); | ||
|
|
||
| Vector4 current = sourceBuffer.GetElementUnsafe(xK, yK).ToScaledVector4(); | ||
| Numerics.Premultiply(ref current); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea 2:
Also log the premultiplied current value, compare the log files, look for highest difference.
| in this.sampler, | ||
| point, | ||
| sourceBuffer, | ||
| Numerics.UnPremultiply(span); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we used the bulk UnPremultiply in the old code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't. Have just ran a test commit using the non-bulk variant and the major issues disappeared leaving only 2 minor (max 0.03% diff) remaining.
Below is the output of the worst issue. The last 3 pixels of each row appear to be affected which suggests to me that something is going wrong here. My guess is that it has something to do with unpremultiplying Vector4.Zero
ImageSharp/src/ImageSharp/Common/Helpers/Numerics.cs
Lines 556 to 576 in bb72769
| if (Modulo2(vectors.Length) != 0) | |
| { | |
| // Vector4 fits neatly in pairs. Any overlap has to be equal to 1. | |
| UnPremultiply(ref MemoryMarshal.GetReference(vectors.Slice(vectors.Length - 1))); | |
| } | |
| } | |
| else | |
| #endif | |
| { | |
| ref Vector4 vectorsStart = ref MemoryMarshal.GetReference(vectors); | |
| ref Vector4 vectorsEnd = ref Unsafe.Add(ref vectorsStart, vectors.Length); | |
| while (Unsafe.IsAddressLessThan(ref vectorsStart, ref vectorsEnd)) | |
| { | |
| UnPremultiply(ref vectorsStart); | |
| vectorsStart = ref Unsafe.Add(ref vectorsStart, 1); | |
| } | |
| } | |
| } |
The question is, what do we do? The issue is clearly fixed for .NET 5 but .NET Core 3.1 is LTS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the bulk code for the rest of the platforms, because it likely takes an important part in the perf improvements we see in your benchmarks, and fall back to scalar stuff on Mac. I would add a check for the OS platform at this line with an explanatory comment:
| if (Avx2.IsSupported && vectors.Length >= 2) |
if (Avx2.IsSupported && !RuntimeInformation.IsOSPlatform(OSPlatform.OSX) && vectors.Length >= 2)In no case I would complicate this with additional logic to detect if we are running on .NET 5 (Mac is not that important).
|
You better re-review this @antonfirsov there's been some awful hackery. |
Codecov Report
@@ Coverage Diff @@
## master #1591 +/- ##
==========================================
- Coverage 83.69% 83.68% -0.02%
==========================================
Files 747 748 +1
Lines 33055 33032 -23
Branches 3692 3698 +6
==========================================
- Hits 27665 27642 -23
+ Misses 4676 4675 -1
- Partials 714 715 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why did you decide to duplicate? Wasn't there a mistake in 35065c7 adding the new conditions checks to Premultiply instead of Unpremultiply?
If there is a perf concern regarding the long expression, I recommend to cache !RuntimeEnvironment.IsOSPlatform(OSPlatform.OSX) && RuntimeEnvironment.IsNetCore into a static variable.
| /// <summary> | ||
| /// Gets the name of the .NET installation on which an app is running. | ||
| /// </summary> | ||
| public static string FrameworkDescription => RuntimeInformation.FrameworkDescription; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to be exposed? I would make it private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did make a mistake yeah but corrected it in a later commit.
What I've tried so far:
- Filtered out the runtime instrinsics part of the
UnPremultiplymethod to aviod Mac on Core 3.1 - Rewrote the scalar overhang portion of the method to use a simple for loop over the span.
Neither worked. The only thing I could get working was duplicating the code which is simply awful. I couldn't replicate the issue locally either to effectively debug.
Regarding the method, no it doesn't need to be exposed.
antonfirsov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it doesn't work without the duplication, let's keep it. LGTM with one suggestion.
Co-authored-by: Anton Firszov <antonfir@gmail.com>

Prerequisites
Description
I haven't been happy with the existing linear transform code since I originally wrote it so thought I'd have another crack.
Benchmarks are very good. Large speedup on all targets with a 2x speedup on .NET Core 3.1+. However they don't tell the whole story. Originally two 2D buffers were rented from the pool of DestinationHeight x MaxDiameter to store kernel weight buffers but, since the kernel weights must be calculated based on the exact transformed subpixel point, it was pointless to do so. Those buffers are now gone.
In addition I discovered a rounding issue where the incorrect scaled filter radius was being calculated which lead to rogue pixels containing RGB values but 0 alpha to be generated outside transform edges. I've updated two references to reflect the improved accuracy.
I consider readability much improved now also. It's much easier to understand how the transform operation actually works.
Ignore individual commits during review, I generated a lot of noise during my experimentation (basically properly learning how the process actually worked 😇) so I'll be squashing to a single commit on merge.
Benchmarks
ROTATE
BEFORE
AFTER
SKEW
BEFORE
AFTER