Fix and normalize Vector4 UnPremultiply #2369

JimBobSquarePants · 2023-02-24T03:07:24Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

Our existing UnPremultiply methods suffered from an issue in that they allowed dividing values by zero. This was masked by our tests since the transform to byte would lead to zero in most cases but could cause issues when chaining certain operations requiring premultiplication using single precision pixel formats since interim values for the XYZ components would be ∞.

The Porter-Duff operators contained a bad fix for this by dividing by the max of the W component vs an epsilon of 0.001F. This would lead to out-of-range values if the X, Y, Z components had a non-zero value. This was generally masked by clamping but again could cause issues for single precision pixel formats.

The fix is simple. Compare the alpha component to zero and return the source value if so. We return the source rather than an explicit zeroed value for two reasons:

For all our operation this would be equivalent to Vector4.Zero
There are Tiff reference files in our tests suite that are encoded with associated alpha values of zero but have non-zero RGB component values. This allows us to maintain the integrity of those values. (Note the reference decoder makes the same mistake we did before this fix).

The fix does not affect the performance of UnPremultiply. Migrating the Porter-Duff operators to use the fixed method however yields an additional 1.55x performance boost over the work done in #2359

This brings an overall performance boost of 14.4x over our v2 version of the operators.

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.200
  [Host]     : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT
  DefaultJob : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT

Porter-Duff NormalSrcOver

Main

Method	Mean	Error	StdDev	Ratio
Scalar	416.1 ns	0.86 ns	0.80 ns	1.00
Avx	235.3 ns	0.62 ns	0.55 ns	0.57

PR

Method	Mean	Error	StdDev	Ratio
Scalar	354.6 ns	0.56 ns	0.50 ns	1.00
Avx	151.4 ns	0.43 ns	0.40 ns	0.43

PremultiplyVector4

Main

Method	Mean	Error	StdDev	Ratio
PremultiplyBaseline	31.16 us	0.061 us	0.054 us	1.00
Premultiply	22.95 us	0.050 us	0.044 us	0.74

PR

Method	Mean	Error	StdDev	Ratio
PremultiplyBaseline	31.17 us	0.074 us	0.066 us	1.00
Premultiply	22.94 us	0.042 us	0.039 us	0.74

UnPremultiplyVector4

Main

Method	Mean	Error	StdDev	Ratio
UnPremultiplyBaseline	1.539 us	0.0023 us	0.0020 us	1.00
UnPremultiply	1.265 us	0.0015 us	0.0013 us	0.82

PR

Method	Mean	Error	StdDev	Ratio
UnPremultiplyBaseline	1.538 us	0.0031 us	0.0029 us	1.00
UnPremultiply	1.266 us	0.0013 us	0.0012 us	0.82

One thing I do not understand though is why is there such a difference between Premultiply vs UnPremultiply

The baseline implementations in the benchmark use virtually identical code and generate virtually identical ASM.

@tannergooding would you have any insight?

private static void Premultiply(ref Vector4 source)
{
    float w = source.W;
    source *= w;
    source.W = w;
}

private static void UnPremultiply(ref Vector4 source)
{
    float w = source.W;
    source /= w;
    source.W = w;
}

C.Premultiply(System.Numerics.Vector4 ByRef)
    L0000: vzeroupper
    L0003: vmovss xmm0, [rcx+0xc]
    L0008: vmovupd xmm1, [rcx]
    L000c: vbroadcastss xmm2, xmm0
    L0011: vmulps xmm1, xmm1, xmm2
    L0015: vmovupd [rcx], xmm1
    L0019: vmovss [rcx+0xc], xmm0
    L001e: ret

C.UnPremultiply(System.Numerics.Vector4 ByRef)
    L0000: vzeroupper
    L0003: vmovss xmm0, [rcx+0xc]
    L0008: vmovupd xmm1, [rcx]
    L000c: vbroadcastss xmm2, xmm0
    L0011: vdivps xmm1, xmm1, xmm2
    L0015: vmovupd [rcx], xmm1
    L0019: vmovss [rcx+0xc], xmm0
    L001e: ret

tannergooding · 2023-02-24T03:59:52Z

One thing I do not understand though is why is there such a difference between Premultiply vs UnPremultiply

Floating-point division is quite a bit slower than multiplication (although not quite as bad as integral division).

divps takes around 11-12 cycles to execute (its latency) and has about 3 cycles before another non-dependent instruction can execute on the same port (reciprocal throughput).

mulps on the other hand takes 3-4 cycles to execute and has about 0.5 cycles before another non-dependent instruction can execute on the same port.

Edit: I had looked at the question and sample asm without looking at the new implementation. Looks like you're already doing this, just without also removing the duplicate memory access.

This is notably another case where you'd probably see some benefit in using the PermuteW and WithW helpers.

What you have now is roughly:

C.Premultiply(System.Numerics.Vector4 ByRef)
    L0000: vzeroupper                   ;
    L0003: vmovss xmm0, [rcx+0xc]       ; 4-7 cycles;  port 2 or 3
    L0008: vmovupd xmm1, [rcx]          ; 4-7 cycles;  port 2 or 3
    L000c: vbroadcastss xmm2, xmm0      ; 1 cycle;     port 5
    L0011: vmulps xmm1, xmm1, xmm2      ; 4 cycles;    port 0 or 1
    L0015: vmovupd [rcx], xmm1          ; 4-10 cycles; port 2 or 3 or 7 + port 4
    L0019: vmovss [rcx+0xc], xmm0       ; 4-10 cycles; port 2 or 3 or 7 + port 4
    L001e: ret                          ;

Given dependencies, parallel dispatch, and other bits this will take around 17-39 cycles.

However, if you change the implementation to something like:

private static void Premultiply1(ref Vector4 source)
{
    var vsource = source.AsVector128();

    Vector4 w = PermuteW(vsource);
    vsource *= w;
    source = WithW(vsource, w);
}

We instead generate:

    L0000: vzeroupper                           ;
    L0003: vmovupd xmm0, [rcx]                  ; 4-7 cycles;  port 2 or 3
    L0007: vpermilps xmm1, xmm0, 0xff           ; 1 cycle;     port 5
    L000d: vmulps xmm0, xmm0, xmm1              ; 4 cycles;    port 0 or 1
    L0011: vinsertps xmm0, xmm0, xmm1, 0xf0     ; 1 cycle;     port 5
    L0017: vmovupd [rcx], xmm0                  ; 4-10 cycles; port 2 or 3 or 7 + port 4
    L001b: ret                                  ;

Which is around 14-23 cycles instead, which avoids the extra memory accesses and uses less overall ports.

tannergooding · 2023-02-24T04:07:12Z

src/ImageSharp/Common/Helpers/Numerics.cs

+        Vector4 alpha = PermuteW(source);
+        source = WithW(source * alpha, alpha);


You'll want to load source into a local here.

Suggested change

Vector4 alpha = PermuteW(source);

source = WithW(source * alpha, alpha);

Vector4 src = source;

Vector4 alpha = PermuteW(src);

source = WithW(src * alpha, alpha);

The reasoning is that since source is a ref, each access ends up coming from memory. However, we're using the value multiple times (twice given the implementation) and so accessing it twice is unnecessary.

-- Since it's AggressiveInlining, its possible the JIT is already optimizing this almost equivalently. However, it might also make it more likely to spill the value since its "address taken" with multiple uses, so it's probably better to be "safe".

tannergooding · 2023-02-24T04:17:51Z

src/ImageSharp/Common/Helpers/Numerics.cs

+        if (alpha == Vector4.Zero)
+        {
+            return;
+        }
+
+        // Divide source by alpha if alpha is nonzero, otherwise set all components to match the source value
+        // Blend the result with the alpha vector to ensure that the alpha component is unchanged
+        source = WithW(source / alpha, alpha);


Branching in SIMD code is "generally" undesirable.

It's probably worth comparing this to an implementation that does CompareEqual + Blend instead (much like you're already doing for the Avx2 path) to see which is better.

You should be able to use the AsVector128() again. With Sse.CompareEqual and using Sse41.BlendVariable.

If Sse41 isn't supported (which is unlikely since its been around since 2007), you can replace it with Sse.Or(Sse.And(left, mask), Sse.AndNot(mask, right))

-- That is: (left & mask) | (right & ~mask)
-- Sse.AndNot(x, y) does ~x & y, despite its name implying x & ~y; a weird quirk of the instruction

-- left here represents the value to use if the comparison result (mask) was true for that element
-- right correspondingly is the value to use if it was false

Would you believe the version I had was actually faster!?

UnPremultiply using Vector4 and branching.

Method Mean Error StdDev Ratio

UnPremultiplyBaseline 1.545 us 0.0070 us 0.0066 us 1.00

UnPremultiply 1.827 us 0.0039 us 0.0037 us 1.18

UnPremultiplyBulk 1.267 us 0.0033 us 0.0027 us 0.82

UnPremultiply using Sse41.

Method Mean Error StdDev Ratio

UnPremultiplyBaseline 1.546 us 0.0077 us 0.0068 us 1.00

UnPremultiply 2.058 us 0.0043 us 0.0036 us 1.33

UnPremultiplyBulk 1.268 us 0.0025 us 0.0022 us 0.82

I'd believe it.

The general reason to avoid branching is that it often slows down the overall operation. However, the exact difference depends on the scenario being tested and how frequently that branch will be hit as compared to the cost of computing and blending both paths.

If I'm reading your table right, then in this case, they look practically identical with a little less error and deviation for blending in Bulk. But the branching ends up winning out for single operations.

Yeah. We'd only ever use in for single operations or for overlaps in the AVX run.

JimBobSquarePants · 2023-02-24T12:58:47Z

One thing I do not understand though is why is there such a difference between Premultiply vs UnPremultiply

Floating-point division is quite a bit slower than multiplication (although not quite as bad as integral division).

divps takes around 11-12 cycles to execute (its latency) and has about 3 cycles before another non-dependent instruction can execute on the same port (reciprocal throughput).

mulps on the other hand takes 3-4 cycles to execute and has about 0.5 cycles before another non-dependent instruction can execute on the same port.

Thanks for all the info there, much appreciated.

What I'm seeing though is that the division is faster. A lot faster. 1.538 us vs 31.17 us. It makes no sense to me.

tannergooding · 2023-02-24T15:33:32Z

What I'm seeing though is that the division is faster. A lot faster. 1.538 us vs 31.17 us. It makes no sense to me.

Ah yes. I misread the numbers initially.

This isn't clear to me either and I'd probably end up doing a profile using Intel VTune or AMD uProf so I could get instruction level timing information at this point.

My gut instinct would be a difference in input data or in how BDN ir measuring the two profiles.

JimBobSquarePants added 3 commits February 24, 2023 00:17

Fix pre/unpremultiply methods

2c30c90

Update test tolerances to cater for reference decoder bug.

5a5d23c

Fix benchmarks

189c07b

JimBobSquarePants added the area:performance label Feb 24, 2023

JimBobSquarePants added this to the 3.0.0 milestone Feb 24, 2023

JimBobSquarePants requested review from brianpopow, antonfirsov, dlemstra and tocsoft February 24, 2023 03:07

tannergooding reviewed Feb 24, 2023

View reviewed changes

JimBobSquarePants added 2 commits February 24, 2023 23:05

Implement feedback and update benchmarks

9e19134

Merge branch 'main' into js/premultiply

aed484a

Remove bad debug check and refs

1704140

JimBobSquarePants mentioned this pull request Feb 25, 2023

Use compile time constants over method for all MMShuffle operations. #2365

Merged

4 tasks

normalize code

aab9955

JimBobSquarePants merged commit cac0960 into main Feb 26, 2023

JimBobSquarePants deleted the js/premultiply branch February 26, 2023 11:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and normalize Vector4 UnPremultiply #2369

Fix and normalize Vector4 UnPremultiply #2369

JimBobSquarePants commented Feb 24, 2023 •

edited

tannergooding commented Feb 24, 2023 •

edited

tannergooding Feb 24, 2023

tannergooding Feb 24, 2023 •

edited

tannergooding Feb 24, 2023 •

edited

tannergooding Feb 24, 2023

JimBobSquarePants Feb 24, 2023 •

edited

tannergooding Feb 24, 2023

JimBobSquarePants Feb 25, 2023

JimBobSquarePants commented Feb 24, 2023

tannergooding commented Feb 24, 2023

		Vector4 alpha = PermuteW(source);
		source = WithW(source * alpha, alpha);

-        Vector4 alpha = PermuteW(source);
-        source = WithW(source * alpha, alpha);
+        Vector4 src = source;
+        Vector4 alpha = PermuteW(src);
+        source = WithW(src * alpha, alpha);

Method	Mean	Error	StdDev	Ratio
UnPremultiplyBaseline	1.545 us	0.0070 us	0.0066 us	1.00
UnPremultiply	1.827 us	0.0039 us	0.0037 us	1.18
UnPremultiplyBulk	1.267 us	0.0033 us	0.0027 us	0.82

Method	Mean	Error	StdDev	Ratio
UnPremultiplyBaseline	1.546 us	0.0077 us	0.0068 us	1.00
UnPremultiply	2.058 us	0.0043 us	0.0036 us	1.33
UnPremultiplyBulk	1.268 us	0.0025 us	0.0022 us	0.82

Fix and normalize Vector4 UnPremultiply #2369

Fix and normalize Vector4 UnPremultiply #2369

Conversation

JimBobSquarePants commented Feb 24, 2023 • edited

Prerequisites

Description

tannergooding commented Feb 24, 2023 • edited

tannergooding Feb 24, 2023

Choose a reason for hiding this comment

tannergooding Feb 24, 2023 • edited

Choose a reason for hiding this comment

tannergooding Feb 24, 2023 • edited

Choose a reason for hiding this comment

tannergooding Feb 24, 2023

Choose a reason for hiding this comment

JimBobSquarePants Feb 24, 2023 • edited

Choose a reason for hiding this comment

tannergooding Feb 24, 2023

Choose a reason for hiding this comment

JimBobSquarePants Feb 25, 2023

Choose a reason for hiding this comment

JimBobSquarePants commented Feb 24, 2023

tannergooding commented Feb 24, 2023

JimBobSquarePants commented Feb 24, 2023 •

edited

tannergooding commented Feb 24, 2023 •

edited

tannergooding Feb 24, 2023 •

edited

tannergooding Feb 24, 2023 •

edited

JimBobSquarePants Feb 24, 2023 •

edited