Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and normalize Vector4 UnPremultiply #2369

Merged
merged 7 commits into from
Feb 26, 2023
Merged

Conversation

JimBobSquarePants
Copy link
Member

@JimBobSquarePants JimBobSquarePants commented Feb 24, 2023

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

Our existing UnPremultiply methods suffered from an issue in that they allowed dividing values by zero. This was masked by our tests since the transform to byte would lead to zero in most cases but could cause issues when chaining certain operations requiring premultiplication using single precision pixel formats since interim values for the XYZ components would be .

The Porter-Duff operators contained a bad fix for this by dividing by the max of the W component vs an epsilon of 0.001F. This would lead to out-of-range values if the X, Y, Z components had a non-zero value. This was generally masked by clamping but again could cause issues for single precision pixel formats.

The fix is simple. Compare the alpha component to zero and return the source value if so. We return the source rather than an explicit zeroed value for two reasons:

  1. For all our operation this would be equivalent to Vector4.Zero
  2. There are Tiff reference files in our tests suite that are encoded with associated alpha values of zero but have non-zero RGB component values. This allows us to maintain the integrity of those values. (Note the reference decoder makes the same mistake we did before this fix).

The fix does not affect the performance of UnPremultiply. Migrating the Porter-Duff operators to use the fixed method however yields an additional 1.55x performance boost over the work done in #2359

This brings an overall performance boost of 14.4x over our v2 version of the operators.

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.200
  [Host]     : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT
  DefaultJob : .NET 6.0.14 (6.0.1423.7309), X64 RyuJIT

Porter-Duff NormalSrcOver

Main

Method Mean Error StdDev Ratio
Scalar 416.1 ns 0.86 ns 0.80 ns 1.00
Avx 235.3 ns 0.62 ns 0.55 ns 0.57

PR

Method Mean Error StdDev Ratio
Scalar 354.6 ns 0.56 ns 0.50 ns 1.00
Avx 151.4 ns 0.43 ns 0.40 ns 0.43

PremultiplyVector4

Main

Method Mean Error StdDev Ratio
PremultiplyBaseline 31.16 us 0.061 us 0.054 us 1.00
Premultiply 22.95 us 0.050 us 0.044 us 0.74

PR

Method Mean Error StdDev Ratio
PremultiplyBaseline 31.17 us 0.074 us 0.066 us 1.00
Premultiply 22.94 us 0.042 us 0.039 us 0.74

UnPremultiplyVector4

Main

Method Mean Error StdDev Ratio
UnPremultiplyBaseline 1.539 us 0.0023 us 0.0020 us 1.00
UnPremultiply 1.265 us 0.0015 us 0.0013 us 0.82

PR

Method Mean Error StdDev Ratio
UnPremultiplyBaseline 1.538 us 0.0031 us 0.0029 us 1.00
UnPremultiply 1.266 us 0.0013 us 0.0012 us 0.82

One thing I do not understand though is why is there such a difference between Premultiply vs UnPremultiply

The baseline implementations in the benchmark use virtually identical code and generate virtually identical ASM.

@tannergooding would you have any insight?

private static void Premultiply(ref Vector4 source)
{
    float w = source.W;
    source *= w;
    source.W = w;
}

private static void UnPremultiply(ref Vector4 source)
{
    float w = source.W;
    source /= w;
    source.W = w;
}
C.Premultiply(System.Numerics.Vector4 ByRef)
    L0000: vzeroupper
    L0003: vmovss xmm0, [rcx+0xc]
    L0008: vmovupd xmm1, [rcx]
    L000c: vbroadcastss xmm2, xmm0
    L0011: vmulps xmm1, xmm1, xmm2
    L0015: vmovupd [rcx], xmm1
    L0019: vmovss [rcx+0xc], xmm0
    L001e: ret

C.UnPremultiply(System.Numerics.Vector4 ByRef)
    L0000: vzeroupper
    L0003: vmovss xmm0, [rcx+0xc]
    L0008: vmovupd xmm1, [rcx]
    L000c: vbroadcastss xmm2, xmm0
    L0011: vdivps xmm1, xmm1, xmm2
    L0015: vmovupd [rcx], xmm1
    L0019: vmovss [rcx+0xc], xmm0
    L001e: ret

@tannergooding
Copy link
Contributor

tannergooding commented Feb 24, 2023

One thing I do not understand though is why is there such a difference between Premultiply vs UnPremultiply

Floating-point division is quite a bit slower than multiplication (although not quite as bad as integral division).

divps takes around 11-12 cycles to execute (its latency) and has about 3 cycles before another non-dependent instruction can execute on the same port (reciprocal throughput).

mulps on the other hand takes 3-4 cycles to execute and has about 0.5 cycles before another non-dependent instruction can execute on the same port.


Edit: I had looked at the question and sample asm without looking at the new implementation. Looks like you're already doing this, just without also removing the duplicate memory access.

This is notably another case where you'd probably see some benefit in using the PermuteW and WithW helpers.

What you have now is roughly:

C.Premultiply(System.Numerics.Vector4 ByRef)
    L0000: vzeroupper                   ;
    L0003: vmovss xmm0, [rcx+0xc]       ; 4-7 cycles;  port 2 or 3
    L0008: vmovupd xmm1, [rcx]          ; 4-7 cycles;  port 2 or 3
    L000c: vbroadcastss xmm2, xmm0      ; 1 cycle;     port 5
    L0011: vmulps xmm1, xmm1, xmm2      ; 4 cycles;    port 0 or 1
    L0015: vmovupd [rcx], xmm1          ; 4-10 cycles; port 2 or 3 or 7 + port 4
    L0019: vmovss [rcx+0xc], xmm0       ; 4-10 cycles; port 2 or 3 or 7 + port 4
    L001e: ret                          ;

Given dependencies, parallel dispatch, and other bits this will take around 17-39 cycles.

However, if you change the implementation to something like:

private static void Premultiply1(ref Vector4 source)
{
    var vsource = source.AsVector128();

    Vector4 w = PermuteW(vsource);
    vsource *= w;
    source = WithW(vsource, w);
}

We instead generate:

    L0000: vzeroupper                           ;
    L0003: vmovupd xmm0, [rcx]                  ; 4-7 cycles;  port 2 or 3
    L0007: vpermilps xmm1, xmm0, 0xff           ; 1 cycle;     port 5
    L000d: vmulps xmm0, xmm0, xmm1              ; 4 cycles;    port 0 or 1
    L0011: vinsertps xmm0, xmm0, xmm1, 0xf0     ; 1 cycle;     port 5
    L0017: vmovupd [rcx], xmm0                  ; 4-10 cycles; port 2 or 3 or 7 + port 4
    L001b: ret                                  ;

Which is around 14-23 cycles instead, which avoids the extra memory accesses and uses less overall ports.

Comment on lines 477 to 478
Vector4 alpha = PermuteW(source);
source = WithW(source * alpha, alpha);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll want to load source into a local here.

Suggested change
Vector4 alpha = PermuteW(source);
source = WithW(source * alpha, alpha);
Vector4 src = source;
Vector4 alpha = PermuteW(src);
source = WithW(src * alpha, alpha);

The reasoning is that since source is a ref, each access ends up coming from memory. However, we're using the value multiple times (twice given the implementation) and so accessing it twice is unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- Since it's AggressiveInlining, its possible the JIT is already optimizing this almost equivalently. However, it might also make it more likely to spill the value since its "address taken" with multiple uses, so it's probably better to be "safe".

Comment on lines +536 to +543
if (alpha == Vector4.Zero)
{
return;
}

// Divide source by alpha if alpha is nonzero, otherwise set all components to match the source value
// Blend the result with the alpha vector to ensure that the alpha component is unchanged
source = WithW(source / alpha, alpha);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Branching in SIMD code is "generally" undesirable.

It's probably worth comparing this to an implementation that does CompareEqual + Blend instead (much like you're already doing for the Avx2 path) to see which is better.

You should be able to use the AsVector128() again. With Sse.CompareEqual and using Sse41.BlendVariable.

If Sse41 isn't supported (which is unlikely since its been around since 2007), you can replace it with Sse.Or(Sse.And(left, mask), Sse.AndNot(mask, right))

-- That is: (left & mask) | (right & ~mask)
-- Sse.AndNot(x, y) does ~x & y, despite its name implying x & ~y; a weird quirk of the instruction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- left here represents the value to use if the comparison result (mask) was true for that element
-- right correspondingly is the value to use if it was false

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you believe the version I had was actually faster!?

UnPremultiply using Vector4 and branching.

Method Mean Error StdDev Ratio
UnPremultiplyBaseline 1.545 us 0.0070 us 0.0066 us 1.00
UnPremultiply 1.827 us 0.0039 us 0.0037 us 1.18
UnPremultiplyBulk 1.267 us 0.0033 us 0.0027 us 0.82

UnPremultiply using Sse41.

Method Mean Error StdDev Ratio
UnPremultiplyBaseline 1.546 us 0.0077 us 0.0068 us 1.00
UnPremultiply 2.058 us 0.0043 us 0.0036 us 1.33
UnPremultiplyBulk 1.268 us 0.0025 us 0.0022 us 0.82

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd believe it.

The general reason to avoid branching is that it often slows down the overall operation. However, the exact difference depends on the scenario being tested and how frequently that branch will be hit as compared to the cost of computing and blending both paths.

If I'm reading your table right, then in this case, they look practically identical with a little less error and deviation for blending in Bulk. But the branching ends up winning out for single operations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We'd only ever use in for single operations or for overlaps in the AVX run.

@JimBobSquarePants
Copy link
Member Author

One thing I do not understand though is why is there such a difference between Premultiply vs UnPremultiply

Floating-point division is quite a bit slower than multiplication (although not quite as bad as integral division).

divps takes around 11-12 cycles to execute (its latency) and has about 3 cycles before another non-dependent instruction can execute on the same port (reciprocal throughput).

mulps on the other hand takes 3-4 cycles to execute and has about 0.5 cycles before another non-dependent instruction can execute on the same port.

Thanks for all the info there, much appreciated.

What I'm seeing though is that the division is faster. A lot faster. 1.538 us vs 31.17 us. It makes no sense to me.

@tannergooding
Copy link
Contributor

What I'm seeing though is that the division is faster. A lot faster. 1.538 us vs 31.17 us. It makes no sense to me.

Ah yes. I misread the numbers initially.

This isn't clear to me either and I'd probably end up doing a profile using Intel VTune or AMD uProf so I could get instruction level timing information at this point.

My gut instinct would be a difference in input data or in how BDN ir measuring the two profiles.

@JimBobSquarePants JimBobSquarePants merged commit cac0960 into main Feb 26, 2023
@JimBobSquarePants JimBobSquarePants deleted the js/premultiply branch February 26, 2023 11:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants