Add Avx2 Vector4 Span Premultiplication and Reverse #1399

JimBobSquarePants · 2020-10-21T17:11:13Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

Adds Avx2 implementation of Vector4Utilities.Premultiply(span) and Vector4Utilities.UnPremultiply(span).
Hat tip to @Turnerj for some advice on Twitter that helped me get started.

Benchmarks are..... Odd. I think they're lying to me.

I'm seeing a massive speedup 38% for UnPremultiply in my benchmark but a 27% speedup for Premultiply. I'm suspecting that BMDN is somehow optimizing something away because according to the intel docs latency and throughput is a good but slower for divide than multiply whereas our benchmark is for both baseline and enhanced versions is ~30x faster!!

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_div_ps&expand=2159
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_mul_ps&expand=2159,3931

@tannergooding is there anything obvious you can see?

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  Job-WXSVRE : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Runtime=.NET Core 3.1  IterationCount=3  LaunchCount=1
WarmupCount=3

Method	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
UnPremultiplyBaseline	2.018 us	0.1879 us	0.0103 us	1.00	-	-	-	-
UnPremultiply	1.255 us	0.0452 us	0.0025 us	0.62	-	-	-	-

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  Job-IGGZLK : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Runtime=.NET Core 3.1  IterationCount=3  LaunchCount=1
WarmupCount=3

Method	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
PremultiplyBaseline	37.64 us	1.482 us	0.081 us	1.00	-	-	-	-
Premultiply	27.42 us	1.738 us	0.095 us	0.73	-	-	-	-

tannergooding · 2020-10-21T17:44:30Z

is there anything obvious you can see?

Could you clarify what CPU you are using? It can be slightly more complicated than just looking at the latency/throughput as there are also specific "ports" that instructions can be dispatched against.

JimBobSquarePants · 2020-10-21T17:57:23Z

@tannergooding It's an i7-8650U CPU @ 1.90GHz. Let me know if you need any more details.

codecov · 2020-10-21T17:58:52Z

Codecov Report

Merging #1399 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1399   +/-   ##
=======================================
  Coverage   82.88%   82.89%           
=======================================
  Files         690      690           
  Lines       30985    30926   -59     
  Branches     3554     3550    -4     
=======================================
- Hits        25683    25637   -46     
+ Misses       4580     4570   -10     
+ Partials      722      719    -3

Flag	Coverage Δ
#unittests	`82.89% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/ImageSharp/Common/Helpers/ImageMaths.cs	`91.54% <100.00%> (+0.12%)`	⬆️
src/ImageSharp/Common/Helpers/Vector4Utilities.cs	`100.00% <100.00%> (ø)`
src/ImageSharp/Common/Helpers/SimdUtils.cs	`65.90% <0.00%> (ø)`
...mageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs
...geSharp/Common/Helpers/SimdUtils.Avx2Intrinsics.cs	`100.00% <0.00%> (ø)`
...arp/Common/Helpers/SimdUtils.ExtendedIntrinsics.cs	`82.19% <0.00%> (+9.58%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b577d8e...f1959f3. Read the comment docs.

tannergooding · 2020-10-21T18:09:58Z

I don't see anything obvious looking at https://uops.info/table.html (probably the best resource for "documented" vs "measured" latency/throughput and port info).

Do you have the disassembly to share, I'd expect similar perf including looking at the code differences and what I'd expect out of the data dependencies.

JimBobSquarePants · 2020-10-21T18:27:03Z

Away from my computer just now but I’ll stick all the relevant code in SharpLab ASAP.

JimBobSquarePants · 2020-10-21T20:23:48Z

Premultiply

; Core CLR v4.700.20.41105 on amd64

C..ctor()
    L0000: ret

C.get_PermuteAlphaMask8x32()
    L0000: mov rax, 0x2940d110bac
    L000a: mov [rcx], rax
    L000d: mov dword ptr [rcx+8], 0x20
    L0014: mov rax, rcx
    L0017: ret

C.Premultiply(System.Span`1<System.Numerics.Vector4>)
    L0000: vzeroupper
    L0003: mov rax, [rcx]
    L0006: mov rdx, 0x2940d110bac
    L0010: vmovupd ymm0, [rdx]
    L0014: mov edx, [rcx+8]
    L0017: shl edx, 2
    L001a: mov ecx, edx
    L001c: sar ecx, 0x1f
    L001f: and ecx, 7
    L0022: add edx, ecx
    L0024: sar edx, 3
    L0027: xor ecx, ecx
    L0029: test edx, edx
    L002b: jle short L0052
    L002d: movsxd r8, ecx
    L0030: shl r8, 5
    L0034: add r8, rax
    L0037: vpermps ymm1, ymm0, [r8]
    L003c: vmulps ymm1, ymm1, [r8]
    L0041: vblendps ymm1, ymm1, [r8], 0x88
    L0047: vmovupd [r8], ymm1
    L004c: inc ecx
    L004e: cmp ecx, edx
    L0050: jl short L002d
    L0052: vzeroupper
    L0055: ret

UnPremultiply Extra line at L003c.

; Core CLR v4.700.20.41105 on amd64

C..ctor()
    L0000: ret

C.get_PermuteAlphaMask8x32()
    L0000: mov rax, 0x2940d0e0bac
    L000a: mov [rcx], rax
    L000d: mov dword ptr [rcx+8], 0x20
    L0014: mov rax, rcx
    L0017: ret

C.UnPremultiply(System.Span`1<System.Numerics.Vector4>)
    L0000: vzeroupper
    L0003: mov rax, [rcx]
    L0006: mov rdx, 0x2940d0e0bac
    L0010: vmovupd ymm0, [rdx]
    L0014: mov edx, [rcx+8]
    L0017: shl edx, 2
    L001a: mov ecx, edx
    L001c: sar ecx, 0x1f
    L001f: and ecx, 7
    L0022: add edx, ecx
    L0024: sar edx, 3
    L0027: xor ecx, ecx
    L0029: test edx, edx
    L002b: jle short L0056
    L002d: movsxd r8, ecx
    L0030: shl r8, 5
    L0034: add r8, rax
    L0037: vpermps ymm1, ymm0, [r8]
    L003c: vmovupd ymm2, [r8]
    L0041: vdivps ymm1, ymm2, ymm1
    L0045: vblendps ymm1, ymm1, [r8], 0x88
    L004b: vmovupd [r8], ymm1
    L0050: inc ecx
    L0052: cmp ecx, edx
    L0054: jl short L002d
    L0056: vzeroupper
    L0059: ret

tannergooding · 2020-10-21T20:57:20Z

The codegen is basically identical other than vdivps not folding the load, which doesn't really explain the difference.
It's also not great codegen and could likely be improved a bit....

Namely I think using a ref like you are is causing the repeat loads from [r8] while caching it in a local would likely improve perf:

https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgDkBXfGKASzFwBuGvSasAShwB2GXtxYBJGXym5+QkY2YtJMuTEXLeq9SwAaADiTDaWidNnyAwhHwAHXgBseAZR4A3fhgNWzEdB31DDB4INz8oQLBgmxEAZiZSBicGAG8aBgKGNz5/bGiGSFUMBmNqgCFvKQATAEFPNwALbBdlCE8GAF4GOmByOnGx8ZTqQqKSspgxJAZxGGwmgHkpTwBPHzdsKQAeYB3ogD4GAAUefA5ots7sAFk8AGsLBFTMgcupGAA7gxTtEANoAXVyDFSaGGsLo8NhMLhKIR0MRqKRGLRaIA7NiMfjMSiiTjCQThgwAL42Wb5QrEdLMZbEFAMACqUiusDunlkbl2AAp9ocjgA1GBgDDQFCXfyS6VQXAASnpBTy1DVswYsAAZgwJVLoKQAKxII66zwQMpyhXQXB1PCLAZa7WFPUc1TYXUGFq4cV2qDoA2B03my3WjDnc6Cj3PGD4aA7V5KrqeFgAcRgGFWPtgUiSgvlRqVyuV0zdBUNirDR1ql3w70GrsrnNw3t9/pBMFh1eNZrrMmjsZg+vjiagyewqew6azOdHPBgBZggpuUDuD3aXVeuA+X1IZYrldqDCkgwYRcDuBYABllwBzDAdBgAKgYKGVDAA9CGS7WIxtFgXAcWlK11aBL1PXgLzoQQagYI4z3g3gAGpUNVGZKw1SttT7KAAKtG0GFwCAOCgJILzbDsWBaJomhHfVi0VB0nVhXhyxbN18MIyMGw4PleAFHYLxafwEFIFh103GAxWnT5vkFUjyKSWFGz3TisNwgpqJ9Wj6MYhhmPtR1cB7GovyGMSEBYBplwY6yWAAEV4QImlXZSKPM3l+V2ZVYU81SGDs5pHi6HoMCgPpNNwqktTi6gqSAA

There is also some overhead (although not in the hot path) with the sar, and, add section which could likely be improved a bit...
Namely its doing signed multiplication and division, but the length and count are guaranteed to be unsigned, so the following improves it a bit more:
https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgDkBXfGKASzFwBuGvSasAShwB2GXtxYBJGXym5+QkY2YtJMuTEXLeq9SwAaADiTDaWidNnyAwhHwAHXgBseAZR4A3fhgNWzEdB31DDB4INz8oQLBgmxEAZiZSBicGAG8aBgKGNz5/bGiGSFUMBmNqgCFvKQATAEFPNwALbBdlCE8GAF4GOmByOnGx8ZTqQqKSspgxJAZxGGwmgHkpTwBPHzdsKQAeYB3ogD4GAAUefA5ots7sAFk8AGsLBFTMgcupGAA7gxTtEANoAXVyDFSaGGsLo8NhMLhKIR0MRqKRGLRaIA7NiMfjMSiiTjCQThgwAL42Wb5QrEdLMZbEFAMACqUiusDunlkbl2AAp9ocjgA1GBgDDQFCXfyS6VQXAASnpBTy1DVswYsAAZgwJVLoKQAKxII66zwQMpyhXQXB1PCLAZa7WFPUc1TYXUGFq4cV2qDoA2B03my3WjDnc6Cj3PGD4aA7V5KrqeFgAcRgGFWPtgUiSgvlRqVyuV0zdBUNirDR1ql3w70GrsrnNw3t9/pBMFh1eNZrrMmjsZg+vjiagyewqew6azOdHPBgBZggpuUDuD3aXVeuA+X1IZYrlY4tQYUkGDEFgtPMmVxcVuBYABllwBzDAdBgAKgYKGVDAAPRXreGDKn2UC1hGNosC4Di0pWurQCBZ68JedCCDUDBHOemG8AA1PhqozJWGqVtqEFQVaNoMLgEAcFASSXm2HYsC0TRNCO+oPvajq4D2V5KBgVwYFAyq8OWLZupRA7QVGDC8vyuyXi0/gIKQLDrpuMBitOnzfIKdEMUksKNnukkkeRBQsT6bEcVxDA8UqfECYKQkiWJEkqWpLANMunGqQgLAACK8IETSrkZjECYpvACjsyqwlFJkMH5zSPF0PSiX0FnkVSWr5dQVJAA

and then the general indexing logic is not "the best" since you really shouldn't need to add the index independently...
so I think tweaking it to track the last index as a ref might be slightly better still:
https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgDkBXfGKASzFwBuGvSasAShwB2GXtxYBJGXym5+QkY2YtJMuTEXLeq9SwAaADiTDaWidNnyAwhHwAHXgBseAZR4A3fhgNWzEdB31DDB4INz8oQLBgmxEAZiZSBicGAG8aBgKGNz5/bGiGSFUMBmNqgCFvKQATAEFPNwALbBdlCE8GAF4GOmByOnGx8ZTqQqKSspgxJAZxGGwmgHkpTwBPHzdsKQAeYB3ogD4GAAUefA5ots7sAFk8AGsLBFTMgcupGAA7gxTtEANoAXVyDFSaGGsLo8NhMLhKIR0MRqKRGLRaIA7NiMfjMSiiTjCQThgwAL42Wb5QrEdLMZbEFAMACqUiusDunlkbl2AAp9ocjgA1GBgDDQFCXfyS6VQXAASnpBTyM1mhVgADMGBKpdBSABWJBHHWeCBlOUK6C4Op4RYDNVa2a6jmqbA6gwtXDi21QdD6gMms0Wq0Yc7nQXu54wfDQHavJVdTwsADiMAwq29sCkSUF8sNSuVyumroKBsVoaOtUu+HegxdFY9uC9Pr9IJgsKrRtNtZkUZjMD1cYTUCT2BT2DTmezI54MHzMEFNygdwe7S6r1wHy+pFL5YrAHpjwwmrxAk1FqcGJk8Aw2TBvNwZLgijxg8W2YcmgwLB+UBftW/bhtazbaiOwF9mGlrWgwRaKrgAAyeDVEM7qcm23osC0TRNMOeqIXaDq4N2DCCkoGBXBgUDKoKgocLUyrEUqLDIUuADmGAdAwZ6kBwh4QQUwkMACHReIsgpYe2ii4HhTSwLgKHBLgAAqXRSIRCEBvajqwu6rEoWhpaiRqLazL2UA1mBkYMLgEAcFASSDDpxZ6WRtIWZWIagXBdm8vyuyuS0/gIKQLBrhuMBilOnzfIKDlOUksINruZaibMRmkU6DChQgLANEuBH5SwAAil68NeiWOc55GBbwAo7MqsJJXVsJFc0jxdD0tF9BlmredljquZhno4Qp2nDWRsLkANLZUi6i3UFSQA=

tannergooding · 2020-10-21T20:59:04Z

That gives the updated method to be something like:

    public static void UnPremultiply(Span<Vector4> vectors)
    {
        ref Vector256<float> vectorsBase =
            ref Unsafe.As<Vector4, Vector256<float>>(ref MemoryMarshal.GetReference(vectors));

        Vector256<int> mask =
            Unsafe.As<byte, Vector256<int>>(ref MemoryMarshal.GetReference(PermuteAlphaMask8x32));

        // divide by 2 as 4 elements per Vector4 and 8 per Vector256<float>
        ref Vector256<float> vectorsLast = ref Unsafe.Add(ref vectorsBase, (IntPtr)((uint)vectors.Length / 2u));
        
        while (Unsafe.IsAddressLessThan(ref vectorsBase, ref vectorsLast))
        {
            Vector256<float> source = vectorsBase;
            Vector256<float> multiply = Avx2.PermuteVar8x32(source, mask);
            vectorsBase = Avx.Blend(Avx.Divide(source, multiply), source, BlendAlphaControl);
            vectorsBase = ref Unsafe.Add(ref vectorsBase, 1);
        }
    }

which on 3.1 generates assembly like:

C.UnPremultiply(System.Span`1<System.Numerics.Vector4>)
    L0000: vzeroupper
    L0003: mov rax, [rcx]
    L0006: mov rdx, 0x29413720bf0
    L0010: vmovupd ymm0, [rdx]
    L0014: mov edx, [rcx+8]
    L0017: shr edx, 1
    L0019: mov edx, edx
    L001b: shl rdx, 5
    L001f: add rdx, rax
    L0022: cmp rax, rdx
    L0025: jae short L0047
    L0027: vmovupd ymm1, [rax]
    L002b: vpermps ymm2, ymm0, ymm1
    L0030: vdivps ymm2, ymm1, ymm2
    L0034: vblendps ymm1, ymm2, ymm1, 0x88
    L003a: vmovupd [rax], ymm1
    L003e: add rax, 0x20
    L0042: cmp rax, rdx
    L0045: jb short L0027
    L0047: vzeroupper
    L004a: ret

Which is quite a bit smaller and with a much tighter inner loop as compared to the original codegen:

C.UnPremultiply(System.Span`1<System.Numerics.Vector4>)
    L0000: vzeroupper
    L0003: mov rax, [rcx]
    L0006: mov rdx, 0x29413760bac
    L0010: vmovupd ymm0, [rdx]
    L0014: mov edx, [rcx+8]
    L0017: shl edx, 2
    L001a: mov ecx, edx
    L001c: sar ecx, 0x1f
    L001f: and ecx, 7
    L0022: add edx, ecx
    L0024: sar edx, 3
    L0027: xor ecx, ecx
    L0029: test edx, edx
    L002b: jle short L0056
    L002d: movsxd r8, ecx
    L0030: shl r8, 5
    L0034: add r8, rax
    L0037: vpermps ymm1, ymm0, [r8]
    L003c: vmovupd ymm2, [r8]
    L0041: vdivps ymm1, ymm2, ymm1
    L0045: vblendps ymm1, ymm1, [r8], 0x88
    L004b: vmovupd [r8], ymm1
    L0050: inc ecx
    L0052: cmp ecx, edx
    L0054: jl short L002d
    L0056: vzeroupper
    L0059: ret

Of course, you'll want to profile that to be certain 😄

JimBobSquarePants · 2020-10-21T21:22:27Z

@tannergooding Crikey that was a bit of a masterclass! I didn't even know the Unsafe.IsAddressLessThan method existed!

Benchmarks are still wonky on my machine but better than before. I'll push your changes now.

Method	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
UnPremultiplyBaseline	38.490 us	7.0771 us	0.3879 us	1.00	-	-	-	-
UnPremultiply	1.348 us	0.4586 us	0.0251 us	0.04	-	-	-	-

Method	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
PremultiplyBaseline	37.68 us	3.704 us	0.203 us	1.00	-	-	-	-
Premultiply	27.53 us	1.072 us	0.059 us	0.73	-	-	-	-

tannergooding · 2020-10-21T22:09:59Z

You might actually try to run it without the "hot path" flag as it can actually negatively impact some codegen scenarios.

Namely, "aggressive optimization" causes it to skip tier 0 and go straight to tier 1. However, this is actually more like a tier 0.75 as there are some optimizations, like removing the check of whether a beforefieldinit static constructor needs to run, which can't happen without rejitting (and, IIRC, we don't currently rejit methods that go straight to tier 1).

I don't think you actually have any static field accesses here, as I believe PermuteAlphaMask8x32 will get compiled down to a method that constructs a span directly over a constant set of bytes, but it might be worth checking anyways.

JimBobSquarePants · 2020-10-21T22:57:59Z

I only added the hot path attribution after it appeared to give me a slight edge on benchmarking. Happy to remove if you see no benefit.

saucecontrol · 2020-10-22T00:17:26Z

src/ImageSharp/Common/Helpers/Vector4Utilities.cs

+                while (Unsafe.IsAddressLessThan(ref vectorsBase, ref vectorsLast))
+                {
+                    Vector256<float> source = vectorsBase;
+                    Vector256<float> multiply = Avx2.PermuteVar8x32(source, mask);


You actually don't need to permute here since you're not crossing 128-bit lanes. Avx.Shuffle(source, source, 0b_11_11_11_11) will do the same thing with lower latency while eliminating the need to load the mask register.

Thanks @saucecontrol I didn't know Shuffle had that overload. Slight speedup.

Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated

PremultiplyBaseline 37.64 us 1.482 us 0.081 us 1.00 - - - -

Premultiply 27.42 us 1.738 us 0.095 us 0.73 - - - -

Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated

UnPremultiplyBaseline 37.753 us 3.9513 us 0.2166 us 1.00 - - - -

UnPremultiply 1.322 us 0.0998 us 0.0055 us 0.04 - - - -

Still don't understand the same method with Divide instead of Multiply results in a ~30X difference in benchmark result though 😖

Ha, I didn't read through all the comments and missed that bit. Your baseline UnPremultiply method in the benchmark is multiplying instead of dividing, but I don't see why the vectorized version is coming out so much faster. Will have a look after sleep if you don't figure it out ;)

I knew it! I knew there was a mistake! Thanks!

I guess my computer just doesn't like multiplying stuff. The baseline is way faster now too.

Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated

UnPremultiplyBaseline 2.018 us 0.1879 us 0.0103 us 1.00 - - - -

UnPremultiply 1.255 us 0.0452 us 0.0025 us 0.62 - - - -

That's strange the division is showing lower times. Is that just an iteration count difference between the Premultiply and UnPremultiply runs? BDN is too clever sometimes.

Nah... Exactly the same setup. I think it's optimizing something away.

JimBobSquarePants · 2020-10-23T10:05:34Z

@SixLabors/core I've everyone is happy to ignore BMDN (since perf difference matches baseline) I'd like to get this merged asap.

antonfirsov

LGTM. Can you post some resize(only) benchmark results before+after to feed my curiosity?

JimBobSquarePants · 2020-10-23T12:32:04Z

@antonfirsov Jumped about ~5%

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  Job-PDTUQZ : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT
  Job-QRBITN : .NET Core 2.1.23 (CoreCLR 4.6.29321.03, CoreFX 4.6.29321.01), X64 RyuJIT
  Job-JVJPHZ : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

IterationCount=3 LaunchCount=1 WarmupCount=3

Method	Job	Runtime	SourceToDest	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
SystemDrawing	Job-PDTUQZ	.NET 4.7.2	3032-400	83.04 ms	9.761 ms	0.535 ms	1.00	0.00	-	-	-	1170 B
'ImageSharp, MaxDegreeOfParallelism = 1'	Job-PDTUQZ	.NET 4.7.2	3032-400	54.62 ms	41.477 ms	2.274 ms	0.66	0.03	-	-	-	17203 B

SystemDrawing	Job-QRBITN	.NET Core 2.1	3032-400	84.71 ms	19.817 ms	1.086 ms	1.00	0.00	-	-	-	96 B
'ImageSharp, MaxDegreeOfParallelism = 1'	Job-QRBITN	.NET Core 2.1	3032-400	67.72 ms	16.151 ms	0.885 ms	0.80	0.02	-	-	-	16928 B

SystemDrawing	Job-JVJPHZ	.NET Core 3.1	3032-400	85.19 ms	5.483 ms	0.301 ms	1.00	0.00	-	-	-	96 B
'ImageSharp, MaxDegreeOfParallelism = 1'	Job-JVJPHZ	.NET Core 3.1	3032-400	55.27 ms	11.544 ms	0.633 ms	0.65	0.01	-	-	-	16904 B

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  Job-VTPBNH : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT
  Job-LYXMSX : .NET Core 2.1.23 (CoreCLR 4.6.29321.03, CoreFX 4.6.29321.01), X64 RyuJIT
  Job-VNWGRF : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

IterationCount=3 LaunchCount=1 WarmupCount=3

Method	Job	Runtime	SourceToDest	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
SystemDrawing	Job-VTPBNH	.NET 4.7.2	3032-400	80.27 ms	7.021 ms	0.385 ms	1.00	-	-	-	1365 B
'ImageSharp, MaxDegreeOfParallelism = 1'	Job-VTPBNH	.NET 4.7.2	3032-400	49.39 ms	2.429 ms	0.133 ms	0.62	-	-	-	17129 B

SystemDrawing	Job-LYXMSX	.NET Core 2.1	3032-400	79.82 ms	3.718 ms	0.204 ms	1.00	-	-	-	96 B
'ImageSharp, MaxDegreeOfParallelism = 1'	Job-LYXMSX	.NET Core 2.1	3032-400	64.13 ms	4.689 ms	0.257 ms	0.80	-	-	-	16928 B

SystemDrawing	Job-VNWGRF	.NET Core 3.1	3032-400	79.76 ms	1.253 ms	0.069 ms	1.00	-	-	-	96 B
'ImageSharp, MaxDegreeOfParallelism = 1'	Job-VNWGRF	.NET Core 3.1	3032-400	44.90 ms	0.947 ms	0.052 ms	0.56	-	-	-	17013 B

Add Avx2 Vector4 Span Premultiplication and Reverse

Add Avx2 Span Premultiplication and Reverse

6c4d65b

JimBobSquarePants added the area:performance label Oct 21, 2020

JimBobSquarePants added this to the 1.1.0 milestone Oct 21, 2020

JimBobSquarePants requested a review from antonfirsov October 21, 2020 17:11

JimBobSquarePants added this to To Do in ImageSharp via automation Oct 21, 2020

Use Tanner's updated code.

bee9e9a

JimBobSquarePants moved this from To Do to In progress in ImageSharp Oct 21, 2020

Remove hotpath attr

c4f849c

saucecontrol reviewed Oct 22, 2020

View reviewed changes

JimBobSquarePants added 2 commits October 22, 2020 10:34

Use Avx.Shuffle for lower latency

59fa1fd

Fix base unpremultiply benchmark

4aebc13

JimBobSquarePants requested a review from a team October 23, 2020 10:02

antonfirsov approved these changes Oct 23, 2020

View reviewed changes

Merge branch 'master' into js/avx2-premultiplication

f1959f3

JimBobSquarePants merged commit b5975a3 into master Oct 23, 2020

ImageSharp automation moved this from In progress to Done Oct 23, 2020

JimBobSquarePants deleted the js/avx2-premultiplication branch October 23, 2020 10:41

JimBobSquarePants added a commit that referenced this pull request Mar 13, 2021

Merge pull request #1399 from SixLabors/js/avx2-premultiplication

8d1cb3e

Add Avx2 Vector4 Span Premultiplication and Reverse

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Avx2 Vector4 Span Premultiplication and Reverse #1399

Add Avx2 Vector4 Span Premultiplication and Reverse #1399

JimBobSquarePants commented Oct 21, 2020 •

edited

tannergooding commented Oct 21, 2020

JimBobSquarePants commented Oct 21, 2020 •

edited

codecov bot commented Oct 21, 2020 •

edited

tannergooding commented Oct 21, 2020

JimBobSquarePants commented Oct 21, 2020

JimBobSquarePants commented Oct 21, 2020

tannergooding commented Oct 21, 2020

tannergooding commented Oct 21, 2020

JimBobSquarePants commented Oct 21, 2020

tannergooding commented Oct 21, 2020

JimBobSquarePants commented Oct 21, 2020

saucecontrol Oct 22, 2020

JimBobSquarePants Oct 22, 2020

JimBobSquarePants Oct 22, 2020

saucecontrol Oct 22, 2020

JimBobSquarePants Oct 22, 2020

saucecontrol Oct 22, 2020

JimBobSquarePants Oct 22, 2020

JimBobSquarePants commented Oct 23, 2020

antonfirsov left a comment

JimBobSquarePants commented Oct 23, 2020 •

edited

Method	Mean	Error	StdDev	Ratio	Gen 0	Gen 1	Gen 2	Allocated
UnPremultiplyBaseline	37.753 us	3.9513 us	0.2166 us	1.00	-	-	-	-
UnPremultiply	1.322 us	0.0998 us	0.0055 us	0.04	-	-	-	-

Add Avx2 Vector4 Span Premultiplication and Reverse #1399

Add Avx2 Vector4 Span Premultiplication and Reverse #1399

Conversation

JimBobSquarePants commented Oct 21, 2020 • edited

Prerequisites

Description

tannergooding commented Oct 21, 2020

JimBobSquarePants commented Oct 21, 2020 • edited

codecov bot commented Oct 21, 2020 • edited

Codecov Report

tannergooding commented Oct 21, 2020

JimBobSquarePants commented Oct 21, 2020

JimBobSquarePants commented Oct 21, 2020

tannergooding commented Oct 21, 2020

tannergooding commented Oct 21, 2020

JimBobSquarePants commented Oct 21, 2020

tannergooding commented Oct 21, 2020

JimBobSquarePants commented Oct 21, 2020

saucecontrol Oct 22, 2020

Choose a reason for hiding this comment

JimBobSquarePants Oct 22, 2020

Choose a reason for hiding this comment

JimBobSquarePants Oct 22, 2020

Choose a reason for hiding this comment

saucecontrol Oct 22, 2020

Choose a reason for hiding this comment

JimBobSquarePants Oct 22, 2020

Choose a reason for hiding this comment

saucecontrol Oct 22, 2020

Choose a reason for hiding this comment

JimBobSquarePants Oct 22, 2020

Choose a reason for hiding this comment

JimBobSquarePants commented Oct 23, 2020

antonfirsov left a comment

Choose a reason for hiding this comment

JimBobSquarePants commented Oct 23, 2020 • edited

JimBobSquarePants commented Oct 21, 2020 •

edited

JimBobSquarePants commented Oct 21, 2020 •

edited

codecov bot commented Oct 21, 2020 •

edited

JimBobSquarePants commented Oct 23, 2020 •

edited