Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Avx2 Vector4 Span Premultiplication and Reverse #1399

Merged
merged 6 commits into from Oct 23, 2020

Conversation

JimBobSquarePants
Copy link
Member

@JimBobSquarePants JimBobSquarePants commented Oct 21, 2020

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 馃懏.
  • I have provided test coverage for my change (where applicable)

Description

Adds Avx2 implementation of Vector4Utilities.Premultiply(span) and Vector4Utilities.UnPremultiply(span).
Hat tip to @Turnerj for some advice on Twitter that helped me get started.

Benchmarks are..... Odd. I think they're lying to me.

I'm seeing a massive speedup 38% for UnPremultiply in my benchmark but a 27% speedup for Premultiply. I'm suspecting that BMDN is somehow optimizing something away because according to the intel docs latency and throughput is a good but slower for divide than multiply whereas our benchmark is for both baseline and enhanced versions is ~30x faster!!

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_div_ps&expand=2159
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_mul_ps&expand=2159,3931

@tannergooding is there anything obvious you can see?

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  Job-WXSVRE : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Runtime=.NET Core 3.1  IterationCount=3  LaunchCount=1
WarmupCount=3
Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
UnPremultiplyBaseline 2.018 us 0.1879 us 0.0103 us 1.00 - - - -
UnPremultiply 1.255 us 0.0452 us 0.0025 us 0.62 - - - -
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  Job-IGGZLK : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Runtime=.NET Core 3.1  IterationCount=3  LaunchCount=1
WarmupCount=3
Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
PremultiplyBaseline 37.64 us 1.482 us 0.081 us 1.00 - - - -
Premultiply 27.42 us 1.738 us 0.095 us 0.73 - - - -

@JimBobSquarePants JimBobSquarePants added this to the 1.1.0 milestone Oct 21, 2020
@JimBobSquarePants JimBobSquarePants added this to To Do in ImageSharp via automation Oct 21, 2020
@tannergooding
Copy link
Contributor

is there anything obvious you can see?

Could you clarify what CPU you are using? It can be slightly more complicated than just looking at the latency/throughput as there are also specific "ports" that instructions can be dispatched against.

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented Oct 21, 2020

@tannergooding It's an i7-8650U CPU @ 1.90GHz. Let me know if you need any more details.

@codecov
Copy link

codecov bot commented Oct 21, 2020

Codecov Report

Merging #1399 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1399   +/-   ##
=======================================
  Coverage   82.88%   82.89%           
=======================================
  Files         690      690           
  Lines       30985    30926   -59     
  Branches     3554     3550    -4     
=======================================
- Hits        25683    25637   -46     
+ Misses       4580     4570   -10     
+ Partials      722      719    -3     
Flag Coverage 螖
#unittests 82.89% <100.00%> (+<0.01%) 猬嗭笍

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage 螖
src/ImageSharp/Common/Helpers/ImageMaths.cs 91.54% <100.00%> (+0.12%) 猬嗭笍
src/ImageSharp/Common/Helpers/Vector4Utilities.cs 100.00% <100.00%> (酶)
src/ImageSharp/Common/Helpers/SimdUtils.cs 65.90% <0.00%> (酶)
...mageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs
...geSharp/Common/Helpers/SimdUtils.Avx2Intrinsics.cs 100.00% <0.00%> (酶)
...arp/Common/Helpers/SimdUtils.ExtendedIntrinsics.cs 82.19% <0.00%> (+9.58%) 猬嗭笍

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update b577d8e...f1959f3. Read the comment docs.

@tannergooding
Copy link
Contributor

I don't see anything obvious looking at https://uops.info/table.html (probably the best resource for "documented" vs "measured" latency/throughput and port info).

Do you have the disassembly to share, I'd expect similar perf including looking at the code differences and what I'd expect out of the data dependencies.

@JimBobSquarePants
Copy link
Member Author

Away from my computer just now but I鈥檒l stick all the relevant code in SharpLab ASAP.

@JimBobSquarePants
Copy link
Member Author

Premultiply

; Core CLR v4.700.20.41105 on amd64

C..ctor()
    L0000: ret

C.get_PermuteAlphaMask8x32()
    L0000: mov rax, 0x2940d110bac
    L000a: mov [rcx], rax
    L000d: mov dword ptr [rcx+8], 0x20
    L0014: mov rax, rcx
    L0017: ret

C.Premultiply(System.Span`1<System.Numerics.Vector4>)
    L0000: vzeroupper
    L0003: mov rax, [rcx]
    L0006: mov rdx, 0x2940d110bac
    L0010: vmovupd ymm0, [rdx]
    L0014: mov edx, [rcx+8]
    L0017: shl edx, 2
    L001a: mov ecx, edx
    L001c: sar ecx, 0x1f
    L001f: and ecx, 7
    L0022: add edx, ecx
    L0024: sar edx, 3
    L0027: xor ecx, ecx
    L0029: test edx, edx
    L002b: jle short L0052
    L002d: movsxd r8, ecx
    L0030: shl r8, 5
    L0034: add r8, rax
    L0037: vpermps ymm1, ymm0, [r8]
    L003c: vmulps ymm1, ymm1, [r8]
    L0041: vblendps ymm1, ymm1, [r8], 0x88
    L0047: vmovupd [r8], ymm1
    L004c: inc ecx
    L004e: cmp ecx, edx
    L0050: jl short L002d
    L0052: vzeroupper
    L0055: ret

UnPremultiply Extra line at L003c.

; Core CLR v4.700.20.41105 on amd64

C..ctor()
    L0000: ret

C.get_PermuteAlphaMask8x32()
    L0000: mov rax, 0x2940d0e0bac
    L000a: mov [rcx], rax
    L000d: mov dword ptr [rcx+8], 0x20
    L0014: mov rax, rcx
    L0017: ret

C.UnPremultiply(System.Span`1<System.Numerics.Vector4>)
    L0000: vzeroupper
    L0003: mov rax, [rcx]
    L0006: mov rdx, 0x2940d0e0bac
    L0010: vmovupd ymm0, [rdx]
    L0014: mov edx, [rcx+8]
    L0017: shl edx, 2
    L001a: mov ecx, edx
    L001c: sar ecx, 0x1f
    L001f: and ecx, 7
    L0022: add edx, ecx
    L0024: sar edx, 3
    L0027: xor ecx, ecx
    L0029: test edx, edx
    L002b: jle short L0056
    L002d: movsxd r8, ecx
    L0030: shl r8, 5
    L0034: add r8, rax
    L0037: vpermps ymm1, ymm0, [r8]
    L003c: vmovupd ymm2, [r8]
    L0041: vdivps ymm1, ymm2, ymm1
    L0045: vblendps ymm1, ymm1, [r8], 0x88
    L004b: vmovupd [r8], ymm1
    L0050: inc ecx
    L0052: cmp ecx, edx
    L0054: jl short L002d
    L0056: vzeroupper
    L0059: ret

@tannergooding
Copy link
Contributor

The codegen is basically identical other than vdivps not folding the load, which doesn't really explain the difference.
It's also not great codegen and could likely be improved a bit....

Namely I think using a ref like you are is causing the repeat loads from [r8] while caching it in a local would likely improve perf:

There is also some overhead (although not in the hot path) with the sar, and, add section which could likely be improved a bit...
Namely its doing signed multiplication and division, but the length and count are guaranteed to be unsigned, so the following improves it a bit more:
https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgDkBXfGKASzFwBuGvSasAShwB2GXtxYBJGXym5+QkY2YtJMuTEXLeq9SwAaADiTDaWidNnyAwhHwAHXgBseAZR4A3fhgNWzEdB31DDB4INz8oQLBgmxEAZiZSBicGAG8aBgKGNz5/bGiGSFUMBmNqgCFvKQATAEFPNwALbBdlCE8GAF4GOmByOnGx8ZTqQqKSspgxJAZxGGwmgHkpTwBPHzdsKQAeYB3ogD4GAAUefA5ots7sAFk8AGsLBFTMgcupGAA7gxTtEANoAXVyDFSaGGsLo8NhMLhKIR0MRqKRGLRaIA7NiMfjMSiiTjCQThgwAL42Wb5QrEdLMZbEFAMACqUiusDunlkbl2AAp9ocjgA1GBgDDQFCXfyS6VQXAASnpBTy1DVswYsAAZgwJVLoKQAKxII66zwQMpyhXQXB1PCLAZa7WFPUc1TYXUGFq4cV2qDoA2B03my3WjDnc6Cj3PGD4aA7V5KrqeFgAcRgGFWPtgUiSgvlRqVyuV0zdBUNirDR1ql3w70GrsrnNw3t9/pBMFh1eNZrrMmjsZg+vjiagyewqew6azOdHPBgBZggpuUDuD3aXVeuA+X1IZYrlY4tQYUkGDEFgtPMmVxcVuBYABllwBzDAdBgAKgYKGVDAAPRXreGDKn2UC1hGNosC4Di0pWurQCBZ68JedCCDUDBHOemG8AA1PhqozJWGqVtqEFQVaNoMLgEAcFASSXm2HYsC0TRNCO+oPvajq4D2V5KBgVwYFAyq8OWLZupRA7QVGDC8vyuyXi0/gIKQLDrpuMBitOnzfIKdEMUksKNnukkkeRBQsT6bEcVxDA8UqfECYKQkiWJEkqWpLANMunGqQgLAACK8IETSrkZjECYpvACjsyqwlFJkMH5zSPF0PSiX0FnkVSWr5dQVJAA

and then the general indexing logic is not "the best" since you really shouldn't need to add the index independently...
so I think tweaking it to track the last index as a ref might be slightly better still:
https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIAYACY8gOgDkBXfGKASzFwBuGvSasAShwB2GXtxYBJGXym5+QkY2YtJMuTEXLeq9SwAaADiTDaWidNnyAwhHwAHXgBseAZR4A3fhgNWzEdB31DDB4INz8oQLBgmxEAZiZSBicGAG8aBgKGNz5/bGiGSFUMBmNqgCFvKQATAEFPNwALbBdlCE8GAF4GOmByOnGx8ZTqQqKSspgxJAZxGGwmgHkpTwBPHzdsKQAeYB3ogD4GAAUefA5ots7sAFk8AGsLBFTMgcupGAA7gxTtEANoAXVyDFSaGGsLo8NhMLhKIR0MRqKRGLRaIA7NiMfjMSiiTjCQThgwAL42Wb5QrEdLMZbEFAMACqUiusDunlkbl2AAp9ocjgA1GBgDDQFCXfyS6VQXAASnpBTyM1mhVgADMGBKpdBSABWJBHHWeCBlOUK6C4Op4RYDNVa2a6jmqbA6gwtXDi21QdD6gMms0Wq0Yc7nQXu54wfDQHavJVdTwsADiMAwq29sCkSUF8sNSuVyumroKBsVoaOtUu+HegxdFY9uC9Pr9IJgsKrRtNtZkUZjMD1cYTUCT2BT2DTmezI54MHzMEFNygdwe7S6r1wHy+pFL5YrAHpjwwmrxAk1FqcGJk8Aw2TBvNwZLgijxg8W2YcmgwLB+UBftW/bhtazbaiOwF9mGlrWgwRaKrgAAyeDVEM7qcm23osC0TRNMOeqIXaDq4N2DCCkoGBXBgUDKoKgocLUyrEUqLDIUuADmGAdAwZ6kBwh4QQUwkMACHReIsgpYe2ii4HhTSwLgKHBLgAAqXRSIRCEBvajqwu6rEoWhpaiRqLazL2UA1mBkYMLgEAcFASSDDpxZ6WRtIWZWIagXBdm8vyuyuS0/gIKQLBrhuMBilOnzfIKDlOUksINruZaibMRmkU6DChQgLANEuBH5SwAAil68NeiWOc55GBbwAo7MqsJJXVsJFc0jxdD0tF9BlmredljquZhno4Qp2nDWRsLkANLZUi6i3UFSQA=

@tannergooding
Copy link
Contributor

That gives the updated method to be something like:

    public static void UnPremultiply(Span<Vector4> vectors)
    {
        ref Vector256<float> vectorsBase =
            ref Unsafe.As<Vector4, Vector256<float>>(ref MemoryMarshal.GetReference(vectors));

        Vector256<int> mask =
            Unsafe.As<byte, Vector256<int>>(ref MemoryMarshal.GetReference(PermuteAlphaMask8x32));

        // divide by 2 as 4 elements per Vector4 and 8 per Vector256<float>
        ref Vector256<float> vectorsLast = ref Unsafe.Add(ref vectorsBase, (IntPtr)((uint)vectors.Length / 2u));
        
        while (Unsafe.IsAddressLessThan(ref vectorsBase, ref vectorsLast))
        {
            Vector256<float> source = vectorsBase;
            Vector256<float> multiply = Avx2.PermuteVar8x32(source, mask);
            vectorsBase = Avx.Blend(Avx.Divide(source, multiply), source, BlendAlphaControl);
            vectorsBase = ref Unsafe.Add(ref vectorsBase, 1);
        }
    }

which on 3.1 generates assembly like:

C.UnPremultiply(System.Span`1<System.Numerics.Vector4>)
    L0000: vzeroupper
    L0003: mov rax, [rcx]
    L0006: mov rdx, 0x29413720bf0
    L0010: vmovupd ymm0, [rdx]
    L0014: mov edx, [rcx+8]
    L0017: shr edx, 1
    L0019: mov edx, edx
    L001b: shl rdx, 5
    L001f: add rdx, rax
    L0022: cmp rax, rdx
    L0025: jae short L0047
    L0027: vmovupd ymm1, [rax]
    L002b: vpermps ymm2, ymm0, ymm1
    L0030: vdivps ymm2, ymm1, ymm2
    L0034: vblendps ymm1, ymm2, ymm1, 0x88
    L003a: vmovupd [rax], ymm1
    L003e: add rax, 0x20
    L0042: cmp rax, rdx
    L0045: jb short L0027
    L0047: vzeroupper
    L004a: ret

Which is quite a bit smaller and with a much tighter inner loop as compared to the original codegen:

C.UnPremultiply(System.Span`1<System.Numerics.Vector4>)
    L0000: vzeroupper
    L0003: mov rax, [rcx]
    L0006: mov rdx, 0x29413760bac
    L0010: vmovupd ymm0, [rdx]
    L0014: mov edx, [rcx+8]
    L0017: shl edx, 2
    L001a: mov ecx, edx
    L001c: sar ecx, 0x1f
    L001f: and ecx, 7
    L0022: add edx, ecx
    L0024: sar edx, 3
    L0027: xor ecx, ecx
    L0029: test edx, edx
    L002b: jle short L0056
    L002d: movsxd r8, ecx
    L0030: shl r8, 5
    L0034: add r8, rax
    L0037: vpermps ymm1, ymm0, [r8]
    L003c: vmovupd ymm2, [r8]
    L0041: vdivps ymm1, ymm2, ymm1
    L0045: vblendps ymm1, ymm1, [r8], 0x88
    L004b: vmovupd [r8], ymm1
    L0050: inc ecx
    L0052: cmp ecx, edx
    L0054: jl short L002d
    L0056: vzeroupper
    L0059: ret

Of course, you'll want to profile that to be certain 馃槃

@JimBobSquarePants
Copy link
Member Author

@tannergooding Crikey that was a bit of a masterclass! I didn't even know the Unsafe.IsAddressLessThan method existed!

Benchmarks are still wonky on my machine but better than before. I'll push your changes now.

Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
UnPremultiplyBaseline 38.490 us 7.0771 us 0.3879 us 1.00 - - - -
UnPremultiply 1.348 us 0.4586 us 0.0251 us 0.04 - - - -
Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
PremultiplyBaseline 37.68 us 3.704 us 0.203 us 1.00 - - - -
Premultiply 27.53 us 1.072 us 0.059 us 0.73 - - - -

@JimBobSquarePants JimBobSquarePants moved this from To Do to In progress in ImageSharp Oct 21, 2020
@tannergooding
Copy link
Contributor

You might actually try to run it without the "hot path" flag as it can actually negatively impact some codegen scenarios.

Namely, "aggressive optimization" causes it to skip tier 0 and go straight to tier 1. However, this is actually more like a tier 0.75 as there are some optimizations, like removing the check of whether a beforefieldinit static constructor needs to run, which can't happen without rejitting (and, IIRC, we don't currently rejit methods that go straight to tier 1).

I don't think you actually have any static field accesses here, as I believe PermuteAlphaMask8x32 will get compiled down to a method that constructs a span directly over a constant set of bytes, but it might be worth checking anyways.

@JimBobSquarePants
Copy link
Member Author

I only added the hot path attribution after it appeared to give me a slight edge on benchmarking. Happy to remove if you see no benefit.

while (Unsafe.IsAddressLessThan(ref vectorsBase, ref vectorsLast))
{
Vector256<float> source = vectorsBase;
Vector256<float> multiply = Avx2.PermuteVar8x32(source, mask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You actually don't need to permute here since you're not crossing 128-bit lanes. Avx.Shuffle(source, source, 0b_11_11_11_11) will do the same thing with lower latency while eliminating the need to load the mask register.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @saucecontrol I didn't know Shuffle had that overload. Slight speedup.

Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
PremultiplyBaseline 37.64 us 1.482 us 0.081 us 1.00 - - - -
Premultiply 27.42 us 1.738 us 0.095 us 0.73 - - - -
Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
UnPremultiplyBaseline 37.753 us 3.9513 us 0.2166 us 1.00 - - - -
UnPremultiply 1.322 us 0.0998 us 0.0055 us 0.04 - - - -

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still don't understand the same method with Divide instead of Multiply results in a ~30X difference in benchmark result though 馃槚

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, I didn't read through all the comments and missed that bit. Your baseline UnPremultiply method in the benchmark is multiplying instead of dividing, but I don't see why the vectorized version is coming out so much faster. Will have a look after sleep if you don't figure it out ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew it! I knew there was a mistake! Thanks!

I guess my computer just doesn't like multiplying stuff. The baseline is way faster now too.

Method Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
UnPremultiplyBaseline 2.018 us 0.1879 us 0.0103 us 1.00 - - - -
UnPremultiply 1.255 us 0.0452 us 0.0025 us 0.62 - - - -

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's strange the division is showing lower times. Is that just an iteration count difference between the Premultiply and UnPremultiply runs? BDN is too clever sometimes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah... Exactly the same setup. I think it's optimizing something away.

@JimBobSquarePants JimBobSquarePants requested a review from a team October 23, 2020 10:02
@JimBobSquarePants
Copy link
Member Author

@SixLabors/core I've everyone is happy to ignore BMDN (since perf difference matches baseline) I'd like to get this merged asap.

Copy link
Member

@antonfirsov antonfirsov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you post some resize(only) benchmark results before+after to feed my curiosity?

@JimBobSquarePants JimBobSquarePants merged commit b5975a3 into master Oct 23, 2020
ImageSharp automation moved this from In progress to Done Oct 23, 2020
@JimBobSquarePants JimBobSquarePants deleted the js/avx2-premultiplication branch October 23, 2020 10:41
@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented Oct 23, 2020

@antonfirsov Jumped about ~5%

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  Job-PDTUQZ : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT
  Job-QRBITN : .NET Core 2.1.23 (CoreCLR 4.6.29321.03, CoreFX 4.6.29321.01), X64 RyuJIT
  Job-JVJPHZ : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

IterationCount=3 LaunchCount=1 WarmupCount=3

Method Job Runtime SourceToDest Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
SystemDrawing Job-PDTUQZ .NET 4.7.2 3032-400 83.04 ms 9.761 ms 0.535 ms 1.00 0.00 - - - 1170 B
'ImageSharp, MaxDegreeOfParallelism = 1' Job-PDTUQZ .NET 4.7.2 3032-400 54.62 ms 41.477 ms 2.274 ms 0.66 0.03 - - - 17203 B
SystemDrawing Job-QRBITN .NET Core 2.1 3032-400 84.71 ms 19.817 ms 1.086 ms 1.00 0.00 - - - 96 B
'ImageSharp, MaxDegreeOfParallelism = 1' Job-QRBITN .NET Core 2.1 3032-400 67.72 ms 16.151 ms 0.885 ms 0.80 0.02 - - - 16928 B
SystemDrawing Job-JVJPHZ .NET Core 3.1 3032-400 85.19 ms 5.483 ms 0.301 ms 1.00 0.00 - - - 96 B
'ImageSharp, MaxDegreeOfParallelism = 1' Job-JVJPHZ .NET Core 3.1 3032-400 55.27 ms 11.544 ms 0.633 ms 0.65 0.01 - - - 16904 B
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.572 (2004/?/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.2.20479.15
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
  Job-VTPBNH : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT
  Job-LYXMSX : .NET Core 2.1.23 (CoreCLR 4.6.29321.03, CoreFX 4.6.29321.01), X64 RyuJIT
  Job-VNWGRF : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

IterationCount=3 LaunchCount=1 WarmupCount=3

Method Job Runtime SourceToDest Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
SystemDrawing Job-VTPBNH .NET 4.7.2 3032-400 80.27 ms 7.021 ms 0.385 ms 1.00 - - - 1365 B
'ImageSharp, MaxDegreeOfParallelism = 1' Job-VTPBNH .NET 4.7.2 3032-400 49.39 ms 2.429 ms 0.133 ms 0.62 - - - 17129 B
SystemDrawing Job-LYXMSX .NET Core 2.1 3032-400 79.82 ms 3.718 ms 0.204 ms 1.00 - - - 96 B
'ImageSharp, MaxDegreeOfParallelism = 1' Job-LYXMSX .NET Core 2.1 3032-400 64.13 ms 4.689 ms 0.257 ms 0.80 - - - 16928 B
SystemDrawing Job-VNWGRF .NET Core 3.1 3032-400 79.76 ms 1.253 ms 0.069 ms 1.00 - - - 96 B
'ImageSharp, MaxDegreeOfParallelism = 1' Job-VNWGRF .NET Core 3.1 3032-400 44.90 ms 0.947 ms 0.052 ms 0.56 - - - 17013 B

JimBobSquarePants added a commit that referenced this pull request Mar 13, 2021
Add Avx2 Vector4 Span Premultiplication and Reverse
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
ImageSharp
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants