New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Avx2 Vector4 Span Premultiplication and Reverse #1399
Conversation
Could you clarify what CPU you are using? It can be slightly more complicated than just looking at the latency/throughput as there are also specific "ports" that instructions can be dispatched against. |
@tannergooding It's an i7-8650U CPU @ 1.90GHz. Let me know if you need any more details. |
Codecov Report
@@ Coverage Diff @@
## master #1399 +/- ##
=======================================
Coverage 82.88% 82.89%
=======================================
Files 690 690
Lines 30985 30926 -59
Branches 3554 3550 -4
=======================================
- Hits 25683 25637 -46
+ Misses 4580 4570 -10
+ Partials 722 719 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I don't see anything obvious looking at https://uops.info/table.html (probably the best resource for "documented" vs "measured" latency/throughput and port info). Do you have the disassembly to share, I'd expect similar perf including looking at the code differences and what I'd expect out of the data dependencies. |
Away from my computer just now but I鈥檒l stick all the relevant code in SharpLab ASAP. |
; Core CLR v4.700.20.41105 on amd64
C..ctor()
L0000: ret
C.get_PermuteAlphaMask8x32()
L0000: mov rax, 0x2940d110bac
L000a: mov [rcx], rax
L000d: mov dword ptr [rcx+8], 0x20
L0014: mov rax, rcx
L0017: ret
C.Premultiply(System.Span`1<System.Numerics.Vector4>)
L0000: vzeroupper
L0003: mov rax, [rcx]
L0006: mov rdx, 0x2940d110bac
L0010: vmovupd ymm0, [rdx]
L0014: mov edx, [rcx+8]
L0017: shl edx, 2
L001a: mov ecx, edx
L001c: sar ecx, 0x1f
L001f: and ecx, 7
L0022: add edx, ecx
L0024: sar edx, 3
L0027: xor ecx, ecx
L0029: test edx, edx
L002b: jle short L0052
L002d: movsxd r8, ecx
L0030: shl r8, 5
L0034: add r8, rax
L0037: vpermps ymm1, ymm0, [r8]
L003c: vmulps ymm1, ymm1, [r8]
L0041: vblendps ymm1, ymm1, [r8], 0x88
L0047: vmovupd [r8], ymm1
L004c: inc ecx
L004e: cmp ecx, edx
L0050: jl short L002d
L0052: vzeroupper
L0055: ret
UnPremultiply Extra line at ; Core CLR v4.700.20.41105 on amd64
C..ctor()
L0000: ret
C.get_PermuteAlphaMask8x32()
L0000: mov rax, 0x2940d0e0bac
L000a: mov [rcx], rax
L000d: mov dword ptr [rcx+8], 0x20
L0014: mov rax, rcx
L0017: ret
C.UnPremultiply(System.Span`1<System.Numerics.Vector4>)
L0000: vzeroupper
L0003: mov rax, [rcx]
L0006: mov rdx, 0x2940d0e0bac
L0010: vmovupd ymm0, [rdx]
L0014: mov edx, [rcx+8]
L0017: shl edx, 2
L001a: mov ecx, edx
L001c: sar ecx, 0x1f
L001f: and ecx, 7
L0022: add edx, ecx
L0024: sar edx, 3
L0027: xor ecx, ecx
L0029: test edx, edx
L002b: jle short L0056
L002d: movsxd r8, ecx
L0030: shl r8, 5
L0034: add r8, rax
L0037: vpermps ymm1, ymm0, [r8]
L003c: vmovupd ymm2, [r8]
L0041: vdivps ymm1, ymm2, ymm1
L0045: vblendps ymm1, ymm1, [r8], 0x88
L004b: vmovupd [r8], ymm1
L0050: inc ecx
L0052: cmp ecx, edx
L0054: jl short L002d
L0056: vzeroupper
L0059: ret |
That gives the updated method to be something like: public static void UnPremultiply(Span<Vector4> vectors)
{
ref Vector256<float> vectorsBase =
ref Unsafe.As<Vector4, Vector256<float>>(ref MemoryMarshal.GetReference(vectors));
Vector256<int> mask =
Unsafe.As<byte, Vector256<int>>(ref MemoryMarshal.GetReference(PermuteAlphaMask8x32));
// divide by 2 as 4 elements per Vector4 and 8 per Vector256<float>
ref Vector256<float> vectorsLast = ref Unsafe.Add(ref vectorsBase, (IntPtr)((uint)vectors.Length / 2u));
while (Unsafe.IsAddressLessThan(ref vectorsBase, ref vectorsLast))
{
Vector256<float> source = vectorsBase;
Vector256<float> multiply = Avx2.PermuteVar8x32(source, mask);
vectorsBase = Avx.Blend(Avx.Divide(source, multiply), source, BlendAlphaControl);
vectorsBase = ref Unsafe.Add(ref vectorsBase, 1);
}
} which on 3.1 generates assembly like: C.UnPremultiply(System.Span`1<System.Numerics.Vector4>)
L0000: vzeroupper
L0003: mov rax, [rcx]
L0006: mov rdx, 0x29413720bf0
L0010: vmovupd ymm0, [rdx]
L0014: mov edx, [rcx+8]
L0017: shr edx, 1
L0019: mov edx, edx
L001b: shl rdx, 5
L001f: add rdx, rax
L0022: cmp rax, rdx
L0025: jae short L0047
L0027: vmovupd ymm1, [rax]
L002b: vpermps ymm2, ymm0, ymm1
L0030: vdivps ymm2, ymm1, ymm2
L0034: vblendps ymm1, ymm2, ymm1, 0x88
L003a: vmovupd [rax], ymm1
L003e: add rax, 0x20
L0042: cmp rax, rdx
L0045: jb short L0027
L0047: vzeroupper
L004a: ret Which is quite a bit smaller and with a much tighter inner loop as compared to the original codegen: C.UnPremultiply(System.Span`1<System.Numerics.Vector4>)
L0000: vzeroupper
L0003: mov rax, [rcx]
L0006: mov rdx, 0x29413760bac
L0010: vmovupd ymm0, [rdx]
L0014: mov edx, [rcx+8]
L0017: shl edx, 2
L001a: mov ecx, edx
L001c: sar ecx, 0x1f
L001f: and ecx, 7
L0022: add edx, ecx
L0024: sar edx, 3
L0027: xor ecx, ecx
L0029: test edx, edx
L002b: jle short L0056
L002d: movsxd r8, ecx
L0030: shl r8, 5
L0034: add r8, rax
L0037: vpermps ymm1, ymm0, [r8]
L003c: vmovupd ymm2, [r8]
L0041: vdivps ymm1, ymm2, ymm1
L0045: vblendps ymm1, ymm1, [r8], 0x88
L004b: vmovupd [r8], ymm1
L0050: inc ecx
L0052: cmp ecx, edx
L0054: jl short L002d
L0056: vzeroupper
L0059: ret Of course, you'll want to profile that to be certain 馃槃 |
@tannergooding Crikey that was a bit of a masterclass! I didn't even know the Benchmarks are still wonky on my machine but better than before. I'll push your changes now.
|
You might actually try to run it without the "hot path" flag as it can actually negatively impact some codegen scenarios. Namely, "aggressive optimization" causes it to skip tier 0 and go straight to tier 1. However, this is actually more like a tier 0.75 as there are some optimizations, like removing the check of whether a I don't think you actually have any static field accesses here, as I believe |
I only added the hot path attribution after it appeared to give me a slight edge on benchmarking. Happy to remove if you see no benefit. |
while (Unsafe.IsAddressLessThan(ref vectorsBase, ref vectorsLast)) | ||
{ | ||
Vector256<float> source = vectorsBase; | ||
Vector256<float> multiply = Avx2.PermuteVar8x32(source, mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You actually don't need to permute here since you're not crossing 128-bit lanes. Avx.Shuffle(source, source, 0b_11_11_11_11)
will do the same thing with lower latency while eliminating the need to load the mask register.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @saucecontrol I didn't know Shuffle had that overload. Slight speedup.
Method | Mean | Error | StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|
PremultiplyBaseline | 37.64 us | 1.482 us | 0.081 us | 1.00 | - | - | - | - |
Premultiply | 27.42 us | 1.738 us | 0.095 us | 0.73 | - | - | - | - |
Method | Mean | Error | StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|
UnPremultiplyBaseline | 37.753 us | 3.9513 us | 0.2166 us | 1.00 | - | - | - | - |
UnPremultiply | 1.322 us | 0.0998 us | 0.0055 us | 0.04 | - | - | - | - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still don't understand the same method with Divide
instead of Multiply
results in a ~30X difference in benchmark result though 馃槚
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, I didn't read through all the comments and missed that bit. Your baseline UnPremultiply method in the benchmark is multiplying instead of dividing, but I don't see why the vectorized version is coming out so much faster. Will have a look after sleep if you don't figure it out ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knew it! I knew there was a mistake! Thanks!
I guess my computer just doesn't like multiplying stuff. The baseline is way faster now too.
Method | Mean | Error | StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|
UnPremultiplyBaseline | 2.018 us | 0.1879 us | 0.0103 us | 1.00 | - | - | - | - |
UnPremultiply | 1.255 us | 0.0452 us | 0.0025 us | 0.62 | - | - | - | - |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's strange the division is showing lower times. Is that just an iteration count difference between the Premultiply and UnPremultiply runs? BDN is too clever sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah... Exactly the same setup. I think it's optimizing something away.
@SixLabors/core I've everyone is happy to ignore BMDN (since perf difference matches baseline) I'd like to get this merged asap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can you post some resize(only) benchmark results before+after to feed my curiosity?
@antonfirsov Jumped about ~5%
IterationCount=3 LaunchCount=1 WarmupCount=3
IterationCount=3 LaunchCount=1 WarmupCount=3
|
Add Avx2 Vector4 Span Premultiplication and Reverse
Prerequisites
Description
Adds Avx2 implementation of
Vector4Utilities.Premultiply(span)
andVector4Utilities.UnPremultiply(span)
.Hat tip to @Turnerj for some advice on Twitter that helped me get started.
Benchmarks are..... Odd. I think they're lying to me.
I'm seeing a massive speedup 38% for UnPremultiply in my benchmark but a 27% speedup for Premultiply. I'm suspecting that BMDN is somehow optimizing something away because according to the intel docs latency and throughput is a good but slower for divide than multiply whereas our benchmark is for both baseline and enhanced versions is ~30x faster!!
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_div_ps&expand=2159
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_mul_ps&expand=2159,3931
@tannergooding is there anything obvious you can see?