-
Notifications
You must be signed in to change notification settings - Fork 5k
[API Proposal]: Add non-overflowing widening sum to Vector types #114832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics |
If you want an overflowing sum, then you likely shouldn't be vectorizing like that and will effectively be falling back to "scalar" logic regardless. There are some potential minor optimizations possible if doing a sum across an entire collection, but for an individual There's also complexities here in deciding whether to throw if "anything fails" or only if the total sum would overflow. For example, given an |
That is what we already have. I want a non-overflowing widening sum as the title says. My benchmarks shows an easy 40-fold performance win by vectorization with Avx2 over scalar addition when summing bytes in a Vector256.
Again it's a non-overflowing widening sum. Obviously the sum of the sbytes [127, 1] is the short 128. |
Benchmarking of some different implementations for Vector256 with AVX2
Benchmark codeusing System.Diagnostics;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
namespace BitsetConditionalSumBenchmark;
[DisassemblyDiagnoser(printInstructionAddresses: true, syntax: DisassemblySyntax.Masm)]
public class SumVector256Bytes
{
private Vector256<byte> vec;
private uint expectedSum;
[GlobalSetup]
public void Setup()
{
var random = new Random(42);
var data = GC.AllocateUninitializedArray<byte>(32);
random.NextBytes(data);
vec = Vector256.LoadUnsafe(ref data[0]);
for (int i = 0; i < data.Length; i++)
{
expectedSum += data[i];
}
}
[Benchmark]
public uint ScalarSum()
{
var v = vec;
var sum = (uint)(
v[0]
+ v[1]
+ v[2]
+ v[3]
+ v[4]
+ v[5]
+ v[6]
+ v[7]
+ v[8]
+ v[9]
+ v[10]
+ v[11]
+ v[12]
+ v[13]
+ v[14]
+ v[15]
+ v[16]
+ v[17]
+ v[18]
+ v[19]
+ v[20]
+ v[21]
+ v[22]
+ v[23]
+ v[24]
+ v[25]
+ v[26]
+ v[27]
+ v[28]
+ v[29]
+ v[30]
+ v[31]);
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public uint LoopSum()
{
var v = vec;
uint sum = 0;
for (int i = 0; i < 32; i++)
{
sum += v[i];
}
return sum;
}
[Benchmark]
public uint WidenAddSum()
{
var (vlo, vhi) = Vector256.Widen(vec);
var vsum = Vector256.Add(vlo, vhi);
var sum = Vector256.Sum(vsum);
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public uint SADSum()
{
var v = vec;
Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
var sum = Vector256.Sum(sad);
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public uint SADSum64()
{
var v = vec;
Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
var sad64 = sad.AsUInt64();
var sum = (uint)Vector256.Sum(sad64);
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public uint SADShuffleWordSum()
{
var v = vec;
Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
var shuffled = Avx2.Shuffle(sad.AsByte(), Vector256.Create(
0, 1, 8, 9, 16, 17, 24, 25,
-1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1).AsByte()).AsUInt16();
var sum = Vector64.Sum(shuffled.GetLower().GetLower());
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public uint SADAddSwapAdd16()
{
var v = vec;
Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
Vector128<ushort> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
const byte control =
(1 << 6) |
(0 << 4) |
(3 << 2) |
2;
var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control).AsUInt16();
var s2 = Sse2.Add(s1, s1Hi);
var sum = s2[0];
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public uint SADAddSwapAdd16_NoCast()
{
var v = vec;
Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
Vector128<ushort> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
const byte control =
(1 << 6) |
(0 << 4) |
(3 << 2) |
2;
var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control).AsUInt16();
var s2 = Sse2.Add(s1, s1Hi);
var sum = s2.AsUInt32()[0];
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public uint SADAddSwapAdd32()
{
var v = vec;
Vector256<uint> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero).AsUInt32();
Vector128<uint> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
const byte control =
(1 << 6) |
(0 << 4) |
(3 << 2) |
2;
var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control);
var s2 = Sse2.Add(s1, s1Hi);
var sum = s2[0];
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public uint SADAddSwapAdd64()
{
var v = vec;
Vector256<ulong> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero).AsUInt64();
Vector128<ulong> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
const byte control =
(1 << 6) |
(0 << 4) |
(3 << 2) |
2;
var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control).AsUInt64();
var s2 = Sse2.Add(s1, s1Hi);
var sum = (uint)s2[0];
Debug.Assert(sum == expectedSum);
return sum;
}
} |
Sorry, I mean one that throws on overflow.
This is already possible and is effectively I don't think this is a common enough to warrant a built-in API or the relevant JIT/etc handling, it's a trivial extension method to add (particularly with the new C# 14 extension members feature (https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-14#extension-members) and not what you would end up using for a typical SIMD primitive that you'd use as part of some other algorithm.
Most of these are too small to be accurate and are below what the hardware timer can measure. Many of them are visible non-representative because they are showing measurements that are less than the time of a single cycle of the 3.10GHz CPU. There is a tracking issue for this here: dotnet/BenchmarkDotNet#1802. My personal recommendation is to ensure you've got at least enough "computational work" to ensure a 100-300ns "mean time" as this helps account for any inherent latency in terms of cache misses (~20-200 cycles), the general querying of the hardware timer (~30 cycles), branch mispredicts (~20 cycles), and general system overhead. |
That's disappointing to hear, BenchmarkDotNet is certainly trying to give the impression that it should take care of all those issues for you. I have tried making it do more work here. Benchmark codeusing System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
namespace BitsetConditionalSumBenchmark;
[DisassemblyDiagnoser(printInstructionAddresses: true, syntax: DisassemblySyntax.Masm)]
public class SumVector256Bytes
{
private int N = 1000;
private TestData[] casesData = null!;
private struct TestData
{
public Vector256<byte> Vec;
public uint ExpectedSum;
}
[GlobalSetup]
public void Setup()
{
var random = new Random(42);
casesData = GC.AllocateUninitializedArray<TestData>(N);
var data = GC.AllocateUninitializedArray<byte>(32);
for (int n = 0; n < data.Length; n++)
{
random.NextBytes(data);
var vec = Vector256.LoadUnsafe(ref data[0]);
uint expectedSum = 0;
for (int i = 0; i < data.Length; i++)
{
expectedSum += data[i];
}
casesData[n] = new() { Vec = vec, ExpectedSum = expectedSum };
}
}
[Benchmark]
public ulong RawOverhead()
{
ulong sum = 0;
for (int n = 0; n < casesData.Length; n++)
{
sum += casesData[n].ExpectedSum;
}
return sum;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint ScalarSumIteration(Vector256<byte> v, uint expectedSum)
{
var sum = (uint)(
v[0]
+ v[1]
+ v[2]
+ v[3]
+ v[4]
+ v[5]
+ v[6]
+ v[7]
+ v[8]
+ v[9]
+ v[10]
+ v[11]
+ v[12]
+ v[13]
+ v[14]
+ v[15]
+ v[16]
+ v[17]
+ v[18]
+ v[19]
+ v[20]
+ v[21]
+ v[22]
+ v[23]
+ v[24]
+ v[25]
+ v[26]
+ v[27]
+ v[28]
+ v[29]
+ v[30]
+ v[31]);
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public ulong ScalarSum()
{
ulong sum = 0;
for (int n = 0; n < casesData.Length; n++)
{
sum += ScalarSumIteration(casesData[n].Vec, casesData[n].ExpectedSum);
}
return sum;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint LoopSumIteration(Vector256<byte> v, uint expectedSum)
{
uint sum = 0;
for (int i = 0; i < 32; i++)
{
sum += v[i];
}
return sum;
}
[Benchmark]
public ulong LoopSum()
{
ulong sum = 0;
for (int n = 0; n < casesData.Length; n++)
{
sum += LoopSumIteration(casesData[n].Vec, casesData[n].ExpectedSum);
}
return sum;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint WidenAddSumIteration(Vector256<byte> v, uint expectedSum)
{
var (vlo, vhi) = Vector256.Widen(v);
var vsum = Vector256.Add(vlo, vhi);
var sum = Vector256.Sum(vsum);
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public ulong WidenAddSum()
{
ulong sum = 0;
for (int n = 0; n < casesData.Length; n++)
{
sum += WidenAddSumIteration(casesData[n].Vec, casesData[n].ExpectedSum);
}
return sum;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint SADSumIteration(Vector256<byte> v, uint expectedSum)
{
Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
var sum = Vector256.Sum(sad);
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public ulong SADSum()
{
ulong sum = 0;
for (int n = 0; n < casesData.Length; n++)
{
sum += SADSumIteration(casesData[n].Vec, casesData[n].ExpectedSum);
}
return sum;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint SADSum64Iteration(Vector256<byte> v, uint expectedSum)
{
Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
var sad64 = sad.AsUInt64();
var sum = (uint)Vector256.Sum(sad64);
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public ulong SADSum64()
{
ulong sum = 0;
for (int n = 0; n < casesData.Length; n++)
{
sum += SADSum64Iteration(casesData[n].Vec, casesData[n].ExpectedSum);
}
return sum;
}
/*[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint SADShuffleWordSumIteration(Vector256<byte> v, uint expectedSum)
{
Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
var shuffled = Avx2.Shuffle(sad.AsByte(), Vector256.Create(
0, 1, 8, 9, 16, 17, 24, 25,
-1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1).AsByte()).AsUInt16();
var sum = Vector64.Sum(shuffled.GetLower().GetLower());
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public ulong SADShuffleWordSum()
{
ulong sum = 0;
for (int n = 0; n < casesData.Length; n++)
{
sum += SADShuffleWordSumIteration(casesData[n].Vec, casesData[n].ExpectedSum);
}
return sum;
}*/
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint SADAddSwapAdd16Iteration(Vector256<byte> v, uint expectedSum)
{
Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
Vector128<ushort> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
const byte control =
(1 << 6) |
(0 << 4) |
(3 << 2) |
2;
var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control).AsUInt16();
var s2 = Sse2.Add(s1, s1Hi);
var sum = s2[0];
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public ulong SADAddSwapAdd16()
{
ulong sum = 0;
for (int n = 0; n < casesData.Length; n++)
{
sum += SADAddSwapAdd16Iteration(casesData[n].Vec, casesData[n].ExpectedSum);
}
return sum;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint SADAddSwapAdd16_NoCastIteration(Vector256<byte> v, uint expectedSum)
{
Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
Vector128<ushort> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
const byte control =
(1 << 6) |
(0 << 4) |
(3 << 2) |
2;
var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control).AsUInt16();
var s2 = Sse2.Add(s1, s1Hi);
var sum = s2.AsUInt32()[0];
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public ulong SADAddSwapAdd16_NoCast()
{
ulong sum = 0;
for (int n = 0; n < casesData.Length; n++)
{
sum += SADAddSwapAdd16_NoCastIteration(casesData[n].Vec, casesData[n].ExpectedSum);
}
return sum;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint SADAddSwapAdd32Iteration(Vector256<byte> v, uint expectedSum)
{
Vector256<uint> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero).AsUInt32();
Vector128<uint> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
const byte control =
(1 << 6) |
(0 << 4) |
(3 << 2) |
2;
var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control);
var s2 = Sse2.Add(s1, s1Hi);
var sum = s2[0];
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public ulong SADAddSwapAdd32()
{
ulong sum = 0;
for (int n = 0; n < casesData.Length; n++)
{
sum += SADAddSwapAdd32Iteration(casesData[n].Vec, casesData[n].ExpectedSum);
}
return sum;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint SADAddSwapAdd64Iteration(Vector256<byte> v, uint expectedSum)
{
Vector256<ulong> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero).AsUInt64();
Vector128<ulong> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
const byte control =
(1 << 6) |
(0 << 4) |
(3 << 2) |
2;
var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control).AsUInt64();
var s2 = Sse2.Add(s1, s1Hi);
var sum = (uint)s2[0];
Debug.Assert(sum == expectedSum);
return sum;
}
[Benchmark]
public ulong SADAddSwapAdd64()
{
ulong sum = 0;
for (int n = 0; n < casesData.Length; n++)
{
sum += SADAddSwapAdd64Iteration(casesData[n].Vec, casesData[n].ExpectedSum);
}
return sum;
}
}
Assuming my attempted overhead measurement is useful, the SADAddSwapAdd32 implementation is still about 2.8 times faster than WidenAddSum (which does |
Yes, but there is ultimately only so far it can go. Things which take less time than the timer's precision will naturally get mismeasured (it's technically possible to do, but immensely complex and error prone), so they will inherently have some run to run flakiness simply due to how computers work. Even in your own code
This is also very hardware, type, and scenario, dependent. On your box,
However, neither of these are strictly representative as the With a simply tweak, suddenly you can trivially beat the
Given: [Benchmark]
public ulong WidenAddSumSeparate()
{
Vector256<ushort> sum = Vector256<ushort>.Zero;
for (int n = 0; n < casesData.Length; n++)
{
var v = casesData[n].Vec;
sum += Vector256.WidenLower(v) + Vector256.WidenUpper(v);
}
return Vector256.Sum(sum);
}
[Benchmark]
public ulong WidenAddSumV512()
{
Vector512<ushort> sum = Vector512<ushort>.Zero;
for (int n = 0; n < casesData.Length; n++)
{
sum += Avx512BW.ConvertToVector512UInt16(casesData[n].Vec);
}
return Vector512.Sum(sum);
} All these microbenchmarks are also a bit skewed because they aren't setup like a real world example. In a more real example you likely wouldn't have some Even without pulling out "all the tricks" and simply updating the loops to be a bit more representative (plus adding a couple additional examples) where you operate over
Where So all in all, I would consider exposing a |
My goal isn't to sum a lot of bytes from memory together, that was just an attempt to give more work for BenchmarkDotNet. The method in the example I gave above was what I originally needed the widening sum for: public static ushort ConditionalSumByBitSet(uint bitset, Vector256<byte> items)
{
Vector256<byte> xbcast = Vector256.Create(bitset).AsByte();
// Each byte gets the source byte containing the corresponding bit
Vector256<byte> indices = Vector256.Create(
0x0000000000000000UL,
0x0101010101010101UL,
0x1E1E1E1E1E1E1E1EUL,
0x1F1F1F1F1F1F1F1FUL).AsByte();
Vector256<byte> shuf = Vector256.Shuffle(xbcast, indices);
Vector256<byte> andmask = Vector256.Create(0x08040201008040201UL).AsByte();
Vector256<byte> isolated = Vector256.BitwiseAnd(shuf, andmask);
Vector256<byte> notSelectedMask = Vector256.Equals(isolated, Vector256<byte>.Zero);
Vector256<byte> selected = Vector256.ConditionalSelect(notSelectedMask, Vector256<byte>.Zero, items);
return Vector256.SumWidening(selected); // <-- Used here
} It's to calculate the size of a variable structure with a lot of fields of different sizes that may or may not be present, specified by a bitset. I don't know if there is a better way to do it, but I got into the rabbit hole of the framework not having an optimized sum function readily available I could use so I could test the approach out. (There certainly are better ways if I had Avx512, but even if the server where it ends up running has it, I currently have no easy way of benchmarking it beforehand) |
That would be useful, but I'd imagine it would be a lot of work for the JIT to be able to optimize it down to something like the SAD+Sum if I just called it as Span<byte> selectedBytes = stackalloc byte[32];
selected.CopyTo(selectedBytes);
return TensorPrimitives.SumWidening<byte, ushort>(selected); or (if this is even legal. I'm not seeing anywhere saying it isn't, so I guess the JIT has to figure out what to do) return TensorPrimitives.SumWidening<byte, ushort>(MemoryMarshal.AsBytes(new Span<Vector256<byte>>(ref selected))); If you think that is feasible then there is no need for adding it as a new primitive for each |
The consideration is more that this API is not something that will be added as a new API on the If you really have the case of just needing to do it precisely once over a single vector, then it is a scenario specific thing and something that can be rolled for that scenario using the existing primitives. There are a few patterns that might be optimizable by the JIT, but the couple cycle perf difference for a single vector isn't likely to matter so adding such pattern recognition would be low priority. For more typical scenarios, you're going to be summing things in a loop and so such an API is the exact opposite of what you want and will hurt perf more than it will help. Instead you want to avoid any reduction across the vector for as long as possible, which will net you nearly a 2x or more throughput improvement. |
Uh oh!
There was an error while loading. Please reload this page.
Background and motivation
The various
Vector
types haveSum
methods available for efficiently summing elements. However they all return the same type as the element types which means they can't be used if the sum of elements might overflow.Actually writing efficient non-overflowing sum methods is a lot of work if you want it to be efficient on diverse platforms with different availability of SIMD instructions - the obvious solution of Widen + Sum is generally not the most performant choice.
API Proposal
API Usage
Alternative Designs
No response
Risks
None known
The text was updated successfully, but these errors were encountered: