Skip to content

[API Proposal]: Add non-overflowing widening sum to Vector types #114832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
poizan42 opened this issue Apr 19, 2025 · 10 comments
Closed

[API Proposal]: Add non-overflowing widening sum to Vector types #114832

poizan42 opened this issue Apr 19, 2025 · 10 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime.Intrinsics

Comments

@poizan42
Copy link
Contributor

poizan42 commented Apr 19, 2025

Background and motivation

The various Vector types have Sum methods available for efficiently summing elements. However they all return the same type as the element types which means they can't be used if the sum of elements might overflow.

Actually writing efficient non-overflowing sum methods is a lot of work if you want it to be efficient on diverse platforms with different availability of SIMD instructions - the obvious solution of Widen + Sum is generally not the most performant choice.

API Proposal

namespace System.Numerics;

public static class Vector
{
    public static ushort SumWidening(Vector<byte> value);
    public static uint SumWidening(Vector<ushort> value);
    public static ulong SumWidening(Vector<uint> value);
    public static UInt128 SumWidening(Vector<ulong> value);

    public static short SumWidening(Vector<sbyte> value);
    public static int SumWidening(Vector<short> value);
    public static long SumWidening(Vector<int> value);
    public static Int128 SumWidening(Vector<long> value);
}
namespace System.Runtime.Intrinsics;

public static class Vector64
{
    public static ushort SumWidening(Vector64<byte> value);
    public static uint SumWidening(Vector64<ushort> value);
    public static ulong SumWidening(Vector64<uint> value);
    public static UInt128 SumWidening(Vector64<ulong> value);

    public static short SumWidening(Vector64<sbyte> value);
    public static int SumWidening(Vector64<short> value);
    public static long SumWidening(Vector64<int> value);
    public static Int128 SumWidening(Vector64<long> value);
}
namespace System.Runtime.Intrinsics;

public static class Vector128
{
    public static ushort SumWidening(Vector128<byte> value);
    public static uint SumWidening(Vector128<ushort> value);
    public static ulong SumWidening(Vector128<uint> value);
    public static UInt128 SumWidening(Vector128<ulong> value);

    public static short SumWidening(Vector128<sbyte> value);
    public static int SumWidening(Vector128<short> value);
    public static long SumWidening(Vector128<int> value);
    public static Int128 SumWidening(Vector128<long> value);
}
namespace System.Runtime.Intrinsics;

public static class Vector256
{
    public static ushort SumWidening(Vector256<byte> value);
    public static uint SumWidening(Vector256<ushort> value);
    public static ulong SumWidening(Vector256<uint> value);
    public static UInt128 SumWidening(Vector256<ulong> value);

    public static short SumWidening(Vector256<sbyte> value);
    public static int SumWidening(Vector256<short> value);
    public static long SumWidening(Vector256<int> value);
    public static Int128 SumWidening(Vector256<long> value);
}
namespace System.Runtime.Intrinsics;

public static class Vector512
{
    public static ushort SumWidening(Vector512<byte> value);
    public static uint SumWidening(Vector512<ushort> value);
    public static ulong SumWidening(Vector512<uint> value);
    public static UInt128 SumWidening(Vector512<ulong> value);

    public static short SumWidening(Vector512<sbyte> value);
    public static int SumWidening(Vector512<short> value);
    public static long SumWidening(Vector512<int> value);
    public static Int128 SumWidening(Vector512<long> value);
}

API Usage

    public static ushort ConditionalSumByBitSet(uint bitset, Vector256<byte> items)
    {
        Vector256<byte> xbcast = Vector256.Create(bitset).AsByte();

        // Each byte gets the source byte containing the corresponding bit
        Vector256<byte> indices = Vector256.Create(
            0x0000000000000000UL,
            0x0101010101010101UL,
            0x1E1E1E1E1E1E1E1EUL,
            0x1F1F1F1F1F1F1F1FUL).AsByte();
        
        Vector256<byte> shuf = Vector256.Shuffle(xbcast, indices);
        Vector256<byte> andmask = Vector256.Create(0x08040201008040201UL).AsByte();
        Vector256<byte> isolated = Vector256.BitwiseAnd(shuf, andmask);
        Vector256<byte> notSelectedMask = Vector256.Equals(isolated, Vector256<byte>.Zero);
        
        Vector256<byte> selected = Vector256.ConditionalSelect(notSelectedMask, Vector256<byte>.Zero, items);
        return Vector256.SumWidening(selected); // <-- Used here
    }

Alternative Designs

No response

Risks

None known

@poizan42 poizan42 added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Apr 19, 2025
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Apr 19, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

@tannergooding
Copy link
Member

If you want an overflowing sum, then you likely shouldn't be vectorizing like that and will effectively be falling back to "scalar" logic regardless. There are some potential minor optimizations possible if doing a sum across an entire collection, but for an individual VectorXXX<T> there isn't really anything feasible as it breaks down to doing almost sequential sums plus overflow checks which is likely more expensive than simply doing the standard scalar loop instead.

There's also complexities here in deciding whether to throw if "anything fails" or only if the total sum would overflow. For example, given an sbyte[] of [127, 1] it is clear that it should overflow, but what about [127, 1, -1]? Should that overflow when it computes the intermediate sum of 128 or should it not throw because the final sum is 127 and therefore still in range? These types of complexities further influence what could be done with any vectorized algorithm and the cost/overhead (which further reduces any benefits you might get from vectorization).

@poizan42
Copy link
Contributor Author

poizan42 commented Apr 19, 2025

If you want an overflowing sum,

That is what we already have. I want a non-overflowing widening sum as the title says. My benchmarks shows an easy 40-fold performance win by vectorization with Avx2 over scalar addition when summing bytes in a Vector256.

There's also complexities here in deciding whether to throw if "anything fails" or only if the total sum would overflow. For example, given an sbyte[] of [127, 1] it is clear that it should overflow, but what about [127, 1, -1]? Should that overflow when it computes the intermediate sum of 128 or should it not throw because the final sum is 127 and therefore still in range? These types of complexities further influence what could be done with any vectorized algorithm and the cost/overhead (which further reduces any benefits you might get from vectorization).

Again it's a non-overflowing widening sum. Obviously the sum of the sbytes [127, 1] is the short 128.

@poizan42
Copy link
Contributor Author

poizan42 commented Apr 19, 2025

Benchmarking of some different implementations for Vector256 with AVX2

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3476)
Intel Core i9-9900 CPU 3.10GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.100-preview.3.25201.16
  [Host]     : .NET 10.0.0 (10.0.25.17105), X64 RyuJIT AVX2
  DefaultJob : .NET 10.0.0 (10.0.25.17105), X64 RyuJIT AVX2


Method Mean Error StdDev Code Size
ScalarSum 10.1521 ns 0.1264 ns 0.1120 ns 421 B
LoopSum 15.5248 ns 0.2052 ns 0.1919 ns 58 B
WidenAddSum 0.6930 ns 0.0220 ns 0.0205 ns 81 B
SADSum 0.2400 ns 0.0139 ns 0.0130 ns 65 B
SADSum64 0.3418 ns 0.0148 ns 0.0131 ns 45 B
SADShuffleWordSum 1.1476 ns 0.0521 ns 0.0558 ns 85 B
SADAddSwapAdd16 0.3827 ns 0.0394 ns 0.0349 ns 43 B
SADAddSwapAdd16_NoCast 0.2878 ns 0.0171 ns 0.0152 ns 40 B
SADAddSwapAdd32 0.2421 ns 0.0202 ns 0.0189 ns 40 B
SADAddSwapAdd64 0.3052 ns 0.0287 ns 0.0269 ns 41 B
Benchmark code
using System.Diagnostics;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;

namespace BitsetConditionalSumBenchmark;

[DisassemblyDiagnoser(printInstructionAddresses: true, syntax: DisassemblySyntax.Masm)]
public class SumVector256Bytes
{
    private Vector256<byte> vec;
    private uint expectedSum;

    [GlobalSetup]
    public void Setup()
    {
        var random = new Random(42);
        var data = GC.AllocateUninitializedArray<byte>(32);
        random.NextBytes(data);
        vec = Vector256.LoadUnsafe(ref data[0]);
        for (int i = 0; i < data.Length; i++)
        {
            expectedSum += data[i];
        }
    }

    [Benchmark]
    public uint ScalarSum()
    {
        var v = vec;
        var sum = (uint)(
            v[0]
            + v[1]
            + v[2]
            + v[3]
            + v[4]
            + v[5]
            + v[6]
            + v[7]
            + v[8]
            + v[9]
            + v[10]
            + v[11]
            + v[12]
            + v[13]
            + v[14]
            + v[15]
            + v[16]
            + v[17]
            + v[18]
            + v[19]
            + v[20]
            + v[21]
            + v[22]
            + v[23]
            + v[24]
            + v[25]
            + v[26]
            + v[27]
            + v[28]
            + v[29]
            + v[30]
            + v[31]);
        Debug.Assert(sum == expectedSum);
        return sum;
    }

    [Benchmark]
    public uint LoopSum()
    {
        var v = vec;
        uint sum = 0;
        for (int i = 0; i < 32; i++)
        {
            sum += v[i];
        }
        return sum;
    }

    [Benchmark]
    public uint WidenAddSum()
    {
        var (vlo, vhi) = Vector256.Widen(vec);
        var vsum = Vector256.Add(vlo, vhi);
        var sum = Vector256.Sum(vsum);
        Debug.Assert(sum == expectedSum);
        return sum;
    }
    
    [Benchmark]
    public uint SADSum()
    {
        var v = vec;
        Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
        var sum = Vector256.Sum(sad);
        Debug.Assert(sum == expectedSum);
        return sum;
    }
   
    
    [Benchmark]
    public uint SADSum64()
    {
        var v = vec;
        Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
        var sad64 = sad.AsUInt64();
        var sum = (uint)Vector256.Sum(sad64);
        Debug.Assert(sum == expectedSum);
        return sum;
    }
    
    [Benchmark]
    public uint SADShuffleWordSum()
    {
        var v = vec;
        Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
        var shuffled = Avx2.Shuffle(sad.AsByte(), Vector256.Create(
             0,  1,  8,  9, 16, 17, 24, 25,
            -1, -1, -1, -1, -1, -1, -1, -1,
            -1, -1, -1, -1, -1, -1, -1, -1,
            -1, -1, -1, -1, -1, -1, -1, -1).AsByte()).AsUInt16();
        var sum = Vector64.Sum(shuffled.GetLower().GetLower());
        Debug.Assert(sum == expectedSum);
        return sum;
    }
    
    [Benchmark]
    public uint SADAddSwapAdd16()
    {
        var v = vec;
        Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
        Vector128<ushort> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
        const byte control =
            (1 << 6) |
            (0 << 4) |
            (3 << 2) |
            2;
        var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control).AsUInt16();
        var s2 = Sse2.Add(s1, s1Hi);
        var sum = s2[0];
        Debug.Assert(sum == expectedSum);
        return sum;
    }

    [Benchmark]
    public uint SADAddSwapAdd16_NoCast()
    {
        var v = vec;
        Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
        Vector128<ushort> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
        const byte control =
            (1 << 6) |
            (0 << 4) |
            (3 << 2) |
            2;
        var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control).AsUInt16();
        var s2 = Sse2.Add(s1, s1Hi);
        var sum = s2.AsUInt32()[0];
        Debug.Assert(sum == expectedSum);
        return sum;
    }

    [Benchmark]
    public uint SADAddSwapAdd32()
    {
        var v = vec;
        Vector256<uint> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero).AsUInt32();
        Vector128<uint> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
        const byte control =
            (1 << 6) |
            (0 << 4) |
            (3 << 2) |
            2;
        var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control);
        var s2 = Sse2.Add(s1, s1Hi);
        var sum = s2[0];
        Debug.Assert(sum == expectedSum);
        return sum;
    }
    
    [Benchmark]
    public uint SADAddSwapAdd64()
    {
        var v = vec;
        Vector256<ulong> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero).AsUInt64();
        Vector128<ulong> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
        const byte control =
            (1 << 6) |
            (0 << 4) |
            (3 << 2) |
            2;
        var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control).AsUInt64();
        var s2 = Sse2.Add(s1, s1Hi);
        var sum = (uint)s2[0];
        Debug.Assert(sum == expectedSum);
        return sum;
    }
}

@poizan42 poizan42 changed the title [API Proposal]: Add non-overflowing sum to Vector types [API Proposal]: Add non-overflowing widening sum to Vector types Apr 19, 2025
@tannergooding
Copy link
Member

That is what we already have.

Sorry, I mean one that throws on overflow.

I want a non-overflowing widening sum as the title says

This is already possible and is effectively Sum(WidenLower(vector) + WidenUpper(vector)), this is exactly what the underlying implementation would do for a theoretical WidenedSum (although the name matching existing conventions would probably be SumWidening).

I don't think this is a common enough to warrant a built-in API or the relevant JIT/etc handling, it's a trivial extension method to add (particularly with the new C# 14 extension members feature (https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-14#extension-members) and not what you would end up using for a typical SIMD primitive that you'd use as part of some other algorithm.

Benchmarking of some different implementations for Vector256 with AVX2

Most of these are too small to be accurate and are below what the hardware timer can measure. Many of them are visible non-representative because they are showing measurements that are less than the time of a single cycle of the 3.10GHz CPU.

There is a tracking issue for this here: dotnet/BenchmarkDotNet#1802. My personal recommendation is to ensure you've got at least enough "computational work" to ensure a 100-300ns "mean time" as this helps account for any inherent latency in terms of cache misses (~20-200 cycles), the general querying of the hardware timer (~30 cycles), branch mispredicts (~20 cycles), and general system overhead.

@poizan42
Copy link
Contributor Author

Most of these are too small to be accurate and are below what the hardware timer can measure. Many of them are visible non-representative because they are showing measurements that are less than the time of a single cycle of the 3.10GHz CPU.

There is a tracking issue for this here: dotnet/BenchmarkDotNet#1802. My personal recommendation is to ensure you've got at least enough "computational work" to ensure a 100-300ns "mean time" as this helps account for any inherent latency in terms of cache misses (~20-200 cycles), the general querying of the hardware timer (~30 cycles), branch mispredicts (~20 cycles), and general system overhead.

That's disappointing to hear, BenchmarkDotNet is certainly trying to give the impression that it should take care of all those issues for you.

I have tried making it do more work here.

Benchmark code
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;

namespace BitsetConditionalSumBenchmark;

[DisassemblyDiagnoser(printInstructionAddresses: true, syntax: DisassemblySyntax.Masm)]
public class SumVector256Bytes
{
    private int N = 1000;
    private TestData[] casesData = null!;
    private struct TestData
    {
        public Vector256<byte> Vec;
        public uint ExpectedSum;
    }

    [GlobalSetup]
    public void Setup()
    {
        var random = new Random(42);
        casesData = GC.AllocateUninitializedArray<TestData>(N);
        var data = GC.AllocateUninitializedArray<byte>(32);
        for (int n = 0; n < data.Length; n++)
        {
            random.NextBytes(data);
            var vec = Vector256.LoadUnsafe(ref data[0]);
            uint expectedSum = 0;
            for (int i = 0; i < data.Length; i++)
            {
                expectedSum += data[i];
            }
            casesData[n] = new() { Vec = vec, ExpectedSum = expectedSum };
        }
    }
    
    [Benchmark]
    public ulong RawOverhead()
    {
        ulong sum = 0;
        for (int n = 0; n < casesData.Length; n++)
        {
            sum += casesData[n].ExpectedSum;
        }
        return sum;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static uint ScalarSumIteration(Vector256<byte> v, uint expectedSum)
    {
        var sum = (uint)(
            v[0]
            + v[1]
            + v[2]
            + v[3]
            + v[4]
            + v[5]
            + v[6]
            + v[7]
            + v[8]
            + v[9]
            + v[10]
            + v[11]
            + v[12]
            + v[13]
            + v[14]
            + v[15]
            + v[16]
            + v[17]
            + v[18]
            + v[19]
            + v[20]
            + v[21]
            + v[22]
            + v[23]
            + v[24]
            + v[25]
            + v[26]
            + v[27]
            + v[28]
            + v[29]
            + v[30]
            + v[31]);
        Debug.Assert(sum == expectedSum);
        return sum;
    }

    [Benchmark]
    public ulong ScalarSum()
    {
        ulong sum = 0;
        for (int n = 0; n < casesData.Length; n++)
        {
            sum += ScalarSumIteration(casesData[n].Vec, casesData[n].ExpectedSum);
        }
        return sum;
    }
    
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static uint LoopSumIteration(Vector256<byte> v, uint expectedSum)
    {
        uint sum = 0;
        for (int i = 0; i < 32; i++)
        {
            sum += v[i];
        }
        return sum;
    }

    [Benchmark]
    public ulong LoopSum()
    {
        ulong sum = 0;
        for (int n = 0; n < casesData.Length; n++)
        {
            sum += LoopSumIteration(casesData[n].Vec, casesData[n].ExpectedSum);
        }
        return sum;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static uint WidenAddSumIteration(Vector256<byte> v, uint expectedSum)
    {
        var (vlo, vhi) = Vector256.Widen(v);
        var vsum = Vector256.Add(vlo, vhi);
        var sum = Vector256.Sum(vsum);
        Debug.Assert(sum == expectedSum);
        return sum;
    }

    [Benchmark]
    public ulong WidenAddSum()
    {
        ulong sum = 0;
        for (int n = 0; n < casesData.Length; n++)
        {
            sum += WidenAddSumIteration(casesData[n].Vec, casesData[n].ExpectedSum);
        }
        return sum;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static uint SADSumIteration(Vector256<byte> v, uint expectedSum)
    {
        Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
        var sum = Vector256.Sum(sad);
        Debug.Assert(sum == expectedSum);
        return sum;
    }

    [Benchmark]
    public ulong SADSum()
    {
        ulong sum = 0;
        for (int n = 0; n < casesData.Length; n++)
        {
            sum += SADSumIteration(casesData[n].Vec, casesData[n].ExpectedSum);
        }
        return sum;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static uint SADSum64Iteration(Vector256<byte> v, uint expectedSum)
    {
        Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
        var sad64 = sad.AsUInt64();
        var sum = (uint)Vector256.Sum(sad64);
        Debug.Assert(sum == expectedSum);
        return sum;
    }

    [Benchmark]
    public ulong SADSum64()
    {
        ulong sum = 0;
        for (int n = 0; n < casesData.Length; n++)
        {
            sum += SADSum64Iteration(casesData[n].Vec, casesData[n].ExpectedSum);
        }
        return sum;
    }


    /*[MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static uint SADShuffleWordSumIteration(Vector256<byte> v, uint expectedSum)
    {
        Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
        var shuffled = Avx2.Shuffle(sad.AsByte(), Vector256.Create(
             0,  1,  8,  9, 16, 17, 24, 25,
            -1, -1, -1, -1, -1, -1, -1, -1,
            -1, -1, -1, -1, -1, -1, -1, -1,
            -1, -1, -1, -1, -1, -1, -1, -1).AsByte()).AsUInt16();
        var sum = Vector64.Sum(shuffled.GetLower().GetLower());
        Debug.Assert(sum == expectedSum);
        return sum;
    }

    
    [Benchmark]
    public ulong SADShuffleWordSum()
    {
        ulong sum = 0;
        for (int n = 0; n < casesData.Length; n++)
        {
            sum += SADShuffleWordSumIteration(casesData[n].Vec, casesData[n].ExpectedSum);
        }
        return sum;
    }*/

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static uint SADAddSwapAdd16Iteration(Vector256<byte> v, uint expectedSum)
    {
        Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
        Vector128<ushort> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
        const byte control =
            (1 << 6) |
            (0 << 4) |
            (3 << 2) |
            2;
        var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control).AsUInt16();
        var s2 = Sse2.Add(s1, s1Hi);
        var sum = s2[0];
        Debug.Assert(sum == expectedSum);
        return sum;
    }
    
    [Benchmark]
    public ulong SADAddSwapAdd16()
    {
        ulong sum = 0;
        for (int n = 0; n < casesData.Length; n++)
        {
            sum += SADAddSwapAdd16Iteration(casesData[n].Vec, casesData[n].ExpectedSum);
        }
        return sum;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static uint SADAddSwapAdd16_NoCastIteration(Vector256<byte> v, uint expectedSum)
    {
        Vector256<ushort> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero);
        Vector128<ushort> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
        const byte control =
            (1 << 6) |
            (0 << 4) |
            (3 << 2) |
            2;
        var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control).AsUInt16();
        var s2 = Sse2.Add(s1, s1Hi);
        var sum = s2.AsUInt32()[0];
        Debug.Assert(sum == expectedSum);
        return sum;
    }

    [Benchmark]
    public ulong SADAddSwapAdd16_NoCast()
    {
        ulong sum = 0;
        for (int n = 0; n < casesData.Length; n++)
        {
            sum += SADAddSwapAdd16_NoCastIteration(casesData[n].Vec, casesData[n].ExpectedSum);
        }
        return sum;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static uint SADAddSwapAdd32Iteration(Vector256<byte> v, uint expectedSum)
    {
        Vector256<uint> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero).AsUInt32();
        Vector128<uint> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
        const byte control =
            (1 << 6) |
            (0 << 4) |
            (3 << 2) |
            2;
        var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control);
        var s2 = Sse2.Add(s1, s1Hi);
        var sum = s2[0];
        Debug.Assert(sum == expectedSum);
        return sum;
    }

    [Benchmark]
    public ulong SADAddSwapAdd32()
    {
        ulong sum = 0;
        for (int n = 0; n < casesData.Length; n++)
        {
            sum += SADAddSwapAdd32Iteration(casesData[n].Vec, casesData[n].ExpectedSum);
        }
        return sum;
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static uint SADAddSwapAdd64Iteration(Vector256<byte> v, uint expectedSum)
    {
        Vector256<ulong> sad = Avx2.SumAbsoluteDifferences(v, Vector256<byte>.Zero).AsUInt64();
        Vector128<ulong> s1 = Sse2.Add(sad.GetUpper(), sad.GetLower());
        const byte control =
            (1 << 6) |
            (0 << 4) |
            (3 << 2) |
            2;
        var s1Hi = Sse2.Shuffle(s1.AsUInt32(), control).AsUInt64();
        var s2 = Sse2.Add(s1, s1Hi);
        var sum = (uint)s2[0];
        Debug.Assert(sum == expectedSum);
        return sum;
    }
    
    [Benchmark]
    public ulong SADAddSwapAdd64()
    {
        ulong sum = 0;
        for (int n = 0; n < casesData.Length; n++)
        {
            sum += SADAddSwapAdd64Iteration(casesData[n].Vec, casesData[n].ExpectedSum);
        }
        return sum;
    }
}
BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3476)
Intel Core i9-9900 CPU 3.10GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK 10.0.100-preview.3.25201.16
  [Host]     : .NET 10.0.0 (10.0.25.17105), X64 RyuJIT AVX2
  DefaultJob : .NET 10.0.0 (10.0.25.17105), X64 RyuJIT AVX2
Method Mean Error StdDev Code Size
RawOverhead 583.7 ns 11.45 ns 12.26 ns 74 B
ScalarSum 11,378.3 ns 55.37 ns 51.79 ns 536 B
LoopSum 11,145.7 ns 48.45 ns 45.32 ns 116 B
WidenAddSum 2,445.4 ns 38.77 ns 32.37 ns 147 B
SADSum 1,958.1 ns 15.69 ns 13.91 ns 131 B
SADSum64 1,459.4 ns 16.33 ns 15.27 ns 112 B
SADAddSwapAdd16 1,389.4 ns 20.52 ns 19.20 ns 109 B
SADAddSwapAdd16_NoCast 1,466.2 ns 10.30 ns 9.63 ns 105 B
SADAddSwapAdd32 1,234.5 ns 10.62 ns 9.42 ns 105 B
SADAddSwapAdd64 1,388.3 ns 7.25 ns 6.42 ns 108 B

Assuming my attempted overhead measurement is useful, the SADAddSwapAdd32 implementation is still about 2.8 times faster than WidenAddSum (which does Sum(WidenLower(vector) + WidenUpper(vector)))

@tannergooding
Copy link
Member

BenchmarkDotNet is certainly trying to give the impression that it should take care of all those issues for you.

Yes, but there is ultimately only so far it can go. Things which take less time than the timer's precision will naturally get mismeasured (it's technically possible to do, but immensely complex and error prone), so they will inherently have some run to run flakiness simply due to how computers work.

Even in your own code RawOverhead isn't strictly "correct" and is including its own noise and additional overhead that isn't strictly representative of the other code paths.

the SADAddSwapAdd32 implementation is still about 2.8 times faster than WidenAddSum

This is also very hardware, type, and scenario, dependent. On your box, SADAddSwapAdd32 is about 49.5% faster. While my Zen4 box only has it 32.9% faster:

BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3902)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 10.0.100-preview.3.25201.16
  [Host]     : .NET 10.0.0 (10.0.25.17105), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 10.0.0 (10.0.25.17105), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Mean Error StdDev Code Size
RawOverhead 454.8 ns 8.98 ns 17.30 ns 74 B
ScalarSum 8,974.4 ns 171.61 ns 176.23 ns 536 B
LoopSum 11,342.1 ns 114.81 ns 101.77 ns 116 B
WidenAddSum 1,028.2 ns 20.19 ns 28.30 ns 147 B
SADSum 904.4 ns 17.21 ns 15.26 ns 131 B
SADSum64 765.0 ns 14.84 ns 22.21 ns 112 B
SADAddSwapAdd16 728.9 ns 13.99 ns 17.69 ns 109 B
SADAddSwapAdd16_NoCast 683.4 ns 13.54 ns 17.60 ns 105 B
SADAddSwapAdd32 689.8 ns 13.68 ns 24.32 ns 105 B
SADAddSwapAdd64 734.6 ns 14.67 ns 24.10 ns 108 B

However, neither of these are strictly representative as the SumAbsoluteDifferences trick only works for byte, it isn't available for sbyte, short, ushort, int, uint, long, or ulong. Additionally, WidenSum isn't what you want at all for a bigger loop in the first place because it's bottlenecking on you doing a reduction every single iteration. You also only get 1 dispatch port for psadbw while you get 2-4 for padd*, so if such a loop actually matters for perf to that degree then even with psadbw you're potentially leaving a significant bit of perf on the table as you could be saturating the CPU and only be bottlenecked by your RAM speed. -- The WidenSum approach is also notably not quite "correct" either if you want to prevent overflow as summing a byte[] into a ushort can overflow in as few as 256 inputs. As indicated in my earlier reply, it's a complex space with many nuances and considerations.

With a simply tweak, suddenly you can trivially beat the SAD implementation:

Method Mean Error StdDev Code Size
WidenAddSumSeparate 572.9 ns 11.42 ns 17.79 ns 139 B
WidenAddSumV512 471.4 ns 9.08 ns 9.33 ns 138 B

Given:

[Benchmark]
public ulong WidenAddSumSeparate()
{
    Vector256<ushort> sum = Vector256<ushort>.Zero;
    for (int n = 0; n < casesData.Length; n++)
    {
        var v = casesData[n].Vec;
        sum += Vector256.WidenLower(v) + Vector256.WidenUpper(v);
    }
    return Vector256.Sum(sum);
}

[Benchmark]
public ulong WidenAddSumV512()
{
    Vector512<ushort> sum = Vector512<ushort>.Zero;
    for (int n = 0; n < casesData.Length; n++)
    {
        sum += Avx512BW.ConvertToVector512UInt16(casesData[n].Vec);
    }
    return Vector512.Sum(sum);
}

All these microbenchmarks are also a bit skewed because they aren't setup like a real world example. In a more real example you likely wouldn't have some TestData[] where you have gaps between the values being summed, you likely wouldn't have a Vector256<byte>[] either. You'd instead most likely have a byte[] and be operating on that, which maximizes prefetch, cache coherency, and other considerations. You may even instead have a byte* if alignment matters, you might take advantage of non-temporal loads/stores, and various other considerations.

Even without pulling out "all the tricks" and simply updating the loops to be a bit more representative (plus adding a couple additional examples) where you operate over byte[], you can see a big change in all the tests (and if I spent a little more time, I could likely get another 25-50% speedup):

Method Mean Error StdDev Code Size
ScalarSum 10,618.5 ns 207.73 ns 310.92 ns 68 B
ScalarSumUnrolledx4 3,431.7 ns 59.02 ns 55.21 ns 88 B
WidenAddSum 929.5 ns 7.05 ns 5.89 ns 124 B
WidenAddSumSeparate 440.4 ns 7.71 ns 7.21 ns 129 B
WidenAddSumV512 380.9 ns 6.98 ns 6.53 ns 125 B
WidenAddSumUnrolledV512x2 288.1 ns 4.01 ns 3.94 ns 148 B
WidenAddSumUnrolledV512x4 258.5 ns 4.27 ns 3.79 ns 202 B
SADSum 731.7 ns 10.69 ns 9.99 ns 108 B
SADIntermediateThenSumFinal 304.5 ns 6.01 ns 6.44 ns 113 B
SADSum64 554.5 ns 9.78 ns 9.15 ns 97 B
SADAddSwapAdd16 525.4 ns 8.96 ns 9.58 ns 94 B
SADAddSwapAdd16_NoCast 490.8 ns 6.86 ns 6.42 ns 82 B
SADAddSwapAdd32 495.9 ns 9.66 ns 9.49 ns 82 B
SADAddSwapAdd64 525.6 ns 10.18 ns 9.52 ns 93 B

Where SADIntermediateThenSumFinal is an example that accumulates the SAD results in a Vector256<ushort> intermediate then calls Sum once at the end of the loop.


So all in all, WidenSum is not something I'm interested in adding. It's a bit of a pit of failure and not what users actually want or should be doing, particularly not for summing large arrays of inputs.

I would consider exposing a U TensorPrimitives.SumWidening<T, U>(ReadOnlySpan<T> x) where T : INumberBase<T> where U : INumberBase<U> API that takes in a T[] and computes a U result, so that users don't have to think about any of the complexities here and they can get something around 125-200ns instead, however.

@poizan42
Copy link
Contributor Author

[Benchmark]
public ulong WidenAddSumSeparate()
{
    Vector256<ushort> sum = Vector256<ushort>.Zero;
    for (int n = 0; n < casesData.Length; n++)
    {
        var v = casesData[n].Vec;
        sum += Vector256.WidenLower(v) + Vector256.WidenUpper(v);
    }
    return Vector256.Sum(sum);
}

My goal isn't to sum a lot of bytes from memory together, that was just an attempt to give more work for BenchmarkDotNet. The method in the example I gave above was what I originally needed the widening sum for:

    public static ushort ConditionalSumByBitSet(uint bitset, Vector256<byte> items)
    {
        Vector256<byte> xbcast = Vector256.Create(bitset).AsByte();

        // Each byte gets the source byte containing the corresponding bit
        Vector256<byte> indices = Vector256.Create(
            0x0000000000000000UL,
            0x0101010101010101UL,
            0x1E1E1E1E1E1E1E1EUL,
            0x1F1F1F1F1F1F1F1FUL).AsByte();
        
        Vector256<byte> shuf = Vector256.Shuffle(xbcast, indices);
        Vector256<byte> andmask = Vector256.Create(0x08040201008040201UL).AsByte();
        Vector256<byte> isolated = Vector256.BitwiseAnd(shuf, andmask);
        Vector256<byte> notSelectedMask = Vector256.Equals(isolated, Vector256<byte>.Zero);
        
        Vector256<byte> selected = Vector256.ConditionalSelect(notSelectedMask, Vector256<byte>.Zero, items);
        return Vector256.SumWidening(selected); // <-- Used here
    }

It's to calculate the size of a variable structure with a lot of fields of different sizes that may or may not be present, specified by a bitset. I don't know if there is a better way to do it, but I got into the rabbit hole of the framework not having an optimized sum function readily available I could use so I could test the approach out. (There certainly are better ways if I had Avx512, but even if the server where it ends up running has it, I currently have no easy way of benchmarking it beforehand)

@poizan42
Copy link
Contributor Author

I would consider exposing a U TensorPrimitives.SumWidening<T, U>(ReadOnlySpan<T> x) where T : INumberBase<T> where U : INumberBase<U> API that takes in a T[] and computes a U result, so that users don't have to think about any of the complexities here and they can get something around 125-200ns instead, however.

That would be useful, but I'd imagine it would be a lot of work for the JIT to be able to optimize it down to something like the SAD+Sum if I just called it as

    Span<byte> selectedBytes = stackalloc byte[32];
    selected.CopyTo(selectedBytes);
    return TensorPrimitives.SumWidening<byte, ushort>(selected);

or (if this is even legal. I'm not seeing anywhere saying it isn't, so I guess the JIT has to figure out what to do)

    return TensorPrimitives.SumWidening<byte, ushort>(MemoryMarshal.AsBytes(new Span<Vector256<byte>>(ref selected)));

If you think that is feasible then there is no need for adding it as a new primitive for each Vector* type.

@tannergooding
Copy link
Member

then there is no need for adding it as a new primitive for each Vector* type.

The consideration is more that this API is not something that will be added as a new API on the Vector* types. As covered above, there are many pits of failure with such an API, it isn't something that is common enough to warrant it, and it is incredibly unlikely to ever be a performance bottleneck

If you really have the case of just needing to do it precisely once over a single vector, then it is a scenario specific thing and something that can be rolled for that scenario using the existing primitives. There are a few patterns that might be optimizable by the JIT, but the couple cycle perf difference for a single vector isn't likely to matter so adding such pattern recognition would be low priority.

For more typical scenarios, you're going to be summing things in a loop and so such an API is the exact opposite of what you want and will hurt perf more than it will help. Instead you want to avoid any reduction across the vector for as long as possible, which will net you nearly a 2x or more throughput improvement.

@tannergooding tannergooding closed this as not planned Won't fix, can't repro, duplicate, stale Apr 28, 2025
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Apr 28, 2025
@github-actions github-actions bot locked and limited conversation to collaborators May 29, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime.Intrinsics
Projects
None yet
Development

No branches or pull requests

2 participants