-
Notifications
You must be signed in to change notification settings - Fork 823
SIMD vectorization of Array.sum<int>, etc #18509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ge, Array.sum and Array.average to take advantage of vectorization in System.Linq.Enumerable module.
❗ Release notes required
|
src/FSharp.Core/array.fs
Outdated
[<CompiledName("Sum")>] | ||
let inline sumFloat (array: float array) : float = | ||
System.Linq.Enumerable.Sum array | ||
|
||
[<CompiledName("Sum")>] | ||
let inline sumFloat32 (array: float32 array) : float32 = | ||
System.Linq.Enumerable.Sum array | ||
|
||
[<CompiledName("Sum")>] | ||
let inline sumInt (array: int array) : int = | ||
System.Linq.Enumerable.Sum array | ||
|
||
[<CompiledName("Sum")>] | ||
let inline sumInt64 (array: int64 array) : int64 = | ||
System.Linq.Enumerable.Sum array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that this would be a reasonable place to use static optimization syntax to specify which types should delegate to the LINQ method and which to the existing code, e.g.,
let inline sum (array: ^T array) : ^T =
existingSumCode array
when ^T : int = System.Linq.Enumerable.Sum array
when ^T : int64 = System.Linq.Enumerable.Sum array
…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I expect "static optimization conditionals" are a compile-time thing and not runtime? Because I can't check easily with sharplab.io, it says "error FS0817: Static optimization conditionals are only for use within the F# library"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the appropriate implementation would be chosen at compile-time once the type parameter was resolved.
You could add some IL tests under https://github.com/dotnet/fsharp/tree/main/tests/FSharp.Compiler.ComponentTests/EmittedIL if you wanted.
It looks like the error case is now different for average over an empty collection. (we should not evaluate a seq before calling into Average, since this would be breaking in a different way) |
I did some initial tests to see if this makes sense at all or not. I used the current [<Benchmark>]
member x.ArraySum() = // Array sum
array
|> Array.sum
|> ignore
[<Benchmark>]
member x.ArrayAverage() = // There has an extra map, because average needs float
array
|> Array.map float
|> Array.average
|> ignore
[<Benchmark>]
member x.ArraySeqSum() = // Seq sum by using array as base data
array
|> Seq.sum
|> ignore
[<Benchmark>]
member x.ListSeqSum() = // Seq sum by using list as base data
list
|> Seq.sum
|> ignore And here are the results with current main:
Here are the results with this PR:
Edit: I used Net 9 but the FSharp.Core.dll netstandard2.0 version in both, which was probably a mistake because Spans are truely efficient on netstandard2.1 only (?) |
Hmm, I'm thinking if default FSI is compiled with debug mode and people use F# as a scripting language, then performance optimizations like this could be marked as I did run the same performance tests with FSharp.Core.dll netstandard2.1 version Main branch:
This PR
By quick look of this, the Sum seems to be improved but the Average not really. Should I revert the Average change? Edit: Average removed |
1/As per your benchmark, the cost for average might be dominated by doing the casts and additional array allocations. 2/The idea of using |
What are the numbers when running on the full clr? 4.6-4.8? Enumerable probably has different implementation there. |
Average will now throw different exception, so it's a breaking change at this point. |
@Thorium: |
I did check this, and the results were worse for Enumerable.Sum
This PR - with Enumerable.Sum
|
…nchmarks\CompiledCodeBenchmarks\MicroPerf\MicroPerf.fsproj
Because this is just a forwarding, behaviour difference is easy to explore already in FSI: [| 1.; Double.NaN |] |> Array.sum;;
//val it: float = nan
[| 1.; Double.NaN |] |> System.Linq.Enumerable.Sum;;
//val it: float = nan Seems to be the same with edge-cases like Double.MaxValue and infinity. |
I don't know how to easily run MicroPerf.fsproj in .NET, but I can run
// * Summary * BenchmarkDotNet v0.13.12, Windows 11 (10.0.26100.3915) Job=.NET Framework 4.8 Runtime=.NET Framework 4.8
Not exactly the results I was hoping for. |
Framework-dependent runtime switch would solve this (and since it is static, I would hope JIT would eliminate such branching), but is not something we are doing in FSharp.Core. On Desktop:
On Core:
It would be great if we could make use of https://learn.microsoft.com/en-us/dotnet/api/system.numerics.tensors.tensorprimitives.sum?view=net-9.0-pp#system-numerics-tensors-tensorprimitives-sum-1(system-readonlyspan((-0))) for some of those implementations. Right now, any such support depending on Tensors will have to be done as a separate lib outside of FSharp.Core . |
Yeah, that's what I suspected will happen :( Bcl in full framework doesn't have those functions vectorised. And the issue with fslib is that it ships as netstandard, so there's no way to tailor separate implementations for netfx and netcore. |
The
What do you mean by "On Desktop" and "On Core" ? I can see that in FSharp.Core.fsproj there is |
Reflection can likely neglect many optimizations (needs proving tho).
Desktop is usually referred when full framework is used, and core is when coreclr runtime and bcl are used.
These should already be defined. However they will be defined on both net48 and net9.0 |
I meant a statically stored flag based on something like: open System.Runtime.InteropServices
let runtimeDescription = RuntimeInformation.FrameworkDescription
let hasLinqAcceleration = runtimeDescription > "..." // string-based logic here And switching at runtime (so cannot be part of statically optimized e.g. top level function: |
Runtime check added. Compared to this This PR:
|
After |
Is it vectorized as well? Is it doing copies first? I am curious. |
Oh, you are right, it isn't. That's sad, makes this very marginal feature especially when tools like FSharpLint prefer sumBy over (map >> sum) |
.NET Framework is not .NET Standard 2.1 compatible, thus .NET Standard 2.1 version shouldn't need runtime check.
|
Will all netstandard2.1 implementations be vectorized? If yes, then it's safe enough (and also it should be defined already). |
@Thorium : I think the discussion about the future deployments of FSharp.Core derailed the discussion. (It's not that a different strategy for shipping FSharp.Core for I would like this optimization to go into NET10 as is, are you still interested in finalizing this PR and getting it in? |
Sure. Is there anything that should be added still? I think all the existing conversation items were already resolved? |
Tests were failing, otherwise it was good. |
@@ -1588,6 +1587,32 @@ module Array = | |||
|
|||
acc | |||
|
|||
let isNetFramework = System.Runtime.InteropServices.RuntimeInformation.FrameworkDescription.StartsWith ".NET Framework" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to functionally compile to a static readonly bool
, so that the JIT can optimize things appropriately?
Is this going to do the "wrong" thing on custom runtimes or scenarios where vectorization may not be available or possible? For example, there was no SIMD acceleration on 32-bit Unix for a while and there is non on Arm32 today. Likewise, acceleration can be disabled via environment variables for testing purposes.
In general it's expected that Enumerable.Sum
is going to do the most optimal thing over time based on the underlying hardware and other user options (like if you're compiling for size vs speed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to functionally compile to a static readonly bool, so that the JIT can optimize things appropriately?
This is runtime check, not compile-time check. Because nothing says the code is compiled and run on similar machines.
Is this going to do the "wrong" thing on custom runtimes or scenarios where vectorization may not be available or possible? For example, there was no SIMD acceleration on 32-bit Unix for a while and there is non on Arm32 today. Likewise, acceleration can be disabled via environment variables for testing purposes.
No, because Enumerable.Sum already checks that within its implementation.
The only reason for this check is because Enumerable.Sum is slow on old .NET Framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is runtime check, not compile-time check. Because nothing says the code is compiled and run on similar machines.
Yes, which is why you should ensure it compiles down to a static readonly bool
in IL. Because that will cause it to be initialized at runtime in Tier 0 and then allow the JIT to treat it as a constant in Tier 1 (or for NativeAOT), allowing the check to be elided once we do know the actual machine/runtime it's running on.
The only reason for this check is because Enumerable.Sum is slow on old .NET Framework.
The point was that you're doing a specific check for .NET Framework, which doesn't account for custom runtimes or other scenarios that may or may not be relevant. So I'm just asking if the nuance of that has been fully considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tannergooding , I think you have it right - Enumerable.Sum
will do the best thing that is available on all versions of modern .NET
. Even without vectorization, it will still detect when the sequence passed to it can be treated as an ReadOnlySpan.
It just does not apply to desktop framework, which is still supported and can be used with latest F# and latest FSharp.Core - there the Enumerable.Sum
is slower than Array.sum
in FSharp.Core.
@tannergooding :
Is there a recommended way to locally benchmark a modern .NET version, however with intentionally disabled vectorization? (to proof that .NET 9/10 ; even when not vectorized, does not carry the drastical perf worsening visible at .NET Framework implementation of Enumerable.Sum compared to FSharp.Core's Array.sum ? )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@T-Gro I think he means that the check should be emitted such that it should be optimized away by JIT and directly use the relevant implementation depending on framework - Array.sum
on .NET Framework, Enumerable.Sum
otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is one part of the concern - F#'s emit for let
does not use initonly
because of the file+module based initialization semantics which guarantees execution order as written in the file.
The other part of the concern is if indeed Enumerable.Sum is not worse for any configuration not .NET Framework
, especially environments without vectorization support.
Description
Specific overloads (float, float32, int, int64) of Seq.sum,
Seq.average,Array.sumand Array.averageto take advantage of vectorization in System.Linq.Enumerable module.This is potentially a naive first try to solve #16230 by the spirit of @T-Gro comment #16230 (comment)
Checklist