Description
Background and motivation
Currently FMA is exposed for primitives (double
& floats
) and full blown SIMD vectors but nothing for convenience primititves like Vector2/3/4
AFAIK which sits between them. FMA isnt about only perf (on hardware that has built-in FMA/SIMD FMA ofc) but also about avoiding intermediate rounding.
API Proposal
namespace System.Numerics;
public struct Vector2
{
+ public static Vector2 MultiplyAddEstimate(Vector2 x,Vector2 y, Vector2 z);
}
public struct Vector3
{
+ public static Vector3 MultiplyAddEstimate(Vector3 x,Vector3 y, Vector3 z);
}
public struct Vector4
{
+ public static Vector4 MultiplyAddEstimate(Vector4 x,Vector4 y, Vector4 z);
}
Under the hood Vector2
could use Vector64<float>
or MathF.FusedMultiplyAdd
where applicable and faster. Vector3
i imagine would likely widen to Vector128<float>
and set 0 to last element since it will be discarded when returning while Vector4
would be used as-is as Vector128<float>
.
Software fallback would be simple (a * b) + c
component-wise for perf reasons hence Estimate
suffix since software fallback would differ in rounding behaviour for very large components.
API Usage
var x = Vector3.UnitX;
var y = Vector3.UnitY;
var z = Vector3.UnitZ;
var fma = Vector3.MultiplyAddEstimate(x,y,z);
Alternative Designs
Alternative would be to write platform-agnostic SIMD FMA (which currently would use S.R.I.x86.FMA and S.R.I.ARM + software fallback under the hood) at which point handrolling FMA for Vector2/3/4
wouldnt be too bad.
Another alternative is to handroll on your own FMA for each component but that becomes ugly the more components there are and adding SIMD FMA support for perf makes this even worse, especially since theres no platform-agnostic SIMD FMA AFAIK.
Risks
Estimate behaviour in face of different hardware support for FMA could be suprising but thats mostly documentation exercise and Estimate
suffix already points out its not exactly FusedMultiplyAdd
.