Skip to content

Conversation

@ffevotte
Copy link
Contributor

@ffevotte ffevotte commented Jul 10, 2020

Most of what I wanted to explore with Float32 is now done, which is why I'm opening this as a draft PR to let anyone interested have a look at it and comment if necessary. I'm planning to release this as v0.3.5 when it's done.

Here is a list of the changes (to be) included in this PR:

  • optimized, vectorized, mixed-precision implementation of sums and dot products relying on the use of a Float64 accumulator for Float32 inputs: sum_mixed and dot_mixed;
  • inclusion of Float32 inputs in all verification tests;
  • possibility to run performance tests on 32-bit inputs; a collateral benefit of this work is that the generation of input data for sums and dot products with a given condition number is now much more reliable;
  • updated README advertising these new features.

I'm still in the process of testing the performance of dot_mixed, but my first conclusions are:

  • mixed-precision implementations are always faster than compensated ones; they should be the preferred choice to get additional accuracy with Float32 inputs;
  • mixed-precision implementations are even almost as fast as naive implementations on AVX512 systems; we could probably even go as far as suggesting that mixed implementations be used by default (i.e. in Base.sum or LinearAlgebra.dot) when working with Float32 inputs on newer or high-end systems;
  • I'm thinking of perhaps introducing single, high-level entry points that would dispatch on the most efficient accurate implementation based on the input element types and the CPU being used. These new entry points could be unexported functions with common names (e.g. AccurateArithmetic.sum and AccurateArithmetic.dot) that users would need to use explicitly. Or they could be exported functions with specific names (like accurate_sum and accurate_dot).

@ffevotte ffevotte marked this pull request as ready for review July 14, 2020 15:38
@ffevotte ffevotte mentioned this pull request Jul 14, 2020
@ffevotte ffevotte merged commit 74a6445 into master Jul 16, 2020
@ffevotte ffevotte deleted the ff/float32 branch July 16, 2020 22:06
ffevotte referenced this pull request Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants