Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I was previously trying to use std::fma, but upon further investigation
it was turning into a single vfmadd instruction a lot less than I thought,
in many cases ending up as a function call (which itself was extra slow
when emulating fma on hardware that lacked it).
The better strategy seems to be just saying
a*b+c
, which on gcc &icc compilers automatically turns into vfmadd when available on the
hardware, and adding a clang-specific pragma ensures this behavior for
clang as well.
Thanks to Alex Wells of Intel for pointing this out to me (in OSL land).