Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization with 'fast math' produces incorrect results and segfaults #1155

Closed
ischoegl opened this issue Dec 8, 2021 · 10 comments · Fixed by #1161
Closed

Optimization with 'fast math' produces incorrect results and segfaults #1155

ischoegl opened this issue Dec 8, 2021 · 10 comments · Fixed by #1161

Comments

@ischoegl
Copy link
Member

ischoegl commented Dec 8, 2021

Problem description

As reported in #1150, compilation with default optimization options for the Intel compiler suite icx/icpx results in incorrect results. The behavior can be reproduced for gcc with the -ffinite-math-only option (which is one of the -ffast-math flags), e.g. for a vanilla gcc toolchain on Ubuntu 20.04:

$ scons build optimize_flags="-O3 -ffinite-math-only"

The option breaks strict IEEE compliance, as it does "Allow optimizations for floating-point arithmetic that assume that arguments and results are not NaNs or +-Infs." There are some instances where Cantera uses NaN internally which presumably breaks 'fast math' optimizations.

Steps to reproduce

  1. Compile as indicated above.
  2. An attempt to run the test suite results in numerous segfaults.
  3. Output for gcc with fast math (same as for icx in Optimized Cantera build with the LLVM-based Intel compilers do not work correctly #1150) ...
In [3]: gas2 = ct.Solution('gri30.yaml')

In [4]: gas2.partial_molar_cp
Out[4]: 
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0.])

In [5]: gas2.partial_molar_entropies
Out[5]: 
array([-1.84618157e-12,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06])

Behavior

System information

  • Cantera version: 2.5.1, 2.6.0a3
  • OS: Ubuntu 20.04 (and presumably others)
  • Python/MATLAB/other software versions: gcc / icx

Additional context

@bryanwweber
Copy link
Member

@ischoegl and @tpg2114, thanks for digging in to this! Does the -ffast-math actually improve performance for standard workloads that we run? If it does, then this is probably worth digging in to. Otherwise, it might be best to leave the code as-is.

@ischoegl
Copy link
Member Author

ischoegl commented Dec 8, 2021

Does the -ffast-math actually improve performance for standard workloads that we run?

Based on my understanding, yes. But as IEEE 754 is not enforced (which is where performance gains come from), this may make for some 'fun' refactoring. I mainly created this issue to make sure that users are aware of the limitation. It's somewhat puzzling that icx enables -fp-model fast by default. FWIW, here's a blog post that I found informative and a SO answer, but I didn't dig very deep.

@bryanwweber
Copy link
Member

Based on my understanding, yes.

I'm not asking you to do this, but if anyone decides to pursue this issue, it'd be nice to have concrete benchmarks that adding these flags does improve real-world performance that justifies the (presumably) increased code complexity to handle this case.

@ischoegl
Copy link
Member Author

ischoegl commented Dec 9, 2021

I'm not asking you to do this …

no worries. I’ll pass on this one 😜

@speth
Copy link
Member

speth commented Dec 21, 2021

I investigated this a bit, using Clang++ on Ubuntu 20.04. The only flag that's part of -ffast-math that is a significant problem is -ffinite-math-only. With the rest of the flags enabled by specifying optimize_flags=-O3 -ffast-math -fno-finite-math-only`, I get 4 test failures:

  • python: test_kinetics.TestReaction.test_Blowers_Masel_change_enthalpy
  • python: test_reaction.TestBlowersMaselRate.test_from_parts
  • VCS-LiSi-verbose
  • cxx-kinetics1

Where the first two at least are just an issue of assertEquals being used where assertNear would be a better choice, and I think the latter two can be resolved without too much difficulty.

Enabling -ffinite-math-only requires eliminating any check that relies on isnan working or even the comparison nan != <some number> returning false. I made some really crude changes to do this in this commit on my fork: speth@5abdaa8, and did some benchmarking of an ignition delay problem using a couple different mechanisms. For these tests, I used Eigen for the linear algebra and the vendored copy of Sundials, so this is about the maximum impact that these flags can have, since there is very little outside code. I tested both GRI 3.0 and a larger mechanism with ~400 species. What I found was:

  • Compared to the baseline (-O3, using -O3 -ffast-math -fno-finite-math-only) is about 2% faster in terms of time steps per second.
  • Compared to the baseline, -O3 -ffast-math is about 4% faster in terms of time steps per second
  • The differences in the calculations lead to the number of time steps needed for an individual simulation to vary unpredictably, with differences in individual simulations of up to 10%, with the average around 2.5%, so it's hard to say that these optimizations necessarily lead to higher performance on a wall-time basis.

Given the unsatisfactory nature of the changes required to support using -ffinite-math-only and the relatively small performance gains, my recommendation is that we add a configuration-time check for whether isnan works correctly and if not, abort compilation with an error message stating that Cantera doesn't work with this flag.

@ischoegl
Copy link
Member Author

Thanks for looking into this further, @speth! The two non-python tests look familiar, and I agree that assertEquals is easy to fix (should be done regardless).

As an aside, one thing I noticed in my own tests was a plethora of warnings coming out of fmt via AnyMap. In addition, there were also several 'legacy' fmt use cases that caused annoying warnings. Fixing what causes these fmt warnings may be the largest issue on hand, as I don't really like the nuclear option that disables these warnings.

Overall, I agree that it probably doesn't make sense to make this a priority. At the same time, it probably also makes sense to avoid using NaN as a sentinel value (which I have recently used), and to avoid exact / bit-wise equality checks in new code. It should be relatively simple to replace the NaN checks in the new reaction rate evaluators, which probably should happen prior to 2.6. In other words, what I'm arguing for is not to make it a priority to fix, but also not to make it harder to fix going forward.

@speth
Copy link
Member

speth commented Dec 22, 2021

I don't see any warnings related to fmt with either GCC or Clang, so I guess that's specific to the Intel compiler.

I would argue that most of the ways that we use NaN in Cantera are very reasonable, and that the alternatives are often worse. For instance, in the modification I made to the CachedValue class to run with -ffinite-math, the initial cache check value now has to be some arbitrary but finite number. And while it's unlikely that the a cached value would be checked against this initial value and return an erroneous result, it's not impossible.

@ischoegl
Copy link
Member Author

ischoegl commented Dec 22, 2021

I don't see any warnings related to fmt with either GCC or Clang, so I guess that's specific to the Intel compiler.

That sounds plausible.

I would argue that most of the ways that we use NaN in Cantera are very reasonable, and that the alternatives are often worse.

I tend to agree: using NaN as a sentinel is efficient from a coding perspective. But apparently we're forcing the code to check for this possibility, which is less ideal from a computational perspective (although the penalty appears to be small). I also don't think that we need to avoid NaN as an output value, it's just that there probably need to be internal booleans that replace the checks. Overall, I don't think that there's any urgency to 'fix' this issue ... although it probably needs to remain open.

@g3bk47
Copy link
Contributor

g3bk47 commented Dec 23, 2021

@speth @bryanwweber @ischoegl Although the issue has been closed, I am reporting some additional data: I ran a small test program benchmarking the computation of reaction rates with 16 different compilers/versions, each with and without fast-math. Quick summary: using fast-math, g++ becomes about 15 % faster and the Intel compilers less than 5 % faster. The relative accuracy in my test is within 10^-10 %. However, even without fast-math, final results for reaction rates between g++/clang++ and icpc/icpx are slightly different, which I cannot explain at the moment. For all details, see here:
https://github.com/g3bk47/CanteraCompilerPerformance
Let me know if I should benchmark any other code snippets with this setup.

@ischoegl
Copy link
Member Author

ischoegl commented Dec 24, 2021

@g3bk47 … thanks for those results! Your comparison suite is impressive. From my perspective, a performance gain of up to 15% would be worth pursuing. While segfaults and erroneous results are fixed here (I.e. those issues are addressed), I believe this warrants an enhancement request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants