Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The gfortran benchmarks should use the "-march=native -ffast-math -funroll-loops" options #24568

Closed
certik opened this issue Nov 10, 2017 · 179 comments

Comments

@certik
Copy link
Contributor

certik commented Nov 10, 2017

Specifically, when I compare

gfortran -O3 perf.f90 && ./a.out
gfortran -O3 -march=native -ffast-math -funroll-loops perf.f90 && ./a.out

on my machine, then I get 2x speedup on the iteration_pi_sum benchmark. I haven't tested the C code, but a similar speedup might be possible.

Note: if you use a recent gfortran compiler (e.g. I just tested this with 7.2.0), you can just use:

gfortran -Ofast perf.f90

And it will turn on the proper options for you, and produce optimal results.

@Keno
Copy link
Member

Keno commented Nov 11, 2017

If we do that, we should allow julia the same liberties of course ;).

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Nov 11, 2017

Fast math mode is not functionally identical, so that's changing the algorithm (and potentially the result). Unrolling loops is fair game, but most compilers should consider doing that anyway.

@certik
Copy link
Contributor Author

certik commented Nov 11, 2017

@StefanKarpinski I checked on my machine and indeed the -ffast-math seems to be solely responsible to bring the iteration_pi_sum benchmark down from 20 (ms?) to 10 ms.

Regarding your point about changing algorithm --- it all depends what you want to show by the benchmarks. I use Fortran every day. I use -Ofast. Intel's ifort I believe does the same thing, in fact by default. As such, your interpretation of the benchmarks is not applicable to my work, since I use the best optimization options that the compiler allows.

Maybe a good compromise could be to mention that you are not using the -ffast-math in gfortran (because you feel that's not fair to Julia), which can deliver up to 2x speedup on some benchmarks.

Btw., do you have a real life example when -ffast-math fails? I've been recommending it to everyone, so if you have a counter example, please let me know.

My final thought is that a fair benchmark would be to simply compare against LLVM based compilers. That way everybody is using the same backend, and thus we are benchmarking how well each language can be optimized into LLVM IR. I think that would be meaningful. Well, except that fact, that there is currently no production LLVM Fortran compiler, but eventually I believe that will change. But at least for the C language I think it will make sense to use clang and exactly the same LLVM version as Julia.

Since there is no Fortran LLVM compiler yet, then I think the next best thing is to let both LLVM and gfortran to do their best. This includes -ffast-math for gfortran, and an equivalent option for Julia/LLVM. Otherwise you need to prove that none of the dozens or so gfortran options that are triggered by -ffast-math are used by LLVM.

@Keno
Copy link
Member

Keno commented Nov 11, 2017

I don't care either way, but julia has ffast-math as well of course, so either we turn it on everywhere or we turn it off everywhere. As for real world examples, ffast-math turns on ffinite-math-only, which is a big problem for people who use float NaNs as missing data.

@certik
Copy link
Contributor Author

certik commented Nov 11, 2017

@Keno do you know if that's how Pandas and NumPy does it? That would be a good argument against doing it by default. For numerical computing, I don't use NaNs --- except in Debug build when I want it to fail when a NaN would be generated. I use the following option in gfortran in Debug mode: -ffpe-trap=invalid,zero,overflow -finit-real=snan -finit-integer=-99999999. And I always use both Debug and Release modes, both in Fortran and in C++. In Julia, do most people just use one mode?

@Keno
Copy link
Member

Keno commented Nov 11, 2017

In julia fastmath is a local annotation, so the user can specify whether they are ok with non-IEEE behavior or not in a particular place in the code base.

@StefanKarpinski
Copy link
Sponsor Member

Fundamentally "fast math" does not compute the same thing, so we could either turn this on in every language or turn it on in no languages. Many languages don't support fast math at all, which is definitely a mark against them for high-performance computation, but if we do it in some but not others, then we're not comparing apples to apples. This hurts languages like Julia, Fortran and C where this is an option in the comparison, but those are also the languages that are kicking everyone else's butts, so I think we can afford it.

@oscardssmith
Copy link
Member

One possibility would be to report two different results for pisum for languages with fastmath. This would acknowledge the existence of a good feature, while not being an unfair comparison.

@romeric
Copy link
Contributor

romeric commented Nov 14, 2017

I believe comparing the performance of different languages/compilers under -ffast-math mode would be fundamentally flawed. While under -O1/-O2/-O3 levels there are still set floating point guidelines to follow (IEEE), there is no such rule that the compilers have to follow under -ffast-math, and one cannot expect all the compilers to do the same optimisation when IEEE floating point compliancy is broken.

Moreover, there is no fast-math mode for languages like Python, MATLAB/Octave, R etc. NumPy/SciPy are typically built with the same flags that the CPython itself was built with and under most environments that is -O2.

The actual reason that the pisum benchmark runs twice faster is that under -ffast-math the compiler emits an AVX vectorised version of the code instead of a SSE version and that is precisely the reason for a 2X speed-up. Clang/GCC do the same optimisation for the C version of the code as well. Easy to check this if you look at the generated assembly. I think the reason, why this optimisation does not kick in under -O3 is that the compiler may not be able to guarantee the same statistical distribution of the error, while choosing to work with bigger vector lanes.

@certik
Copy link
Contributor Author

certik commented Nov 14, 2017

So I think your argument is that you want to set the playground rules, and those are the IEEE floating point arithmetic. My argument is that IEEE rules are not the relevant rules, and I asked above if somebody can provide an example where -ffast-math produces incorrect code, from the computational numerical perspective. Instead of talking abstractly, let's discuss a particular example where -ffast-math fails from any practical perspective, e.g .that some numerical algorithm fails due to that (or some convergence rate changes, etc.).

If such an example does not exist, then you can give up IEEE, since it does not affect any good numerical algorithm in practice. That is my argument.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Nov 14, 2017

Here's a simple example of computing the error of a floating-point addition:

#include <math.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
    double a = atof(argv[1]);
    double b = atof(argv[2]);
    if (fabs(a) < fabs(b)) { double t = b; b = a; a = t; }
    double e = ((a + b) - a) - b;
    printf("addition error for `%g + %g`: %g\n", a, b, e);
    return 0;
}

IEEE versus fast math:

$ gcc -O3 tmp/fastmath.c -o tmp/ieeemath
$ gcc -O3 -ffast-math tmp/fastmath.c -o tmp/fastmath

$ tmp/ieeemath 1.12e17 12.34
addition error for `1.12e+17 + 12.34`: 3.66

$ tmp/fastmath 1.12e17 12.34
addition error for `1.12e+17 + 12.34`: 0

It's extremely naïve to think that things at least this subtle don't happen in real numerical codes.

@certik
Copy link
Contributor Author

certik commented Nov 14, 2017

@StefanKarpinski thanks for taking the time to figure out an example. Here is the same example that you can try in Fortran with, say, the Intel compiler:

program fastmath
integer, parameter :: dp = kind(0.d0)
real(dp) :: a, b, t, e
a = get_float_arg(1)
b = get_float_arg(2)
if (abs(a) < abs(b)) then
    t = b; b = a; a = t
end if
e = ((a + b) - a) - b
print "('addition error for `', es9.2, ' + ', es9.2, '`: ', es9.2)", a, b, e

contains

    real(dp) function get_float_arg(i) result(r)
    integer, intent(in) :: i
    integer, parameter :: maxlen=32
    character(len=maxlen) :: arg
    integer :: s
    if (.not. (0 < i .and. i <= command_argument_count())) then
        error stop "get_float_arg: `i` must satisfy `0 < i <= arg_count`."
    end if
    call get_command_argument(i, arg, status=s)
    if (s == -1) then
        error stop "get_float_arg: Argument too long, increase `maxlen`."
    else if (s > 0) then
        error stop "get_float_arg: Argument retrieval failed."
    else if (s /= 0) then
        error stop "get_float_arg: Unknown error."
    end if
    read(arg, *, iostat=s) r
    if (s /= 0) then
        error stop "get_float_arg: Failed to convert an argument to a float."
    end if
    end function

end

And run it as:

$ ifort fastmath.f90 -o intelmath
$ ./intelmath 1.12e17 12.34
addition error for ` 1.12E+17 +  1.23E+01`:  0.00E+00

Gfortran here behaves as gcc:

$ gfortran fastmath.f90 -o gfortranmath
$ ./gfortranmath 1.12e17 12.34
addition error for ` 1.12E+17 +  1.23E+01`:  3.66E+00
$ gfortran -Ofast fastmath.f90 -o gfortranmathfast
$ ./gfortranmathfast 1.12e17 12.34
addition error for ` 1.12E+17 +  1.23E+01`:  0.00E+00

The Intel compiler has the expected behavior, since in Fortran the compilers have the freedom to rewrite an expression to an equivalent mathematical form (based on real numbers, not floating point), so ((a + b) - a) - b and 0 are equivalent. Now, strictly speaking, here is what the Fortran 2008 Standard says:

7.1.5.2.4 Evaluation of numeric intrinsic operations
...
2. Once the interpretation of a numeric intrinsic operation is established, the processor may evaluate any mathematically equivalent expression, provided that the integrity of parentheses is not violated.
3. Two expressions of a numeric type are mathematically equivalent if, for all possible values of their primaries, their mathematical values are equal. However, mathematically equivalent expressions of numeric type may produce different computational results.

In other words, the expression as used above:

e = ((a + b) - a) - b

has only one way to interpret it, and the Intel compiler indeed broke the "integrity of parentheses". That is confirmed by an example in the Standard, which says that the expression (X + Y) + Z has a "forbidden alternative form" X + (Y + Z). However, if you change it to:

e = a + b - a - b

Then the ifort as well as gfortran results above do not change, but now the Standard allows to evaluate a mathematical equivalent expression 0 instead, and even gives an example of an expression X - Y + Z that has an "allowable alternative form" of X - (Y - Z), so a + b - a - b is rewritten to a+b-(a+b) and that is rewritten to 0.

The conclusion is that the Fortran Standard encourages this kind of rewriting, but even the Standard says not to break parentheses, but in practice compilers actually break them. An Intel Fortran is a good, solid reputable compiler, and it does this by default, as you can see above. In light of the above I must strongly disagree this is "extremely naïve". It might not be what you want to do in Julia, but it's not naïve.

If you use icc, then in fact, Intel breaks this by default even in your own C example above with parentheses:

$ icc fastmath.c 
$ ./a.out 1.12e17 12.34
addition error for `1.12e+17 + 12.34`: 0

Now, when we have the standard and compiler behavior discussed, the more important problem with the above example is that when you write numerical code, you should not rely on such behavior. Besides the fact the Intel compiler breaks it by default both in C and in Fortran, the more important problem is that such a code is unreliable. As an example, where you seem to depend on such a behavior in Julia (JuliaMath/Roots.jl#12), consider this blog post that the PR is based on:

http://www.shapeoperator.com/2014/02/22/bisecting-floats/

where it is claimed that instead of using abs(fn(x1)) > tol || abs(fn(x2)) > tol one should use x1 < (x1 + x2)/2 < x2, and even that is later replaced with casting floats to integers and back. So first of all, one should never cast a float to integer, as that is not portable (and also not readable). Second of all, the version x1 < (x1 + x2)/2 < x2 is also not portable, since it's good to assume that the Intel compiler can improve, and since the Fortran standard encourages rewriting using mathematically equivalent expressions, the compiler knows that x1 < x2, and as such can trivially replace x1 < (x1 + x2)/2 by .true., and the above loop will never terminate. Rather, the correct solution is to use tolerances, but instead of tolerances on the values of the function, in the above case one should simply use x2 - x1 > tol. Then one has to set a problem dependent tolerance, but that one has to do anyway. In fact, many times one does not want to go to machine accuracy, but a given problem might only require say 1e-6 or even less.

When numerical code is written like that (e.g., x2 - x1 > tol) then it does not depend on IEEE conventions, in fact it does not even depend on the details of the floating point implementation. One always has to do a convergence study (in the case of bisection it's trivial, but for more complicated algorithms one obtains a noise in the values once you reach convergence, say with a mesh refinement, and then you set the tolerance based to be safely above the noise). To get an example of what I am talking about, look at the Fig. 1 in arXiv:1209.1752, where an error is plotted based on various convergence parameters, and it always behaves the same: it drops, but then it hits a certain level and stays there --- that's when it's converged. One can plot the bisection in the same way. In both cases, if I set the tolerance below this minimal error level, then the algorithm will never terminate. Such is a nature of numerical computing. Typically the tolerance is problem dependent, though one can try to use relative, or even both absolute and relative errors at the same time, to mitigate that a bit, but in the end, it's going to be problem dependent.

Coming to your example above, I claim that one should never write a numerical algorithm that depends on such a behavior. You are subtracting two large numbers, so by definition you lost all the accuracy. The result can be anything, and one should never rely on such a result in any way.

Based on my reasoning above, the conclusion is: when writing numerical code, write it mathematically, and assume that you can use any mathematically equivalent expressions, and at the same time, avoid numerical constructs which lose accuracy in floating point (such as subtracting two large numbers), never assume any particular representation of floating point (but you can and should assume how many significant digits can be represented) and compare floating point numbers using a tolerance. There might be a few more rules that I forgot.

I welcome anyone to find holes in my arguments. I would be most interested to see where my reasoning is incorrect.

@PallHaraldsson
Copy link
Contributor

@certik, I do not want to contradict @Keno "In julia fastmath is a local annotation", yes, but not only? julia flag:

--math-mode={ieee,fast} Disallow or enable unsafe floating point optimizations (overrides @fastmath declaration)

@romeric "The actual reason that the pisum benchmark runs twice faster is that under -ffast-math the compiler emits an AVX vectorised version of the code instead of a SSE version"

I believe you should get that with Julia, for local or global setting; you can at least try. Does Fotran and C only have this similar global blunt instrument? It's good for performance testing, but you really should enable locally?

It would be good to indicate at least in a footnote, that you can get faster in Julia and some other languages, even if numerically less accurate, and point out Julia's advantages.

@Keno
Copy link
Member

Keno commented Nov 14, 2017

The math-mode command line flag is quite problematic, since there's a bunch of code in base an elsewhere that assumes NaNs are possible. E.g. #21375

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Nov 14, 2017

Now, when we have the standard and compiler behavior discussed, the more important problem with the above example is that when you write numerical code, you should not rely on such behavior.

Without reliable associativity as guaranteed by IEEE, you cannot write important and useful algorithms like Kahan summation (the error term is always zero mathematically). Julia can and does use this kind of code for correct answers in many places in the standard library. Perhaps your numerical code does not rely on a correct implementation of floating-point associativity, but insisting that everyone else has a moral obligation not to rely on it strikes me as a rather myopic position. If everyone thought like this, we'd still live in the dark bad old days before IEEE. If the compiler is allowed to reassociate summation arbitrarily, you can literally get any possible answer when summing a small, fixed set of floating-point numbers. Numerical code goes from being fully well-defined with IEEE to being completely implementation-specific without it – even for something as basic as adding a bunch of numbers. What results you get depends on what your compiler chooses to do. And it can change from compiler to compiler and version to version. That's a fairly awful situation to be in for anyone who cares about reproducibility.

Regarding Intel compilers: I consider not following IEEE behavior without opt-in to be a bug. This may be part of why their compilers have such a reputation for being fast, yet everyone actually uses gcc and clang (although historical cost likely has a great deal to do with it as well). This is probably also a large part of why libraries like openlibm cannot be safely compiled with icc.

All that said, I'm pretty tired of arguing about this, so feel free to make a PR adding a "fast math" versions of the Fortran, C and Julia benchmarks. (Please do not make a PR just changing the compiler options for Fortran from IEEE to fast math.)

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Nov 14, 2017

Also: your argument conflates numerical instability with incorrect answers due to invalid optimizations. These are not the same thing at all. However, tricky as these are, they are predictable and a skilled numerical analyst can tell you exactly how much error round-off can occur in a given computation. It seems like you may be falling into the common trap of imagining that floating-point operations introduce some sort of unpredictable error or fuzziness. And if that's the case, then why not give the compiler license to produce different fuzzy results? But this is a fundamental misunderstanding of floating-point arithmetic: it is completely predictable and always gives the correct answer rounded to the representable precision. Yes, the rounding introduces error, but in a well-defined, predictable way. There is no gremlin introducing random mistakes in the last digit. When using -ffast-math, there is a gremlin introducing random mistakes – the compiler.

@certik
Copy link
Contributor Author

certik commented Nov 14, 2017

@StefanKarpinski the summation example is clever, but ultimately it's just a curiosity, since even Julia can't sum it correctly:

julia> foldl(+, (sumsto(10.)))
10.0

julia> sum(sumsto(10.))
1.99584030953472e292

julia> sum_kbn(sumsto(10.))
9.9792015476736e291

and the returned array of numbers are large and small, all over the place:

julia> sumsto(10.)
2046-element Array{Float64,1}:
  1.79769e308 
  4.94066e-324
             
  2.4948e291  
  4.9896e291  
 -1.79769e308 
  2.0         
  8.0         

So that breaks my rule above "avoid numerical constructs which lose accuracy", since we are summing large and small numbers, and as such it is something to be avoided when writing numerical code.

The openlibm however is a very good example, that applies. So I opened up an issue (JuliaMath/openlibm#169) with the icc failures, let me have a look at what is going on. For example the exp2 test gives a completely wrong answer with icc:

Failure: Test: exp2 (0.7) == 1.6245047927124710452
Result:
 is:          1.62362532701732886764e+00   0x1.9fa5e8d07f29e0000000p+0
 should be:   1.62450479271247094637e+00   0x1.9fdf8bcce533d0000000p+0
 difference:  8.79465695142078729418e-04   0x1.cd17e3304f8000000000p-11
 ulp       :  3960761376927.0000
 max.ulp   :  0.0000
Maximal error of `exp2'
 is      :  3960761376927.0000 ulp
 accepted:  0.0000 ulp

That's unacceptable. We would have to find exactly where the problem is, before we can draw conclusions.

As a workaround, one can always turn off the optimization that broke it. In openlibm case, it's -fp-model=precise, which I think turns on IEEE compliance. Then all tests pass with Intel (JuliaMath/openlibm#169 (comment)). That should make icc usable for you, since you like IEEE compliance anyway. It's unsatisfactory for me, so I'll have a look what is going on if I have time.

But this exp2 example is a good example where the -ffast-math broke people's code, and I wanted to have a good example. Thank you for that.

I personally agree with Intel, and their decision happens to be in line with my guidelines how to write numerical code above. And for special cases, one can always turn it off, and I never argued there should be no way to turn it off.

I have submitted PRs to Julia before, but I will not submit this one, because you @StefanKarpinski seem "pretty tired of arguing about this" (using your own words), and I did not appreciate how I was treated in this thread. You @StefanKarpinski attacked me ad-hominem in almost every reply of yours ("extremely naïve", "myopic", "your numerical code vs everybody else", etc.), and I am very tolerant, but this is enough even for me. I have no interest in discussing with people who attack me personally, instead of discussing just technical details. I came here with good intentions to improve the benchmarks, and I have improved your benchmarks before. In fact, I met you @StefanKarpinski at the SciPy conference and also at the Google Mentor Summit. I had beers with you. I am very sad. But, we live in free country, I will always defend your freedom to free speech. But I will not discuss this online anymore. However, if we ever meet in person again, I am happy to buy you a beer so that we can continue this discussion in person. I will now try to forget the attacks and just think about the technical details, where you provided good arguments, even if I disagree with your conclusion.

I am closing this issue, since as @StefanKarpinski explained, for Julia it makes sense to stick with IEEE by default. I initially thought that maybe the -ffast-math option wasn't used by a mistake, but based on the discussion here it is clear you guys have thought about this a lot, and reached a conclusion not to use it. That is fine with me. I think you should still mention this in the benchmarks in the text somewhere.

@certik certik closed this as completed Nov 14, 2017
@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Nov 14, 2017

I'm sorry that you feel personally attacked – that certainly was not my intention. We have fundamentally different views here. I apologize for any adjectives I used that you felt were insulting. From my perspective, this is the hundredth debate I've participated in about why language X (in this case Fortran) should be allowed to do something special that makes it faster on these benchmarks. It's quite tiring after the tenth such discussion or so. Our principles regarding these benchmarks have always been simple and consistent:

  • Every language implements the same computation using the same algorithm.

Turning on -ffast-math or the equivalent in some languages but not others violates that fundamental principle because it means some languages are computing something different than others. I made this point early on, but you insisted that it was wrong to write numerical code that requires IEEE math and demanded examples where compilers not following IEEE made a practical difference – which I've provided several of. You continue to have the perspective that code that relies on IEEE math is wrong. I continue to believe that a compiler which violates IEEE without explicitly being asked to do so is wrong – and is not computing what it was asked to compute.

In the particular case of openlibm, writing correct algorithms without the guarantees of IEEE will end up being be less efficient than just turning IEEE mode arithmetic on. If you are in the business of writing mathematical code that gives consistent, predictable, correct answers, IEEE is not the enemy of performance, it is precisely the set of guarantees you need in order to know that you are doing exactly the minimal amount of work needed to get the right answer.

I hope that I get a chance to buy you a beer when we cross paths in the future and apologize in person for offending you – I got carried away with the argument and with having a similar debate for the hundredth time. For what it's worth, this point of contention is far more interesting and substantive than the previous ones have been.

@certik
Copy link
Contributor Author

certik commented Nov 15, 2017

@StefanKarpinski no offense taken, and apology accepted. I didn't know you had hundreds of these debates before. I maintain some large open source projects also, and so I know that some users or developers keep bringing the same issue over and over again --- the same issue for me, but for each of the people who brought it up it was their first time. As it was the case here, I never had such a debate before.

Just a note that I didn't say to allow it for Fortran and not for Julia. Either both or none. With -ffast-math, it might make sense to print the final accuracy/error, since things are not well defined anymore. The task then becomes "give me a given (say, 1e-14) accuracy". For openlibm, if it depends on IEEE, then just turn on IEEE in icc, and then it does the right thing. I've been looking into openlibm, something fishy is going on in the exp2 routine, that IEEE completely breaks it. Once I figure it out, I'll try to rewrite it "my way" (as you say), and we can compare performance. If the performance is worse, then you have a point, and I'll admit that IEEE makes sense (in this particular case). I'll also check that I am getting the correct answer, up to a few ulp. I understand that is not enough for openlibm, where the answer must have 0 ulp error to be IEEE compliant. But that's a different question, for me a few ulp error is completely fine. But 1e-4 error is unacceptable.

This debate made me think that I should write up the rules that I follow when writing numerical code, and put it up for scrutiny. I thought they are well known, but I don't think they are.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Nov 15, 2017

Just a note that I didn't say to allow it for Fortran and not for Julia. Either both or none.

The issue with this is that the comparison isn't just between Julia and Fortran. C, Fortran and Julia all support "fast math" modes, but other languages don't and we need to apply the principle of "the same computation" to all the languages in the benchmark. That's why it would be ok to add fast math versions, but the basic cross-language comparison has to be IEEE – so that the comparison is fair. We could turn "fast math" on in just C, Fortran and Julia and they would look even better by comparison to other languages, but we're already clobbering them, so that doesn't seem very sporting.

@certik
Copy link
Contributor Author

certik commented Nov 15, 2017

@StefanKarpinski I understand this argument and I agree one should compare exactly the same computation. My own view of the benchmarks is that Julia already won against Python or Matlab, so I don't care about those, the case is settled. I do, however, care about the Fortran / Julia benchmark.

Anyway, back to the IEEE issue in openlibm. So I figured out why the exp2 fails, the failure is caused by icc rewriting z = x - ((x+redux)-redux) into 0, which was exactly your example! Funny. I said about your code that one should never do that in practice. I still think that. So what is going in here? Here are the relevant definitions:

#define TBLBITS 8
#define TBLSIZE (1 << TBLBITS)
static const double
    redux    = 0x1.8p52 / TBLSIZE;

And here is the actual original code, that fails in icc:

STRICT_ASSIGN(double, t, x + redux);
t -= redux;
z = x - t;

The t variable is also used for some other stuff, I removed that from this listing, as icc compiles that part properly.

I will explain my reasoning in detail, so that you or others can scrutinize it. As I said, I welcome anybody to find holes in my arguments.

This code is hackish and unreadable. Look at it and tell me in 5s what this is doing. Maybe you can, but I certainly can't tell. Here is what this is doing, which took me a long time to figure out:

z = x - floor(x*TBLSIZE + 0.5) / (double)TBLSIZE;

Now this is a whole different issue -- this is a common pattern, I usually use it the other way round as X = X - L*floor(X/L), which periodically shifts particles to the [0, L]^3 box. In the way it's written above, L=1/TBLSIZE, so we are periodically shifting "x" into a z in a box of [0, 1/TBLSIZE]. The 0.5 in there I think just shifts the box to [-1/TBLSIZE/2, +1/TBLSIZE/2]. This is consistent with the mathematical formulation from the C file:

 * Method: (accurate tables)
 *
 *   Reduce x:
 *     x = 2**k + y, for integer k and |y| <= 1/2.
 *     Thus we have exp2(x) = 2**k * exp2(y).
 *
 *   Reduce y:
 *     y = i/TBLSIZE + z - eps[i] for integer i near y * TBLSIZE.
 *     Thus we have exp2(y) = exp2(i/TBLSIZE) * exp2(z - eps[i]),
 *     with |z - eps[i]| <= 2**-9 + 2**-39 for the table used.

First of all, there is a mistake, it should be x = k + y. Second of all, you can see the 1/TBLSIZE compartment in there. The original code is using redux to do the floor together with the TBLSIZE scaling. If you think about the value of redux in binary:

In [6]: "{0:b}".format(int(float.fromhex("0x1.8p52")))
Out[6]: '11000000000000000000000000000000000000000000000000000'

You can see that it is doing the floor function when you add to it, and then since redux is divided by TBLSIZE=256, that just removes some zeros from the end, and then by adding and subtracting redux, you essentially do the same thing as floor(x*TBLSIZE + 0.5) / (double)TBLSIZE. One can verify this in, e.g., Python:

In [7]: redux = float.fromhex("0x1.8p52") / 256

In [8]: "{0:b}".format(int(redux))
Out[8]: '110000000000000000000000000000000000000000000'

In [45]: x = -0.7; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[45]: (-0.0007812499999999556, -0.0007812499999999556)

In [46]: x = 0.7; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[46]: (0.0007812499999999556, 0.0007812499999999556)

In [47]: x = 0.7; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[47]: (0.0007812499999999556, 0.0007812499999999556)

In [48]: x = -0.7; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[48]: (-0.0007812499999999556, -0.0007812499999999556)

In [49]: x = 1.123; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[49]: (0.0019062499999999982, 0.0019062499999999982)

In [50]: x = -1.123; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[50]: (-0.0019062499999999982, -0.0019062499999999982)

In [51]: x = -0.5; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[51]: (0.0, 0.0)

In [52]: x = 0.5; x-((x+redux)-redux), x-math.floor(x*256+0.5)/256
Out[52]: (0.0, 0.0)

But the x-((x+redux)-redux) version is unreadable, and relies on IEEE, so it fails in icc. On the other hand, the version x - floor(x*TBLSIZE + 0.5) / (double)TBLSIZE is corresponding to the math, it is mathematically correct, and thus it will be correct whether or not your floating point numbers follow IEEE, and thus it will also work in icc. And it does. And that is my whole point --- one should not write stuff like x-((x+redux)-redux) which is not portable, and mathematically equal to 0. One should translate the math into code, which in this case is x - floor(x*TBLSIZE + 0.5) / (double)TBLSIZE. This is correct mathematically, and in the code. This is the Fortran philosophy. Make the code look like the math. Formula translation (that's where the name Fortran comes from).

Ok, I am not here to preach, but I want to put this point across, as it is often misunderstood. Even if you don't agree, it's worth understanding the idea.

There is still one problem. The new version is much slower than the original. Well, let's work on that. So what is TBLSIZE? It happens to be 256. Ok, what if it is something else, like 250? Well, then the equivalence does not work anymore:

In [53]: redux = float.fromhex("0x1.8p52") / 250

In [56]: x = 0.71; x-((x+redux)-redux), x-math.floor(x*250+0.5)/250
Out[56]: (-0.0009375000000000355, -0.0020000000000000018)

I don't know, but I suspect the original might not work unless TBLSIZE is a power of two. My new version works just fine for any number. In the C code, TBLSIZE is always a power of two, so no problem there. So for multiplying and dividing a floating point number by a power of two, there is an ldexp function:

z = x - ldexp(floor(ldexp(x, TBLBITS)+0.5), -TBLBITS);

We now use TBLBITS, see the definition in the C code above. This is about as much as we can do, while keeping this mathematically correct. A good compiler however should have a very fast (and thus platform and floating point implementation dependent) implementations of the ldexp and floor functions. Instead of floor(a+0.5), we could have also used round(a), so that we don't have to do the +0.5 addition. Either way, the compiler can now generate very efficient code. I think a good compiler might be able to optimize this to be as fast as the original. I haven't checked what assembly icc generates. The redux trick is essentially a way to implement this, in a platform dependent way, and the compiler should do that. If not, and we really really need this for speed, well, one can always use some macros and have two implementations --- one general, but perhaps slower sometimes, and one platform dependent. In fact, the openlibm library is already using exactly this approach with the STRICT_ASSIGN macro. If anyone objects to this particular point, that a compiler can't generate as efficient code as the original, let me know. I can look into this further. Update: I implemented a solution that is as fast as the original, but readable and that works with icc in JuliaMath/openlibm#170.

Anyway, the final version now works with icc.

You can see all the details at JuliaMath/openlibm#169. I spent the whole evening on this. But this is a great example, and I think it proves my point that I was making above. One should write mathematical code, and then use -ffast-math.

One final disclaimer: I am not claiming that there can't be an example where my approach does not work. All I am claiming is that so far every example that anybody ever showed me can either be rewritten into mathematical form + -ffast-math, or it is not relevant in practice.

@StefanKarpinski
Copy link
Sponsor Member

Great analysis and thanks for the PR.

As a counter-example, consider my original problem: how do you find the error of a single floating-point addition when the compiler is free to re-associate the operations arbitrarily?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 15, 2017

I'm pretty sure the goal of the code was written to be fast, not readable. But it's also definitely not a hack to make use of the standards-defined behavior in a language. C compilers like to make it very clear that they'll use any undefined behavior in the C standard to optimize / delete / rewrite your code. However, I think the flip side of that is that a compiler that intentionally miscompiles standards-complient code is not a C compiler, and thus disqualified from being used as a benchmark.

write it mathematically, and assume that you can use any mathematically equivalent expressions, and at the same time, avoid numerical constructs which lose accuracy in floating point

Also, you've been arguing that the compiler should be allowed to use "mathematically equivalent expressions", but it is still constrained by floating point precision, so the transforms being made by icc are not equivalent. For example, above you state that adding 0.5 "just shifts the box". That would be true mathematically, but with real floating point numbers, those two boxes are a different size. That could result in different branch-cuts near the log/linear non-uniformities in the IEEE double format, or simply lower precision results. I don't know whether it is using that property here, but I would be hesitant to trust the output of a compiler that ignores the original author's expressed intent.

@certik
Copy link
Contributor Author

certik commented Nov 15, 2017

@StefanKarpinski you do it exactly as I have done in the PR. In C, you can use volatile, but as they pointed out, that spills the register, so one would have to write this either in assembly, or provide a C implementation and then compile that particular file with IEEE semantics, to avoid writing assembly. In openlibm, you essentially expanded this IEEE semantics on the whole code, so then the whole code must be compiled like that, but that's against my philosophy.

@vtjnash, @StefanKarpinski this discussion made me realize that there are two schools.

  • One is the IEEE school, which is where most (if not all) people on this thread belong, and that schools says that you rely on the IEEE semantics, and assume it always works, and forbid the compiler to do any kind of optimizations that would change the order of floating point operations, or do anything that.

  • The other school, let's call it the Fortran school, assumes that the compiler is free to do any kind of mathematical rewriting. As I pointed out above, it turns out the Fortran standard itself actually requires to keep the parentheses, so the Fortran standard is not following the "Fortran school" either (what an irony!), if you see what I mean. But icc is following the Fortran school. So is -ffast-math in gcc.

In most of this thread, I think @StefanKarpinski assumed that there is only the IEEE school, and that the Fortran school is "extremely naïve", "myopic", or that there is no such school, it's only "my code", but everybody else is using the IEEE school. I am not offended in any way, I am just trying to explain what happened in this thread. Doing this made me realize a few things about the IEEE school that I didn't realize before, and also I am trying to figure out the rules (and write them down later!) of the Fortran school. The IEEE school has pretty clear rules. The Fortran school, it turns out, also has very firm rules, but they are not written down anywhere very clearly. But the rules are clearly there. That's why I will attempt to write them down.

The two schools are not equivalent and that's what's causing the disagreement in this thread. But they are almost equivalent, but they stress different things. It's like a different angle to approach things, and if you go far enough, the two angles can almost meet. But never become 100% equivalent.

Let's first answer this one:

how do you find the error of a single floating-point addition when the compiler is free to re-associate the operations arbitrarily?

  • In the IEEE school, you simply rely on IEEE semantics and do e = ((a + b) - a) - b, and enforce the IEEE semantics in the compiler. (I am short here, since I don't have to explain the IEEE school in this thread.)

  • In the Fortran school, you can't rely on IEEE semantics, that is a platform specific low level detail. The Fortran language provides lots of intrinsics to query the floating point features in a platform independent way, examples are: tiny, huge, epsilon, selected_real_kind, precision and many others. So if you need to know such an error, then there should be an intrinsic fp_add_error(a, b), that has to be implemented by the compiler, or a library. You don't implement this in Fortran. You can implement this intrinsic in any way you like (whether in C with IEEE semantics, or in assembly, or any other way in the compiler itself, does not matter). And you build your Fortran code using such platform independent intrinsics / building blocks.

So the Fortran school stresses portability and it needs to look like a math. However, the Fortran school also stresses speed. That's why we give up IEEE semantics, that's why Intel gives it up by default. Note that Intel initiated the IEEE standard in the first place (as I learned on @StefanKarpinski's link above, great link btw), but they still give it up by default. You might think what an irony. But it's not an irony, it's just two schools, and you need to support both. Anyway, so you need speed in Fortran, so as an example, in the PR above, we ended up with the ldexp(floor(ldexp(x, p)+0.5), -p) operation, and now we needed to make it as fast as possible. The Fortran school relies a lot on the compiler, and the compiler can do a lot --- unlike with IEEE semantics, in Fortran we are limited in what we can use, and as such, the compiler has great freedoms to greatly optimize the code. So in this case, the compiler knows exactly what we are doing, since ldexp(floor(ldexp(x, p)+0.5), -p) is mathematically well defined. So a good compiler can do amazing things with it. But let's say icc can't produce optimal code for this particular example. Then we can provide an efficient implementation of this, as I have done in the PR, and simply create a function ldexp_floor_ldexp(x, p) that does this quickly, but in a platform specific way, using the IEEE specific redux trick.

You can see that the schools can get quite close, but are not exactly equivalent. In the Fortran school, if the compiler can't quite optimize the platform independent way of doing things, you might need to help it by providing such intrinsics. However, in practice, the -ffast-math really is fast and faster than not using it. So the Fortran school gets the speed this way.

Now to @vtjnash's excellent comments:

However, I think the flip side of that is that a compiler that intentionally miscompiles standards-complient code is not a C compiler, and thus disqualified from being used as a benchmark.

Yes, a Fortran school compiler can't be used for an IEEE school benchmark. No contradiction here.

Also, you've been arguing that the compiler should be allowed to use "mathematically equivalent expressions", but it is still constrained by floating point precision, so the transforms being made by icc are not equivalent.

In the IEEE school they are not equivalent. They are however equivalent in the Fortran school. No contradiction here. The Fortran school is different than the IEEE school, and the advantage of the Fortran school is precisely that you can now use such mathematically equivalent expressions to greatly speedup the code, but at the same time, the Fortran school requires you to write your code in a platform independent way (or more specifically in a floating point implementation independent way) without relying on IEEE semantics. It's a trade off.

For example, above you state that adding 0.5 "just shifts the box". That would be true mathematically, but with real floating point numbers, those two boxes are a different size.

Indeed. So in Fortran school you can do that, but in IEEE school you can't.

That could result in different branch-cuts near the log/linear non-uniformities in the IEEE double format, or simply lower precision results.

It could and it matters for the IEEE school. But it does not matter for the Fortran school, because you write the algorithm in a way not to rely on such things.

I don't know whether it is using that property here, but I would be hesitant to trust the output of a compiler that ignores the original author's expressed intent.

And this is the core of the argument. The key difference between the IEEE school and the Fortran school. In the IEEE school, you actually don't know in practice what exactly it is doing. Yes, it is well specified, but you don't know in practice, because you can't tell, whether my reimplementation is still 100% equivalent or not. In other words, you can't tell if x-((x+redux)-redux) and x-math.floor(x*256+0.5)/256 is equivalent. Both are well specified in IEEE. But you can't tell in practice, because, as you said, it "could result in different branch-cuts near the log/linear non-uniformities in the IEEE double format, or simply lower precision results." In other words, the floor, and the division may change things slightly to how the x-((x+redux)-redux) does it. Absolutely! That is the key. Now, regarding author's expressed intent. Well, what exactly was his intent? The two schools differ here too. The IEEE school says that his intent was exactly x-((x+redux)-redux), period. The Fortran school says that his intent was to implement the math, as described by the comment in the code. So it needed to put the numbers into boxes. Which is mathematically done using the x-math.floor(x*256+0.5)/256 idiom. And so that what the intent was. So in Fortran, that's how you write it. And then you leave it to the compiler to produce correct mathematical code (i.e., the correct result with the best accuracy it can get).

I would be hesitant to trust the output of a compiler that ignores the original author's expressed intent.

Again, a huge difference between the schools. In IEEE school, indeed you must require and IEEE compiler, otherwise the code will completely fail, as exemplified by the openlibm example. In Fortran school, the code itself exactly represents your mathematical intent. And the compiler thus can't ignore it. It must obey it, and by definition, it is also free to use any other mathematically equivalent representations to implement it. So icc can't miscompile my code. (Yes, it can, if there are bugs in the compiler, but let's say there are no bugs.). That is the key to understand: in Fortran school, the icc or ifort complers can't miscompile the code written in Fortran school.

@StefanKarpinski also mentioned that he considers the default icc behavior a bug. It is indeed a bug in the IEEE school. But it is not only not a bug in the Fortran school, it is the desired feature!

Anyway, sorry for the long comments. But things are becoming very clear in my head. I think I nailed it. I never realized nor written down explicitly like this that there are two schools, IEEE and Fortran, and that they differ so much. But at the same time, in practice, they overlap a lot. But they still differ, as I have exemplified above. And a given code can be a bug in one school, but a feature in the other, and vice versa.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 15, 2017

Which is mathematically done using the x-math.floor(x*256+0.5)/256 idiom

Mathematically, perhaps, but they aren't equivalent in hardware on real floating point numbers. That's why the IEEE standard was created to replace the wild-west of "fortran school" implementation-defined behaviors. In particular, the rewrite assumes that x*256 + 0.5 is being performed in arbitrary-precision arithmetic, and won't overflow (false for extra large numbers, return Inf), and that adding 0.5 will be precisely computed (possibly false for small numbers, for example those near 0.5/256).

he considers the default icc behavior a bug

It is a bug. It is a clear violation of both the Fortran and C standards. There may be times when it is unclear if a compiler is taking advantage of UB when doing an optimization. But this is not UB. I don't expect that they'll fix this, since it'll impact their benchmark results. But it does call into question the validity of their performance claims, since they have been found to not be doing an apples-to-apples comparison.

@certik
Copy link
Contributor Author

certik commented Nov 15, 2017

@vtjnash yes, in IEEE school I 100% agree with your comment. You are simply logically following the IEEE school's assumptions and I agree with your conclusions.

In Fortran school, you don't care about how this is implemented in hardware. It is not wild-west (that's the only thing I disagree with you in your comment -- the Fortran school is well defined too, or can be well defined, as I am discovering the rules myself). The x-math.floor(x*256+0.5)/256 operation is mathematically well defined. Period. Now the question is, how to actually compile it. You have freedom here. I think it's precisely this freedom that you refer to as "wild-west". But I object to it, since this freedom does not mean to produce incorrect results. You must produce the best result you can, on the given hardware. The link that @StefanKarpinski posted, that was truly wild-west, since he described that if you didn't multiply by 1.0, it would return an incorrect answer. Well, that is against the Fortran school. So I object that they were following a Fortran school, using my definition above. That was not Fortran school, just an incorrect behavior. But yes, how exactly x-math.floor(x*256+0.5)/256 is implemented is not fully specified in Fortran school, but the Fortran school has some requirements --- it must be numerically as accurate as you can get. And if say icc does not give as good accuracy, then that's a bug. Now, in practice, you don't get the precise IEEE rounding. So you might say you lose accuracy. Yes, you do, but only a little bit, and that is allowed by the Fortran school. So instead of 1e-15, you get only 1e-14 sometimes. Yes, I think that is fine with the Fortran school. What is however not fine with Fortran school if instead of 1e-15, you get 1e-6 or 1e-4, or even 0 accuracy. That's unacceptable.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 15, 2017

as I am discovering the rules myself

Having implementation-defined behaviors and having actual rules (standards), are not substitutions. That's how we ended up with the mess that is the 386 and the horribly broken C standards written to try to formalize and accommodate it.

You must produce the best result you can, on the given hardware

But your implementation doesn't specify that, and doesn't do that. It specifies precisely that it should compute x * 256 + 0.5 using double-precision IEEE-format rounding and overflow. If you want the mathematically correct answer, then you need to use a mathematically correct library or language (such as Mathematica, SymPy, or Nemo).

it must be numerically as accurate as you can get

Then is must not be compiled with fast math flag. That flag explicitly discards accuracy in the interest of improving performance.

but only a little bit

There's actually no bound on how much accuracy might be lost in fast-math mode. You might instead get 0 accuracy. This depends on you checking the assembly output for your compiler on your code to see if it made a replacement that is not valid for your problem domain.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Nov 15, 2017

It should be noted that every -ffast-math optimization example in this thread can be explicitly implemented by the programmer by straightforward mathematical simplification. If I really wanted ((x + y) - x) - y to be simplified to zero, I could just do that and write 0 instead. The compiler is not doing anything that the programmer couldn't do. On the other hand, in -ffast-math mode, there is no way to express that you want to compute the error of float addition short of using volatile (with a huge performance hit) or "write it in assembly". If "write it in assembly" is the answer, I think you've lost the argument. This was precisely my point above when I wrote:

IEEE is not the enemy of performance, it is precisely the set of guarantees you need in order to know that you are doing exactly the minimal amount of work needed to get the right answer.

The optimization that -ffast-math does in this case is easy to ask for explicitly, but the behavior that IEEE guarantees is impossible to express in -ffast-math mode. These are not symmetrical positions: IEEE mode is strictly more expressive than unconstrained "fast math".

There is, however, a class of optimizations that it's quite hard to express explicitly with strict IEEE semantics – namely the kinds of reassociations required for SIMD loop optimization: the compiler needs to be free to assume that it can reassociate reductions across loop iterations so that it can emit vectorized code instead of sequential code. Julia has a @simd annotation specifically to explicitly allow this transformation. But throwing the additional expressiveness of IEEE semantics out with the SIMD bathwater doesn't seem like a good tradeoff at all.

@certik
Copy link
Contributor Author

certik commented Nov 17, 2017

@yuyichao exactly. IEEE allows you to trust the code even if you did the tests only once, while with -ffast-math, you have to do these tests all the time.

But the thing is, what exactly are those tests that they did? I feel it is exactly the same test that I did, unless you have some evidence to the contrary.

Perhaps the point is that with IEEE guarantees, it's not just the original author who did the test, but all the thousands of people who used the code. Since no error was reported, you can trust it. While with -ffast-math, you can only trust it as long as you trust that one particular test, but you have to redo the test if you change compilers.

So it's about having robust tests.

@yuyichao
Copy link
Contributor

I feel it is exactly the same test that I did, unless you have some evidence to the contrary.

I personally have no evidence but that doesn't matter at all. The point is that the test only needs to be done once. And

you have to do these tests all the time.

is completely unacceptable.

@certik
Copy link
Contributor Author

certik commented Nov 17, 2017

you have to do these tests all the time.

is completely unacceptable.

And that's where you and I differ. I test my numerical codes very thoroughly. I see only 3 (!) tests for exp2 in openlibm, which is unacceptable to me. Because if I submit a change to exp2, how will you verify that I didn't break anything? @vtjnash above said he didn't know if my floor(x+0.5) change didn't break anything (and it turned out it did, and I had to change it to round). So the code is essentially read only at this point. While if you had thorough tests, then you can keep changing the algorithm, and then just test it, that the 1ulp maximum error is still satisfied.

And that is perhaps how we can wrap up the discussion. It's all about how you test the code.

@simonbyrne
Copy link
Contributor

How do you know that the analytical error stays? That comes to doing either a) or b), as you posted. I don't see any evidence that either of that was done. I think they simply did "smoke testing", exactly as I did, to determine the error bounds of it. Would you agree?

No. If you look at the code you will see that the approach closely resembles the Tang paper I mentioned above. The author of the code was clearly aware of the different sources of error (reduction error, approximation error and rounding error) as they detail them in the comments, so presumably they precisely kept track of them while writing it. My guess is that they did some informal calculations to check that their error was bounded, but didn't write up a formal proof (as that is a much longer and more thankless task).

@certik
Copy link
Contributor Author

certik commented Nov 17, 2017

@simonbyrne ok, that might be. I personally trust an automatic test a lot more than just a hope that the author didn't make a mistake. If you see what I mean.

@certik
Copy link
Contributor Author

certik commented Nov 17, 2017

@ViralBShah sorry, this is actually very on topic here, and we are close to a conclusion.

@simonbyrne, @yuyichao when I started working on JuliaMath/openlibm#171, the first thing I did was to write robust tests. @StefanKarpinski calls them "smoke test", which seems to suggest that I haven't done a good job, but I really tried to do the best testing job I could. If any of you have a better idea, how to test it, let me know. Ok, so we have tests. Then I test the IEEE version. If it passes, then you can use the IEEE version on other machines, and, as you claim, you don't even have to run the tests.

Then you use this test suite to check other implementations of the same IEEE algorithm, that is, I changed x+redux-redux to round(x*256)/256. I haven't done any -ffast-math. Just a simple change. Now we need to figure out if I broke anything. So I run the tests. They passed. All good.

Now I can optionally turn on -ffast-math and check what broke. Nothing broke. So all good. Case closed. You don't have to use -ffast-math, but I can. And all I have to do is to run tests. But even if you never touch -ffast-math, you still need the same tests to verify that I didn't break anything with the round change in the IEEE version. That is the point.

You can also verify it by doing numerical analysis, as @simonbyrne explained. You should do both. I personally would not trust any change that was not thoroughly tested by automatic tests similar to what I have done. So once you have a solid testing infrastructure, that I argue you need anyway, well then you might as well optionally use -ffast-math, as long as tests pass, on a given machine/compilers.

That is how I develop.

@simonbyrne
Copy link
Contributor

Ultimately I think you need both. Unit tests can't guarantee that you didn't make a mistake, as they can't exhaustively test all possible values. Written proofs can't guarantee that the author didn't miss some weird edge case, or fail to implement it correctly.

Then sometimes you get weird bugs which only affect 5 specific values.

@certik
Copy link
Contributor Author

certik commented Nov 17, 2017

@simonbyrne well then, if you agree we need both (but you don't have both in openlibm, but I am sure that will be fixed), then I don't know what we are arguing about.

When I develop, I have two modes. A Debug mode (with no -ffast-math) and a Release mode (with -ffast-math). I test my numerical codes in both modes. And I do numerical analysis on my code too, as much as I can. But it's those automatic tests that allow me to trust -ffast-math on a given machine.

@mbauman
Copy link
Sponsor Member

mbauman commented Nov 17, 2017

Ok, so what happens when those tests under -ffast-math fail? Is that a bug in the library? What if it's because -ffast-math swapped out the exp2 in those tests to my hypothetical one? How do you fix that?

@simonbyrne
Copy link
Contributor

I guess "trust" is what it comes down to, which unfortunately is a personal thing. While you might trust the results of -ffast-math based on the results of unit tests, I would not.

@certik
Copy link
Contributor Author

certik commented Nov 17, 2017

@mbauman exactly the right questions --- so the first answer is that they usually don't fail. Like they didn't fail after I fixed exp2 and turned on -ffast-math. Things just work, even I was surprised by that, especially after people were telling me how things can't be trusted.

So if they fail, in my codes, I usually have a tolerance for a numerical algorithm. And it can't be tight too much anyway, i.e. say some error in total energy is equal to 1.234e-9. Ok, then I usually set my tolerance to 5e-9, or 1e-8. The reason is that if I swap compilers, it automatically swaps the standard library like openlibm, or the Lapack implementation, and things numerically change a bit. So I set the numerical accuracy such that when I test across multiple compilers and architectures, things still pass.

Ok, now we turn on -ffast-math. In my codes the -ffast-math almost always passed such "relaxed" tolerances from the previous step. If it doesn't, I just slightly adjust the tolerance. But at this point, this is extremely rare.

What if it's because -ffast-math swapped out the exp2 in those tests to my hypothetical one?

So even -ffast-math is supposed to provide "good accuracy". So your hypothetical one must provide "good accuracy". So it is no different to switching from reference Lapack to MKL. Things will numerically change. But just a little bit. So the total energy error typically stays withing my already set tolerances. That has been my experience.

@yuyichao
Copy link
Contributor

Because if I submit a change to exp2, how will you verify that I didn't break anything?

If you are submitting a change you must provide proof or test that justify your change which includes making sure it works everywhere including future compilers. Saidly, the test is not there to catch all breakage in this case since such test is very expesive and should be done only once when the change is made.

It's all about how you test the code.

Nop, its all about how you can proof you change is correct. Or in general how someone should justify any change that can change the result.

@certik
Copy link
Contributor Author

certik commented Nov 17, 2017

I forgot to say, that I also have to test in parallel with MPI, and there you can't assume anything --- even the same binary on the same cluster produced numerically different answers on two different nodes (both nodes had an identical architecture and system). Just machine accuracy noise, but it did (probably a hardware bug, who knows?). That broke our code (which was not written in the Fortran school;), which didn't use MPI to communicate the now slightly different forces on each atom, so each node was effectively running a different MD simulation and eventually the pressure got negative and things broke. Had the code been written in the Fortran school, this would never happen. The fix was to use MPI to communicate the forces. This took me 12h of hard work on one Saturday to figure out.

Your answer might have been --- don't use that cluster. Well, that is not an answer for us here. We have to use what we got, and write good robust code, that will not break on imperfect platforms.

@certik
Copy link
Contributor Author

certik commented Nov 17, 2017

Nop, its all about how you can proof you change is correct. Or in general how someone should justify any change that can change the result.

That is a good enough summary of this thread for me. I can agree this is what this thread is about.

@yuyichao
Copy link
Contributor

I also have to test in parallel with MPI

That's unrelated to this discussion at all. but

and there you can't assume anything

Assuming it is due to parallelism this is not true at all. You don't have to assume anything but you can if you want.

Had the code been written in the Fortran school, this would never happen.

If it is about parallelism, then the code was not written in any school since it doesn't take that into account. If it is due to hardware but then this claim is also not true since nothing can save you with that.

@certik
Copy link
Contributor Author

certik commented Nov 17, 2017

@yuyichao the mistake in the code was to assume that if you have a calculation, say an Ewald subroutine, you compile it, then you take the binary on two different nodes, that have exactly the same system and architecture, and run the binary, that it will give you a total energy and forces as exactly the same double precision number. The code assumed that, and so it ran the Ewald on each node, and just used those forces, assuming every node got the same force. But if each node can produce a slightly different (it was an 1e-15 error) results, then you have to communicate the forces across nodes to ensure each node has exactly the same forces, otherwise the errors will slightly accumulate and things will break, as they did. (The forces are used to propagate atomic positions, and those must be identical on each node.)

So this is very relevant to this thread --- it's about reproducibility. The other way why I put this here is that it's about how you set tolerances for the results in tests. The tolerance must be relaxed a bit, because the same test must run say with 4 cores, or 8 cores. The same with OpenMP. I just try to write the tests so that they work with any number of cores, instead of depending on working with exactly 2 cores only.

@StefanKarpinski so there was a gremlin. And it was not the compiler, as the binary was identical on each node. It just reminded me of your comment. So I do assume there are gremlins, for good reason I think.

@yuyichao
Copy link
Contributor

So this is very relevant to this thread --- it's about reproducibility.

This is more about predictability than reproducibility. As you have observed, openlibm might not give the correctly rounded value. However it does give some bound on it that can be guaranteed and reasoned about across multiple versions and I have said many times, compiler options that may change the output are acceptable, as long as they can be reasoned about.

@certik
Copy link
Contributor Author

certik commented Nov 17, 2017

I spend more time on the openlibm benchmarks I did, and I think I was wrong. Based on my latest investigation, details of which I documented in JuliaMath/openlibm#171, I think there is no significant speedup in using -ffast-math on those routines. I tried exp2 and pow. I fooled myself with the LTO, as I suspect the compiler inlined the routine I was benchmarking at link time, but only sometimes, based on compiler options (like -ffast-math...). I tried both gcc 7.2 and Intel 16.0.2., both deliver comparable results, both in terms of accuracy and in terms of speed.

By playing with all this, my conclusion so far with openlibm and fast-math is:

  • The accuracy gets tiny bit worse for some input, but the maximum error bound didn't change in pow and exp2.
  • The speed didn't change.
  • The inf/nan checks in exp2 do slow things down a bit. It's worth removing them if performance is important. This is unrelated to -ffast-math.
  • It seems it is possible to make code changes so that things produce correct results with -ffast-math, and speed is not affected without -ffast-math. But given that I don't actually get any benefits in using -ffast-math, I don't think it's worth the work. (It would make it more portable in my school of thought, but less portable in your school of thought.)

It was still useful, it did answer some of our questions, like:

  • Do you pay a performance penalty for inf/nan checks? Yes you do. A small one.
  • Is there any speedup with -ffast-math in openlibm? Sadly no.
  • Can you make openlibm compile with icc and trust it? Yes. I trust it based on my tests. You wouldn't trust it, based on our discussion.
  • Did the maximum error of pow or exp2 with -ffast-math decrease on my two machines and gcc/Intel compilers? No.
  • Can I prove it that it won't decrease in future versions of gcc/intel compilers? No.
  • Do I think it will decrease significantly in the future versions of gcc/intel compilers? No. By significantly I mean by more than a few ulp.

However, the reason people use -ffast-math is that it does deliver speedups. I first worked on openlibm because @StefanKarpinski said one can't use icc with it. And I think we answered all the related questions to that. Then somebody suggested what about speed of my changes. But it turns out there is no benefit of -ffast-math in openlibm. So regarding speed, we should go back to the original benchmark that started this thread:

real(dp) function pisum() result(s)
integer :: j, k
do j = 1, 500
    s = 0
    do k = 1, 10000
        s = s + 1._dp / k**2
    end do
end do
end function

Where -ffast-math delivers a speedup.

And there, @StefanKarpinski essentially said the speedup is not worth it, compared to Python etc. But on that I disagree. It is worth it, especially since it seems to be 2x in this case. And it would be nice to have this in benchmarks, both Fortran and Julia, and get the best performance that one can get. As part of those benchmarks, it should also print the accuracy, so that we can assess if we can trust the result, on a given machine. Based on the discussion, we can certainly keep the IEEE benchmarks there too, as those will be the one accuracy will be compared against.

@simonbyrne
Copy link
Contributor

I'm glad we are able to put this thread to rest, thanks for the effort on the benchmarks. Though I will note that the main objection to -ffast-math in the micro benchmarks was not "the speedup is not worth it", but that it was no longer an equivalent comparison.

@certik
Copy link
Contributor Author

certik commented Nov 17, 2017

@simonbyrne yes, and that I agree with. It should be a separate benchmark.

@certik
Copy link
Contributor Author

certik commented Nov 18, 2017

I got 25% speedup on my laptop with -ffast-math in the pow function in openlibm!!

The details of the minimal gcc options are here:

JuliaMath/openlibm#171 (comment)

Good, I got my speedup, 25% is about right, that's usually what I get on real code, so I can put this thread safely to rest. You guys can now figure out how to get the same speedup without -ffast-math. ;)

@certik
Copy link
Contributor Author

certik commented Nov 18, 2017

@mbauman regarding your question about division by zero, I moved it and answered here: https://discourse.julialang.org/t/when-if-a-b-x-1-a-b-divides-by-zero/7154, if you have any additional comments to that, we can discuss it there.

@certik
Copy link
Contributor Author

certik commented Nov 19, 2017

Regarding summations algorithms, I moved it here: https://discourse.julialang.org/t/accurate-summation-algorithm/7163

(I posted a comment here, which I moved into that discourse post.)

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Nov 19, 2017

And there, @StefanKarpinski essentially said the speedup is not worth it, compared to Python etc. But on that I disagree.

Yeah, that’s absolutely not what I said. I said that turning fast math on changes what you’re computing which makes this an unfair comparison with languages that can’t do those computations. I also said that measuring fast math is basically meaningless because it’s not even well defined what you are measuring.

@certik
Copy link
Contributor Author

certik commented Nov 19, 2017

@StefanKarpinski I apologize for writing it sloppily. After I sent it I realized it can be interpreted as "not worth the speedup since Julia is already beating Python" (which is not what you said, and I apologize it sounded like that), but what I meant by "speedup is not worth it" is that "it is not worth doing the comparison with -ffast-math, because each code then computes a different thing, so it's not clear what is being compared" (which is what you said before, and repeated now). And on that I disagree.

Again, I apologize, I didn't mean to argue something you didn't say.

@stevengj
Copy link
Member

If I might chime in, I would say that -ffast-math is useful, but in a somewhat different way than has been proposed here. As others have pointed out, -ffast-math is a poor choice for inter-language benchmarking because then different languages are not computing the same function, and it is problematic for robust library code because it becomes hard to guarantee the results. But it is a nice way to quickly experiment with whether simple algebraic transformations might be worthwhile.

If you find that -ffast-math gives a big speedup on any of the library functions or benchmarks, great! In that case, this is a tip that we should then figure out exactly what transformation the compiler did that helped, determine whether it is safe (for reliable, accurate, portable results), and then decide whether to do it manually (e.g. by re-ordering operations) in non -ffast-math mode.

Sometimes, for benchmarking, an optimization might not be a good idea even if it speeds things up! In the pisum case, a sufficiently aggressive compiler might just sum the series for you, but the point of the benchmark is not to compute π²/6—the point is to measure the speed of loops and basic arithmetic operations, so it is important that all languages are performing the same arithmetic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests