-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The gfortran benchmarks should use the "-march=native -ffast-math -funroll-loops" options #24568
Comments
If we do that, we should allow julia the same liberties of course ;). |
Fast math mode is not functionally identical, so that's changing the algorithm (and potentially the result). Unrolling loops is fair game, but most compilers should consider doing that anyway. |
@StefanKarpinski I checked on my machine and indeed the Regarding your point about changing algorithm --- it all depends what you want to show by the benchmarks. I use Fortran every day. I use Maybe a good compromise could be to mention that you are not using the Btw., do you have a real life example when My final thought is that a fair benchmark would be to simply compare against LLVM based compilers. That way everybody is using the same backend, and thus we are benchmarking how well each language can be optimized into LLVM IR. I think that would be meaningful. Well, except that fact, that there is currently no production LLVM Fortran compiler, but eventually I believe that will change. But at least for the C language I think it will make sense to use clang and exactly the same LLVM version as Julia. Since there is no Fortran LLVM compiler yet, then I think the next best thing is to let both LLVM and gfortran to do their best. This includes |
I don't care either way, but julia has ffast-math as well of course, so either we turn it on everywhere or we turn it off everywhere. As for real world examples, ffast-math turns on ffinite-math-only, which is a big problem for people who use float NaNs as missing data. |
@Keno do you know if that's how Pandas and NumPy does it? That would be a good argument against doing it by default. For numerical computing, I don't use NaNs --- except in Debug build when I want it to fail when a NaN would be generated. I use the following option in |
In julia fastmath is a local annotation, so the user can specify whether they are ok with non-IEEE behavior or not in a particular place in the code base. |
Fundamentally "fast math" does not compute the same thing, so we could either turn this on in every language or turn it on in no languages. Many languages don't support fast math at all, which is definitely a mark against them for high-performance computation, but if we do it in some but not others, then we're not comparing apples to apples. This hurts languages like Julia, Fortran and C where this is an option in the comparison, but those are also the languages that are kicking everyone else's butts, so I think we can afford it. |
One possibility would be to report two different results for |
I believe comparing the performance of different languages/compilers under Moreover, there is no The actual reason that the |
So I think your argument is that you want to set the playground rules, and those are the IEEE floating point arithmetic. My argument is that IEEE rules are not the relevant rules, and I asked above if somebody can provide an example where If such an example does not exist, then you can give up IEEE, since it does not affect any good numerical algorithm in practice. That is my argument. |
Here's a simple example of computing the error of a floating-point addition: #include <math.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
double a = atof(argv[1]);
double b = atof(argv[2]);
if (fabs(a) < fabs(b)) { double t = b; b = a; a = t; }
double e = ((a + b) - a) - b;
printf("addition error for `%g + %g`: %g\n", a, b, e);
return 0;
} IEEE versus fast math:
It's extremely naïve to think that things at least this subtle don't happen in real numerical codes. |
@StefanKarpinski thanks for taking the time to figure out an example. Here is the same example that you can try in Fortran with, say, the Intel compiler: program fastmath
integer, parameter :: dp = kind(0.d0)
real(dp) :: a, b, t, e
a = get_float_arg(1)
b = get_float_arg(2)
if (abs(a) < abs(b)) then
t = b; b = a; a = t
end if
e = ((a + b) - a) - b
print "('addition error for `', es9.2, ' + ', es9.2, '`: ', es9.2)", a, b, e
contains
real(dp) function get_float_arg(i) result(r)
integer, intent(in) :: i
integer, parameter :: maxlen=32
character(len=maxlen) :: arg
integer :: s
if (.not. (0 < i .and. i <= command_argument_count())) then
error stop "get_float_arg: `i` must satisfy `0 < i <= arg_count`."
end if
call get_command_argument(i, arg, status=s)
if (s == -1) then
error stop "get_float_arg: Argument too long, increase `maxlen`."
else if (s > 0) then
error stop "get_float_arg: Argument retrieval failed."
else if (s /= 0) then
error stop "get_float_arg: Unknown error."
end if
read(arg, *, iostat=s) r
if (s /= 0) then
error stop "get_float_arg: Failed to convert an argument to a float."
end if
end function
end And run it as:
Gfortran here behaves as gcc:
The Intel compiler has the expected behavior, since in Fortran the compilers have the freedom to rewrite an expression to an equivalent mathematical form (based on real numbers, not floating point), so
In other words, the expression as used above: e = ((a + b) - a) - b has only one way to interpret it, and the Intel compiler indeed broke the "integrity of parentheses". That is confirmed by an example in the Standard, which says that the expression e = a + b - a - b Then the The conclusion is that the Fortran Standard encourages this kind of rewriting, but even the Standard says not to break parentheses, but in practice compilers actually break them. An Intel Fortran is a good, solid reputable compiler, and it does this by default, as you can see above. In light of the above I must strongly disagree this is "extremely naïve". It might not be what you want to do in Julia, but it's not naïve. If you use
Now, when we have the standard and compiler behavior discussed, the more important problem with the above example is that when you write numerical code, you should not rely on such behavior. Besides the fact the Intel compiler breaks it by default both in C and in Fortran, the more important problem is that such a code is unreliable. As an example, where you seem to depend on such a behavior in Julia (JuliaMath/Roots.jl#12), consider this blog post that the PR is based on: http://www.shapeoperator.com/2014/02/22/bisecting-floats/ where it is claimed that instead of using When numerical code is written like that (e.g., Coming to your example above, I claim that one should never write a numerical algorithm that depends on such a behavior. You are subtracting two large numbers, so by definition you lost all the accuracy. The result can be anything, and one should never rely on such a result in any way. Based on my reasoning above, the conclusion is: when writing numerical code, write it mathematically, and assume that you can use any mathematically equivalent expressions, and at the same time, avoid numerical constructs which lose accuracy in floating point (such as subtracting two large numbers), never assume any particular representation of floating point (but you can and should assume how many significant digits can be represented) and compare floating point numbers using a tolerance. There might be a few more rules that I forgot. I welcome anyone to find holes in my arguments. I would be most interested to see where my reasoning is incorrect. |
@certik, I do not want to contradict @Keno "In julia fastmath is a local annotation", yes, but not only? julia flag:
@romeric "The actual reason that the pisum benchmark runs twice faster is that under -ffast-math the compiler emits an AVX vectorised version of the code instead of a SSE version" I believe you should get that with Julia, for local or global setting; you can at least try. Does Fotran and C only have this similar global blunt instrument? It's good for performance testing, but you really should enable locally? It would be good to indicate at least in a footnote, that you can get faster in Julia and some other languages, even if numerically less accurate, and point out Julia's advantages. |
The math-mode command line flag is quite problematic, since there's a bunch of code in base an elsewhere that assumes NaNs are possible. E.g. #21375 |
Without reliable associativity as guaranteed by IEEE, you cannot write important and useful algorithms like Kahan summation (the error term is always zero mathematically). Julia can and does use this kind of code for correct answers in many places in the standard library. Perhaps your numerical code does not rely on a correct implementation of floating-point associativity, but insisting that everyone else has a moral obligation not to rely on it strikes me as a rather myopic position. If everyone thought like this, we'd still live in the dark bad old days before IEEE. If the compiler is allowed to reassociate summation arbitrarily, you can literally get any possible answer when summing a small, fixed set of floating-point numbers. Numerical code goes from being fully well-defined with IEEE to being completely implementation-specific without it – even for something as basic as adding a bunch of numbers. What results you get depends on what your compiler chooses to do. And it can change from compiler to compiler and version to version. That's a fairly awful situation to be in for anyone who cares about reproducibility. Regarding Intel compilers: I consider not following IEEE behavior without opt-in to be a bug. This may be part of why their compilers have such a reputation for being fast, yet everyone actually uses All that said, I'm pretty tired of arguing about this, so feel free to make a PR adding a "fast math" versions of the Fortran, C and Julia benchmarks. (Please do not make a PR just changing the compiler options for Fortran from IEEE to fast math.) |
Also: your argument conflates numerical instability with incorrect answers due to invalid optimizations. These are not the same thing at all. However, tricky as these are, they are predictable and a skilled numerical analyst can tell you exactly how much error round-off can occur in a given computation. It seems like you may be falling into the common trap of imagining that floating-point operations introduce some sort of unpredictable error or fuzziness. And if that's the case, then why not give the compiler license to produce different fuzzy results? But this is a fundamental misunderstanding of floating-point arithmetic: it is completely predictable and always gives the correct answer rounded to the representable precision. Yes, the rounding introduces error, but in a well-defined, predictable way. There is no gremlin introducing random mistakes in the last digit. When using |
@StefanKarpinski the summation example is clever, but ultimately it's just a curiosity, since even Julia can't sum it correctly: julia> foldl(+, (sumsto(10.)))
10.0
julia> sum(sumsto(10.))
1.99584030953472e292
julia> sum_kbn(sumsto(10.))
9.9792015476736e291 and the returned array of numbers are large and small, all over the place: julia> sumsto(10.)
2046-element Array{Float64,1}:
1.79769e308
4.94066e-324
⋮
2.4948e291
4.9896e291
-1.79769e308
2.0
8.0 So that breaks my rule above "avoid numerical constructs which lose accuracy", since we are summing large and small numbers, and as such it is something to be avoided when writing numerical code. The
That's unacceptable. We would have to find exactly where the problem is, before we can draw conclusions. As a workaround, one can always turn off the optimization that broke it. In But this I personally agree with Intel, and their decision happens to be in line with my guidelines how to write numerical code above. And for special cases, one can always turn it off, and I never argued there should be no way to turn it off. I have submitted PRs to Julia before, but I will not submit this one, because you @StefanKarpinski seem "pretty tired of arguing about this" (using your own words), and I did not appreciate how I was treated in this thread. You @StefanKarpinski attacked me ad-hominem in almost every reply of yours ("extremely naïve", "myopic", "your numerical code vs everybody else", etc.), and I am very tolerant, but this is enough even for me. I have no interest in discussing with people who attack me personally, instead of discussing just technical details. I came here with good intentions to improve the benchmarks, and I have improved your benchmarks before. In fact, I met you @StefanKarpinski at the SciPy conference and also at the Google Mentor Summit. I had beers with you. I am very sad. But, we live in free country, I will always defend your freedom to free speech. But I will not discuss this online anymore. However, if we ever meet in person again, I am happy to buy you a beer so that we can continue this discussion in person. I will now try to forget the attacks and just think about the technical details, where you provided good arguments, even if I disagree with your conclusion. I am closing this issue, since as @StefanKarpinski explained, for Julia it makes sense to stick with IEEE by default. I initially thought that maybe the |
I'm sorry that you feel personally attacked – that certainly was not my intention. We have fundamentally different views here. I apologize for any adjectives I used that you felt were insulting. From my perspective, this is the hundredth debate I've participated in about why language X (in this case Fortran) should be allowed to do something special that makes it faster on these benchmarks. It's quite tiring after the tenth such discussion or so. Our principles regarding these benchmarks have always been simple and consistent:
Turning on In the particular case of I hope that I get a chance to buy you a beer when we cross paths in the future and apologize in person for offending you – I got carried away with the argument and with having a similar debate for the hundredth time. For what it's worth, this point of contention is far more interesting and substantive than the previous ones have been. |
@StefanKarpinski no offense taken, and apology accepted. I didn't know you had hundreds of these debates before. I maintain some large open source projects also, and so I know that some users or developers keep bringing the same issue over and over again --- the same issue for me, but for each of the people who brought it up it was their first time. As it was the case here, I never had such a debate before. Just a note that I didn't say to allow it for Fortran and not for Julia. Either both or none. With This debate made me think that I should write up the rules that I follow when writing numerical code, and put it up for scrutiny. I thought they are well known, but I don't think they are. |
The issue with this is that the comparison isn't just between Julia and Fortran. C, Fortran and Julia all support "fast math" modes, but other languages don't and we need to apply the principle of "the same computation" to all the languages in the benchmark. That's why it would be ok to add fast math versions, but the basic cross-language comparison has to be IEEE – so that the comparison is fair. We could turn "fast math" on in just C, Fortran and Julia and they would look even better by comparison to other languages, but we're already clobbering them, so that doesn't seem very sporting. |
@StefanKarpinski I understand this argument and I agree one should compare exactly the same computation. My own view of the benchmarks is that Julia already won against Python or Matlab, so I don't care about those, the case is settled. I do, however, care about the Fortran / Julia benchmark. Anyway, back to the IEEE issue in #define TBLBITS 8
#define TBLSIZE (1 << TBLBITS)
static const double
redux = 0x1.8p52 / TBLSIZE; And here is the actual original code, that fails in icc: STRICT_ASSIGN(double, t, x + redux);
t -= redux;
z = x - t; The I will explain my reasoning in detail, so that you or others can scrutinize it. As I said, I welcome anybody to find holes in my arguments. This code is hackish and unreadable. Look at it and tell me in 5s what this is doing. Maybe you can, but I certainly can't tell. Here is what this is doing, which took me a long time to figure out: z = x - floor(x*TBLSIZE + 0.5) / (double)TBLSIZE; Now this is a whole different issue -- this is a common pattern, I usually use it the other way round as * Method: (accurate tables)
*
* Reduce x:
* x = 2**k + y, for integer k and |y| <= 1/2.
* Thus we have exp2(x) = 2**k * exp2(y).
*
* Reduce y:
* y = i/TBLSIZE + z - eps[i] for integer i near y * TBLSIZE.
* Thus we have exp2(y) = exp2(i/TBLSIZE) * exp2(z - eps[i]),
* with |z - eps[i]| <= 2**-9 + 2**-39 for the table used. First of all, there is a mistake, it should be In [6]: "{0:b}".format(int(float.fromhex("0x1.8p52")))
Out[6]: '11000000000000000000000000000000000000000000000000000' You can see that it is doing the floor function when you add to it, and then since redux is divided by TBLSIZE=256, that just removes some zeros from the end, and then by adding and subtracting redux, you essentially do the same thing as
But the Ok, I am not here to preach, but I want to put this point across, as it is often misunderstood. Even if you don't agree, it's worth understanding the idea. There is still one problem. The new version is much slower than the original. Well, let's work on that. So what is TBLSIZE? It happens to be 256. Ok, what if it is something else, like 250? Well, then the equivalence does not work anymore:
I don't know, but I suspect the original might not work unless TBLSIZE is a power of two. My new version works just fine for any number. In the C code, TBLSIZE is always a power of two, so no problem there. So for multiplying and dividing a floating point number by a power of two, there is an
We now use TBLBITS, see the definition in the C code above. This is about as much as we can do, while keeping this mathematically correct. A good compiler however should have a very fast (and thus platform and floating point implementation dependent) implementations of the ldexp and floor functions. Instead of Anyway, the final version now works with icc. You can see all the details at JuliaMath/openlibm#169. I spent the whole evening on this. But this is a great example, and I think it proves my point that I was making above. One should write mathematical code, and then use One final disclaimer: I am not claiming that there can't be an example where my approach does not work. All I am claiming is that so far every example that anybody ever showed me can either be rewritten into mathematical form + |
Great analysis and thanks for the PR. As a counter-example, consider my original problem: how do you find the error of a single floating-point addition when the compiler is free to re-associate the operations arbitrarily? |
I'm pretty sure the goal of the code was written to be fast, not readable. But it's also definitely not a hack to make use of the standards-defined behavior in a language. C compilers like to make it very clear that they'll use any undefined behavior in the C standard to optimize / delete / rewrite your code. However, I think the flip side of that is that a compiler that intentionally miscompiles standards-complient code is not a C compiler, and thus disqualified from being used as a benchmark.
Also, you've been arguing that the compiler should be allowed to use "mathematically equivalent expressions", but it is still constrained by floating point precision, so the transforms being made by icc are not equivalent. For example, above you state that adding 0.5 "just shifts the box". That would be true mathematically, but with real floating point numbers, those two boxes are a different size. That could result in different branch-cuts near the log/linear non-uniformities in the IEEE double format, or simply lower precision results. I don't know whether it is using that property here, but I would be hesitant to trust the output of a compiler that ignores the original author's expressed intent. |
@StefanKarpinski you do it exactly as I have done in the PR. In C, you can use @vtjnash, @StefanKarpinski this discussion made me realize that there are two schools.
In most of this thread, I think @StefanKarpinski assumed that there is only the IEEE school, and that the Fortran school is "extremely naïve", "myopic", or that there is no such school, it's only "my code", but everybody else is using the IEEE school. I am not offended in any way, I am just trying to explain what happened in this thread. Doing this made me realize a few things about the IEEE school that I didn't realize before, and also I am trying to figure out the rules (and write them down later!) of the Fortran school. The IEEE school has pretty clear rules. The Fortran school, it turns out, also has very firm rules, but they are not written down anywhere very clearly. But the rules are clearly there. That's why I will attempt to write them down. The two schools are not equivalent and that's what's causing the disagreement in this thread. But they are almost equivalent, but they stress different things. It's like a different angle to approach things, and if you go far enough, the two angles can almost meet. But never become 100% equivalent. Let's first answer this one:
So the Fortran school stresses portability and it needs to look like a math. However, the Fortran school also stresses speed. That's why we give up IEEE semantics, that's why Intel gives it up by default. Note that Intel initiated the IEEE standard in the first place (as I learned on @StefanKarpinski's link above, great link btw), but they still give it up by default. You might think what an irony. But it's not an irony, it's just two schools, and you need to support both. Anyway, so you need speed in Fortran, so as an example, in the PR above, we ended up with the You can see that the schools can get quite close, but are not exactly equivalent. In the Fortran school, if the compiler can't quite optimize the platform independent way of doing things, you might need to help it by providing such intrinsics. However, in practice, the Now to @vtjnash's excellent comments:
Yes, a Fortran school compiler can't be used for an IEEE school benchmark. No contradiction here.
In the IEEE school they are not equivalent. They are however equivalent in the Fortran school. No contradiction here. The Fortran school is different than the IEEE school, and the advantage of the Fortran school is precisely that you can now use such mathematically equivalent expressions to greatly speedup the code, but at the same time, the Fortran school requires you to write your code in a platform independent way (or more specifically in a floating point implementation independent way) without relying on IEEE semantics. It's a trade off.
Indeed. So in Fortran school you can do that, but in IEEE school you can't.
It could and it matters for the IEEE school. But it does not matter for the Fortran school, because you write the algorithm in a way not to rely on such things.
And this is the core of the argument. The key difference between the IEEE school and the Fortran school. In the IEEE school, you actually don't know in practice what exactly it is doing. Yes, it is well specified, but you don't know in practice, because you can't tell, whether my reimplementation is still 100% equivalent or not. In other words, you can't tell if
Again, a huge difference between the schools. In IEEE school, indeed you must require and IEEE compiler, otherwise the code will completely fail, as exemplified by the openlibm example. In Fortran school, the code itself exactly represents your mathematical intent. And the compiler thus can't ignore it. It must obey it, and by definition, it is also free to use any other mathematically equivalent representations to implement it. So @StefanKarpinski also mentioned that he considers the default Anyway, sorry for the long comments. But things are becoming very clear in my head. I think I nailed it. I never realized nor written down explicitly like this that there are two schools, IEEE and Fortran, and that they differ so much. But at the same time, in practice, they overlap a lot. But they still differ, as I have exemplified above. And a given code can be a bug in one school, but a feature in the other, and vice versa. |
Mathematically, perhaps, but they aren't equivalent in hardware on real floating point numbers. That's why the IEEE standard was created to replace the wild-west of "fortran school" implementation-defined behaviors. In particular, the rewrite assumes that
It is a bug. It is a clear violation of both the Fortran and C standards. There may be times when it is unclear if a compiler is taking advantage of UB when doing an optimization. But this is not UB. I don't expect that they'll fix this, since it'll impact their benchmark results. But it does call into question the validity of their performance claims, since they have been found to not be doing an apples-to-apples comparison. |
@vtjnash yes, in IEEE school I 100% agree with your comment. You are simply logically following the IEEE school's assumptions and I agree with your conclusions. In Fortran school, you don't care about how this is implemented in hardware. It is not wild-west (that's the only thing I disagree with you in your comment -- the Fortran school is well defined too, or can be well defined, as I am discovering the rules myself). The |
Having implementation-defined behaviors and having actual rules (standards), are not substitutions. That's how we ended up with the mess that is the 386 and the horribly broken C standards written to try to formalize and accommodate it.
But your implementation doesn't specify that, and doesn't do that. It specifies precisely that it should compute
Then is must not be compiled with fast math flag. That flag explicitly discards accuracy in the interest of improving performance.
There's actually no bound on how much accuracy might be lost in fast-math mode. You might instead get 0 accuracy. This depends on you checking the assembly output for your compiler on your code to see if it made a replacement that is not valid for your problem domain. |
It should be noted that every
The optimization that There is, however, a class of optimizations that it's quite hard to express explicitly with strict IEEE semantics – namely the kinds of reassociations required for SIMD loop optimization: the compiler needs to be free to assume that it can reassociate reductions across loop iterations so that it can emit vectorized code instead of sequential code. Julia has a |
@yuyichao exactly. IEEE allows you to trust the code even if you did the tests only once, while with But the thing is, what exactly are those tests that they did? I feel it is exactly the same test that I did, unless you have some evidence to the contrary. Perhaps the point is that with IEEE guarantees, it's not just the original author who did the test, but all the thousands of people who used the code. Since no error was reported, you can trust it. While with So it's about having robust tests. |
I personally have no evidence but that doesn't matter at all. The point is that the test only needs to be done once. And
is completely unacceptable. |
And that's where you and I differ. I test my numerical codes very thoroughly. I see only 3 (!) tests for exp2 in openlibm, which is unacceptable to me. Because if I submit a change to And that is perhaps how we can wrap up the discussion. It's all about how you test the code. |
No. If you look at the code you will see that the approach closely resembles the Tang paper I mentioned above. The author of the code was clearly aware of the different sources of error (reduction error, approximation error and rounding error) as they detail them in the comments, so presumably they precisely kept track of them while writing it. My guess is that they did some informal calculations to check that their error was bounded, but didn't write up a formal proof (as that is a much longer and more thankless task). |
@simonbyrne ok, that might be. I personally trust an automatic test a lot more than just a hope that the author didn't make a mistake. If you see what I mean. |
@ViralBShah sorry, this is actually very on topic here, and we are close to a conclusion. @simonbyrne, @yuyichao when I started working on JuliaMath/openlibm#171, the first thing I did was to write robust tests. @StefanKarpinski calls them "smoke test", which seems to suggest that I haven't done a good job, but I really tried to do the best testing job I could. If any of you have a better idea, how to test it, let me know. Ok, so we have tests. Then I test the IEEE version. If it passes, then you can use the IEEE version on other machines, and, as you claim, you don't even have to run the tests. Then you use this test suite to check other implementations of the same IEEE algorithm, that is, I changed Now I can optionally turn on You can also verify it by doing numerical analysis, as @simonbyrne explained. You should do both. I personally would not trust any change that was not thoroughly tested by automatic tests similar to what I have done. So once you have a solid testing infrastructure, that I argue you need anyway, well then you might as well optionally use That is how I develop. |
Ultimately I think you need both. Unit tests can't guarantee that you didn't make a mistake, as they can't exhaustively test all possible values. Written proofs can't guarantee that the author didn't miss some weird edge case, or fail to implement it correctly. Then sometimes you get weird bugs which only affect 5 specific values. |
@simonbyrne well then, if you agree we need both (but you don't have both in openlibm, but I am sure that will be fixed), then I don't know what we are arguing about. When I develop, I have two modes. A Debug mode (with no |
Ok, so what happens when those tests under |
I guess "trust" is what it comes down to, which unfortunately is a personal thing. While you might trust the results of |
@mbauman exactly the right questions --- so the first answer is that they usually don't fail. Like they didn't fail after I fixed So if they fail, in my codes, I usually have a tolerance for a numerical algorithm. And it can't be tight too much anyway, i.e. say some error in total energy is equal to 1.234e-9. Ok, then I usually set my tolerance to 5e-9, or 1e-8. The reason is that if I swap compilers, it automatically swaps the standard library like openlibm, or the Lapack implementation, and things numerically change a bit. So I set the numerical accuracy such that when I test across multiple compilers and architectures, things still pass. Ok, now we turn on
So even |
If you are submitting a change you must provide proof or test that justify your change which includes making sure it works everywhere including future compilers. Saidly, the test is not there to catch all breakage in this case since such test is very expesive and should be done only once when the change is made.
Nop, its all about how you can proof you change is correct. Or in general how someone should justify any change that can change the result. |
I forgot to say, that I also have to test in parallel with MPI, and there you can't assume anything --- even the same binary on the same cluster produced numerically different answers on two different nodes (both nodes had an identical architecture and system). Just machine accuracy noise, but it did (probably a hardware bug, who knows?). That broke our code (which was not written in the Fortran school;), which didn't use MPI to communicate the now slightly different forces on each atom, so each node was effectively running a different MD simulation and eventually the pressure got negative and things broke. Had the code been written in the Fortran school, this would never happen. The fix was to use MPI to communicate the forces. This took me 12h of hard work on one Saturday to figure out. Your answer might have been --- don't use that cluster. Well, that is not an answer for us here. We have to use what we got, and write good robust code, that will not break on imperfect platforms. |
That is a good enough summary of this thread for me. I can agree this is what this thread is about. |
That's unrelated to this discussion at all. but
Assuming it is due to parallelism this is not true at all. You don't have to assume anything but you can if you want.
If it is about parallelism, then the code was not written in any school since it doesn't take that into account. If it is due to hardware but then this claim is also not true since nothing can save you with that. |
@yuyichao the mistake in the code was to assume that if you have a calculation, say an Ewald subroutine, you compile it, then you take the binary on two different nodes, that have exactly the same system and architecture, and run the binary, that it will give you a total energy and forces as exactly the same double precision number. The code assumed that, and so it ran the Ewald on each node, and just used those forces, assuming every node got the same force. But if each node can produce a slightly different (it was an 1e-15 error) results, then you have to communicate the forces across nodes to ensure each node has exactly the same forces, otherwise the errors will slightly accumulate and things will break, as they did. (The forces are used to propagate atomic positions, and those must be identical on each node.) So this is very relevant to this thread --- it's about reproducibility. The other way why I put this here is that it's about how you set tolerances for the results in tests. The tolerance must be relaxed a bit, because the same test must run say with 4 cores, or 8 cores. The same with OpenMP. I just try to write the tests so that they work with any number of cores, instead of depending on working with exactly 2 cores only. @StefanKarpinski so there was a gremlin. And it was not the compiler, as the binary was identical on each node. It just reminded me of your comment. So I do assume there are gremlins, for good reason I think. |
This is more about predictability than reproducibility. As you have observed, openlibm might not give the correctly rounded value. However it does give some bound on it that can be guaranteed and reasoned about across multiple versions and I have said many times, compiler options that may change the output are acceptable, as long as they can be reasoned about. |
I spend more time on the openlibm benchmarks I did, and I think I was wrong. Based on my latest investigation, details of which I documented in JuliaMath/openlibm#171, I think there is no significant speedup in using By playing with all this, my conclusion so far with openlibm and fast-math is:
It was still useful, it did answer some of our questions, like:
However, the reason people use real(dp) function pisum() result(s)
integer :: j, k
do j = 1, 500
s = 0
do k = 1, 10000
s = s + 1._dp / k**2
end do
end do
end function Where And there, @StefanKarpinski essentially said the speedup is not worth it, compared to Python etc. But on that I disagree. It is worth it, especially since it seems to be 2x in this case. And it would be nice to have this in benchmarks, both Fortran and Julia, and get the best performance that one can get. As part of those benchmarks, it should also print the accuracy, so that we can assess if we can trust the result, on a given machine. Based on the discussion, we can certainly keep the IEEE benchmarks there too, as those will be the one accuracy will be compared against. |
I'm glad we are able to put this thread to rest, thanks for the effort on the benchmarks. Though I will note that the main objection to |
@simonbyrne yes, and that I agree with. It should be a separate benchmark. |
I got 25% speedup on my laptop with The details of the minimal gcc options are here: JuliaMath/openlibm#171 (comment) Good, I got my speedup, 25% is about right, that's usually what I get on real code, so I can put this thread safely to rest. You guys can now figure out how to get the same speedup without |
@mbauman regarding your question about division by zero, I moved it and answered here: https://discourse.julialang.org/t/when-if-a-b-x-1-a-b-divides-by-zero/7154, if you have any additional comments to that, we can discuss it there. |
Regarding summations algorithms, I moved it here: https://discourse.julialang.org/t/accurate-summation-algorithm/7163 (I posted a comment here, which I moved into that discourse post.) |
Yeah, that’s absolutely not what I said. I said that turning fast math on changes what you’re computing which makes this an unfair comparison with languages that can’t do those computations. I also said that measuring fast math is basically meaningless because it’s not even well defined what you are measuring. |
@StefanKarpinski I apologize for writing it sloppily. After I sent it I realized it can be interpreted as "not worth the speedup since Julia is already beating Python" (which is not what you said, and I apologize it sounded like that), but what I meant by "speedup is not worth it" is that "it is not worth doing the comparison with Again, I apologize, I didn't mean to argue something you didn't say. |
If I might chime in, I would say that If you find that Sometimes, for benchmarking, an optimization might not be a good idea even if it speeds things up! In the |
Specifically, when I compare
on my machine, then I get 2x speedup on the
iteration_pi_sum
benchmark. I haven't tested the C code, but a similar speedup might be possible.Note: if you use a recent gfortran compiler (e.g. I just tested this with 7.2.0), you can just use:
And it will turn on the proper options for you, and produce optimal results.
The text was updated successfully, but these errors were encountered: