Skip to content

Conversation

@oscardssmith
Copy link
Member

4x faster for Float32, 2x faster for Float64. This has the same problem as #37426 in that it needs exthorner to achieve necessary precision.

@StefanKarpinski
Copy link
Member

This PR looks like it breaks a lot of tests.

@oscardssmith
Copy link
Member Author

Yeah. I need this one to be a bit more accurate still. I think it's currently in the 1.1 ulp range, and it needs to be in the .8 range.

@oscardssmith
Copy link
Member Author

I just pushed a commit inspired by @simonbyrne's idea of using Float64 for extended precision Float32 code. This new version took way less time to write, is more accurate, and at least as fast. Now to get equally good Float64 results.

@oscardssmith oscardssmith marked this pull request as draft October 2, 2020 04:35
@jmert
Copy link
Contributor

jmert commented Oct 2, 2020

Now to get equally good Float64 results.

If I'm reading this correctly, the range reduction here is only to the interval [-ln(2)/2, ln(2)/2] — assuming #36761 gets merged, couldn't you reuse the range-reduction machinery onto the smaller interval [-ln(2)/512, ln(2)/512] so that your polynomial is optimized over a smaller range? I thought at one point when playing around with the new exp, I'd figured out where there was only a small change required to turn your exp function into an expm1 function. (Maybe doing that just makes it too slow, though??)

@oscardssmith
Copy link
Member Author

@jmert the problem with the obvious way of doing that is that the table will be 2x bigger. This isn't a critical flaw, and I should probably try it. The reason you need a bigger table is that my exp code repeatedly uses the fact that all the values with rounding error I'm using are near 0, so I can be a little sloppier since the last 9 or so bits of precision get wiped out when you add 1. As such, to get correct results from a table approach here, you need to store the high and low bits of the table in 2 separate Float64s instead of packing it all into an Int64.

@jmert
Copy link
Contributor

jmert commented Oct 2, 2020

Ah OK — that makes sense.

@oscardssmith
Copy link
Member Author

@StefanKarpinski how important is expm1(-0.0)=-0.0 (vs 0.0). Right now, that's the only test that seems to be failing (other than whitespace).

@oscardssmith oscardssmith marked this pull request as ready for review October 25, 2020 03:43
@oscardssmith
Copy link
Member Author

I've just updated this for the most recent master, (specifically fixing the merge with the betterExp code)

@oscardssmith
Copy link
Member Author

also note to not let this merge until I refactor to move the high precision evalpoly to math.jl

@KristofferC
Copy link
Member

also note to not let this merge until I refactor to move the high precision evalpoly to math.jl

You could make it into a draft PR perhaps?

@oscardssmith
Copy link
Member Author

This PR is now ready for review. This new version isn't quite as fast as the original, but is more accurate and faster than the current implementation. Many points to @jmert for getting me to think about how to make the table for Float64 work. It turns out that if you are a little careful with how you put everything back together, it just works (as long as you use a separate method for small x). https://doi.org/10.1145/146847.146928

@ViralBShah ViralBShah added maths Mathematical functions performance Must go faster labels Feb 7, 2021
@oscardssmith
Copy link
Member Author

Bump on this. Would be great to have it in 1.7.

@imciner2
Copy link
Contributor

Sorry for being late looking at this, but I am slightly concerned by the use of Float64 inside the Float32 function. I think that will hurt the performance of the code on 32-bit systems that don't support 64-bit floating point natively (since then it must emulate it using software).

@oscardssmith
Copy link
Member Author

The sense I got was that in general processors with 32 bit Floats almost always have 64 bit floats too. Is there a cpu in particular you are thinking about where this isn't the case? If so, any chance you can run benchmarks on it?

@stevengj
Copy link
Member

I think that will hurt the performance of the code on 32-bit systems that don't support 64-bit floating point natively (since then it must emulate it using software).

There is no such system AFAIK. All general-purpose CPUs for decades (whether 32-bit or 64-bit) have had hardware 64-bit floating-point arithmetic. (This is a common misunderstanding: "32 bit" only refers to the native pointer size.)

@KristofferC
Copy link
Member

KristofferC commented Mar 26, 2021

There is no such system AFAIK.

I guess GPUs can be considered exceptions? And this should be easily compilable to a GPU just using CUDA.jl, I presume.

@oscardssmith
Copy link
Member Author

oscardssmith commented Mar 26, 2021

gpus are an exception, but if we want good gpu performance, we need specialized gpu methods (probably involving intrinsics). As such, I don't think the existence of GPUs changes what we should do in Base (especially since the current method using a C library doesn't work on GPU at all)

@KristofferC
Copy link
Member

Yeah, that makes sense.

@oscardssmith
Copy link
Member Author

Given this, can we merge?

@oscardssmith oscardssmith mentioned this pull request Apr 3, 2021
3 tasks
@oscardssmith oscardssmith merged commit 7d6dfe1 into JuliaLang:master Apr 8, 2021
@oscardssmith oscardssmith deleted the better-expm1 branch April 8, 2021 20:04
ElOceanografo pushed a commit to ElOceanografo/julia that referenced this pull request May 4, 2021
Faster and slightly more accurate (and all Julia).
antoine-levitt pushed a commit to antoine-levitt/julia that referenced this pull request May 9, 2021
Faster and slightly more accurate (and all Julia).
johanmon pushed a commit to johanmon/julia that referenced this pull request Jul 5, 2021
Faster and slightly more accurate (and all Julia).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maths Mathematical functions performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants