-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Faster int to float conversion #45656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster int to float conversion #45656
Conversation
|
The repo itself is under the BSD-2 license, which should be fine. I might play with this a bit, looks fun to optimize! You can conditionally define the top-level interface by doing |
|
Found an old 32 bit intel mac and tried it there. Seems that the same method was faster there, i.e. the one supposedly needing fast 64 bit floats. Not sure what architectures it might be worse on? |
|
For all architectures relevant to us, this will be better. You only don't have |
|
Some benchmarks Float64Oldjulia> @benchmark Float64(x) setup=(x=rand(UInt128))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 5.656 ns … 26.065 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.906 ns ┊ GC (median): 0.00%
Time (mean ± σ): 6.823 ns ± 1.896 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇▄ ▅▄ ▅▂▃▃▂▅▄▁ ▂
██████▇█████████▇▇▇▇██▇██▇█▇▇▇▇▇█▇█▇███▇▇▇█▆▆▆▆▆▇▆▆▆▆▆▅▅▆▄ █
5.66 ns Histogram: log(frequency) by time 14.3 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark Float64(x) setup=(x=rand(Int128))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 5.040 ns … 27.323 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.571 ns ┊ GC (median): 0.00%
Time (mean ± σ): 5.958 ns ± 1.232 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▆ █ ▄ ▅ ▄▂ ▄ ▅ ▂ ▇ ▂
███▇█▁█▄██▁█▄▁█▁▃█▅▃██▅▆▅▆▄▆▅▅▅▆▆▆▆▇▇▇█▇██████▇▇▇▆▇▇▆▆▆▆▅▇ █
5.04 ns Histogram: log(frequency) by time 9.85 ns <
Memory estimate: 0 bytes, allocs estimate: 0.Newjulia> @benchmark Float64(x) setup=(x=rand(UInt128))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 3.898 ns … 19.335 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.330 ns ┊ GC (median): 0.00%
Time (mean ± σ): 4.528 ns ± 0.664 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ ▆▂ ▆ █ ▄
█▂▂▂▂██▅▃▂▃▇█▂▄▂▃█▅▂▂▂▂▂█▂▂▂▂▁▃▄▂▁▁▂▁▂█▂▂▂▁▁▁▃█▂▂▂▂▂▂▁█▆▂▃ ▃
3.9 ns Histogram: frequency by time 5.35 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark Float64(x) setup=(x=rand(Int128))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 4.419 ns … 31.624 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.911 ns ┊ GC (median): 0.00%
Time (mean ± σ): 5.214 ns ± 1.021 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▂
▃▂█▂▅▂▄▂▃▄▂▄▂▂▃▂▂▅▂▂█▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂ ▂
4.42 ns Histogram: frequency by time 8.64 ns <
Memory estimate: 0 bytes, allocs estimate: 0.Float32Oldjulia> @benchmark Float32(x) setup=(x=rand(UInt128))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 5.047 ns … 30.022 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.272 ns ┊ GC (median): 0.00%
Time (mean ± σ): 6.009 ns ± 1.828 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
██▆▅▄▂▄▄▁ ▇▁ ▁▁ ▁ ▂
█████████▇▃██▇▃▄▄▃▄▅▆▅▇▆▇▆▅▄▅▄▅▅▃▅▅▆▆▅▅▅▆▅▅▅▄▁▅▃▅▄▆███████ █
5.05 ns Histogram: log(frequency) by time 13.9 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark Float32(x) setup=(x=rand(Int128))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 5.046 ns … 20.603 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.554 ns ┊ GC (median): 0.00%
Time (mean ± σ): 5.830 ns ± 0.835 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▅ █ ▂
█▂▂▄█▂▂▆▆▂▂█▅▂▃▃▃▃▃▂▃▃▃▂▂▃▃▄▂▂▃▂▃▃▂▂▃▂▅█▂▂▂▂▂▁▁▂▆▂▂▂▂▂▂▂▂▂ ▃
5.05 ns Histogram: frequency by time 7.58 ns <
Memory estimate: 0 bytes, allocs estimate: 0.Newjulia> @benchmark Float32(x) setup=(x=rand(UInt128))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 4.413 ns … 19.153 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.741 ns ┊ GC (median): 0.00%
Time (mean ± σ): 5.135 ns ± 0.821 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▃
▃▂█▂▂▆▂▂▄▂▂▃▂▂▄▂▂▃▂▂▂▃▁▁▂█▂▂▁▁▁▂▂▁▂▁▁▂▁▁▂▂▂▂▂▁▂▁▂▁▁▂▂▂▂▂▂▂ ▂
4.41 ns Histogram: frequency by time 7.82 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark Float32(x) setup=(x=rand(Int128))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 4.719 ns … 19.826 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.051 ns ┊ GC (median): 0.00%
Time (mean ± σ): 5.481 ns ± 1.033 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▃ ▆
███▆▁▄▁▅▁▁▂▁▅▁▂▂▁█▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
4.72 ns Histogram: frequency by time 9.93 ns <
Memory estimate: 0 bytes, allocs estimate: 0.Float16This one is probably not too difficult to do more for, the only thing changed for now is to do partial conversion through Float64 instead of Float32 since that one was slightly faster. Oldjulia> @benchmark Float16(x) setup=(x=rand(UInt128))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 7.018 ns … 25.547 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.532 ns ┊ GC (median): 0.00%
Time (mean ± σ): 8.110 ns ± 1.289 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▁
▄▂█▆▆█▂▄▂▂▄▂▂▃▂▃▃▂▂▄▂▂▃▂▂▂█▄▂▂▁▂▁▁▁▂▂▂▂▁▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂ ▃
7.02 ns Histogram: frequency by time 12.9 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark Float16(x) setup=(x=rand(Int128))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 8.314 ns … 39.060 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 10.474 ns ┊ GC (median): 0.00%
Time (mean ± σ): 11.278 ns ± 2.858 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▅▃▃▁▆▂▄ ▃▂█▅▃▅ ▁ ▁ ▁ ▁▁ ▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁ ▂
█████████████████▆▆▆▆▆▆▆▇▇▇▇██▇████████████████████████▇▇▆▆ █
8.31 ns Histogram: log(frequency) by time 19.3 ns <
Memory estimate: 0 bytes, allocs estimate: 0.Newjulia> @benchmark Float16(x) setup=(x=rand(UInt128))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 6.733 ns … 23.785 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 7.132 ns ┊ GC (median): 0.00%
Time (mean ± σ): 7.716 ns ± 1.211 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇ ▅▅ ▁
██▅██▇▆▆▄▂▃▃▂▂▃▃▂▂▂▂▂▁▁▃▂▁▂▁▃▂▁▁▁█▆▃▃▃▂▃▃▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
6.73 ns Histogram: frequency by time 10.8 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark Float16(x) setup=(x=rand(Int128))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
Range (min … max): 7.538 ns … 24.031 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 8.076 ns ┊ GC (median): 0.00%
Time (mean ± σ): 8.522 ns ± 1.282 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▆ █
█▂▇█▁█▃▂▅▂▂▄▂▂▆▂▁▂▃▂▂▃▂▂▂▄▂▂▁▆▇▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂ ▃
7.54 ns Histogram: frequency by time 13.2 ns <
Memory estimate: 0 bytes, allocs estimate: 0. |
…albheim/int_to_float_conv
|
Played around a little with a Float32 version similar to the Float64 one, but that seems much trickier to get working well if even possible. Since the mantissa is smaller it requires either more numbers or more if cases, neither seemed to perform well in comparison in some very small tests I did. I think the Float64 is the one that might be noticeable. There are some errors, but they seem to be related to something going on with the |
|
Welcome, and thank you! Very cool first PR. |
base/float.jl
Outdated
| l + h | ||
| if x < UInt128(1) << 104 # Can fit in two 52 bits mantissa | ||
| low_exp = 4.503599627370496e15 # 2.0^52 | ||
| high_exp = 2.028240960365167e31 # 2.0^104 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0x1p104
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never seen this syntax, but could be nice.
Was a bit dissappointed when I couldn't write 2.0^52 since it seems the compiler does not take care of the calculation in most recent versions of julia, but in 1.7.3 it seems to do so. Not sure if this is something to report as a regression or if it is known and there for a reason?
1.7.3
julia> f(x) = 2.0^52+x
f (generic function with 1 method)
julia> @code_llvm f(2.0)
; @ REPL[1]:1 within `f`
define double @julia_f_392(double %0) #0 {
top:
; ┌ @ float.jl:399 within `+`
%1 = fadd double %0, 0x4330000000000000
; └
ret double %1
}1.8.0-beta
julia> f(x) = 2.0^52+x
f (generic function with 1 method)
julia> @code_llvm f(2.0)
; @ REPL[1]:1 within `f`
define double @julia_f_637(double %0) #0 {
top:
; ┌ @ intfuncs.jl:326 within `literal_pow`
; │┌ @ math.jl:1037 within `^`
%1 = call double @j_pow_body_639(double 2.000000e+00, i64 signext 52) #0
; └└
; ┌ @ float.jl:383 within `+`
%2 = fadd double %1, %0
; └
ret double %2
}|
Played around a little more with the branchless Float32 version, and compared to the current version it seems to sometimes be faster and sometimes slower. The branchless is very consistent around 4.7 ns while the current version varies between around 4 ns and around 5 ns. I think this is just based on how well the branch prediction is going, though when I tried to force it to have really bad branch prediction by having a 50/50 chance of a number that is higher/lower than 2^24 it was actually faster. So testing for number always above 2^24 I get the 5 ns time, and numbers always below I get 3.7 ns so I guess it mostly has to do with the number of instructions for the different paths then? Though I find it strange that when I pick random UInt128 I would still get 4 ns since a random UInt128 should be larger than 2^24 pretty much all the time, and as such should get 5 ns? Might be something completely different here that I'm missing... The branchless has a total of 28 llvm instructions, while the current one has 4 instructions if x == 0, 21 instuctions if x < 2^24 and 30 instructions if x >= 2^24. So what is best might depend on how much consistency is valued, and if we expect UInt128 to be larger than 2^24 on average or smaller? Though in general during my tests it seems that the current version is faster more often, so I would lean towards keeping that. I also thought the branchless might be more paralellizable, so if you broadcasted conversion over an array it might be faster, though it seems like that was also slower for a couple of tests I ran. Otherwise I feel pretty done with the Float64 version, and the Float16 now uses Float64 as intermediate instead since that is faster than Float32 at least on the computers I tested it on (one 64 bit intel and one 32 bit intel). |
|
What's the status here, is it ready to go? |
|
I think this is mostly good to go, but I would like to see a set of test cases to make sure it gives the right answer. Specifically, testing |
I can try to have a look at that later today. |
|
Didn't really know what tests to put in, but julia> function f(n, k=0)
a = (Int128(1) << n + 1) << k
b = Float64(a)
println("@test Float64($a) == $b")
end
f (generic function with 2 methods)
julia> for n in [52:55; 100:110]
k = rand(0:10)
f(n, k)
end
...and pasted into the tests. Maybe unnecessarily many? |
| """ | ||
| uabs(x::Integer) | ||
| Return the absolute value of `x`, possibly returning a different type should the | ||
| operation be susceptible to overflow. This typically arises when `x` is a two's complement | ||
| signed integer, so that `abs(typemin(x)) == typemin(x) < 0`, in which case the result of | ||
| `uabs(x)` will be an unsigned integer of the same size. | ||
| """ | ||
| uabs(x::Integer) = abs(x) | ||
| uabs(x::BitSigned) = unsigned(abs(x)) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to move this up for it to compile Julia correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After switching to using uabs(x) instead of unsigned(abs(x)) that is. Could also just go back, but since the function was defined it felt like I might as well use it.
|
The 52-55 is to check around the mantissa length for a 64 bit float, and the 100-110 is to check both cases in the algorithm since the cutoff is 104? |
|
Does these tests cover enough @oscardssmith, or should I add a couple of more for the lsb/msb gaps? Otherwise I think it is ready. The failed tests seems (to me) to not be related, and just before this run it passed all tests with the only change being merging master and modifying some of the tests for this (which all pass). |
|
Thanks for pushing this through! |
Brought this up on discourse a few days ago and said I would try to put something together.
Finally got to it, and for now I have only tried it for:
UInt128 -> Float64- fasterInt128 -> Float64- fasterUInt128 -> Float32- fasterInt128 -> Float32- slower, so I didn't include it for now.The fastest version on my computer is the one that seems to need fast 64-bit floating point operations, and I'm not sure if there is an easy way to check for that and use different methods depending on that availability?
I don't have any other hardware that I know of to test the difference on other platforms, so not sure what to do there.
Not sure how the licensing work here, the implementation is based on both the code and information in the blog post and the code in the twitter thread as well as having a look at the implementations in this repo.