Skip to content

Conversation

@albheim
Copy link
Contributor

@albheim albheim commented Jun 12, 2022

Brought this up on discourse a few days ago and said I would try to put something together.

Finally got to it, and for now I have only tried it for:
UInt128 -> Float64 - faster
Int128 -> Float64 - faster
UInt128 -> Float32 - faster
Int128 -> Float32 - slower, so I didn't include it for now.

The fastest version on my computer is the one that seems to need fast 64-bit floating point operations, and I'm not sure if there is an easy way to check for that and use different methods depending on that availability?
I don't have any other hardware that I know of to test the difference on other platforms, so not sure what to do there.

Not sure how the licensing work here, the implementation is based on both the code and information in the blog post and the code in the twitter thread as well as having a look at the implementations in this repo.

@Seelengrab
Copy link
Contributor

The repo itself is under the BSD-2 license, which should be fine. I might play with this a bit, looks fun to optimize!

You can conditionally define the top-level interface by doing @static if Sys.ARCH == ....

@albheim
Copy link
Contributor Author

albheim commented Jun 12, 2022

Found an old 32 bit intel mac and tried it there. Seems that the same method was faster there, i.e. the one supposedly needing fast 64 bit floats. Not sure what architectures it might be worse on?

@oscardssmith
Copy link
Member

For all architectures relevant to us, this will be better. You only don't have Float64 on embedded platforms.

@albheim
Copy link
Contributor Author

albheim commented Jun 12, 2022

Some benchmarks

Float64

Old

julia> @benchmark Float64(x) setup=(x=rand(UInt128))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  5.656 ns … 26.065 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.906 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.823 ns ±  1.896 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▇▄ ▅▄ ▅▂▃▃▂▅▄▁                                            ▂
  ██████▇█████████▇▇▇▇██▇██▇█▇▇▇▇▇█▇█▇███▇▇▇█▆▆▆▆▆▇▆▆▆▆▆▅▅▆▄ █
  5.66 ns      Histogram: log(frequency) by time     14.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark Float64(x) setup=(x=rand(Int128))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  5.040 ns … 27.323 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.571 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.958 ns ±  1.232 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆ █ ▄ ▅ ▄▂ ▄  ▅  ▂  ▇                                      ▂
  ███▇█▁█▄██▁█▄▁█▁▃█▅▃██▅▆▅▆▄▆▅▅▅▆▆▆▆▇▇▇█▇██████▇▇▇▆▇▇▆▆▆▆▅▇ █
  5.04 ns      Histogram: log(frequency) by time     9.85 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

New

julia> @benchmark Float64(x) setup=(x=rand(UInt128))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  3.898 ns … 19.335 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.330 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.528 ns ±  0.664 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂    ▆▂     ▆                                 █       ▄     
  █▂▂▂▂██▅▃▂▃▇█▂▄▂▃█▅▂▂▂▂▂█▂▂▂▂▁▃▄▂▁▁▂▁▂█▂▂▂▁▁▁▃█▂▂▂▂▂▂▁█▆▂▃ ▃
  3.9 ns         Histogram: frequency by time        5.35 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark Float64(x) setup=(x=rand(Int128))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  4.419 ns … 31.624 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.911 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.214 ns ±  1.021 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █                 ▂                                       
  ▃▂█▂▅▂▄▂▃▄▂▄▂▂▃▂▂▅▂▂█▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂ ▂
  4.42 ns        Histogram: frequency by time        8.64 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Float32

Old

julia> @benchmark Float32(x) setup=(x=rand(UInt128))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  5.047 ns … 30.022 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.272 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.009 ns ±  1.828 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▆▅▄▂▄▄▁  ▇▁                                       ▁▁   ▁ ▂
  █████████▇▃██▇▃▄▄▃▄▅▆▅▇▆▇▆▅▄▅▄▅▅▃▅▅▆▆▅▅▅▆▅▅▅▄▁▅▃▅▄▆███████ █
  5.05 ns      Histogram: log(frequency) by time     13.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark Float32(x) setup=(x=rand(Int128))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  5.046 ns … 20.603 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.554 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.830 ns ±  0.835 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅   █                                  ▂                    
  █▂▂▄█▂▂▆▆▂▂█▅▂▃▃▃▃▃▂▃▃▃▂▂▃▃▄▂▂▃▂▃▃▂▂▃▂▅█▂▂▂▂▂▁▁▂▆▂▂▂▂▂▂▂▂▂ ▃
  5.05 ns        Histogram: frequency by time        7.58 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

New

julia> @benchmark Float32(x) setup=(x=rand(UInt128))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  4.413 ns … 19.153 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.741 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.135 ns ±  0.821 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █                      ▃                                  
  ▃▂█▂▂▆▂▂▄▂▂▃▂▂▄▂▂▃▂▂▂▃▁▁▂█▂▂▁▁▁▂▂▁▂▁▁▂▁▁▂▂▂▂▂▁▂▁▂▁▁▂▂▂▂▂▂▂ ▂
  4.41 ns        Histogram: frequency by time        7.82 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark Float32(x) setup=(x=rand(Int128))
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  4.719 ns … 19.826 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.051 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.481 ns ±  1.033 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃█▃              ▆                                          
  ███▆▁▄▁▅▁▁▂▁▅▁▂▂▁█▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  4.72 ns        Histogram: frequency by time        9.93 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Float16

This one is probably not too difficult to do more for, the only thing changed for now is to do partial conversion through Float64 instead of Float32 since that one was slightly faster.

Old

julia> @benchmark Float16(x) setup=(x=rand(UInt128))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.018 ns … 25.547 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.532 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.110 ns ±  1.289 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █  ▁                                                      
  ▄▂█▆▆█▂▄▂▂▄▂▂▃▂▃▃▂▂▄▂▂▃▂▂▂█▄▂▂▁▂▁▁▁▂▂▂▂▁▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  7.02 ns        Histogram: frequency by time        12.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark Float16(x) setup=(x=rand(Int128))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):   8.314 ns … 39.060 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.474 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.278 ns ±  2.858 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄▅▃▃▁▆▂▄ ▃▂█▅▃▅             ▁  ▁ ▁ ▁▁ ▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁       ▂
  █████████████████▆▆▆▆▆▆▆▇▇▇▇██▇████████████████████████▇▇▆▆ █
  8.31 ns      Histogram: log(frequency) by time      19.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

New

julia> @benchmark Float16(x) setup=(x=rand(UInt128))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  6.733 ns … 23.785 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.132 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.716 ns ±  1.211 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▇ ▅▅                            ▁                          
  ██▅██▇▆▆▄▂▃▃▂▂▃▃▂▂▂▂▂▁▁▃▂▁▂▁▃▂▁▁▁█▆▃▃▃▂▃▃▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  6.73 ns        Histogram: frequency by time        10.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark Float16(x) setup=(x=rand(Int128))
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min … max):  7.538 ns … 24.031 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.076 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.522 ns ±  1.282 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █  ▆ █                                                      
  █▂▇█▁█▃▂▅▂▂▄▂▂▆▂▁▂▃▂▂▃▂▂▂▄▂▂▁▆▇▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂ ▃
  7.54 ns        Histogram: frequency by time        13.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@oscardssmith oscardssmith added maths Mathematical functions performance Must go faster labels Jun 14, 2022
@albheim albheim marked this pull request as ready for review June 15, 2022 09:24
@albheim
Copy link
Contributor Author

albheim commented Jun 15, 2022

Played around a little with a Float32 version similar to the Float64 one, but that seems much trickier to get working well if even possible. Since the mantissa is smaller it requires either more numbers or more if cases, neither seemed to perform well in comparison in some very small tests I did. I think the Float64 is the one that might be noticeable.

There are some errors, but they seem to be related to something going on with the range function that I haven't really found how it would be connected to this. Though I can't say I have looked at it very carefully.

@JeffBezanson
Copy link
Member

Welcome, and thank you! Very cool first PR.

base/float.jl Outdated
l + h
if x < UInt128(1) << 104 # Can fit in two 52 bits mantissa
low_exp = 4.503599627370496e15 # 2.0^52
high_exp = 2.028240960365167e31 # 2.0^104
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0x1p104

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never seen this syntax, but could be nice.

Was a bit dissappointed when I couldn't write 2.0^52 since it seems the compiler does not take care of the calculation in most recent versions of julia, but in 1.7.3 it seems to do so. Not sure if this is something to report as a regression or if it is known and there for a reason?

1.7.3

julia> f(x) = 2.0^52+x
f (generic function with 1 method)

julia> @code_llvm f(2.0)
;  @ REPL[1]:1 within `f`
define double @julia_f_392(double %0) #0 {
top:
; ┌ @ float.jl:399 within `+`
   %1 = fadd double %0, 0x4330000000000000
; └
  ret double %1
}

1.8.0-beta

julia> f(x) = 2.0^52+x
f (generic function with 1 method)

julia> @code_llvm f(2.0)
;  @ REPL[1]:1 within `f`
define double @julia_f_637(double %0) #0 {
top:
; ┌ @ intfuncs.jl:326 within `literal_pow`
; │┌ @ math.jl:1037 within `^`
    %1 = call double @j_pow_body_639(double 2.000000e+00, i64 signext 52) #0
; └└
; ┌ @ float.jl:383 within `+`
   %2 = fadd double %1, %0
; └
  ret double %2
}

@albheim
Copy link
Contributor Author

albheim commented Jun 28, 2022

Played around a little more with the branchless Float32 version, and compared to the current version it seems to sometimes be faster and sometimes slower. The branchless is very consistent around 4.7 ns while the current version varies between around 4 ns and around 5 ns. I think this is just based on how well the branch prediction is going, though when I tried to force it to have really bad branch prediction by having a 50/50 chance of a number that is higher/lower than 2^24 it was actually faster.

So testing for number always above 2^24 I get the 5 ns time, and numbers always below I get 3.7 ns so I guess it mostly has to do with the number of instructions for the different paths then? Though I find it strange that when I pick random UInt128 I would still get 4 ns since a random UInt128 should be larger than 2^24 pretty much all the time, and as such should get 5 ns? Might be something completely different here that I'm missing...

The branchless has a total of 28 llvm instructions, while the current one has 4 instructions if x == 0, 21 instuctions if x < 2^24 and 30 instructions if x >= 2^24.

So what is best might depend on how much consistency is valued, and if we expect UInt128 to be larger than 2^24 on average or smaller? Though in general during my tests it seems that the current version is faster more often, so I would lean towards keeping that.

I also thought the branchless might be more paralellizable, so if you broadcasted conversion over an array it might be faster, though it seems like that was also slower for a couple of tests I ran.

Otherwise I feel pretty done with the Float64 version, and the Float16 now uses Float64 as intermediate instead since that is faster than Float32 at least on the computers I tested it on (one 64 bit intel and one 32 bit intel).

@KristofferC
Copy link
Member

What's the status here, is it ready to go?

@oscardssmith
Copy link
Member

I think this is mostly good to go, but I would like to see a set of test cases to make sure it gives the right answer. Specifically, testing typemin and cases where there are 52 to 55, and 100 to 110 bits between the msb and lsb of the input would probably be a good idea.

@albheim
Copy link
Contributor Author

albheim commented Sep 1, 2022

I think this is mostly good to go, but I would like to see a set of test cases to make sure it gives the right answer. Specifically, testing typemin and cases where there are 52 to 55, and 100 to 110 bits between the msb and lsb of the input would probably be a good idea.

I can try to have a look at that later today.

@albheim
Copy link
Contributor Author

albheim commented Sep 1, 2022

Didn't really know what tests to put in, but typemin was already there with some other ones, and for the msb/lsb gaps I just generated a few of them with this (using Float64 from Julia 1.8, assume that can be trusted as correct)

julia> function f(n, k=0)
       a = (Int128(1) << n + 1) << k
       b = Float64(a)
       println("@test Float64($a) == $b")
       end
f (generic function with 2 methods)

julia> for n in [52:55; 100:110]
       k = rand(0:10)
       f(n, k)
       end
...

and pasted into the tests. Maybe unnecessarily many?

Comment on lines +138 to +148
"""
uabs(x::Integer)
Return the absolute value of `x`, possibly returning a different type should the
operation be susceptible to overflow. This typically arises when `x` is a two's complement
signed integer, so that `abs(typemin(x)) == typemin(x) < 0`, in which case the result of
`uabs(x)` will be an unsigned integer of the same size.
"""
uabs(x::Integer) = abs(x)
uabs(x::BitSigned) = unsigned(abs(x))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to move this up for it to compile Julia correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After switching to using uabs(x) instead of unsigned(abs(x)) that is. Could also just go back, but since the function was defined it felt like I might as well use it.

@albheim
Copy link
Contributor Author

albheim commented Sep 2, 2022

The 52-55 is to check around the mantissa length for a 64 bit float, and the 100-110 is to check both cases in the algorithm since the cutoff is 104?
Was going to add a couple more tests, and make sure they are there for both signed and unsigned. But I felt it became a lot so I wanted to add fewer but make sure to get the cases you were after.

@albheim
Copy link
Contributor Author

albheim commented Sep 6, 2022

Does these tests cover enough @oscardssmith, or should I add a couple of more for the lsb/msb gaps?

Otherwise I think it is ready. The failed tests seems (to me) to not be related, and just before this run it passed all tests with the only change being merging master and modifying some of the tests for this (which all pass).

@oscardssmith oscardssmith merged commit 639a4ff into JuliaLang:master Sep 6, 2022
@oscardssmith
Copy link
Member

Thanks for pushing this through!

@albheim albheim deleted the albheim/int_to_float_conv branch September 6, 2022 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maths Mathematical functions performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants