Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for FD multiplication: allow LLVM to optimize away the division by a constant. #43

Merged
merged 1 commit into from Dec 10, 2018

Conversation

NHDaly
Copy link
Member

@NHDaly NHDaly commented Dec 6, 2018

A large part of the runtime cost of *(x::FD, y::FD) comes from dividing the result of widemul(x,y) by the coefficient (10^f).

However, that coefficient is a known, compile-time static constant, and so we should be able to replace that expensive division-by-a-constant with cheaper alternatives. LLVM can normally do this optimization, but in this case, it's prevented from doing so in our code because fldmod is acting as a function barrier. If fldmod isn't inlined, it's compiled without any knowledge that one of its arguments is a compile-time constant.

This PR simply changes our multiplication (and round and ceil) to call a version of fldmod marked inline, which allows LLVM to optimize the division-by-constant into a cheaper operation.

Before this PR, the call to fldmod takes around 80% of the time of * according to this profile:
screen shot 2018-12-06 at 1 14 14 pm
((2936+3+8275)/(13693+399) == 79.6%)

The above is generated from:

julia> using Profile, StatProfilerHTML, FixedPointDecimals
       function foo()
         out = FixedPointDecimals.FixedDecimal{Int32,2}(1.02);
         Profile.clear()
         @profile for i in 1:1_000_000_000; out = out * out; end
       end
       foo(); statprofilehtml()

After this PR, that number drops to 64%, which doesn't seem like that much, but in my overall benchmark suite, this change consistently improves the time for *(::FD{Int32,2}, ::FD{Int32,2}) by ~70%! So I might be missing something in my profiling.

Here are my benchmark results after this change (and the @eval for maxexp10(Int128) change, in #41):

    Operation Values                
    *   /   +   div   identity  
Category Type time (ns) * % improvement time (ns) * % improvement time (ns) * % improvement time (ns) * % improvement time (ns) * % improvement
Int Int32 0.35 0% 3.86 -2% 0.35 0% 0.35 0% 0.35 0%
  Int64 0.35 0% 3.93 -4% 0.35 0% 0.35 0% 0.35 0%
  Int128 0.39 8% 13.28 -3% 0.35 0% 0.35 0% 0.35 0%
Float Float32 0.35 0% 0.35 0% 0.35 0% 2.69 0% 0.35 0%
  Float64 0.35 0% 0.35 0% 0.35 0% 2.69 -4% 0.35 0%
FixedDecimal FD{ Int32,2} 2.92 72% 9.81 -13% 0.35 0% 0.35 0% 0.35 0%
  FD{ Int64,2} 16.61 9% 21.94 -4% 0.35 0% 0.35 88% 0.35 0%
  FD{Int128,2} 2123.24 82% 1986.91 85% 0.35 100% 558.76 96% 5.40 100%
Big BigFloat 71.08 0% 146.58 -5% 49.14 -16% 188.99 -23% 0.35 0%
  BigInt 228.66 -13% 518.31 -6% 196.80 -8% 196.36 2% 0.35 0%

* note: The times in each column are truncated to a minimum of 0.345ns, which is the estimated time for a single clock cycle on my machine.
Comparisons are relative to benchmarks run on master. Those results are here: #37 (comment)

For the small numbers, it's fairly noisy (that 8% improvement in *, Int128 is noise, for example) but for the larger numbers it seems fairly consistent between runs.

(Almost all of the improvements on the FD{Int128,2} line come from #41, since inlining fldmod doesn't seem to help any when you're doing BigInt math.)

This allows LLVM to generate cheaper operations for dividing by a
constant power-of-ten.

On my machine, this drops the time for multiplying two
FixedDecimal{Int32,2} numbers from 10.30ns to 2.92ns, or around a 70%
improvement.
@NHDaly
Copy link
Member Author

NHDaly commented Dec 6, 2018

Here's some more info on the optimization LLVM ends up performing:

The main idea is that it can replace the expensive division-by-a-constant with a multiplication by its inverse! That is:

  • Logically, we are doing this calculation:
    (a * b) / 10^f

  • Let's say f=2 and T=Int64, so we have:
    (a*b)/100

  • The idea behind the optimization is to avoid the division by multiplying by its inverse! That is, we'd like to instead say a*b * (1/100). But of course, we can't represent 1/100 as an integer, so instead, we multiply by 2^64/2^64 giving us:
    (a*b) * 2^64/100 / 2^64

  • Then, we can precompute 2^64/100 as a magic number, and we can replace the / 2^64 with a bitshift! This gives us:
    ((a*b) * CONSTANT) >> 64

which is cheap and fast! :D Hooray! Here's LLVM in action:

julia> f(x) = x ÷ 100
f (generic function with 1 method)

julia> @code_native f(3)
        .section        __TEXT,__text,regular,pure_instructions
; Function f {
; Location: REPL[1]:1
; Function div; {
; Location: REPL[1]:1
        movabsq $-6640827866535438581, %rcx ## imm = 0xA3D70A3D70A3D70B
        movq    %rdi, %rax
        imulq   %rcx
        addq    %rdi, %rdx
        movq    %rdx, %rax
        shrq    $63, %rax
        sarq    $6, %rdx
        leaq    (%rdx,%rax), %rax
;}
        retq
        nopw    %cs:(%rax,%rax)
;}

So by inlining fldmod, we allow LLVM to make this optimization in FD multiplication.

@NHDaly
Copy link
Member Author

NHDaly commented Dec 6, 2018

As I mentioned above, this improvement doesn't help for FD{Int128} at all. For that, we're working on a PR to perform the same optimization that LLVM is doing above, manually, so that we can use it for Int128s, too, and skip the division entirely. And through another trick, we can replace the widemul(Int128, Int128) with splitting each number into two Int64s, and do long-hand multiplication, which leaves us with two Int128s, one for each half of the "widened" number. The neat thing is that since we're going to then shift by 128 bits anyway, we can just drop one of the two numbers to arrive at our final answer. :)

This lets us avoid BigInts for FD{Int128}, which makes a big difference for performance. I'm cleaning up that code now and i'll send a PR when it's done.

@codecov-io
Copy link

Codecov Report

Merging #43 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #43      +/-   ##
==========================================
+ Coverage   98.82%   98.83%   +<.01%     
==========================================
  Files           1        1              
  Lines         170      171       +1     
==========================================
+ Hits          168      169       +1     
  Misses          2        2
Impacted Files Coverage Δ
src/FixedPointDecimals.jl 98.83% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1768c58...c119821. Read the comment docs.

Copy link
Collaborator

@TotalVerb TotalVerb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you file a bug on julia for fldmod not being inlined? I think this is an example of a function that definitely should be.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.007%) to 98.83% when pulling c119821 on NHDaly:fldmod_inline into 1768c58 on JuliaMath:master.

@TotalVerb
Copy link
Collaborator

Thank you for your contributions, @NHDaly! They are well appreciated.

@NHDaly NHDaly deleted the fldmod_inline branch December 11, 2018 16:03
@NHDaly
Copy link
Member Author

NHDaly commented Dec 18, 2018

Thank you for your contributions, @NHDaly! They are well appreciated.

Thanks @TotalVerb. :) It's been a team effort over here at RelationalAI -- especially much thanks to @tveldhui who originally had a lot of the ideas I've sent here, and reviewed all my PRs. Sorry i haven't been crediting you directly, Todd!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants