Improve performance for FD multiplication: allow LLVM to optimize away the division by a constant. #43

NHDaly · 2018-12-06T18:42:01Z

A large part of the runtime cost of *(x::FD, y::FD) comes from dividing the result of widemul(x,y) by the coefficient (10^f).

However, that coefficient is a known, compile-time static constant, and so we should be able to replace that expensive division-by-a-constant with cheaper alternatives. LLVM can normally do this optimization, but in this case, it's prevented from doing so in our code because fldmod is acting as a function barrier. If fldmod isn't inlined, it's compiled without any knowledge that one of its arguments is a compile-time constant.

This PR simply changes our multiplication (and round and ceil) to call a version of fldmod marked inline, which allows LLVM to optimize the division-by-constant into a cheaper operation.

Before this PR, the call to fldmod takes around 80% of the time of * according to this profile:

((2936+3+8275)/(13693+399) == 79.6%)

The above is generated from:

julia> using Profile, StatProfilerHTML, FixedPointDecimals
       function foo()
         out = FixedPointDecimals.FixedDecimal{Int32,2}(1.02);
         Profile.clear()
         @profile for i in 1:1_000_000_000; out = out * out; end
       end
       foo(); statprofilehtml()

After this PR, that number drops to 64%, which doesn't seem like that much, but in my overall benchmark suite, this change consistently improves the time for *(::FD{Int32,2}, ::FD{Int32,2}) by ~70%! So I might be missing something in my profiling.

Here are my benchmark results after this change (and the @eval for maxexp10(Int128) change, in #41):

		Operation	Values
		*		/		+		div		identity
Category	Type	time (ns) *	% improvement	time (ns) *	% improvement	time (ns) *	% improvement	time (ns) *	% improvement	time (ns) *	% improvement
Int	Int32	0.35	0%	3.86	-2%	0.35	0%	0.35	0%	0.35	0%
	Int64	0.35	0%	3.93	-4%	0.35	0%	0.35	0%	0.35	0%
	Int128	0.39	8%	13.28	-3%	0.35	0%	0.35	0%	0.35	0%
Float	Float32	0.35	0%	0.35	0%	0.35	0%	2.69	0%	0.35	0%
	Float64	0.35	0%	0.35	0%	0.35	0%	2.69	-4%	0.35	0%
FixedDecimal	FD{ Int32,2}	2.92	72%	9.81	-13%	0.35	0%	0.35	0%	0.35	0%
	FD{ Int64,2}	16.61	9%	21.94	-4%	0.35	0%	0.35	88%	0.35	0%
	FD{Int128,2}	2123.24	82%	1986.91	85%	0.35	100%	558.76	96%	5.40	100%
Big	BigFloat	71.08	0%	146.58	-5%	49.14	-16%	188.99	-23%	0.35	0%
	BigInt	228.66	-13%	518.31	-6%	196.80	-8%	196.36	2%	0.35	0%

* note: The times in each column are truncated to a minimum of 0.345ns, which is the estimated time for a single clock cycle on my machine.
Comparisons are relative to benchmarks run on master. Those results are here: #37 (comment)

For the small numbers, it's fairly noisy (that 8% improvement in *, Int128 is noise, for example) but for the larger numbers it seems fairly consistent between runs.

(Almost all of the improvements on the FD{Int128,2} line come from #41, since inlining fldmod doesn't seem to help any when you're doing BigInt math.)

This allows LLVM to generate cheaper operations for dividing by a constant power-of-ten. On my machine, this drops the time for multiplying two FixedDecimal{Int32,2} numbers from 10.30ns to 2.92ns, or around a 70% improvement.

NHDaly · 2018-12-06T18:56:04Z

Here's some more info on the optimization LLVM ends up performing:

The main idea is that it can replace the expensive division-by-a-constant with a multiplication by its inverse! That is:

Logically, we are doing this calculation:
(a * b) / 10^f
Let's say f=2 and T=Int64, so we have:
(a*b)/100
The idea behind the optimization is to avoid the division by multiplying by its inverse! That is, we'd like to instead say a*b * (1/100). But of course, we can't represent 1/100 as an integer, so instead, we multiply by 2^64/2^64 giving us:
(a*b) * 2^64/100 / 2^64
Then, we can precompute 2^64/100 as a magic number, and we can replace the / 2^64 with a bitshift! This gives us:
((a*b) * CONSTANT) >> 64

which is cheap and fast! :D Hooray! Here's LLVM in action:

julia> f(x) = x ÷ 100
f (generic function with 1 method)

julia> @code_native f(3)
        .section        __TEXT,__text,regular,pure_instructions
; Function f {
; Location: REPL[1]:1
; Function div; {
; Location: REPL[1]:1
        movabsq $-6640827866535438581, %rcx ## imm = 0xA3D70A3D70A3D70B
        movq    %rdi, %rax
        imulq   %rcx
        addq    %rdi, %rdx
        movq    %rdx, %rax
        shrq    $63, %rax
        sarq    $6, %rdx
        leaq    (%rdx,%rax), %rax
;}
        retq
        nopw    %cs:(%rax,%rax)
;}

So by inlining fldmod, we allow LLVM to make this optimization in FD multiplication.

NHDaly · 2018-12-06T18:56:55Z

As I mentioned above, this improvement doesn't help for FD{Int128} at all. For that, we're working on a PR to perform the same optimization that LLVM is doing above, manually, so that we can use it for Int128s, too, and skip the division entirely. And through another trick, we can replace the widemul(Int128, Int128) with splitting each number into two Int64s, and do long-hand multiplication, which leaves us with two Int128s, one for each half of the "widened" number. The neat thing is that since we're going to then shift by 128 bits anyway, we can just drop one of the two numbers to arrive at our final answer. :)

This lets us avoid BigInts for FD{Int128}, which makes a big difference for performance. I'm cleaning up that code now and i'll send a PR when it's done.

codecov-io · 2018-12-06T19:00:15Z

Codecov Report

Merging #43 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #43      +/-   ##
==========================================
+ Coverage   98.82%   98.83%   +<.01%     
==========================================
  Files           1        1              
  Lines         170      171       +1     
==========================================
+ Hits          168      169       +1     
  Misses          2        2

Impacted Files	Coverage Δ
src/FixedPointDecimals.jl	`98.83% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1768c58...c119821. Read the comment docs.

TotalVerb

LGTM. Can you file a bug on julia for fldmod not being inlined? I think this is an example of a function that definitely should be.

coveralls · 2018-12-08T08:35:10Z

Coverage increased (+0.007%) to 98.83% when pulling c119821 on NHDaly:fldmod_inline into 1768c58 on JuliaMath:master.

TotalVerb · 2018-12-10T21:23:39Z

Thank you for your contributions, @NHDaly! They are well appreciated.

NHDaly · 2018-12-18T18:20:05Z

Thank you for your contributions, @NHDaly! They are well appreciated.

Thanks @TotalVerb. :) It's been a team effort over here at RelationalAI -- especially much thanks to @tveldhui who originally had a lot of the ideas I've sent here, and reviewed all my PRs. Sorry i haven't been crediting you directly, Todd!

TotalVerb approved these changes Dec 7, 2018

View reviewed changes

NHDaly mentioned this pull request Dec 10, 2018

Consider inlining fldmod to allow optimizing if an arg is a Const JuliaLang/julia#30333

Open

TotalVerb merged commit 639e18b into JuliaMath:master Dec 10, 2018

NHDaly deleted the fldmod_inline branch December 11, 2018 16:03

NHDaly mentioned this pull request Dec 19, 2018

Speed up * for FD{Int64} via custom fldmod_by_const #45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for FD multiplication: allow LLVM to optimize away the division by a constant. #43

Improve performance for FD multiplication: allow LLVM to optimize away the division by a constant. #43

NHDaly commented Dec 6, 2018

NHDaly commented Dec 6, 2018

NHDaly commented Dec 6, 2018 •

edited

Loading

codecov-io commented Dec 6, 2018

TotalVerb left a comment

coveralls commented Dec 8, 2018

TotalVerb commented Dec 10, 2018

NHDaly commented Dec 18, 2018 •

edited

Loading

Improve performance for FD multiplication: allow LLVM to optimize away the division by a constant. #43

Improve performance for FD multiplication: allow LLVM to optimize away the division by a constant. #43

Conversation

NHDaly commented Dec 6, 2018

NHDaly commented Dec 6, 2018

NHDaly commented Dec 6, 2018 • edited Loading

codecov-io commented Dec 6, 2018

Codecov Report

TotalVerb left a comment

Choose a reason for hiding this comment

coveralls commented Dec 8, 2018

TotalVerb commented Dec 10, 2018

NHDaly commented Dec 18, 2018 • edited Loading

NHDaly commented Dec 6, 2018 •

edited

Loading

NHDaly commented Dec 18, 2018 •

edited

Loading