Skip to content

Force inlining on convert and rem#80

Merged
timholy merged 2 commits intomasterfrom
teh/perf
Mar 10, 2017
Merged

Force inlining on convert and rem#80
timholy merged 2 commits intomasterfrom
teh/perf

Conversation

@timholy
Copy link
Copy Markdown
Member

@timholy timholy commented Mar 1, 2017

This leads to a ~30% performance improvement in certain operations. It has a noticeable effect on visualizations.

This doesn't make conversions "perfect," though:

julia> using FixedPointNumbers
                                                                                   
julia> foo(x) = convert(N0f8, convert(Float32, x))   # in an ideal world, LLVM figures out this is a no-op
foo (generic function with 1 method)                                               
                                                                                   
julia> x = N0f8(0.8)
0.8N0f8                                                                            
                                                                                   
julia> @code_native foo(x)
        .text                                                                      
Filename: REPL[2]                                                                  
        pushq   %rbp                                                               
        movq    %rsp, %rbp                                                         
Source line: 66                                                                    
        movzbl  (%rdi), %eax                                                       
        vcvtsi2ssl      %eax, %xmm0, %xmm0                                         
        movabsq $139739572855192, %rax  # imm = 0x7F17A799E198                     
        vmulss  (%rax), %xmm0, %xmm0                                               
        movabsq $139739572855196, %rax  # imm = 0x7F17A799E19C                     
Source line: 47                                                                    
        vmovss  (%rax), %xmm2           # xmm2 = mem[0],zero,zero,zero             
        vmulss  %xmm2, %xmm0, %xmm1                                                
        vroundss        $4, %xmm1, %xmm0, %xmm1                                    
Source line: 48                                                                    
        vucomiss        %xmm1, %xmm2                                               
        jb      L71
        vxorps  %xmm2, %xmm2, %xmm2
        vucomiss        %xmm2, %xmm1
        jb      L71
Source line: 1
        vcvttss2si      %xmm1, %eax
        popq    %rbp
        retq
Source line: 48
L71:
        movabsq $throw_converterror, %rax
        movabsq $139739908406960, %rdi  # imm = 0x7F17BB99FEB0
        callq   *%rax
        nopl    (%rax)

This leads to a ~30% performance improvement in certain operations. It has a noticeable effect on visualizations.
@timholy
Copy link
Copy Markdown
Member Author

timholy commented Mar 1, 2017

See JuliaImages/ImageCore.jl#26 for some slightly-scary analysis.

@vchuravy
Copy link
Copy Markdown
Collaborator

vchuravy commented Mar 2, 2017

LLVM will probably never be able to figure it out. The LLVM IR is:

  %1 = getelementptr inbounds %Normed, %Normed* %0, i64 0, i32 0
  %2 = load i8, i8* %1, align 1
  %3 = uitofp i8 %2 to float
  %4 = fmul float %3, 0x3F70101020000000
  %5 = fmul float %4, 2.550000e+02
  %6 = call float @llvm.rint.f32(float %5)
  %notlhs = fcmp ult float %6, 0.000000e+00
  %notrhs = fcmp ugt float %6, 2.550000e+02
  %7 = or i1 %notrhs, %notlhs
  br i1 %7, label %L35, label %L37

And reinterpret(Float64, 0x3F70101020000000) * 2.550000e+02 == 1.0000000591389835 and I don't think constant propagation is good enough to realise that rint x*1.0000000591389835 = x..., but this indeed an improvement over the status quo.

As a side note LLVM doesn't merge normally, but it does merge with --math=fast

  %4 = fdiv float %3, 2.550000e+02
  %5 = fmul float %4, 2.550000e+02

and it merges the above as well. I was hoping using @fastmath here and there would be beneficial, but it apparently doesn't propagate all the way.

@timholy
Copy link
Copy Markdown
Member Author

timholy commented Mar 2, 2017

I hadn't looked as deeply as you, but I also suspected this would be hard for LLVM to analyze. Interesting that @fastmath isn't quite good enough (and that does seem like something that should be fixable).

How scared are you by JuliaImages/ImageCore.jl#26 (comment)? How many other things might be affected by this?

@vchuravy
Copy link
Copy Markdown
Collaborator

vchuravy commented Mar 2, 2017

I added fastmath in the wrong location now the code_llvm looks like this (also the rest of this operations should be foldable but that might require improving LLVM):

define %Normed @julia_foo_68004(%Normed*) #0 !dbg !5 {
top:
  %1 = getelementptr inbounds %Normed, %Normed* %0, i64 0, i32 0
  %2 = load i8, i8* %1, align 1
  %3 = uitofp i8 %2 to float
  %4 = call float @llvm.rint.f32(float %3)
  %notlhs = fcmp ult float %4, 0.000000e+00
  %notrhs = fcmp ugt float %4, 2.550000e+02
  %5 = or i1 %notrhs, %notlhs
  br i1 %5, label %L35, label %L37

L35:                                              ; preds = %top
  %6 = fmul fast float %3, 0x3F70101020000000
  call void @julia_throw_converterror_68005(i8** inttoptr (i64 140606485076464 to i8**), float %6)
  call void @llvm.trap()
  unreachable

L37:                                              ; preds = %top
  %7 = fptoui float %4 to i8
  %8 = insertvalue %Normed undef, i8 %7, 0
  ret %Normed %8
}

and the code_native:

Source line: 66
	movzbl	(%rdi), %eax
	vcvtsi2ssl	%eax, %xmm0, %xmm0
Source line: 47
	vroundss	$4, %xmm0, %xmm0, %xmm1
	movabsq	$140606270307004, %rax  # imm = 0x7FE172CB7ABC
Source line: 48
	vmovss	(%rax), %xmm2           # xmm2 = mem[0],zero,zero,zero
	vucomiss	%xmm1, %xmm2
	jb	L48
	vxorps	%xmm2, %xmm2, %xmm2
	vucomiss	%xmm2, %xmm1
	jb	L48
Source line: 1
	vcvttss2si	%xmm1, %eax
	retq
L48:
	pushq	%rbp
	movq	%rsp, %rbp
	movabsq	$140606270307008, %rax  # imm = 0x7FE172CB7AC0
Source line: 66
	vmulss	(%rax), %xmm0, %xmm0
Source line: 48
	movabsq	$throw_converterror, %rax
	movabsq	$140606485076464, %rdi  # imm = 0x7FE17F9899F0
	callq	*%rax
	nopl	(%rax,%rax)

WRT how scared I am? I haven't looked to closely at why runtime seems to explode (you mentioned code instability?) an other reason might be that excessive inlining can also be detrimental to performance.

Before we merge this we should run the testsuite for affected packages.

@codecov-io
Copy link
Copy Markdown

codecov-io commented Mar 2, 2017

Codecov Report

Merging #80 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master      #80   +/-   ##
=======================================
  Coverage   84.44%   84.44%           
=======================================
  Files           4        4           
  Lines         180      180           
=======================================
  Hits          152      152           
  Misses         28       28
Impacted Files Coverage Δ
src/normed.jl 90.36% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 852909e...8e98366. Read the comment docs.

@timholy
Copy link
Copy Markdown
Member Author

timholy commented Mar 10, 2017

I ran the tests for all of my installed dependents of FixedPointNumbers, ColorTypes, and Colors. Info on failures:

  • GLVisualize: currently borked for me to due to an Intel driver bug
  • PGFPlots: seems to call internal methods in ImageMagick and hasn't been updated
  • Compose: overly-finicky tests (pixel-level discrepancies in rendered plots)

None of these seem to have anything to do with this PR. I can't promise there won't be performance regressions that nevertheless allow the tests to pass, but I think we've done due diligence here.

@timholy timholy merged commit f393589 into master Mar 10, 2017
@timholy timholy deleted the teh/perf branch March 10, 2017 13:02
timholy added a commit that referenced this pull request Apr 20, 2017
This reverts commit f393589, reversing
changes made to 852909e.
timholy added a commit that referenced this pull request Apr 20, 2017
Revert "Merge pull request #80 from JuliaMath/teh/perf"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants