Force inlining on convert and rem by timholy · Pull Request #80 · JuliaMath/FixedPointNumbers.jl

timholy · 2017-03-01T20:51:31Z

This leads to a ~30% performance improvement in certain operations. It has a noticeable effect on visualizations.

This doesn't make conversions "perfect," though:

julia> using FixedPointNumbers
                                                                                   
julia> foo(x) = convert(N0f8, convert(Float32, x))   # in an ideal world, LLVM figures out this is a no-op
foo (generic function with 1 method)                                               
                                                                                   
julia> x = N0f8(0.8)
0.8N0f8                                                                            
                                                                                   
julia> @code_native foo(x)
        .text                                                                      
Filename: REPL[2]                                                                  
        pushq   %rbp                                                               
        movq    %rsp, %rbp                                                         
Source line: 66                                                                    
        movzbl  (%rdi), %eax                                                       
        vcvtsi2ssl      %eax, %xmm0, %xmm0                                         
        movabsq $139739572855192, %rax  # imm = 0x7F17A799E198                     
        vmulss  (%rax), %xmm0, %xmm0                                               
        movabsq $139739572855196, %rax  # imm = 0x7F17A799E19C                     
Source line: 47                                                                    
        vmovss  (%rax), %xmm2           # xmm2 = mem[0],zero,zero,zero             
        vmulss  %xmm2, %xmm0, %xmm1                                                
        vroundss        $4, %xmm1, %xmm0, %xmm1                                    
Source line: 48                                                                    
        vucomiss        %xmm1, %xmm2                                               
        jb      L71
        vxorps  %xmm2, %xmm2, %xmm2
        vucomiss        %xmm2, %xmm1
        jb      L71
Source line: 1
        vcvttss2si      %xmm1, %eax
        popq    %rbp
        retq
Source line: 48
L71:
        movabsq $throw_converterror, %rax
        movabsq $139739908406960, %rdi  # imm = 0x7F17BB99FEB0
        callq   *%rax
        nopl    (%rax)

This leads to a ~30% performance improvement in certain operations. It has a noticeable effect on visualizations.

timholy · 2017-03-01T21:14:45Z

See JuliaImages/ImageCore.jl#26 for some slightly-scary analysis.

vchuravy · 2017-03-02T08:52:59Z

LLVM will probably never be able to figure it out. The LLVM IR is:

  %1 = getelementptr inbounds %Normed, %Normed* %0, i64 0, i32 0
  %2 = load i8, i8* %1, align 1
  %3 = uitofp i8 %2 to float
  %4 = fmul float %3, 0x3F70101020000000
  %5 = fmul float %4, 2.550000e+02
  %6 = call float @llvm.rint.f32(float %5)
  %notlhs = fcmp ult float %6, 0.000000e+00
  %notrhs = fcmp ugt float %6, 2.550000e+02
  %7 = or i1 %notrhs, %notlhs
  br i1 %7, label %L35, label %L37

And reinterpret(Float64, 0x3F70101020000000) * 2.550000e+02 == 1.0000000591389835 and I don't think constant propagation is good enough to realise that rint x*1.0000000591389835 = x..., but this indeed an improvement over the status quo.

As a side note LLVM doesn't merge normally, but it does merge with --math=fast

  %4 = fdiv float %3, 2.550000e+02
  %5 = fmul float %4, 2.550000e+02

and it merges the above as well. I was hoping using @fastmath here and there would be beneficial, but it apparently doesn't propagate all the way.

timholy · 2017-03-02T08:59:05Z

I hadn't looked as deeply as you, but I also suspected this would be hard for LLVM to analyze. Interesting that @fastmath isn't quite good enough (and that does seem like something that should be fixable).

How scared are you by JuliaImages/ImageCore.jl#26 (comment)? How many other things might be affected by this?

vchuravy · 2017-03-02T09:16:09Z

I added fastmath in the wrong location now the code_llvm looks like this (also the rest of this operations should be foldable but that might require improving LLVM):

define %Normed @julia_foo_68004(%Normed*) #0 !dbg !5 {
top:
  %1 = getelementptr inbounds %Normed, %Normed* %0, i64 0, i32 0
  %2 = load i8, i8* %1, align 1
  %3 = uitofp i8 %2 to float
  %4 = call float @llvm.rint.f32(float %3)
  %notlhs = fcmp ult float %4, 0.000000e+00
  %notrhs = fcmp ugt float %4, 2.550000e+02
  %5 = or i1 %notrhs, %notlhs
  br i1 %5, label %L35, label %L37

L35:                                              ; preds = %top
  %6 = fmul fast float %3, 0x3F70101020000000
  call void @julia_throw_converterror_68005(i8** inttoptr (i64 140606485076464 to i8**), float %6)
  call void @llvm.trap()
  unreachable

L37:                                              ; preds = %top
  %7 = fptoui float %4 to i8
  %8 = insertvalue %Normed undef, i8 %7, 0
  ret %Normed %8
}

and the code_native:

Source line: 66
	movzbl	(%rdi), %eax
	vcvtsi2ssl	%eax, %xmm0, %xmm0
Source line: 47
	vroundss	$4, %xmm0, %xmm0, %xmm1
	movabsq	$140606270307004, %rax  # imm = 0x7FE172CB7ABC
Source line: 48
	vmovss	(%rax), %xmm2           # xmm2 = mem[0],zero,zero,zero
	vucomiss	%xmm1, %xmm2
	jb	L48
	vxorps	%xmm2, %xmm2, %xmm2
	vucomiss	%xmm2, %xmm1
	jb	L48
Source line: 1
	vcvttss2si	%xmm1, %eax
	retq
L48:
	pushq	%rbp
	movq	%rsp, %rbp
	movabsq	$140606270307008, %rax  # imm = 0x7FE172CB7AC0
Source line: 66
	vmulss	(%rax), %xmm0, %xmm0
Source line: 48
	movabsq	$throw_converterror, %rax
	movabsq	$140606485076464, %rdi  # imm = 0x7FE17F9899F0
	callq	*%rax
	nopl	(%rax,%rax)

WRT how scared I am? I haven't looked to closely at why runtime seems to explode (you mentioned code instability?) an other reason might be that excessive inlining can also be detrimental to performance.

Before we merge this we should run the testsuite for affected packages.

codecov-io · 2017-03-02T09:31:00Z

Codecov Report

Merging #80 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master      #80   +/-   ##
=======================================
  Coverage   84.44%   84.44%           
=======================================
  Files           4        4           
  Lines         180      180           
=======================================
  Hits          152      152           
  Misses         28       28

Impacted Files	Coverage Δ
src/normed.jl	`90.36% <100%> (ø)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 852909e...8e98366. Read the comment docs.

timholy · 2017-03-10T13:02:32Z

I ran the tests for all of my installed dependents of FixedPointNumbers, ColorTypes, and Colors. Info on failures:

GLVisualize: currently borked for me to due to an Intel driver bug
PGFPlots: seems to call internal methods in ImageMagick and hasn't been updated
Compose: overly-finicky tests (pixel-level discrepancies in rendered plots)

None of these seem to have anything to do with this PR. I can't promise there won't be performance regressions that nevertheless allow the tests to pass, but I think we've done due diligence here.

This reverts commit f393589, reversing changes made to 852909e.

Revert "Merge pull request #80 from JuliaMath/teh/perf"

Force inlining on convert and rem

867d088

This leads to a ~30% performance improvement in certain operations. It has a noticeable effect on visualizations.

timholy mentioned this pull request Mar 1, 2017

Performance improvements in scaleminmax JuliaImages/ImageCore.jl#26

Merged

timholy mentioned this pull request Mar 1, 2017

Several performance enhancements JuliaGraphics/ColorTypes.jl#75

Merged

use fastmath to constant fold to fmul in conversion

8e98366

timholy merged commit f393589 into master Mar 10, 2017

timholy deleted the teh/perf branch March 10, 2017 13:02

timholy added a commit that referenced this pull request Apr 20, 2017

Revert "Merge pull request #80 from JuliaMath/teh/perf"

8847644

This reverts commit f393589, reversing changes made to 852909e.

This was referenced Apr 20, 2017

Revert "Merge pull request #80 from JuliaMath/teh/perf" #83

Merged

Array iteration is slow #82

Closed

timholy added a commit that referenced this pull request Apr 20, 2017

Merge pull request #83 from JuliaMath/teh/fix_82

0033458

Revert "Merge pull request #80 from JuliaMath/teh/perf"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force inlining on convert and rem#80

Force inlining on convert and rem#80
timholy merged 2 commits intomasterfrom
teh/perf

timholy commented Mar 1, 2017

Uh oh!

timholy commented Mar 1, 2017

Uh oh!

vchuravy commented Mar 2, 2017

Uh oh!

timholy commented Mar 2, 2017

Uh oh!

vchuravy commented Mar 2, 2017

Uh oh!

codecov-io commented Mar 2, 2017 •

edited

Loading

Uh oh!

timholy commented Mar 10, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

timholy commented Mar 1, 2017

Uh oh!

timholy commented Mar 1, 2017

Uh oh!

vchuravy commented Mar 2, 2017

Uh oh!

timholy commented Mar 2, 2017

Uh oh!

vchuravy commented Mar 2, 2017

Uh oh!

codecov-io commented Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

timholy commented Mar 10, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-io commented Mar 2, 2017 •

edited

Loading