Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimizer: inline abstract union-split callsite #44512

Merged
merged 1 commit into from
Mar 30, 2022
Merged

Conversation

aviatesk
Copy link
Sponsor Member

@aviatesk aviatesk commented Mar 8, 2022

Currently the optimizer handles abstract callsite only when there is a
single dispatch candidate (in most cases), and so inlining and static-dispatch
are prohibited when the callsite is union-split (in other word, union-split
happens only when all the dispatch candidates are concrete).

However, there are certain patterns of code (most notably our Julia-level compiler code)
that inherently need to deal with abstract callsite.
The following example is taken from Core.Compiler utility:

julia> @inline isType(@nospecialize t) = isa(t, DataType) && t.name === Type.body.name
isType (generic function with 1 method)

julia> code_typed((Any,)) do x # abstract, but no union-split, successful inlining
           isType(x)
       end |> only
CodeInfo(
1%1 = (x isa Main.DataType)::Bool
└──      goto #3 if not %1
2%3 = π (x, DataType)
│   %4 = Base.getfield(%3, :name)::Core.TypeName%5 = Base.getfield(Type{T}, :name)::Core.TypeName%6 = (%4 === %5)::Bool
└──      goto #4
3 ─      goto #4
4%9 = φ (#2 => %6, #3 => false)::Bool
└──      return %9
) => Bool

julia> code_typed((Union{Type,Nothing},)) do x # abstract, union-split, unsuccessful inlining
           isType(x)
       end |> only
CodeInfo(
1%1 = (isa)(x, Nothing)::Bool
└──      goto #3 if not %1
2 ─      goto #4
3%4 = Main.isType(x)::Bool
└──      goto #4
4%6 = φ (#2 => false, #3 => %4)::Bool
└──      return %6
) => Bool

(note that this is a limitation of the inlining algorithm, and so any
user-provided hints like callsite inlining annotation doesn't help here)

This commit enables inlining and static dispatch for abstract union-split callsite.
The core idea here is that we can simulate our dispatch semantics by
generating isa checks in order of the specialities of dispatch candidates:

julia> code_typed((Union{Type,Nothing},)) do x # union-split, unsuccessful inlining
                  isType(x)
              end |> only
CodeInfo(
1%1  = (isa)(x, Nothing)::Bool
└──       goto #3 if not %1
2 ─       goto #9
3%4  = (isa)(x, Type)::Bool
└──       goto #8 if not %4
4%6  = π (x, Type)
│   %7  = (%6 isa Main.DataType)::Bool
└──       goto #6 if not %7
5%9  = π (%6, DataType)
│   %10 = Base.getfield(%9, :name)::Core.TypeName%11 = Base.getfield(Type{T}, :name)::Core.TypeName%12 = (%10 === %11)::Bool
└──       goto #7
6 ─       goto #7
7%15 = φ (#5 => %12, #6 => false)::Bool
└──       goto #9
8 ─       Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{}
└──       unreachable
9%19 = φ (#2 => false, #7 => %15)::Bool
└──       return %19
) => Bool

Inlining/static-dispatch of abstract union-split callsite will improve
the performance in such situations (and so this commit will improve the
latency of our JIT compilation). Especially, this commit helps us avoid
excessive specializations of Core.Compiler code by statically-resolving
@nospecialized callsites, and as the result, the # of precompiled
statements is now reduced from 1956 (master) to 1901 (this commit).

And also, as a side effect, the implementation of our inlining algorithm
gets much simplified now since we no longer need the previous special
handlings for abstract callsites.

One possible drawback would be increased code size.
This change seems to certainly increase the size of sysimage,
but I think these numbers are in an acceptable range:

master

❯ du -sh usr/lib/julia/*
 17M    usr/lib/julia/corecompiler.ji
188M    usr/lib/julia/sys-o.a
164M    usr/lib/julia/sys.dylib
 23M    usr/lib/julia/sys.dylib.dSYM
101M    usr/lib/julia/sys.ji

this commit

❯ du -sh usr/lib/julia/*
 17M    usr/lib/julia/corecompiler.ji
190M    usr/lib/julia/sys-o.a
166M    usr/lib/julia/sys.dylib
 23M    usr/lib/julia/sys.dylib.dSYM
102M    usr/lib/julia/sys.ji

@aviatesk aviatesk added compiler:optimizer Optimization passes (mostly in base/compiler/ssair/) compiler:latency Compiler latency labels Mar 8, 2022
@aviatesk
Copy link
Sponsor Member Author

aviatesk commented Mar 8, 2022

@nanosoldier runbenchmarks("inference", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here.

@aviatesk
Copy link
Sponsor Member Author

aviatesk commented Mar 8, 2022

Ok, the "inference" benchmark result looks good. Can I ask your reviews @Keno and @vtjnash ?

@nanosoldier runbenchmarks(!"scalar", vs=":master")

Comment on lines -1225 to -1232
if handled_all_cases && revisit_idx !== nothing
# If there's only one case that's not a dispatchtuple, we can
# still unionsplit by visiting all the other cases first.
# This is useful for code like:
# foo(x::Int) = 1
# foo(@nospecialize(x::Any)) = 2
# where we where only a small number of specific dispatchable
# cases are split off from an ::Any typed fallback.
(i, j) = revisit_idx
match = infos[i].results[j]
handled_all_cases &= handle_match!(match, argtypes, flag, state, cases, true)
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR subsumes the improvement of #44421 , so I just deleted this special handling. The test case added in #44421 is untouched and succeeds to run on this PR.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@aviatesk aviatesk force-pushed the avi/abstractinline branch 2 times, most recently from bf346fb to d76f2a4 Compare March 9, 2022 05:28
@aviatesk
Copy link
Sponsor Member Author

aviatesk commented Mar 9, 2022

@nanosoldier runbenchmarks("inference", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@aviatesk
Copy link
Sponsor Member Author

@nanosoldier runbenchmarks("inference", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@Keno
Copy link
Member

Keno commented Mar 14, 2022

Are we sure that this actually works in general? Method matching is very complicated, e.g. when matching against diagonal signatures that I'm pretty sure the current isa-generating code will not handle properly. We should be able to split all cases that inference decided to unionsplit, but we'd have to remember what arguments it split on and then generate an appropriate isa nest for those. Unless I'm missing something.

@Keno
Copy link
Member

Keno commented Mar 14, 2022

E.g., I just tried the simplest possible case I could think of here:

julia> f(x::Int, y::Int) = 1
f (generic function with 3 methods)

julia> f(x::Any, y::Int) = 2
f (generic function with 3 methods)

julia> f(x::Int, y::Any) = 3
f (generic function with 3 methods)

julia> g(x,y) = f(x,y)
g (generic function with 1 method)

julia> code_typed(g, Tuple{Any, Any})
ERROR: AssertionError: invalid order of dispatch candidate
Stacktrace:
  [1] handle_cases!(ir::Core.Compiler.IRCode, idx::Int64, stmt::Expr, atype::Any, cases::Vector{Core.Compiler.InliningCase}, fully_covered::Bool, todo::Vector{Pair{Int64, Any}}, params::Core.Compiler.OptimizationParams)
    @ Core.Compiler ./compiler/ssair/inlining.jl:1309

@aviatesk aviatesk force-pushed the avi/abstractinline branch 2 times, most recently from 5cf29ed to b0eca9a Compare March 16, 2022 03:55
@aviatesk
Copy link
Sponsor Member Author

@nanosoldier runbenchmarks(!"scalar", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@aviatesk
Copy link
Sponsor Member Author

Thanks @Keno for pointing out that assertion is not valid.
Now I concluded we can just process dispatch candidates in order without doing the assertion, since we should meet the conditions that are necessary for the isa-checks to simulate the semantic dispatch semantics. Especially because:

  • we bail out from cases when there are any ambiguity
  • ml_matches already sorted matches in order of their specificity when there are no ambiguity
  • we excluded case.sig::UnionAll cases and so we don't need to account for type-equality constraint from type variables

I also added this comment describing the transformation of ir_inline_union_split!:

ir_inline_unionsplit!

The core idea of this function is to simulate the dispatch semantics by generating
(flat) isa-checks corresponding to the signatures of union-split dispatch candidates,
and then inline their bodies into each isa-conditional block.

This isa-based virtual dispatch requires some pre-conditions to hold in order to simulate
the actual semantics correctly.

The first one is that these dispatch candidates need to be processed in order of their specificity,
and the corresponding isa-checks should reflect the method specificities, since now their
signatures are not necessarily concrete.
Fortunately, ml_matches should already sorted them in that way, except cases when there is
any ambiguity, from which we already bail out at this point.

Another consideration is type equality constraint from type variables: the isa-checks are
not enough to simulate the dispatch semantics in cases like:

Given a definition:

f(x::T, y::T) where T<:Integer = ...

Transform a callsite:

(x::Any, y::Any)

Into the optimized form:

if isa(x, Integer) && isa(y, Integer)
    f(x::Integer, y::Integer)
else
    f(x::Integer, y::Integer)
end

But again, we should already bail out from such cases at this point, essentially by
excluding cases where case.sig::UnionAll.

In short, here we can process the dispatch candidates in order, assuming we haven't changed
their order somehow somewhere up to this point.


Now I think we can move this forward once we confirm that this PR doesn't come with any regression, which may happen from performance penalty of isa-checks with abstract types, as @vtjnash pointed out to me on a meeting.

@aviatesk
Copy link
Sponsor Member Author

@nanosoldier runbenchmarks("linalg" || "simd" || "inference", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@Keno
Copy link
Member

Keno commented Mar 17, 2022

I'm still quite skeptical that this works, because in general, method matching cannot necessarily be compiled down to an isa nest. Maybe there's some argument here that I'm missing for why it should work here, but I'm skeptical. I guess we'll run PkgEval and see if that comes up with any good examples.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 24, 2022

@nanosoldier runbenchmarks("comprehension", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - no performance regressions were detected. A full report can be found here.

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@aviatesk
Copy link
Sponsor Member Author

@nanosoldier runtests(["BasicInterpolators", "CategoricalDistributions", "Diagonalizations", "Evolutionary", "GeometryPrimitives", "MLJSerialization", "OptimizationAlgorithms", "PointwiseKDEs", "PolyaGammaSamplers", "SimpleBufferStream", "SquidGame", "Caesar", "DroneSurveillance", "EclipsingBinaryStars", "PowerFlows", "SBMLToolkit", "Thebes"], vs = ":master")

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@aviatesk
Copy link
Sponsor Member Author

@nanosoldier runtests(["Diagonalizations", "Evolutionary", "PointwiseKDEs", "PolyaGammaSamplers", "SquidGame"], vs = ":master")

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@aviatesk
Copy link
Sponsor Member Author

The only real failure is from SquidGame.jl. Posted a reproducer here.

@aviatesk
Copy link
Sponsor Member Author

@nanosoldier runbenchmarks(!"scalar", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

Base automatically changed from avi/44705 to master March 30, 2022 00:27
Currently the optimizer handles abstract callsite only when there is a
single dispatch candidate (in most cases), and so inlining and static-dispatch
are prohibited when the callsite is union-split (in other word, union-split
happens only when all the dispatch candidates are concrete).

However, there are certain patterns of code (most notably our Julia-level compiler code)
that inherently need to deal with abstract callsite.
The following example is taken from `Core.Compiler` utility:
```julia
julia> @inline isType(@nospecialize t) = isa(t, DataType) && t.name === Type.body.name
isType (generic function with 1 method)

julia> code_typed((Any,)) do x # abstract, but no union-split, successful inlining
           isType(x)
       end |> only
CodeInfo(
1 ─ %1 = (x isa Main.DataType)::Bool
└──      goto #3 if not %1
2 ─ %3 = π (x, DataType)
│   %4 = Base.getfield(%3, :name)::Core.TypeName
│   %5 = Base.getfield(Type{T}, :name)::Core.TypeName
│   %6 = (%4 === %5)::Bool
└──      goto #4
3 ─      goto #4
4 ┄ %9 = φ (#2 => %6, #3 => false)::Bool
└──      return %9
) => Bool

julia> code_typed((Union{Type,Nothing},)) do x # abstract, union-split, unsuccessful inlining
           isType(x)
       end |> only
CodeInfo(
1 ─ %1 = (isa)(x, Nothing)::Bool
└──      goto #3 if not %1
2 ─      goto #4
3 ─ %4 = Main.isType(x)::Bool
└──      goto #4
4 ┄ %6 = φ (#2 => false, #3 => %4)::Bool
└──      return %6
) => Bool
```
(note that this is a limitation of the inlining algorithm, and so any
user-provided hints like callsite inlining annotation doesn't help here)

This commit enables inlining and static dispatch for abstract union-split callsite.
The core idea here is that we can simulate our dispatch semantics by
generating `isa` checks in order of the specialities of dispatch candidates:
```julia
julia> code_typed((Union{Type,Nothing},)) do x # union-split, unsuccessful inlining
                  isType(x)
              end |> only
CodeInfo(
1 ─ %1  = (isa)(x, Nothing)::Bool
└──       goto #3 if not %1
2 ─       goto #9
3 ─ %4  = (isa)(x, Type)::Bool
└──       goto #8 if not %4
4 ─ %6  = π (x, Type)
│   %7  = (%6 isa Main.DataType)::Bool
└──       goto #6 if not %7
5 ─ %9  = π (%6, DataType)
│   %10 = Base.getfield(%9, :name)::Core.TypeName
│   %11 = Base.getfield(Type{T}, :name)::Core.TypeName
│   %12 = (%10 === %11)::Bool
└──       goto #7
6 ─       goto #7
7 ┄ %15 = φ (#5 => %12, #6 => false)::Bool
└──       goto #9
8 ─       Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{}
└──       unreachable
9 ┄ %19 = φ (#2 => false, #7 => %15)::Bool
└──       return %19
) => Bool
```

Inlining/static-dispatch of abstract union-split callsite will improve
the performance in such situations (and so this commit will improve the
latency of our JIT compilation). Especially, this commit helps us avoid
excessive specializations of `Core.Compiler` code by statically-resolving
`@nospecialize`d callsites, and as the result, the # of precompiled
statements is now reduced from  `1956` ([`master`](dc45d77)) to `1901` (this commit).

And also, as a side effect, the implementation of our inlining algorithm
gets much simplified now since we no longer need the previous special
handlings for abstract callsites.

One possible drawback would be increased code size.
This change seems to certainly increase the size of sysimage,
but I think these numbers are in an acceptable range:
> [`master`](dc45d77)
```
❯ du -sh usr/lib/julia/*
 17M    usr/lib/julia/corecompiler.ji
188M    usr/lib/julia/sys-o.a
164M    usr/lib/julia/sys.dylib
 23M    usr/lib/julia/sys.dylib.dSYM
101M    usr/lib/julia/sys.ji
```

> this commit
```
❯ du -sh usr/lib/julia/*
 17M    usr/lib/julia/corecompiler.ji
190M    usr/lib/julia/sys-o.a
166M    usr/lib/julia/sys.dylib
 23M    usr/lib/julia/sys.dylib.dSYM
102M    usr/lib/julia/sys.ji
```
@aviatesk
Copy link
Sponsor Member Author

Going to merge this if CI runs successfully.

@timholy
Copy link
Sponsor Member

timholy commented Apr 29, 2022

Probably worth cross-referencing #45051 (comment) from here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:latency Compiler latency compiler:optimizer Optimization passes (mostly in base/compiler/ssair/)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants