Insert eager calls to `finalize` for otherwise-dead finalizeable objects #44056

jpsamaroo · 2022-02-06T22:23:03Z

Finalizers are a great way to defer the freeing side of memory management until some later point; however, they can have unpredictable behavior when the data that they free is not fully known to the GC (e.g. GPU allocations, or distributed references). This can result in behavior like out-of-memory situations, excessive memory usage, and sometimes more costly freeing behavior (in the event that locks need to be taken).

This seems like a bad situation, but there is a silver lining: some code patterns which allocate such objects don't actually need the allocations to stick around very long, and the lifetime of the object could (in theory) be statically determined by the compiler. Thankfully, with the ongoing work of integrating EscapeAnalysis.jl into the optimizer, we can use the generated escape information to improve this situation.

This PR uses escape info from EA to determine when an object has an attached finalizer, and when its lifetime is provably finite (i.e. the object does not escape the analyzed scope). For such objects, we can insert an early call to finalize(obj) at the end of obj's lifetime, which will allow the object's finalizer to be enqueued for execution immediately, minimizing how long finalizeable object stay live in the GC.

aviatesk

I rebased avi/EASROA. Maybe rebasing this branch against it would fix the build error?

base/compiler/optimize.jl

jpsamaroo · 2022-02-09T14:35:41Z

Latest push post-rebase still spams UndefRefError() in adce_pass!

jpsamaroo · 2022-02-09T14:38:46Z

Error:

Internal error: encountered unexpected error in runtime:
UndefRefError()
getindex at ./array.jl:921 [inlined]
getindex at ./compiler/ssair/ir.jl:238 [inlined]
is_union_phi at ./compiler/ssair/passes.jl:1186 [inlined]
adce_pass! at ./compiler/ssair/passes.jl:1240
run_passes at ./compiler/optimize.jl:606
optimize at ./compiler/optimize.jl:585 [inlined]
_typeinf at ./compiler/typeinfer.jl:253
typeinf at ./compiler/typeinfer.jl:209
typeinf_edge at ./compiler/typeinfer.jl:831
abstract_call_method at ./compiler/abstractinterpretation.jl:561
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:114
abstract_call_known at ./compiler/abstractinterpretation.jl:1475
unknown function (ip: 0x7f13e64d468d)
_jl_invoke at /home/jpsamaroo/julia-fin-el/src/gf.c:2311
ijl_invoke at /home/jpsamaroo/julia-fin-el/src/gf.c:2337
unknown function (ip: 0x7f13e6aceb4c)
unknown function (ip: 0x7f13e6aceaad)

jpsamaroo · 2022-02-09T15:16:15Z

Per discussion: the issue here is probably that we don't check that the return dominates the allocation passed to finalize, so we'll need to query the domtree as well.

This change will also potentially cause extended lifetimes for some allocations, but that's apparently a general issue that needs resolving.

jpsamaroo · 2022-02-11T15:26:57Z

We're now factoring in dominance information (hopefully correctly), which appears to have fixed the errors!

This commit ports [EscapeAnalysis.jl](https://github.com/aviatesk/EscapeAnalysis.jl) into Julia base. You can find the documentation of this escape analysis at [this GitHub page](https://aviatesk.github.io/EscapeAnalysis.jl/dev/)[^1]. [^1]: The same documentation will be included into Julia's developer documentation by this commit. This escape analysis will hopefully be an enabling technology for various memory-related optimizations at Julia's high level compilation pipeline. Possible target optimization includes alias aware SROA (JuliaLang#43888), array SROA (JuliaLang#43909), `mutating_arrayfreeze` optimization (JuliaLang#42465), stack allocation of mutables, finalizer elision and so on[^2]. [^2]: It would be also interesting if LLVM-level optimizations can consume IPO information derived by this escape analysis to broaden optimization possibilities. The primary motivation for porting EA in this PR is to check its impact on latency as well as to get feedbacks from a broader range of developers. The plan is that we first introduce EA in this commit, and then merge the depending PRs built on top of this commit like JuliaLang#43888, JuliaLang#43909 and JuliaLang#42465 This commit simply defines and runs EA inside Julia base compiler and enables the existing test suite with it. In this commit, we just run EA before inlining to generate IPO cache. The depending PRs, EA will be invoked again after inlining to be used for various local optimizations.

@allocated

Enhances SROA of mutables using the novel Julia-level escape analysis (on top of JuliaLang#43800): 1. alias-aware SROA, mutable ϕ-node elimination 2. `isdefined` check elimination 3. load-forwarding for non-eliminable but analyzable mutables --- 1. alias-aware SROA, mutable ϕ-node elimination EA's alias analysis allows this new SROA to handle nested mutables allocations pretty well. Now we can eliminate the heap allocations completely from this insanely nested examples by the single analysis/optimization pass: ```julia julia> function refs(x) (Ref(Ref(Ref(Ref(Ref(Ref(Ref(Ref(Ref(Ref((x))))))))))))[][][][][][][][][][] end refs (generic function with 1 method) julia> refs("julia"); @allocated refs("julia") 0 ``` EA can also analyze escape of ϕ-node as well as its aliasing. Mutable ϕ-nodes would be eliminated even for a very tricky case as like: ```julia julia> code_typed((Bool,String,)) do cond, x # these allocation form multiple ϕ-nodes if cond ϕ2 = ϕ1 = Ref{Any}("foo") else ϕ2 = ϕ1 = Ref{Any}("bar") end ϕ2[] = x y = ϕ1[] # => x return y end 1-element Vector{Any}: CodeInfo( 1 ─ goto JuliaLang#3 if not cond 2 ─ goto JuliaLang#4 3 ─ nothing::Nothing 4 ┄ return x ) => Any ``` Combined with the alias analysis and ϕ-node handling above, allocations in the following "realistic" examples will be optimized: ```julia julia> # demonstrate the power of our field / alias analysis with realistic end to end examples # adapted from http://wiki.luajit.org/Allocation-Sinking-Optimization#implementation%5B abstract type AbstractPoint{T} end julia> struct Point{T} <: AbstractPoint{T} x::T y::T end julia> mutable struct MPoint{T} <: AbstractPoint{T} x::T y::T end julia> add(a::P, b::P) where P<:AbstractPoint = P(a.x + b.x, a.y + b.y); julia> function compute_point(T, n, ax, ay, bx, by) a = T(ax, ay) b = T(bx, by) for i in 0:(n-1) a = add(add(a, b), b) end a.x, a.y end; julia> function compute_point(n, a, b) for i in 0:(n-1) a = add(add(a, b), b) end a.x, a.y end; julia> function compute_point!(n, a, b) for i in 0:(n-1) a′ = add(add(a, b), b) a.x = a′.x a.y = a′.y end end; julia> compute_point(MPoint, 10, 1+.5, 2+.5, 2+.25, 4+.75); julia> compute_point(MPoint, 10, 1+.5im, 2+.5im, 2+.25im, 4+.75im); julia> @allocated compute_point(MPoint, 10000, 1+.5, 2+.5, 2+.25, 4+.75) 0 julia> @allocated compute_point(MPoint, 10000, 1+.5im, 2+.5im, 2+.25im, 4+.75im) 0 julia> compute_point(10, MPoint(1+.5, 2+.5), MPoint(2+.25, 4+.75)); julia> compute_point(10, MPoint(1+.5im, 2+.5im), MPoint(2+.25im, 4+.75im)); julia> @allocated compute_point(10000, MPoint(1+.5, 2+.5), MPoint(2+.25, 4+.75)) 0 julia> @allocated compute_point(10000, MPoint(1+.5im, 2+.5im), MPoint(2+.25im, 4+.75im)) 0 julia> af, bf = MPoint(1+.5, 2+.5), MPoint(2+.25, 4+.75); julia> ac, bc = MPoint(1+.5im, 2+.5im), MPoint(2+.25im, 4+.75im); julia> compute_point!(10, af, bf); julia> compute_point!(10, ac, bc); julia> @allocated compute_point!(10000, af, bf) 0 julia> @allocated compute_point!(10000, ac, bc) 0 ``` 2. `isdefined` check elimination This commit also implements a simple optimization to eliminate `isdefined` call by checking load-fowardability. This optimization may be especially useful to eliminate extra allocation involved with a capturing closure, e.g.: ```julia julia> callit(f, args...) = f(args...); julia> function isdefined_elim() local arr::Vector{Any} callit() do arr = Any[] end return arr end; julia> code_typed(isdefined_elim) 1-element Vector{Any}: CodeInfo( 1 ─ %1 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Vector{Any}, svec(Any, Int64), 0, :(:ccall), Vector{Any}, 0, 0))::Vector{Any} └── goto JuliaLang#3 if not true 2 ─ goto JuliaLang#4 3 ─ $(Expr(:throw_undef_if_not, :arr, false))::Any 4 ┄ return %1 ) => Vector{Any} ``` 3. load-forwarding for non-eliminable but analyzable mutables EA also allows us to forward loads even when the mutable allocation can't be eliminated but still its fields are known precisely. The load forwarding might be useful since it may derive new type information that succeeding optimization passes can use (or just because it allows simpler code transformations down the load): ```julia julia> code_typed((Bool,String,)) do c, s r = Ref{Any}(s) if c return r[]::String # adce_pass! will further eliminate this type assert call also else return r end end 1-element Vector{Any}: CodeInfo( 1 ─ %1 = %new(Base.RefValue{Any}, s)::Base.RefValue{Any} └── goto JuliaLang#3 if not c 2 ─ return s 3 ─ return %1 ) => Union{Base.RefValue{Any}, String} ``` --- Please refer to the newly added test cases for more examples. Also, EA's alias analysis already succeeds to reason about arrays, and so this EA-based SROA will hopefully be generalized for array SROA as well.

Co-authored-by: Shuhei Kadowaki <aviatesk@gmail.com>

yuyichao · 2022-03-04T02:45:29Z

Note that calling finalize is expensive since it's not designed to be used this way. This is especially in code that uses finalizer a lot, which seems to be what this is targetting. In another word, this transformation will make your code run slower if it didn't run out of memory.

If you can prove that the object has only known finalizers, and you know what those are, you should be able to call the finalizer directly without going through the normal GC logic. If you can't prove that exception won't occur, which you probably won't be able to prove at this level, you can just set a flag in the GC to tell it to not run any currently registered finalizers (finalize may need to check this flag as well). If that's too much code to generate, you can simply add a new C API to pass in the finalizer directly so that you can skip the scan of the finalizer list and have that C API set appropriate flags for the GC.

Also, as I've said many time before, it is fairly easy to effectively let the GC know about these object. In most cases all what you need to do is to call GC.gc() when your allocation/file opening failed.

AriMKatz · 2022-03-07T04:14:54Z

@chflood any thoughts on this? (Particularly wrt to GPU)

maleadt · 2022-03-08T08:48:31Z

the counter-argument against directly calling the finalizer thunk is that users may not have written their finalizers such that they can be safely called in the same scope as the allocation (for example, if the finalizer takes a non-reentrant lock that is already held in the allocation scope).

FWIW, for CUDA.jl it'd be advantageous to free in the same scope, because IIUC that means using the same task, which has the effect that memory operations can be ordered against the task-local stream. If they get executed in the finalizer's task, that means using a global stream and ordering against all streams.

Are packages currently doing such resource handling in finalizers, given we don't have #35689?

jpsamaroo · 2022-03-08T15:18:25Z

Are packages currently doing such resource handling in finalizers, given we don't have #35689?

Yes, MemPool.jl is using a global non-reentrant lock (taken from CUDAdrv/CUDAnative originally), which is taken during allocation, and during finalization. I wouldn't mind changing it to a regular ReentrantLock if this is considered to be an unsupported pattern (I'm not sure if it really needs to be non-reentrant anymore, now that we disable finalizers while taking locks; @krynju).

chflood · 2022-03-08T15:23:49Z

In principal I love the idea of escape analysis stack allocating objects that go away without garbage collector intervention however objects with finalizers pose special issues that I don't fully understand.

I'm still learning Julia so please indulge my questions. Does Julia have a rule about finalizers running exactly once like Java does? Can a finalizer bring an object back to life by stashing it somewhere? My concern is that the GC might accidentally run a finalizer again on a zombie object.

There are also memory model issues in Java as detailed here which may or may not be applicable to Julia. If you run the finalizer without some sort of memory barrier is it possible that instructions may be reordered in incorrect ways? GC provides that memory barrier.

oscardssmith · 2022-03-08T15:32:50Z

We don't appear to document such a property, (or really anything about how finalizers are run). https://docs.julialang.org/en/v1.9-dev/base/base/#Base.finalizer and https://docs.julialang.org/en/v1.9-dev/manual/multi-threading/#Safe-use-of-Finalizers are the only places where they are documented at all. We should probably figure out what properties we want to guarantee and document them.

jpsamaroo · 2022-03-08T19:04:45Z

It appears to me that the current implementation of finalize ends up calling all finalizers for the object directly (instead of queuing them for later), meaning that finalizers must be safe to execute immediately in allocation scope if this pass calls finalize. Is this something that we want to assume for finalizers? Or do we want to assume that finalizers must be executed outside of allocation scope, and thus change the approach in this PR to using a delayed approach?

yuyichao · 2022-03-09T14:25:15Z

I agree with adding this fast-path; would it be reasonable to punt that to a future PR, or do you want to see that done here before this is considered for merge?

C api should be fairly straightforward as well. The issue with the pr as is is that it will introduce regression.

Can a finalizer bring an object back to life by stashing it somewhere?

yes.

My concern is that the GC might accidentally run a finalizer again on a zombie object.

no that is not supposed to happen, each finalizer will run only once. It is removed from the list before being called.

Is this something that we want to assume for finalizers?

not just that, finalizer can run at any time that a gc can run. There were proposals about running hem on separate thread but it is not done and still have issues regarding gc triggered on the finalizer thread.

This is a variant of the eager-finalization idea (e.g. as seen in #44056), but with a focus on the mechanism of finalizer insertion, since I need a similar pass downstream. Integration of EscapeAnalysis is left to #44056. My motivation for this change is somewhat different. In particular, I want to be able to insert finalize call such that I can subsequently SROA the mutable object. This requires a couple design points that are more stringent than the pass from #44056, so I decided to prototype them as an independent PR. The primary things I need here that are not seen in #44056 are: - The ability to forgo finalizer registration with the runtime entirely (requires additional legality analyis) - The ability to inline the registered finalizer at the deallocation point (to enable subsequent SROA) To this end, adding a finalizer is promoted to a builtin that is recognized by inference and inlining (such that inference can produce an inferred version of the finalizer for inlining). The current status is that this fixes the minimal example I wanted to have work, but does not yet extend to the motivating case I had. Nevertheless, I felt that this was a good checkpoint to synchronize with other efforts along these lines. Currently working demo: ``` julia> const total_deallocations = Ref{Int}(0) Base.RefValue{Int64}(0) julia> mutable struct DoAlloc function DoAlloc() this = new() Core._add_finalizer(this, function(this) global total_deallocations[] += 1 end) return this end end julia> function foo() for i = 1:1000 DoAlloc() end end foo (generic function with 1 method) julia> @code_llvm foo() ; @ REPL[3]:1 within `foo` define void @julia_foo_111() #0 { top: %.promoted = load i64, i64* inttoptr (i64 140370001753968 to i64*), align 16 ; @ REPL[3]:2 within `foo` %0 = add i64 %.promoted, 1000 ; @ REPL[3] within `foo` store i64 %0, i64* inttoptr (i64 140370001753968 to i64*), align 16 ; @ REPL[3]:4 within `foo` ret void } ```

* Eager finalizer insertion This is a variant of the eager-finalization idea (e.g. as seen in #44056), but with a focus on the mechanism of finalizer insertion, since I need a similar pass downstream. Integration of EscapeAnalysis is left to #44056. My motivation for this change is somewhat different. In particular, I want to be able to insert finalize call such that I can subsequently SROA the mutable object. This requires a couple design points that are more stringent than the pass from #44056, so I decided to prototype them as an independent PR. The primary things I need here that are not seen in #44056 are: - The ability to forgo finalizer registration with the runtime entirely (requires additional legality analyis) - The ability to inline the registered finalizer at the deallocation point (to enable subsequent SROA) To this end, adding a finalizer is promoted to a builtin that is recognized by inference and inlining (such that inference can produce an inferred version of the finalizer for inlining). The current status is that this fixes the minimal example I wanted to have work, but does not yet extend to the motivating case I had. Nevertheless, I felt that this was a good checkpoint to synchronize with other efforts along these lines. Currently working demo: ``` julia> const total_deallocations = Ref{Int}(0) Base.RefValue{Int64}(0) julia> mutable struct DoAlloc function DoAlloc() this = new() Core._add_finalizer(this, function(this) global total_deallocations[] += 1 end) return this end end julia> function foo() for i = 1:1000 DoAlloc() end end foo (generic function with 1 method) julia> @code_llvm foo() ; @ REPL[3]:1 within `foo` define void @julia_foo_111() #0 { top: %.promoted = load i64, i64* inttoptr (i64 140370001753968 to i64*), align 16 ; @ REPL[3]:2 within `foo` %0 = add i64 %.promoted, 1000 ; @ REPL[3] within `foo` store i64 %0, i64* inttoptr (i64 140370001753968 to i64*), align 16 ; @ REPL[3]:4 within `foo` ret void } ``` * rm redundant copy Co-authored-by: Shuhei Kadowaki <40514306+aviatesk@users.noreply.github.com>

jpsamaroo · 2022-07-14T17:14:25Z

Abandoned in favor of #45272

jpsamaroo added GC Garbage collector compiler:optimizer Optimization passes (mostly in base/compiler/ssair/) labels Feb 6, 2022

jpsamaroo requested a review from aviatesk February 6, 2022 22:23

jpsamaroo force-pushed the jps/finalizer-elision branch from 17791bb to 1d03b6d Compare February 7, 2022 19:45

jpsamaroo changed the base branch from avi/EscapeAnalysis to avi/EASROA February 7, 2022 19:46

aviatesk force-pushed the avi/EASROA branch from 4c2ed4e to 961366f Compare February 8, 2022 08:52

aviatesk reviewed Feb 8, 2022

View reviewed changes

base/compiler/optimize.jl Outdated Show resolved Hide resolved

aviatesk force-pushed the avi/EASROA branch from 961366f to e503797 Compare February 8, 2022 08:58

jpsamaroo force-pushed the jps/finalizer-elision branch from 1d03b6d to 9dff04d Compare February 9, 2022 14:34

aviatesk force-pushed the avi/EASROA branch 4 times, most recently from ef8f7be to 04f2d48 Compare February 10, 2022 15:44

jpsamaroo force-pushed the jps/finalizer-elision branch from 9dff04d to 9714afc Compare February 11, 2022 15:16

jpsamaroo marked this pull request as ready for review February 11, 2022 15:25

jpsamaroo added the needs tests Unit tests are required for this change label Feb 11, 2022

jpsamaroo requested a review from aviatesk February 11, 2022 15:27

aviatesk force-pushed the avi/EASROA branch from 04f2d48 to 974891c Compare February 13, 2022 09:58

aviatesk force-pushed the avi/EASROA branch from 974891c to 17c84ff Compare February 14, 2022 13:07

aviatesk force-pushed the avi/EASROA branch from 17c84ff to 325f414 Compare February 14, 2022 15:28

Implement FinalizerEscape

6340ff0

Co-authored-by: Shuhei Kadowaki <aviatesk@gmail.com>

jpsamaroo force-pushed the jps/finalizer-elision branch from 9714afc to ea0aaab Compare March 1, 2022 17:25

jpsamaroo marked this pull request as draft March 8, 2022 19:08

jpsamaroo removed the needs tests Unit tests are required for this change label Mar 8, 2022

jpsamaroo added 2 commits March 8, 2022 14:57

optimizer: Add early finalize calls

75f83ed

tests: Add finalizer escape tests

e708e23

jpsamaroo force-pushed the jps/finalizer-elision branch from 25e4b37 to e708e23 Compare March 8, 2022 20:57

fixup! optimizer: Add early finalize calls

1a334a8

jpsamaroo mentioned this pull request Mar 9, 2022

Use refcounting for memory management JuliaGPU/AMDGPU.jl#207

Closed

Use IPO EA and query argescapes of invokes

7491ade

aviatesk force-pushed the avi/EASROA branch 2 times, most recently from 9c84ddc to cdef102 Compare March 23, 2022 07:11

Keno mentioned this pull request May 11, 2022

Eager finalizer insertion #45272

Merged

Keno mentioned this pull request May 12, 2022

give finalizers their own RNG state #45212

Merged

jpsamaroo closed this Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insert eager calls to `finalize` for otherwise-dead finalizeable objects #44056

Insert eager calls to `finalize` for otherwise-dead finalizeable objects #44056

jpsamaroo commented Feb 6, 2022

aviatesk left a comment

jpsamaroo commented Feb 9, 2022

jpsamaroo commented Feb 9, 2022

jpsamaroo commented Feb 9, 2022

jpsamaroo commented Feb 11, 2022

yuyichao commented Mar 4, 2022

AriMKatz commented Mar 7, 2022 •

edited

Loading

maleadt commented Mar 8, 2022

jpsamaroo commented Mar 8, 2022

chflood commented Mar 8, 2022

oscardssmith commented Mar 8, 2022

jpsamaroo commented Mar 8, 2022

yuyichao commented Mar 9, 2022

jpsamaroo commented Jul 14, 2022

Insert eager calls to finalize for otherwise-dead finalizeable objects #44056

Insert eager calls to finalize for otherwise-dead finalizeable objects #44056

Conversation

jpsamaroo commented Feb 6, 2022

aviatesk left a comment

Choose a reason for hiding this comment

jpsamaroo commented Feb 9, 2022

jpsamaroo commented Feb 9, 2022

jpsamaroo commented Feb 9, 2022

jpsamaroo commented Feb 11, 2022

yuyichao commented Mar 4, 2022

AriMKatz commented Mar 7, 2022 • edited Loading

maleadt commented Mar 8, 2022

jpsamaroo commented Mar 8, 2022

chflood commented Mar 8, 2022

oscardssmith commented Mar 8, 2022

jpsamaroo commented Mar 8, 2022

yuyichao commented Mar 9, 2022

jpsamaroo commented Jul 14, 2022

Insert eager calls to `finalize` for otherwise-dead finalizeable objects #44056

Insert eager calls to `finalize` for otherwise-dead finalizeable objects #44056

AriMKatz commented Mar 7, 2022 •

edited

Loading