Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TapedTask type annotations to storage[:tapedtask] #121

Merged
merged 6 commits into from
Feb 20, 2022
Merged

Conversation

rikhuijzer
Copy link
Contributor

In all these instances, the compiler wasn't able to figure out the type in Julia 1.7.2. By adding the type annotations, the number of allocations decreased. For example, this PR reduces the number of allocations for

callback !== nothing && callback()

inside (tf::TapedFunction)(args...; callback=nothing) from 1 allocation (48 bytes) to 0 allocations according to @time.

I ran the same selection of the Turing tests again as in #119 with @time:

master

inference/AdvancedSMC.jl 61.216612 seconds (123.97 M allocations: 7.719 GiB, 3.37% gc time, 96.79% compilation time)
inference/gibbs.jl 450.156926 seconds (1.44 G allocations: 87.710 GiB, 3.63% gc time, 18.94% compilation time)
contrib/inference/sghmc.jl 13.233059 seconds (39.25 M allocations: 2.370 GiB, 4.88% gc time, 98.14% compilation time)
inference/gibbs.jl 448.414789 seconds (1.37 G allocations: 81.548 GiB, 3.64% gc time, 7.42% compilation time)
contrib/inference/sghmc.jl 6.587861 seconds (36.67 M allocations: 1.669 GiB, 7.35% gc time, 73.75% compilation time)
inference/gibbs.jl 470.303001 seconds (1.32 G allocations: 79.921 GiB, 3.33% gc time, 7.07% compilation time)
contrib/inference/sghmc.jl 6.462751 seconds (21.34 M allocations: 1.156 GiB, 4.66% gc time, 77.33% compilation time)
stdlib/RandomMeasures.jl  11.844663 seconds (23.81 M allocations: 1.434 GiB, 4.63% gc time, 69.74% compilation time)

this PR

inference/AdvancedSMC.jl 60.147337 seconds (123.88 M allocations: 7.718 GiB, 3.35% gc time, 96.87% compilation time)
inference/gibbs.jl 444.311275 seconds (1.38 G allocations: 85.330 GiB, 3.59% gc time, 19.28% compilation time)
contrib/inference/sghmc.jl 12.994735 seconds (39.14 M allocations: 2.363 GiB, 5.17% gc time, 98.51% compilation time)
inference/gibbs.jl 435.198188 seconds (1.30 G allocations: 78.764 GiB, 3.66% gc time, 7.51% compilation time)
contrib/inference/sghmc.jl 6.246251 seconds (36.67 M allocations: 1.669 GiB, 7.00% gc time, 73.60% compilation time)
inference/gibbs.jl 452.938648 seconds (1.26 G allocations: 77.115 GiB, 3.40% gc time, 7.10% compilation time)
contrib/inference/sghmc.jl 6.420642 seconds (21.34 M allocations: 1.156 GiB, 5.01% gc time, 76.51% compilation time)
stdlib/RandomMeasures.jl 12.161753 seconds (21.92 M allocations: 1.351 GiB, 3.96% gc time, 67.34% compilation time)

So, this saves about 40 seconds and 7 GB of allocations on the Turing tests.

Let's not register a new version after merging this PR. I think that I can find some other optimization soon.

This reduces the number of allocations for

```julia
@time callback !== nothing && callback()
```

inside `(tf::TapedFunction)(args...; callback=nothing)` from 1
allocation: 48 bytes to 0 allocations.
@devmotion
Copy link
Member

Let's not register a new version after merging this PR. I think that I can find some other optimization soon.

Why not? If there are other additional optimizations, we can register another version with them. We can register as many versions as we want, and according to ColPrac we even should make a new release as soon as new bugfixes or features are merged into the main branch (it's a bit less clear if only tests are updated).

Copy link
Member

@devmotion devmotion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'm curious did you check if a function barrier would fix these cases as well?

src/tapedtask.jl Outdated Show resolved Hide resolved
rikhuijzer and others added 2 commits February 19, 2022 19:57
Co-authored-by: David Widmann <devmotion@users.noreply.github.com>
@rikhuijzer
Copy link
Contributor Author

Looks good. I'm curious did you check if a function barrier would fix these cases as well?

Would you prefer function barriers if they fix these cases? And if yes, could you tell me what's the benefit here? I only learned today about function barriers so they have to sink in a bit more

@rikhuijzer
Copy link
Contributor Author

I'm curious did you check if a function barrier would fix these cases as well?

And to answer your question: nope.

I don't mind trying it out. But, in that case, could you give an example of how I can rewrite one of the changes to a function barrier? 🙂

@devmotion
Copy link
Member

The advantage of a function barrier is that you don't have to reason about types and can write code more generically and don't limit the implementation to the annotated type. Additionally, the part of the function that you move to a separate function will be compiled for the types of whatever it is called with and hence (potentially) Julia can generate more efficient code than if there is an abstract type annotation.

@rikhuijzer
Copy link
Contributor Author

rikhuijzer commented Feb 19, 2022

I've tested it at both. For the first one, it also results in 0 allocations when writing:

function producer_core(ttask)
    if length(ttask.produced_val) > 0
        val = pop!(ttask.produced_val)
        put!(ttask.produce_ch, val)
        take!(ttask.consume_ch) # wait for next consumer
    end
    return nothing
end

producer() = producer_core(current_task().storage[:tapedtask])

instead of having one method and a type annotation.

For the other one, it results in 2 allocations:

function produce_core(val, ttask)
    length(ttask.produced_val) > 1 &&
        error("There is a produced value which is not consumed.")
    push!(ttask.produced_val, val)
    return nothing
end

function produce(val)
    is_in_tapedtask() || return nothing
    @time produce_core(val, current_task().storage[:tapedtask])
end

whereas

function produce_core(val)
    ttask = current_task().storage[:tapedtask]::TapedTask
    length(ttask.produced_val) > 1 &&
        error("There is a produced value which is not consumed.")
    push!(ttask.produced_val, val)
    return nothing
end

function produce(val)
    is_in_tapedtask() || return nothing
    @time produce_core(val)
end

results in only 1 allocation. Also, the type annotations in this case should reduce compile time slightly because it avoids compiling one method instance, it should avoid a method table lookup and also it is easier for the reader to know that a ttask contains an object of type TapedTask (if it wasn't clear from the variable name already).

Let me know how to continue from here to get this PR merged

@devmotion
Copy link
Member

Did you compare timings, i.e. compile time and run time? I would assume that a function barrier is useful in particular if there are large optimization and hence performamce gains from knowing type parameters.

@rikhuijzer
Copy link
Contributor Author

Did you compare timings, i.e. compile time and run time? I would assume that a function barrier is useful in particular if there are large optimization and hence performamce gains from knowing type parameters.

In this case, it shouldn't matter because isconcretetype(Libtask.TapedTask) == true. So, the function barrier will only introduce one method instance.

@devmotion
Copy link
Member

Oh somehow I assumed it had a type parameter F and a field of type TapedFunction{F}. Is there a specific reason why it has a field of abstract type TapedFunction instead? Otherwise, it could be beneficial to change it?

@rikhuijzer
Copy link
Contributor Author

Oh somehow I assumed it had a type parameter F and a field of type TapedFunction{F}. Is there a specific reason why it has a field of abstract type TapedFunction instead? Otherwise, it could be beneficial to change it?

I was thinking exactly the same this afternoon and have experimented with making TapedTask a parametric type like so

struct TapedTask{F}
    task::Task
    tf::TapedFunction{F}
    produce_ch::Channel{Any}
    consume_ch::Channel{Int}
    produced_val::Vector{Any}

    function TapedTask{F}( ... )
end

in the hope that it would fix inference at points where ttask.tf was called. However, it didn't provide any performance benefits.

For the real performance gains, it would be great if we can get one or more structs to be stack allocated. Unfortunately, that's currently not possible at all because we can't allocate a TapendFunction.func of type # maybe a function, a constructor, or a callable object on the stack.

Another thing that I noticed is that TapedFunction.retval is unknown. Is this the case for the whole Turing ecosystem? Being able to estimate the return type should be possible for many models I think

@devmotion
Copy link
Member

I think it would be good to change the type nevertheless if there is no compelling reason to keep the abstract field. Generally fields with abstract types should be avoided if possible.

@devmotion
Copy link
Member

I guess it might be useful to do this in a separate PR first and then reevaluate this PR.

@rikhuijzer
Copy link
Contributor Author

I think it would be good to change the type nevertheless if there is no compelling reason to keep the abstract field. Generally fields with abstract types should be avoided if possible.

Okay. Just to check. That’s a breaking change then because CTask = TapedTask is exported

@devmotion
Copy link
Member

devmotion commented Feb 19, 2022

Yes, since it is exported it would be breaking. Would be good to evaluate if the alias is needed - I guess it exists for historical reasons (when there was only CTask but no TapedTask) but it seems a bit confusing without clear benefit. And additionally it's defined incorrectly it seems,

CTask = TapedTask
should be const CTask = TapedTask. I opened #122.

@yebai
Copy link
Member

yebai commented Feb 20, 2022

Many thanks @rikhuijzer!

@yebai yebai merged commit 689abb2 into master Feb 20, 2022
@delete-merged-branch delete-merged-branch bot deleted the rh/perf3 branch February 20, 2022 20:22
@devmotion
Copy link
Member

@rikhuijzer Did you rerun the benchmarks after the TapedTask{F} change and compare the, now non-concrete, type annotations with a function barrier?

@rikhuijzer
Copy link
Contributor Author

@rikhuijzer Did you rerun the benchmarks after the TapedTask{F} change and compare the, now non-concrete, type annotations with a function barrier?

Now, the function barrier should be better indeed because the type hint is not concrete. I've just released only Libtask 0.7. That release doesn't contain this PR. You or I can look into changing it to a function barrier and release once that's done.

@rikhuijzer
Copy link
Contributor Author

@rikhuijzer Did you rerun the benchmarks after the TapedTask{F} change and compare the, now non-concrete, type annotations with a function barrier?

I tested it yesterday and I couldn't find any differences in speed. My guess is that ttask.tf is a reference anyway so it doesn't matter whether the type is known or not. Still, I would need to investigate more to figure out what the reason is exactly. For now, I won't do that because I don't think that it is worth the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants