Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peformance Optimization #138

Merged
merged 35 commits into from
Jun 15, 2022
Merged

Peformance Optimization #138

merged 35 commits into from
Jun 15, 2022

Conversation

KDr2
Copy link
Member

@KDr2 KDr2 commented Apr 2, 2022

The first try is using generated function to run instruction.

@KDr2
Copy link
Member Author

KDr2 commented Apr 2, 2022

Benchmarks on master:

benchmarking rosenbrock...
  Run Original Function:  357.635 μs (16 allocations: 6.10 MiB)
  Run TapedFunction:  507.503 μs (298 allocations: 6.11 MiB)
  Run TapedFunction (compiled):  509.553 μs (327 allocations: 6.12 MiB)
  Run TapedTask: #produce=1;   570.172 μs (423 allocations: 6.12 MiB)
benchmarking ackley...
  Run Original Function:  1.278 ms (0 allocations: 0 bytes)
  Run TapedFunction:  731.830 ms (17798224 allocations: 479.10 MiB)
  Run TapedFunction (compiled):  684.299 ms (19798251 allocations: 640.85 MiB)
  Run TapedTask: #produce=100000;   1.393 s (18798365 allocations: 502.00 MiB)
benchmarking matrix_test...
  Run Original Function:  125.608 μs (16 allocations: 469.22 KiB)
  Run TapedFunction:  145.718 μs (338 allocations: 478.50 KiB)
  Run TapedFunction (compiled):  142.328 μs (370 allocations: 481.42 KiB)
  Run TapedTask: #produce=1;   217.167 μs (477 allocations: 489.12 KiB)
benchmarking neural_net...
  Run Original Function:  492.670 ns (4 allocations: 576 bytes)
  Run TapedFunction:  5.017 μs (91 allocations: 3.66 KiB)
  Run TapedFunction (compiled):  4.561 μs (101 allocations: 4.48 KiB)
  Run TapedTask: #produce=1;   62.029 μs (191 allocations: 9.30 KiB)
done

Benchmarks on this PR:

benchmarking rosenbrock...
  Run Original Function:  362.513 μs (16 allocations: 6.10 MiB)
  Run TapedFunction:  490.431 μs (266 allocations: 6.11 MiB)
  Run TapedFunction (compiled):  473.882 μs (295 allocations: 6.11 MiB)
  Run TapedTask: #produce=1;   564.089 μs (390 allocations: 6.12 MiB)
benchmarking ackley...
  Run Original Function:  1.232 ms (0 allocations: 0 bytes)
  Run TapedFunction:  739.744 ms (15598196 allocations: 456.21 MiB)
  Run TapedFunction (compiled):  694.899 ms (17598223 allocations: 617.96 MiB)
  Run TapedTask: #produce=100000;   1.470 s (16498337 allocations: 477.59 MiB)
benchmarking matrix_test...
  Run Original Function:  124.928 μs (16 allocations: 469.22 KiB)
  Run TapedFunction:  146.478 μs (322 allocations: 478.81 KiB)
  Run TapedFunction (compiled):  142.938 μs (354 allocations: 481.73 KiB)
  Run TapedTask: #produce=1;   221.376 μs (457 allocations: 489.33 KiB)
benchmarking neural_net...
  Run Original Function:  478.605 ns (4 allocations: 576 bytes)
  Run TapedFunction:  5.212 μs (81 allocations: 3.52 KiB)
  Run TapedFunction (compiled):  4.806 μs (91 allocations: 4.34 KiB)
  Run TapedTask: #produce=1;   63.829 μs (210 allocations: 10.17 KiB)
done

The allocation only reduced A LITTLE. There must be some other places where memory is allocated.

Copy link
Member

@devmotion devmotion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the benchmark results are not very convincing 😥

src/tapedfunction.jl Outdated Show resolved Hide resolved
@KDr2
Copy link
Member Author

KDr2 commented Apr 2, 2022

Hmm the benchmark results are not very convincing disappointed_relieved

Indeed, maybe we are not on the right track...

src/tapedfunction.jl Outdated Show resolved Hide resolved
Copy link
Contributor

@rikhuijzer rikhuijzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked further into the contents of inputs = map(x -> val(_lookup(tf, x)), instr.input) which is one of the allocations culprits.

It turned out that the eltypes of inputs can be a DataType, Box, Matrix and many more types. This means that input will become a vector of pointers and all elements need to be allocated. Even when we re-write things so that the vector is allocated, we only reduce some allocations on the vector, but we do not reduce allocations on the elements towards which the pointers are pointing. Those will cost much more in terms of allocations since they are bigger.

src/tapedfunction.jl Outdated Show resolved Hide resolved
@KDr2
Copy link
Member Author

KDr2 commented Apr 4, 2022

but we do not reduce allocations on the elements towards which the pointers are pointing.

I think the elements are already stored in tf.bindings, but the way we are passing them prevents updating them in place?

@KDr2
Copy link
Member Author

KDr2 commented Apr 13, 2022

mem profile result: https://paste.debian.net/1237788/

@yebai
Copy link
Member

yebai commented Apr 13, 2022

mem profile result: https://paste.debian.net/1237788/

thanks, @KDr2 - can you list the hotspots and try to fix them?

@KDr2
Copy link
Member Author

KDr2 commented Apr 27, 2022

I tried to inspect on TypedFunction:

# const TypedFunction = FunctionWrapper
struct TypedFunction{OT, IT<:Tuple}
    func::Function
    retval::Base.RefValue{OT}
    TypedFunction{OT, IT}(f::Function) where {OT, IT<:Tuple} = new{OT, IT}(f, Ref{OT}())
end

function (f::TypedFunction{OT, IT})(args...) where {OT, IT<:Tuple}
    output = f.func(args...)
    # 1. conversion/type-assertion -> allocation
    retv = f.retval
    retv[] = OT === Nothing ? nothing : output
    return retv[]
end


getter(bindings::Dict, v::Symbol) = bindings[v]
setter(bindings::Dict, v::Symbol, c) = bindings[v] = c

bs = Dict(:a=>1, :b=>"x")
gint = TypedFunction{Int, Tuple{Dict, Symbol}}(getter)
gstr = TypedFunction{String, Tuple{Dict, Symbol}}(getter)

sint = TypedFunction{Int, Tuple{Dict, Symbol, Int}}(setter)
sstr = TypedFunction{String, Tuple{Dict, Symbol, String}}(setter)

using BenchmarkTools

@btime gint(bs, :a)
@btime gstr(bs, :b)
@btime sint(bs, :a, 1)
@btime sstr(bs, :b, "y")

Both BenchmarkTools and --track-allocation report no mem allocation.

@KDr2
Copy link
Member Author

KDr2 commented Apr 27, 2022

The latest commit eliminates allocations from return retval[] and Symbol("_", i + 1).

@yebai
Copy link
Member

yebai commented Apr 27, 2022

thanks @KDr2 - nice improvements. I included the results below. Allocations for ackley went down to 6.7 million from 17 million. Let's keep optimising the remaining hotspots.

Results on master and with e1529f1 (below)

benchmarking rosenbrock...
  Run Original Function:  667.309 μs (16 allocations: 6.10 MiB)
  Run TapedFunction:  845.212 μs (114 allocations: 6.11 MiB)
  Run TapedFunction (compiled):  780.911 μs (143 allocations: 6.11 MiB)
  Run TapedTask: #produce=1;   1.150 ms (234 allocations: 6.12 MiB)
benchmarking ackley...
  Run Original Function:  2.042 ms (0 allocations: 0 bytes)
  Run TapedFunction:  915.595 ms (6798568 allocations: 205.97 MiB)
  Run TapedFunction (compiled):  856.984 ms (8798595 allocations: 367.72 MiB)
  Run TapedTask: #produce=100000;   2.121 s (7298712 allocations: 216.66 MiB)
benchmarking matrix_test...
  Run Original Function:  152.102 μs (16 allocations: 469.22 KiB)
  Run TapedFunction:  187.503 μs (136 allocations: 473.12 KiB)
  Run TapedFunction (compiled):  193.502 μs (168 allocations: 476.05 KiB)
  Run TapedTask: #produce=1;   470.607 μs (268 allocations: 483.55 KiB)
benchmarking neural_net...
  Run Original Function:  706.436 ns (4 allocations: 576 bytes)
  Run TapedFunction:  5.117 μs (21 allocations: 1.19 KiB)
  Run TapedFunction (compiled):  4.414 μs (31 allocations: 2.02 KiB)
  Run TapedTask: #produce=1;   185.902 μs (113 allocations: 6.56 KiB)

@yebai yebai mentioned this pull request May 3, 2022
src/tapedfunction.jl Outdated Show resolved Hide resolved
@KDr2
Copy link
Member Author

KDr2 commented May 12, 2022

The latest results:

benchmarking rosenbrock...
  Run Original Function:  661.008 μs (16 allocations: 6.10 MiB)
  Run TapedFunction:  817.910 μs (95 allocations: 6.11 MiB)
  Run TapedFunction (compiled):  779.715 μs (124 allocations: 6.11 MiB)
  Run TapedTask: #produce=1;   1.147 ms (175 allocations: 6.11 MiB)
benchmarking ackley...
  Run Original Function:  2.018 ms (0 allocations: 0 bytes)
  Run TapedFunction:  607.439 ms (6098555 allocations: 140.36 MiB)
  Run TapedFunction (compiled):  527.161 ms (8098582 allocations: 233.44 MiB)
  Run TapedTask: #produce=100000;   1.793 s (6598652 allocations: 149.52 MiB)
benchmarking matrix_test...
  Run Original Function:  147.702 μs (16 allocations: 469.22 KiB)
  Run TapedFunction:  170.502 μs (106 allocations: 471.50 KiB)
  Run TapedFunction (compiled):  171.903 μs (138 allocations: 473.03 KiB)
  Run TapedTask: #produce=1;   448.807 μs (189 allocations: 479.00 KiB)
benchmarking neural_net...
  Run Original Function:  700.721 ns (4 allocations: 576 bytes)
  Run TapedFunction:  3.413 μs (18 allocations: 944 bytes)
  Run TapedFunction (compiled):  2.911 μs (28 allocations: 1.39 KiB)
  Run TapedTask: #produce=1;   179.703 μs (99 allocations: 5.62 KiB)
done

@KDr2
Copy link
Member Author

KDr2 commented May 24, 2022

Time consumption and memory allocation per instruction: https://paste.debian.net/1241862/.

src/tapedfunction.jl Outdated Show resolved Hide resolved
@KDr2
Copy link
Member Author

KDr2 commented Jun 8, 2022

It seems that this change (impact vector bindings)brings a little performance decrease in the Turing.jl unit tests tho it's done during compile time, maybe there are a large number of tapedfunctions generated during the test.

BTW, docs are added.

@yebai

@KDr2 KDr2 requested a review from yebai June 8, 2022 23:21
@KDr2 KDr2 linked an issue Jun 8, 2022 that may be closed by this pull request
src/tapedfunction.jl Outdated Show resolved Hide resolved
src/tapedfunction.jl Outdated Show resolved Hide resolved
src/tapedfunction.jl Outdated Show resolved Hide resolved
src/tapedfunction.jl Outdated Show resolved Hide resolved
src/tapedfunction.jl Outdated Show resolved Hide resolved
src/tapedfunction.jl Outdated Show resolved Hide resolved
src/tapedfunction.jl Outdated Show resolved Hide resolved
src/tapedfunction.jl Outdated Show resolved Hide resolved
src/tapedfunction.jl Outdated Show resolved Hide resolved
src/tapedfunction.jl Show resolved Hide resolved
@yebai yebai merged commit 8ea2899 into master Jun 15, 2022
@yebai yebai deleted the perf branch June 15, 2022 19:53
@yebai
Copy link
Member

yebai commented Jun 15, 2022

Many thanks, @KDr2 for the hard work! Also thanks to @devmotion and @rikhuijzer for the comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The performance issue
4 participants