Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current method of caching arrays seems to be inefficient in some scenarios. #92

Closed
KristofferC opened this issue Feb 1, 2016 · 5 comments

Comments

@KristofferC
Copy link
Collaborator

I was profiling my code a bit and after doing some optimization I noticed that for jacobian computations, ~60-70% of the time is spend in the cache functions of ForwardDiff.

Here is a profile screenshot with some annotations:

image

It seems that a lot of the time in the cache method is spent in ht_keyindex in dict.jl which I believe is a key lookup. This line and the one after will both look up the key in the dictionary which is unnecessary. I am not sure there is a good way to solve the double lookup right now, JuliaLang/julia#12157 is relevant.

Another could be reconsider the whole cache method so that instead of using a dict some other data structure is used. The dict method is very flexible since it is possible to insert and retrieve arrays of arbitrary type and size but it seems that maybe the lookups are a bit too slow..

If you want to repeat the profile data:

Add these packages:

Pkg.clone("https://github.com/KristofferC/Voigt.jl")
Pkg.clone("https://github.com/KristofferC/ConstLab.jl")
Pkg.add("Parameters")
Pkg.add("Devectorize")

Include this file: https://gist.github.com/KristofferC/71a0797b5b9bf42b8a43

And benchmark with (hopefully I didn't forget anything):

E = 200000.0
ν = 0.3
σy = 800.0
n = 2.0
l = 1.e-2
Hl = 0.1E
Hg = 4.e7
q = 0.1
D = 1.0
tstar = 1000.0
angles = [20, 40]
const mp = CrystPlastMP(E, ν, σy, n, Hl, Hg, q, D, l, tstar, angles)
const ms = CrystPlastMS(length(angles))

comp_res!(y, x) = compute_residual!(y, x, zeros(6), zeros(2), 0.1, mp, ms)
jac = ForwardDiff.jacobian(comp_res!, output_length = 10)
jac(zeros(10))

@profile for i = 1:10^5 jac(zeros(10)) end
@KristofferC KristofferC changed the title Current method of caching array seems to be a bottleneck. Current method of caching arrays seems to be inefficient in some scenarios. Feb 1, 2016
@KristofferC
Copy link
Collaborator Author

I experimented a bit and with some ugly hacking and slashing in the code I now run with

type JacobianCache{T, Q}
    workvec::Vector{T}
    output::Vector{T}
    partials::Vector{Q}
end

as cache (only vector mode works so far).

This (naturally) eliminates the key lookup overhead:

image

The negative effect of this is that I need to specify the output size in order to allocate the correct types for the gradient numbers.

I also specified the input size upon creating the function that computes the jacobian but with resize! on the fields in the cache that might not be needed.

I believe that in the large majority of cases the user has a known constant input size -> known constant output size. It might make sense to try to optimize that case a bit extra if it is possible even if it requires the user to explicitly give the output size from the start.

@KristofferC
Copy link
Collaborator Author

If we already use @generated to instantiate new functions for the same types that are needed for the cache, why not instantiate the arrays that are needed as well? They will work just like static variables for the function.

Like:

@generated function foo{N,T}(a::NTuple{N, T}, x)
    println("Creating buffer")
    cache_arr = zeros(T, N)
    return quote
        print($cache_arr)
        fill!($cache_arr, x)
        print($cache_arr)
    end
end
julia> foo((1.0,2.0,3.0), 2.0)
Creating buffer
[0.0,0.0,0.0][2.0,2.0,2.0]
julia> foo((1.0,2.0,3.0), 1.0)
[2.0,2.0,2.0][1.0,1.0,1.0]
julia> foo((1.0,2.0,3.0,4.0), 1.0)
Creating buffer
[0.0,0.0,0.0,0.0][1.0,1.0,1.0,1.0]

@KristofferC
Copy link
Collaborator Author

With a diff like: https://gist.github.com/KristofferC/5e2169d0057df82494cbdf6fb341bfba

and using:

function brown_almost_linear(x::Vector)
    fvec = similar(x)
    n = length(x)
    sum1 = sum(x) - (n+1)
    for k = 1:(n-1)
        fvec[k] = x[k] + sum1
    end
    fvec[n] = prod(x) - 1
    return fvec
end

g = ForwardDiff.jacobian(brown_almost_linear)

I get (with all tests passing):

julia> @time for i in 1:10^5 g(rand(5)) end
  0.091972 seconds (600.00 k allocations: 86.975 MB, 5.09% gc time)

and on master

julia> @time for i in 1:10^5 g(rand(5)) end
  0.293641 seconds (800.00 k allocations: 91.553 MB, 2.07% gc time)

There is probably some reason why some compiler person doesn't want you to do this but it seems quite powerful...

@jrevels
Copy link
Member

jrevels commented Apr 4, 2016

If I understand you correctly, what you're suggesting is similar to what I started to do in my original refactor last year, but it was decided to abandon that approach since the memory cached in the @generated function couldn't easily be freed by the user.

There's probably a sane way to combine these approaches, such that we can make cache retrieval a compile-time lookup instead of a runtime lookup, but still allocate the memory such that it can be freed by the user when necessary.

@KristofferC
Copy link
Collaborator Author

It could be possible to swap to the current cache lookup for some input length large enough. For small input lengths the size of the cached array is probably in the same order of magnitude as the generated code which doesn't get freed anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants