Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Cache refactor #105

Closed
wants to merge 3 commits into from
Closed

WIP: Cache refactor #105

wants to merge 3 commits into from

Conversation

KristofferC
Copy link
Collaborator

This is a WIP to refactor the cache to be a bit more efficient.

The design thoughts are given here:

Currently, the overhead from retrieving vectors from the cache is significant for smallish problems because we do two dict lookups for each vector retrieval. This can be reduced to two dict lookups per function call by storing all the vectors we need for the function in a type and store a vector of those types (one per thread) in a cache instead. The first commit implements this.

The second commit implements an additional keyword argument to the macros, cache. A ForwardDiffCache is currently created by cache = ForwardDiff.ForwardDiffCache(Val{input_length}, eltype(X), Val{chunk_size}). If such an argument is given, we use the passed in cache, else we create or reuse one from the global cache. This requires a bit of extra code and the performance gains are so-so so we have to decide if this is worth the extra code + documentation.

Changes from giant-refactor branch:

  • There is now only a single Dict where the values are ForwardDiffCache which each contains a vector of size NTHREADS of a new type GradientCache. This means that at the start of a gradient call we only do one lookup in the Dict, the rest of the accesses are type stable array accesses and field accesses.
  • Adds a totalsizeofcache function that counts the total bytes in the cache.

TODO:

  • Benchmark properly
  • Integrate it with jacobian when this is added
  • Reduce a bit of memory used by only allocaying partials_remainder for thread tid.

@KristofferC
Copy link
Collaborator Author

Some benchmarking (for rosenbrock):

# giant-refactor
julia> @time for i = 1:10^5 g(rand(5)) end
  0.352903 seconds (800.00 k allocations: 35.095 MB, 1.12% gc time)

# this branch
julia> @time for i = 1:10^5 g(rand(5)) end
  0.166368 seconds (900.00 k allocations: 39.673 MB, 5.06% gc time)

# Hardcoding a cache vector of GradientCaches, i.e. no Dict at all:
julia> @time for i = 1:10^5 g(rand(5)) end
  0.114197 seconds (800.00 k allocations: 35.095 MB, 2.73% gc time)

So seems it is still worthwhile to allow a user to completely skip the dict by specifying the input data.

@jrevels jrevels mentioned this pull request Feb 11, 2016
5 tasks
@KristofferC KristofferC force-pushed the cache_refactor branch 2 times, most recently from eb90bbc to 07aff58 Compare February 12, 2016 10:22
@KristofferC
Copy link
Collaborator Author

One idea that might save some memory is to not have the input length as a key to the dict but instead just call resize! on the pertinent arrays.

AFAIU, resize! only does memory allocation when the underlying buffer is increased and is extremely fast when a size is decreased.

@KristofferC
Copy link
Collaborator Author

Tried it but it didn't work out so well for reasons..

@jrevels jrevels force-pushed the giant-refactor branch 3 times, most recently from c0038b9 to cf1024d Compare February 12, 2016 23:21
@KristofferC KristofferC force-pushed the cache_refactor branch 3 times, most recently from 98b09a0 to affd82a Compare February 13, 2016 17:08
@KristofferC
Copy link
Collaborator Author

Added the possibility to run with a precreated cache:

For example

cache = ForwardDiff.ForwardDiffCache(Val{5}, Float64)
g = ForwardDiff.@gradient(rosenbrock, cache = cache)

Some benchmarks at: https://gist.github.com/KristofferC/007e8ced53e2bb0484e6

TLDR (rosenbrock function used, giant_refactor is baseline with 1x):
For input_length = 1000, no difference between anything on this branch or giant_refactor
For input_length = 50, explicit cache: 0.8x, this branch only: 0.8x
For input_length = 20, explicit cache: 0.55x, this branch only: 0.6x
For input_length = 5, explicit cache: 0.4x, this branch only: 0.5x.

Explicitly giving the cache gave less than I thought. Might not be worth it.

@KristofferC
Copy link
Collaborator Author

Rebased and updated original post with some details.

@jrevels jrevels deleted the cache_refactor branch June 27, 2016 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant