Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce number of dependencies #44

Open
sethaxen opened this issue Sep 21, 2021 · 17 comments
Open

Reduce number of dependencies #44

sethaxen opened this issue Sep 21, 2021 · 17 comments

Comments

@sethaxen
Copy link
Member

@ParadaCarleton and I discussed trying to slim down the package to make it a more lightweight dependency. Currently on my machine this package loads in 5s, vs 5ms with https://github.com/arviz-devs/PSIS.jl, which only depends on standard library packages.

The two reasons to add dependencies are 1) convenience and 2) performance. In my view, features like pretty-printed results are not worth the extra load time if a package can be light-weight, or they can be made conditional dependencies.

For performance, we discussed that Tullio+LoopVectorization may be unnecessary, as PSIS.jl is much faster than ParetoSmooth.jl for vectors and only 20% slower than ParetoSmooth.jl for arrays (if PSIS.jl uses the @threads macro like ParetoSmooth.jl does). If removing then only causes a ~20% decrease in performance but decreases load time significantly, this seems acceptable.

@sethaxen
Copy link
Member Author

On my machine, a non-exhaustive list of load times of the heaviest dependencies in a clean environment:

  • LoopVectorization: 2.7s
  • AxisKeys: 0.86s
  • MCMCDiagnosticTools: 0.80s
  • Distributions: 0.77s
  • Polyester: 0.60 s

@ParadaCarleton
Copy link
Member

ParadaCarleton commented Sep 21, 2021

Yep, makes sense. I think a fair bit of these can be cut out.

MCMCDiagnosticTools, as we discussed earlier, can be cut out pretty easily. Another way we could cut it out besides what we discussed is just copy/pasting the FFTESS code from MCMCDiagnosticTools so we don't need to take everything else, which might just be the easiest option. I think both Distributions.jl and Polyester.jl can be cut out too at very low cost and I'll make a PR doing that soon.

It might be worth checking whether switching from AxisKeys to DimensionalData reduces compile time. My suspicion is it will -- AxisKeys works by turning every array into a functor that returns the output associated with an index, so maybe it compiles a function for every array?

LoopVectorization may be harder to cut out without losing performance, but I've been considering whether a switch to TensorOperations.jl for calculations might be faster than LoopVectorization+Tullio for larger arrays anyways; unsure how it would affect compile times.

@ParadaCarleton
Copy link
Member

ParadaCarleton commented Sep 21, 2021

julia> @time using AxisKeys
    1.176212 seconds (2.43 M allocations: 158.393 MiB, 1.06% gc time)
julia> @time using DimensionalData
    0.089362 seconds (241.28 k allocations: 17.780 MiB, 25.46% compilation time)

Wow, yeah, the difference is huge. I stuck with AxisKeys because I wanted to have it interface easily with ArviZ.jl, but DimensionalData does exactly the same job in under a tenth of the compile time. Switching should be very easy (if a bit tedious) since the only real difference is DimensionalData uses a slightly different constructor and square brackets for lookup (instead of round brackets).

@ParadaCarleton
Copy link
Member

ParadaCarleton commented Sep 21, 2021

julia> @time using LoopVectorization, Tullio
  1.722977 seconds (4.75 M allocations: 256.594 MiB, 3.16% gc time)
julia> @time using TensorOperations
  0.171592 seconds (425.81 k allocations: 27.868 MiB)

Similar story here. I'll probably switch them out even if the performance drops slightly, although I've been having some very annoying errors -- for some reason trying to replace @tullio with @tensor throws bugs that I can't really make heads or tails of -- things like not understanding := (even though it's definitely included in TensorOps).

@devmotion
Copy link
Member

MCMCDiagnosticTools, as we discussed earlier, can be cut out pretty easily. Another way we could cut it out besides what we discussed is just copy/pasting the FFTESS code from MCMCDiagnosticTools so we don't need to take everything else, which might just be the easiest option.

IMO copying code is usually one of the worst alternatives. Regarding FFT, I assume you could/should replace FFTW with AbstractFFTs, as we do in MCMCDiagnosticTools and MCMCChains as well.

@ParadaCarleton
Copy link
Member

ParadaCarleton commented Sep 21, 2021

MCMCDiagnosticTools, as we discussed earlier, can be cut out pretty easily. Another way we could cut it out besides what we discussed is just copy/pasting the FFTESS code from MCMCDiagnosticTools so we don't need to take everything else, which might just be the easiest option.

IMO copying code is usually one of the worst alternatives. Regarding FFT, I assume you could/should replace FFTW with AbstractFFTs, as we do in MCMCDiagnosticTools and MCMCChains as well.

FFT isn't the important bit -- what matters is that we need a function to calculate ESS but don't want to pull everything else in MCMCDiagnosticTools.jl along with it.

Seth suggested letting users calculate the relative effective sample size themselves; I think this is reasonable but it's a bit inconvenient.

@devmotion
Copy link
Member

Sure, the FFTW suggestion was just an additional comment.

I haven't benchmarked the precompilation and loading times of MCMCDiagnosticTools but it mainly depends on lightweight packages: https://github.com/devmotion/MCMCDiagnosticTools.jl/blob/main/Project.toml Most of these packages are just interfaces (AbstractFFTs, DataAPI, MLJModelInterface, Tables), stdlibs (LinearAlgebra, Random, Statistics), or standard packages from the math/stats ecosystem that basically every package depends on (SpecialFunctions and StatsBase). The only other dependency is Distributions which is also quite common in stats/probability theory packages. If there are any issues regarding loading/compilation times, it would be good to track them down (invalidations in one of these dependencies or MCMCDiagnosticTools?) and fix them. From just looking at the list of dependencies it is not clear to me why you would want to remove the dependency on MCMCDiagnosticTools and copy some of the code. In my experience usually this leads to diverging code bases, i.e., you will miss out on any improvements or bug fixes in MCMCDiagnosticTools and vice versa; this happened regularly and is still a problem eg with the copied-from-Distributions-and-slightly-modified code in DistributionsAD.

@ParadaCarleton
Copy link
Member

Sure, the FFTW suggestion was just an additional comment.

I haven't benchmarked the precompilation and loading times of MCMCDiagnosticTools but it mainly depends on lightweight packages: https://github.com/devmotion/MCMCDiagnosticTools.jl/blob/main/Project.toml Most of these packages are just interfaces (AbstractFFTs, DataAPI, MLJModelInterface, Tables), stdlibs (LinearAlgebra, Random, Statistics), or standard packages from the math/stats ecosystem that basically every package depends on (SpecialFunctions and StatsBase). The only other dependency is Distributions which is also quite common in stats/probability theory packages. If there are any issues regarding loading/compilation times, it would be good to track them down (invalidations in one of these dependencies or MCMCDiagnosticTools?) and fix them. From just looking at the list of dependencies it is not clear to me why you would want to remove the dependency on MCMCDiagnosticTools and copy some of the code. In my experience usually this leads to diverging code bases, i.e., you will miss out on any improvements or bug fixes in MCMCDiagnosticTools and vice versa; this happened regularly and is still a problem eg with the copied-from-Distributions-and-slightly-modified code in DistributionsAD.

Makes sense. From what I can see, MCMCDiagnosticTools' load time comes primarily from Distributions.jl. Do you happen to know where Distributions.jl is used in MCMCDiagnosticTools? Maybe if it's a few lines that need, say, a Normal PDF calculation, that could just be hand-coded instead to remove the dependency.

Another way to cut load times might be switching from FFTESS to another ESS method so we can drop FFTW, but I have no idea how they compare performance/accuracy-wise. (Using AbstractFFTs doesn't make much difference since users will still have to load their own FFT library.)

@devmotion
Copy link
Member

Another way to cut load times might be switching from FFTESS to another ESS method so we can drop FFTW, but I have no idea how they compare performance/accuracy-wise.

Performance-wise in most of my benchmarks computing the ESS with FFT is the worst out of all methods.

(Using AbstractFFTs doesn't make much difference since users will still have to load their own FFT library.)

That's the main point of using AbstractFFTs - users can pick whatever FFT library they want that implements the interface.

@devmotion
Copy link
Member

Some benchmarks: TuringLang/MCMCChains.jl#156 and TuringLang/MCMCChains.jl#215 FFT (using FFTW) was by far the slowest method.

@devmotion
Copy link
Member

This is also why it's neither the default nor recommended 😄

@ParadaCarleton
Copy link
Member

This is also why it's neither the default nor recommended

Haha, makes sense! I just copied it from R's loo package, since I figured they must have some good reason for using it. I'll swap it out for another one then -- is the default the best to stick with?

@devmotion
Copy link
Member

According to the linked benchmarks, yes.

I checked where exactly Distributions is used in MCMCDiagnosticTools. It turns out mainly in the discrete diagnostics (for sampling and for evaluating CDFs, the former could maybe be replaced by some, possibly less optimized, calls to StatsBase and the latter by calling into StatsFuns) but unfortunately it is crucial for the Rstar diagnostic: for probabilistic predictions we return a scaled Poisson Binomial distribution (https://github.com/devmotion/MCMCDiagnosticTools.jl/blob/90f502dec5f116f7c12a42df1aa285137dffe6af/src/rstar.jl#L129). So I don't think it can be removed as a dependency.

BTW the dependencies of Distributions are lightweight as well (https://github.com/JuliaStats/Distributions.jl/blob/master/Project.toml), maybe apart from StatsFuns which still depends on Rmath (there are JuliaStats/StatsFuns.jl#125 and JuliaStats/StatsFuns.jl#113 though), so probably any performance issues are caused by Distributions itself. Since it is loaded by most stats/probability theory/Turing packages anyway I wonder how helpful it is to try to remove it though.

@sethaxen
Copy link
Member Author

Another way to cut load times might be switching from FFTESS to another ESS method so we can drop FFTW, but I have no idea how they compare performance/accuracy-wise.

Performance-wise in most of my benchmarks computing the ESS with FFT is the worst out of all methods.

Haha, makes sense! I just copied it from R's loo package, since I figured they must have some good reason for using it. I'll swap it out for another one then -- is the default the best to stick with?

According to the linked benchmarks, yes.

There are more factors to consider than benchmark speed though. ESS computation is not a computational bottleneck in the statistical workflow (the inference method, e.g. MCMC, is), so IMO we should use whatever is currently considered to the best ESS method. Right now, that appears to be the rank-normalized bulk ESS method from Vehtari et al, 2021. That paper notes

In our experiments, FFT-based autocorrelation estimates have also been computationally more accurate than naive autocovariance computation.

While ESSMethod references this paper, it has a number of substantial differences from this paper (not just use of FFT, but also at least maximum lag and choice of bias), and in my experience it can disagree quite significantly with Stan's and ArviZ's ESS methods, both of which implement the bulk ESS method and tend to agree to within less than 1 effective sample.

For some reason though, the loo package doesn't seem to use the rank-normalized version.

@devmotion
Copy link
Member

I think it is nevertheless questionable if the FFT-based approach should be considered rhe best one. It is difficult to comment on accuracy and efficiency without inspecting the actual implementation and making sure that all methods are implemented in an optimal way. And even then efficiency might be different for different programming languages due to differences in eg how efficient FFTs are implemented or if the language yields optimized native code for for-loops or requires vectorization. (As a side remark, there's a paper that claims that the best way to compute the pmf values of a Poisson-Binomial distribution is based FFTs; however, it turned out it's better to use the dynamic programming approach that the authors claimed to be worse, and therefore we replaced the FFT-approach with it in Distributions).

BTW all three ESS implementations use the same ESS algorithm from the split-rhat paper, really the only difference is how the autocorrelation terms are computed.

@tpapp also noticed in MCMCDiagnostics that often only a few autocorrelation terms are needed which I assume explains some of the efficiency differences.

@sethaxen
Copy link
Member Author

BTW all three ESS implementations use the same ESS algorithm from the split-rhat paper, really the only difference is how the autocorrelation terms are computed.
Which split-rhat paper? If you mean Vehtari, 2021 (cited in the docstrings), there are several differences, and as I noted, the key innovation in that paper is computing ESS on rank-normalized samples, as well as bulk, tail, and quantile ESS estimates, none of which are currently in MCMCDiagnosticTools.

I think it is nevertheless questionable if the FFT-based approach should be considered rhe best one. It is difficult to comment on accuracy and efficiency without inspecting the actual implementation and making sure that all methods are implemented in an optimal way. And even then efficiency might be different for different programming languages due to differences in eg how efficient FFTs are implemented or if the language yields optimized native code for for-loops or requires vectorization. (As a side remark, there's a paper that claims that the best way to compute the pmf values of a Poisson-Binomial distribution is based FFTs; however, it turned out it's better to use the dynamic programming approach that the authors claimed to be worse, and therefore we replaced the FFT-approach with it in Distributions).

Your point seems to be that papers can make mistakes, and I of course agree. But if we dismiss the advice it should only be after running our own experiments. Unfortunately, they don't detail their FFT experiments or the magnitude of the improvement they saw by switching to FFT.

Here's a quick check that indicates that at least for small, IID samples, their method has slightly lower variance than ESSMethod, and FFTESSMethod maybe has a slightly lower variance than ESSMethod, but probably not worth it. But the big improvement of their method shows up for Cauchy random variables, where their method has a much lower variance due to rank-normalization.

using ArviZ, MCMCDiagnosticTools, Random, StatsBase, FFTW
ndraws = 100
nreps = 100_000

Random.seed!(42)

randcauchy(n) = map(_ -> randn() / randn(), 1:n)
rand_funs = [randn, rand, randexp, randcauchy]
ess_funs = [
    ESSMethod => x -> MCMCDiagnosticTools.ess_rhat(reshape(x, :, 1, 1))[1][1],
    FFTESSMethod => x -> MCMCDiagnosticTools.ess_rhat(reshape(x, :, 1, 1); method=FFTESSMethod())[1][1],
    ArviZ => x -> ArviZ.ess(reshape(x, 1, :))
]

for f in rand_funs, (ess_name, ess_fun) in ess_funs 
    μ, σ = mean_and_std(ess_fun(f(ndraws)) for _ in 1:nreps)
    println("$f: $(round(μ; digits=2)) ± $(round(σ; digits=2)) ($ess_name)")
end
randn: 100.46 ± 30.39 (ESSMethod)
randn: 100.54 ± 30.23 (FFTESSMethod)
randn: 99.05 ± 28.04 (ArviZ)
rand: 100.48 ± 30.57 (ESSMethod)
rand: 100.42 ± 30.35 (FFTESSMethod)
rand: 99.18 ± 27.94 (ArviZ)
randexp: 100.19 ± 27.99 (ESSMethod)
randexp: 100.21 ± 27.59 (FFTESSMethod)
randexp: 98.99 ± 27.86 (ArviZ)
randcauchy: 102.65 ± 210.04 (ESSMethod)
randcauchy: 101.47 ± 260.84 (FFTESSMethod)
randcauchy: 98.98 ± 27.92 (ArviZ)

We should also of course test on (auto/anti)correlated chains.

So probably more important than FFT vs no FFT is adding the bulk/tail/quantile rank-normalized ESS methods to MCMCDiagnosticTools. It would be preferrable if the user could mix and match autocov-computation methods, whether rank normalization is performed, and how maximum lag and bias are computed, but this would require a refactor.

@devmotion
Copy link
Member

Which split-rhat paper? If you mean Vehtari, 2021 (cited in the docstrings), there are several differences, and as I noted, the key innovation in that paper is computing ESS on rank-normalized samples, as well as bulk, tail, and quantile ESS estimates, none of which are currently in MCMCDiagnosticTools.

It is based on this paper (well, on an earlier version https://arxiv.org/pdf/1903.08008v1.pdf since the method was implemented quite a while ago), just without the rank normalization and other improvements in section 4. More concretely, it uses the formulas in section 3 (and appendix A the earlier versions, it seems this was now all merged into section 3).

Your point seems to be that papers can make mistakes, and I of course agree. But if we dismiss the advice it should only be after running our own experiments.

Sure, we shouldn't just dismiss existing results. But we conducted benchmarks and since all our runs showed a significant overhead of FFT vs non-FFT and apparently the same was observed in MCMCDiagnostics, I am sceptical that a FFT-based approach is the best.

But the big improvement of their method shows up for Cauchy random variables, where their method has a much lower variance due to rank-normalization.

I agree, it seems there is an improvement here. A more detailed inspection of the distributions would be good probably, 100 +- 260 seems to indicate a very skewed distribution or negative values.

So probably more important than FFT vs no FFT is adding the bulk/tail/quantile rank-normalized ESS methods to MCMCDiagnosticTools. It would be preferrable if the user could mix and match autocov-computation methods, whether rank normalization is performed, and how maximum lag and bias are computed, but this would require a refactor.

Sure, it would be good to add such improvements. But as you say this is orthogonal to the original discussion about FFT vs non-FFT that I commented on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants