Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DataFrames for ChainSummaries #79

Merged
merged 47 commits into from
Apr 16, 2019
Merged

Use DataFrames for ChainSummaries #79

merged 47 commits into from
Apr 16, 2019

Conversation

cpfiffer
Copy link
Member

@cpfiffer cpfiffer commented Apr 9, 2019

This PR switches MCMCChains off the strange I/O driven ChainSummary and ChainSummaries structs, and uses a much simpler ChainDataFrame struct that brings a lot of the versatility of DataFrames for printing, organization, and indexing. It's now a lot easier to get values out of ChainSummary structs (see #71).

Currently, the bulk of the work is done. There's still a couple of remaining steps, namely tidying and making sure all the tests are doing what they're supposed to:

  • Remove ChainSummary and ChainSummaries entirely
  • Make all the tests pass

There is a new function called summarize which maps a function parameter-wise and returns a dataframe. Hopefully this'll make it quicker to get some non-traditional summary stats, and it's used a lot on the backend. Example:

using MCMCChains, Statistics

chn = Chains(rand(500, 3, 2))
x = summarize(chn, mean, var, std)
│ Row │ parameters │ mean     │ var       │ std      │
│     │ Symbol     │ Float64  │ Float64   │ Float64  │
├─────┼────────────┼──────────┼───────────┼──────────┤
│ 1   │ Param1     │ 0.49674  │ 0.0877798 │ 0.296277 │
│ 2   │ Param2     │ 0.492215 │ 0.0807893 │ 0.284235 │
│ 3   │ Param3     │ 0.49734  │ 0.0850419 │ 0.291619 │

The indexing on these is overloaded to put a higher priority on rows:

x[:mean] # Select the mean column
x[:Param1, :mean] # select the mean for Param1
x[:Param1, :] # Select the Param1 row

The show function for Chains now uses this as well:

julia> chn
Object of type Chains, with data of type 500×3×2 Array{Float64,3}

Iterations        = 1:500
Thinning interval = 1
Chains            = 1, 2
Samples per chain = 500
parameters        = Param1, Param2, Param3

Summary Statistics

│ Row │ parameters │ mean     │ std      │ naive_se   │ mcse       │ ess     │
│     │ Symbol     │ Float64  │ Float64  │ Float64    │ Float64    │ Float64 │
├─────┼────────────┼──────────┼──────────┼────────────┼────────────┼─────────┤
│ 1   │ Param1     │ 0.496740.2962770.009369090.0109575731.097 │
│ 2   │ Param2     │ 0.4922150.2842350.008988290.00939995914.329 │
│ 3   │ Param3     │ 0.497340.2916190.009221820.006756941000.0

All the stats functions include keyword arguments for sections and append_chains. If append_chains=true, then all the chains are smushed together into one chain. Otherwise, you'll get a vector of ChainDataFrames, one for each chain. sections just allows you to subset your chain by sections.

goedman and others added 30 commits March 30, 2019 15:52
…ains.jl into dfchainsummary

# Conflicts:
#	src/MCMCChains.jl
#	src/dfchainsummary.jl
Fix issue betwee Julia 1.1 and Julia 1.2
1. Renamed dfchainsummary.jl to summarize.jl.
2. Renamed and added several more test cases to summarize_tests.jl.
3. Updated MCMCChains.jl accordingly.
@cpfiffer
Copy link
Member Author

I've added an ess function which liberally lifts from the Stan implementation. It'll show up now whenever describe, summarystats, or show(chn) is called.

@cpfiffer cpfiffer marked this pull request as ready for review April 14, 2019 15:18
@cpfiffer
Copy link
Member Author

Looks like our tests are passing. This is now ready for a formal review.

@goedman
Copy link
Collaborator

goedman commented Apr 14, 2019

I think this looks great Cameron!

@trappmartin
Copy link
Member

I'm happy to review the PR by tomorrow.

Copy link
Collaborator

@goedman goedman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said earlier, this looks great. Once merged I will update all models in CmdStan and StatisticalRethinking to use summarize. But right now all ~60 models run.

@goedman
Copy link
Collaborator

goedman commented Apr 15, 2019

Cameron, just as a thought, would it be useful to have describe() return the summary data frame?
Working through the CmdStan examples I think I'll end up calling both most of the time.

@cpfiffer
Copy link
Member Author

I actually wasn't sure what to do about that, so I'm happy to take suggestions there. My thinking as to why it doesn't actually return anything is that I had thought it was intended as primarily an I/O function to show you information about the chain --- not to return the dataframes. I had it set to do both but if you ran describe from the REPL it would print the value out twice: one for the show calls in the function, and one for the return value.

Perhaps a better option would be to have describe(chn) only return the dataframes (with no printing or internals show calls), and then set show(chn) to call describe in the back end. Does that sounds slightly more reasonable?

Copy link
Member

@trappmartin trappmartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work, look good. However, I think this PR is a bit too large and it's therefore hard for me to carefully review the PR. Maybe next time make sure that the PRs stay smaller and doesn't try to solve so many issues.

src/discretediag.jl Outdated Show resolved Hide resolved
src/discretediag.jl Show resolved Hide resolved
@trappmartin
Copy link
Member

trappmartin commented Apr 15, 2019

Look good to me. I would not add any further functionalities as this PR is already quite large. Better merge the PR and open a new one for the describe function.

@goedman
Copy link
Collaborator

goedman commented Apr 15, 2019

Yes, running it in the REPL would require s = describe(chns); I guess, with the ;. I could live with that. Part of my suggestion is to prevent, for very big models, generating the data frame twice. But in general the time the sampling takes dominates the post processing anyway.

@cpfiffer
Copy link
Member Author

Merging for now, since I've only modified comments and previous tests were passing. Thanks for the review, guys!

@cpfiffer cpfiffer merged commit 6b693f5 into master Apr 16, 2019
@cpfiffer cpfiffer deleted the dfchainsummary branch April 16, 2019 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants