Use DataFrames for ChainSummaries #79

cpfiffer · 2019-04-09T04:04:05Z

This PR switches MCMCChains off the strange I/O driven ChainSummary and ChainSummaries structs, and uses a much simpler ChainDataFrame struct that brings a lot of the versatility of DataFrames for printing, organization, and indexing. It's now a lot easier to get values out of ChainSummary structs (see #71).

Currently, the bulk of the work is done. There's still a couple of remaining steps, namely tidying and making sure all the tests are doing what they're supposed to:

Remove ChainSummary and ChainSummaries entirely
Make all the tests pass

There is a new function called summarize which maps a function parameter-wise and returns a dataframe. Hopefully this'll make it quicker to get some non-traditional summary stats, and it's used a lot on the backend. Example:

using MCMCChains, Statistics

chn = Chains(rand(500, 3, 2))
x = summarize(chn, mean, var, std)

│ Row │ parameters │ mean     │ var       │ std      │
│     │ Symbol     │ Float64  │ Float64   │ Float64  │
├─────┼────────────┼──────────┼───────────┼──────────┤
│ 1   │ Param1     │ 0.49674  │ 0.0877798 │ 0.296277 │
│ 2   │ Param2     │ 0.492215 │ 0.0807893 │ 0.284235 │
│ 3   │ Param3     │ 0.49734  │ 0.0850419 │ 0.291619 │

The indexing on these is overloaded to put a higher priority on rows:

x[:mean] # Select the mean column
x[:Param1, :mean] # select the mean for Param1
x[:Param1, :] # Select the Param1 row

The show function for Chains now uses this as well:

julia> chn
Object of type Chains, with data of type 500×3×2 Array{Float64,3}

Iterations        = 1:500
Thinning interval = 1
Chains            = 1, 2
Samples per chain = 500
parameters        = Param1, Param2, Param3

Summary Statistics

│ Row │ parameters │ mean     │ std      │ naive_se   │ mcse       │ ess     │
│     │ Symbol     │ Float64  │ Float64  │ Float64    │ Float64    │ Float64 │
├─────┼────────────┼──────────┼──────────┼────────────┼────────────┼─────────┤
│ 1   │ Param1     │ 0.49674  │ 0.296277 │ 0.00936909 │ 0.0109575  │ 731.097 │
│ 2   │ Param2     │ 0.492215 │ 0.284235 │ 0.00898829 │ 0.00939995 │ 914.329 │
│ 3   │ Param3     │ 0.49734  │ 0.291619 │ 0.00922182 │ 0.00675694 │ 1000.0  │

All the stats functions include keyword arguments for sections and append_chains. If append_chains=true, then all the chains are smushed together into one chain. Otherwise, you'll get a vector of ChainDataFrames, one for each chain. sections just allows you to subset your chain by sections.

…ains.jl into dfchainsummary # Conflicts: # src/MCMCChains.jl # src/dfchainsummary.jl

Fix issue betwee Julia 1.1 and Julia 1.2

1. Renamed dfchainsummary.jl to summarize.jl. 2. Renamed and added several more test cases to summarize_tests.jl. 3. Updated MCMCChains.jl accordingly.

…ains.jl into dfchainsummary

…ssing.

cpfiffer · 2019-04-14T05:11:25Z

I've added an ess function which liberally lifts from the Stan implementation. It'll show up now whenever describe, summarystats, or show(chn) is called.

cpfiffer · 2019-04-14T15:19:12Z

Looks like our tests are passing. This is now ready for a formal review.

goedman · 2019-04-14T17:27:46Z

I think this looks great Cameron!

trappmartin · 2019-04-14T17:28:55Z

I'm happy to review the PR by tomorrow.

goedman

As I said earlier, this looks great. Once merged I will update all models in CmdStan and StatisticalRethinking to use summarize. But right now all ~60 models run.

goedman · 2019-04-15T16:54:42Z

Cameron, just as a thought, would it be useful to have describe() return the summary data frame?
Working through the CmdStan examples I think I'll end up calling both most of the time.

cpfiffer · 2019-04-15T17:07:57Z

I actually wasn't sure what to do about that, so I'm happy to take suggestions there. My thinking as to why it doesn't actually return anything is that I had thought it was intended as primarily an I/O function to show you information about the chain --- not to return the dataframes. I had it set to do both but if you ran describe from the REPL it would print the value out twice: one for the show calls in the function, and one for the return value.

Perhaps a better option would be to have describe(chn) only return the dataframes (with no printing or internals show calls), and then set show(chn) to call describe in the back end. Does that sounds slightly more reasonable?

trappmartin

Thanks for the work, look good. However, I think this PR is a bit too large and it's therefore hard for me to carefully review the PR. Maybe next time make sure that the PRs stay smaller and doesn't try to solve so many issues.

src/discretediag.jl

trappmartin · 2019-04-15T18:34:28Z

Look good to me. I would not add any further functionalities as this PR is already quite large. Better merge the PR and open a new one for the describe function.

goedman · 2019-04-15T19:49:34Z

Yes, running it in the REPL would require s = describe(chns); I guess, with the ;. I could live with that. Part of my suggestion is to prevent, for very big models, generating the data frame twice. But in general the time the sampling takes dominates the post processing anyway.

cpfiffer · 2019-04-16T01:07:14Z

Merging for now, since I've only modified comments and previous tests were passing. Thanks for the review, guys!

goedman and others added 30 commits March 30, 2019 15:52

DataFrame summary

cdc5685

Added ChainDataFrame struct.

715cbcf

Added names() and size()

0ecacad

Added a "sorted" keyword to the DataFrame constructor to sort params

ad20ad7

Added summarize function and rebuilt the dfchainsummary function

3a5ac1a

Added exports

e73ee2d

Fixed exports

df44be8

Wrapped dataframe in ChainDataFrame

cd44ef9

Temp updates

3ec0cad

Merge branch 'dfchainsummary' of https://github.com/TuringLang/MCMCCh…

6fbd1cb

…ains.jl into dfchainsummary # Conflicts: # src/MCMCChains.jl # src/dfchainsummary.jl

Using Cameron's functions

94029c7

Updated to dfchainsummary

ae14c91

Merge branch 'master' into dfchainsummary

c74ac63

Update dfchainsummary.jl

c9462c1

Fix issue betwee Julia 1.1 and Julia 1.2

Several changes:

5bbbe62

1. Renamed dfchainsummary.jl to summarize.jl. 2. Renamed and added several more test cases to summarize_tests.jl. 3. Updated MCMCChains.jl accordingly.

Exported autocor.

4050119

Allowed for sections to be a vector or a single parameter.

98e3514

Added DF versions of autocor and cor.

632afdf

Commenting out old stats.jl functions.

51feea6

Merge branch 'dfchainsummary' of https://github.com/TuringLang/MCMCCh…

06ca679

…ains.jl into dfchainsummary

Add cor, autocor. Working on changerate.

adbec60

Export/import changes.

0035e5a

ChainDataFrame migrations.

be26d70

Fixed constructors.jl spacing.

111f8e1

Set diagnostic functions to use ChainDataFrames

5fdb786

Added all stats functions, cleaning up.

bb5e331

Allowed summarize(chn) to allow for multiple chains.

857c578

Fixing some tests.

8c042c6

Added catch statement to not remove Union{Missing, T} when we have mi…

5efec2b

…ssing.

Fixes ignore missing bug, autocorplot

93dbc42

cpfiffer added 12 commits April 13, 2019 20:55

Fixes skipping missing values for summarystats.

1d0d0e1

Removes my earlier weird ammendments to summarize

0d34c0c

Added shorthand function cskip which wraps collect(skipmissing( . ))

62d4aa9

Minor shuffling around of missing_tests.jl

e8f6465

Updated tests to use "sections" instead of "section"

826b971

Changed the ordering of variables in the testing suite.

4b946d4

Export ess function

a3fb223

Delete old chainsummary file

a1a99ea

Added a names(chn, sections) function.

dd3cb6f

Adds ESS function.

3985bc7

Added the ability to append a dataframe to a summarize function.

1639ee3

Fixed incorrect test specification.

9120ed5

cpfiffer marked this pull request as ready for review April 14, 2019 15:18

cpfiffer requested review from goedman and trappmartin April 14, 2019 15:19

goedman approved these changes Apr 14, 2019

View reviewed changes

cpfiffer mentioned this pull request Apr 15, 2019

Calculation of Effective Sample Size for multiple chains #85

Closed

trappmartin approved these changes Apr 15, 2019

View reviewed changes

src/discretediag.jl Outdated Show resolved Hide resolved

src/discretediag.jl Show resolved Hide resolved

Addressed comments.

d3d9ec2

cpfiffer merged commit 6b693f5 into master Apr 16, 2019

cpfiffer deleted the dfchainsummary branch April 16, 2019 01:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use DataFrames for ChainSummaries #79

Use DataFrames for ChainSummaries #79

cpfiffer commented Apr 9, 2019 •

edited

cpfiffer commented Apr 14, 2019

cpfiffer commented Apr 14, 2019

goedman commented Apr 14, 2019

trappmartin commented Apr 14, 2019

goedman left a comment

goedman commented Apr 15, 2019

cpfiffer commented Apr 15, 2019

trappmartin left a comment

trappmartin commented Apr 15, 2019 •

edited

goedman commented Apr 15, 2019

cpfiffer commented Apr 16, 2019

Use DataFrames for ChainSummaries #79

Use DataFrames for ChainSummaries #79

Conversation

cpfiffer commented Apr 9, 2019 • edited

cpfiffer commented Apr 14, 2019

cpfiffer commented Apr 14, 2019

goedman commented Apr 14, 2019

trappmartin commented Apr 14, 2019

goedman left a comment

Choose a reason for hiding this comment

goedman commented Apr 15, 2019

cpfiffer commented Apr 15, 2019

trappmartin left a comment

Choose a reason for hiding this comment

trappmartin commented Apr 15, 2019 • edited

goedman commented Apr 15, 2019

cpfiffer commented Apr 16, 2019

cpfiffer commented Apr 9, 2019 •

edited

trappmartin commented Apr 15, 2019 •

edited