Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StatsModels #201

Merged
merged 7 commits into from Dec 11, 2017
Merged

StatsModels #201

merged 7 commits into from Dec 11, 2017

Conversation

dmbates
Copy link
Contributor

@dmbates dmbates commented Nov 26, 2017

Switch to StatsModels v0.1.0

Tests now use DataFrames v0.11.0, CSV and CategoricalArrays. Because of the change from DataArrays::pool to CategoricalArrays::categorical the order of the elements in the pool changed. This caused with changes in the coefficients and standard errors in some models used for tests.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks!

src/GLM.jl Outdated
using Base.LinAlg.LAPACK: potrf!, potrs!
using Base.LinAlg.BLAS: gemm!, gemv!
using Base.LinAlg: QRCompactWY, Cholesky, BlasReal
using StatsBase: StatsBase, CoefTable, StatisticalModel, RegressionModel
using StatsFuns: logit, logistic
using StatsModels: @formula, Formula, ModelFrame, ModelMatrix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we should reexport StatsModels for now, to avoid the need to call using StatsModels explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dislike reexporting entire packages. Often I will check

whos(PkgName)

to see what the functions and types in a package are and I hate it when everything in, say, Distributions is in that list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm not sure what's the best approach here. @ararslan @kleinschmidt Ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the trouble of having using StatsModels inside the package; if GLM users want to mess about with statsmodels stuff then I think it's reasonable to have them explicitly ask for it, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my concern is about people who just want to fit a GLM. It would be too bad to require them to run using StatsModels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation may not be worse but this PR changes the relationship between GLM and StatsModels. Before, if you just did using GLM you wouldn't get DataFrames or any of the StatsModeling functionality; you'd need to also do using DataFrames. I'm fine with this PR pulling in StatsModels and DataFrames whenever I do using GLM now, because I'm never fitting a GLM to non-tabular data anyway. But there may be people out there who don't want the added dependency on DataFrames and could be surprised by it when they update. Just something to consider.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there may be people out there who don't want the added dependency on DataFrames and could be surprised by it when they update. Just something to consider.

Yes, hopefully we'll get rid of the dependency on DataFrames at some point. Until then I'd say it's fine if those people need to install DataFrames.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have kind of lost track of the discussion here. Some of the early comments were about @formula not being available without using StatsModels but @formula is exported from this package, as shown below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question was, isn't it weird that you need to call using StatsModels to adjust contrasts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I think we should go with your original suggestion of re-exporting the symbols from StatsModels, as there aren't very many. I will amend the PR accordingly.

test/runtests.jl Outdated
@test isapprox(aic(lm1), -36.409684288095946)
@test isapprox(aicc(lm1), -24.409684288095946)
@test isapprox(bic(lm1), -37.03440588041178)
@test isapprox(StatsBase.aic(lm1), -36.409684288095946)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StatsBase. shouldn't be needed AFAIK?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that. It's vestigial. Initially I wasn't including StatsBase and using fully qualified names instead.

@@ -69,8 +70,8 @@ admit[:rank] = pool(admit[:rank])
@test isapprox(aicc(gm2), 470.7312329339146)
@test isapprox(bic(gm2), 494.4662797585473)
@test isapprox(coef(gm2),
[-3.9899786606380734, 0.0022644256521549043, 0.8040374535155766,
-0.6754428594116577, -1.3402038117481079,-1.5514636444657492])
[-5.3301824723861895, 0.002264425652154903, 0.8040374535155792,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen your comment about why coefficients change, but that's still not clear to me. The order of levels should be the same as with PDAs AFAIK. I guess that's rather a change in StatsModels? Anyway, calling levels! should be enough to get the same behavior as before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are with respect to the contrasts chosen for the levels of admit[:rank], which is read as an integer then converted to categorical. I think the pool conversion from a vector of integers to a PooledDataArray just uses the unique levels without sorting them whereas the categorical conversion does sort. You would know better than I.

In any case, the contrasts chosen here are the sensible ones. I would prefer not to enforce bug-for-bug compatibility with the last version.

@@ -129,8 +130,7 @@ end
end

## Example with offsets from Venables & Ripley (2002, p.189)
anorexia = readtable(joinpath(glm_datadir, "anorexia.csv.gz"))
anorexia[:Treat] = pool(anorexia[:Treat])
anorexia = CSV.read(joinpath(glm_datadir, "anorexia.csv"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe pass categorical=true, that way we're safe if the default changes in a future version (as it's been discussed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the documentation the default value of categorical is true. The reason for the explicit call here is that the levels are integers. I could use an explicit types argument in the call to CSV.read but I don't think that would gain much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I hadn't realized that categorical=true just means that the result will be categorical if the proportion of unique values is below a certain threshold. I guess that's OK. (I think we would have to pass CategoricalValue as the type to be certain we get a categorical vector.)

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, at least reexporting StatsModels is simpler for users (and we already reexport all StatsBase modeling functions).

Maybe README.md needs updating?

@dmbates
Copy link
Contributor Author

dmbates commented Nov 28, 2017

I guess I don't understand how the @reexport macro works. Notice below that generics and macros, like @formula and coefnames are exported as StatsModels.#@formula, etc. but the types ModelMatrix and ModelFrame are exported as GLM.#ModelMatrix and GLM.#ModelFrame. A package that uses both GLM and StatsModels encounters a name conflict.

julia> whos(GLM)
                      @formula      0 bytes  StatsModels.#@formula
             AbstractContrasts     92 bytes  DataType
                     Bernoulli     40 bytes  UnionAll
                      Binomial     40 bytes  UnionAll
                   CauchitLink     92 bytes  DataType
                   CloglogLink     92 bytes  DataType
               ContrastsCoding    176 bytes  DataType
                     DensePred     92 bytes  DataType
                 DensePredChol     96 bytes  UnionAll
                   DensePredQR     56 bytes  UnionAll
                   DummyCoding    124 bytes  DataType
                 EffectsCoding    124 bytes  DataType
                       Formula    188 bytes  DataType
                           GLM    128 KB     Module
                         Gamma     40 bytes  UnionAll
        GeneralizedLinearModel    280 bytes  UnionAll
                       GlmResp    200 bytes  UnionAll
                 HelmertCoding    124 bytes  DataType
                  IdentityLink     92 bytes  DataType
               InverseGaussian     40 bytes  UnionAll
                   InverseLink     92 bytes  DataType
             InverseSquareLink     92 bytes  DataType
                       LinPred     92 bytes  DataType
                  LinPredModel     92 bytes  DataType
                   LinearModel    160 bytes  UnionAll
                          Link     92 bytes  DataType
                        LmResp     80 bytes  UnionAll
                       LogLink     92 bytes  DataType
                     LogitLink     92 bytes  DataType
                    ModelFrame      0 bytes  GLM.#ModelFrame
                   ModelMatrix      0 bytes  GLM.#ModelMatrix
                        Normal     40 bytes  UnionAll
                       Poisson     40 bytes  UnionAll
                    ProbitLink     92 bytes  DataType
                      SqrtLink     92 bytes  DataType
                   StatsModels    115 KB     Module
                         adjr2      0 bytes  StatsBase.#adjr2
                         adjr²      0 bytes  StatsBase.#adjr2
                 canonicallink      0 bytes  GLM.#canonicallink
                          coef      0 bytes  StatsBase.#coef
                     coefnames      0 bytes  StatsModels.#coefnames
                     coeftable      0 bytes  StatsBase.#coeftable
                       confint      0 bytes  StatsBase.#confint
                      delbeta!      0 bytes  GLM.#delbeta!
                      deviance      0 bytes  StatsBase.#deviance
                      devresid      0 bytes  GLM.#devresid
                           dof      0 bytes  StatsBase.#dof
                  dof_residual      0 bytes  StatsBase.#dof_residual
                      dropterm      0 bytes  StatsModels.#dropterm
                           fit      0 bytes  StatsBase.#fit
                          fit!      0 bytes  StatsBase.#fit!
                       formula      0 bytes  GLM.#formula
                         ftest      0 bytes  GLM.#ftest
                           glm      0 bytes  GLM.#glm
                        glmvar      0 bytes  GLM.#glmvar
                   inverselink      0 bytes  GLM.#inverselink
                       linkfun      0 bytes  GLM.#linkfun
                       linkinv      0 bytes  GLM.#linkinv
                       linpred      0 bytes  GLM.#linpred
                      linpred!      0 bytes  GLM.#linpred!
                            lm      0 bytes  GLM.#lm
                      logistic      0 bytes  StatsFuns.#logistic
                         logit      0 bytes  StatsFuns.#logit
                 loglikelihood      0 bytes  StatsBase.#loglikelihood
                model_response      0 bytes  StatsBase.#model_response
                         mueta      0 bytes  GLM.#mueta
                       mustart      0 bytes  GLM.#mustart
                          nobs      0 bytes  StatsBase.#nobs
                  nulldeviance      0 bytes  StatsBase.#nulldeviance
             nullloglikelihood      0 bytes  StatsBase.#nullloglikelihood
                       predict      0 bytes  StatsBase.#predict
                            r2      0 bytes  StatsBase.#r2
                     residuals      0 bytes  StatsBase.#residuals0 bytes  StatsBase.#r2
                 setcontrasts!      0 bytes  StatsModels.#setcontrasts!
                        stderr      0 bytes  StatsBase.#stderr
                      updateμ!      0 bytes  GLM.#updateμ!
                          vcov      0 bytes  StatsBase.#vcov
                       wrkresp      0 bytes  GLM.#wrkresp

@nalimilan
Copy link
Member

Maybe @reexport is confused by the fact that ModelMatrix and ModelFrame are both types and functions (via constructors)? The GLM.#ModelMatrix indicates a function type (the type of a type is DataType). Probably worth filing an issue.

@dmbates
Copy link
Contributor Author

dmbates commented Nov 28, 2017

Thanks @nalimilan. I just discovered that problem myself. For the time being I will remove those constructors. I think it is another case of my getting trying to get too cute with the code.

@dmbates dmbates merged commit 3577753 into master Dec 11, 2017
@dmbates
Copy link
Contributor Author

dmbates commented Dec 11, 2017

Travis failure was transient.

@ararslan ararslan deleted the StatsModels branch December 11, 2017 21:15
@nalimilan
Copy link
Member

Do you want to tag a release?

@dmbates
Copy link
Contributor Author

dmbates commented Dec 12, 2017

I'll do so now.

@dmbates
Copy link
Contributor Author

dmbates commented Dec 12, 2017

I found that I needed to make one more change. I am waiting on Travis.

andreasnoack pushed a commit that referenced this pull request Nov 9, 2018
It also fixes errors due to #201
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants