StatsModels #201

dmbates · 2017-11-26T21:14:03Z

Switch to StatsModels v0.1.0

Tests now use DataFrames v0.11.0, CSV and CategoricalArrays. Because of the change from DataArrays::pool to CategoricalArrays::categorical the order of the elements in the pool changed. This caused with changes in the coefficients and standard errors in some models used for tests.

nalimilan

Cool, thanks!

nalimilan · 2017-11-26T21:24:20Z

src/GLM.jl

    using Base.LinAlg.LAPACK: potrf!, potrs!
    using Base.LinAlg.BLAS: gemm!, gemv!
    using Base.LinAlg: QRCompactWY, Cholesky, BlasReal
    using StatsBase: StatsBase, CoefTable, StatisticalModel, RegressionModel
    using StatsFuns: logit, logistic
+    using StatsModels: @formula, Formula, ModelFrame, ModelMatrix


I wonder whether we should reexport StatsModels for now, to avoid the need to call using StatsModels explicitly.

I really dislike reexporting entire packages. Often I will check

whos(PkgName)

to see what the functions and types in a package are and I hate it when everything in, say, Distributions is in that list.

Yes, I'm not sure what's the best approach here. @ararslan @kleinschmidt Ideas?

I don't see the trouble of having using StatsModels inside the package; if GLM users want to mess about with statsmodels stuff then I think it's reasonable to have them explicitly ask for it, right?

Yeah, my concern is about people who just want to fit a GLM. It would be too bad to require them to run using StatsModels.

The situation may not be worse but this PR changes the relationship between GLM and StatsModels. Before, if you just did using GLM you wouldn't get DataFrames or any of the StatsModeling functionality; you'd need to also do using DataFrames. I'm fine with this PR pulling in StatsModels and DataFrames whenever I do using GLM now, because I'm never fitting a GLM to non-tabular data anyway. But there may be people out there who don't want the added dependency on DataFrames and could be surprised by it when they update. Just something to consider.

But there may be people out there who don't want the added dependency on DataFrames and could be surprised by it when they update. Just something to consider.

Yes, hopefully we'll get rid of the dependency on DataFrames at some point. Until then I'd say it's fine if those people need to install DataFrames.

I have kind of lost track of the discussion here. Some of the early comments were about @formula not being available without using StatsModels but @formula is exported from this package, as shown below.

The question was, isn't it weird that you need to call using StatsModels to adjust contrasts?

In that case I think we should go with your original suggestion of re-exporting the symbols from StatsModels, as there aren't very many. I will amend the PR accordingly.

nalimilan · 2017-11-26T21:25:16Z

test/runtests.jl

-    @test isapprox(aic(lm1), -36.409684288095946)
-    @test isapprox(aicc(lm1), -24.409684288095946)
-    @test isapprox(bic(lm1), -37.03440588041178)
+    @test isapprox(StatsBase.aic(lm1), -36.409684288095946)


StatsBase. shouldn't be needed AFAIK?

Thanks for catching that. It's vestigial. Initially I wasn't including StatsBase and using fully qualified names instead.

nalimilan · 2017-11-26T21:28:35Z

test/runtests.jl

@@ -69,8 +70,8 @@ admit[:rank] = pool(admit[:rank])
        @test isapprox(aicc(gm2), 470.7312329339146)
        @test isapprox(bic(gm2), 494.4662797585473)
        @test isapprox(coef(gm2),
-            [-3.9899786606380734, 0.0022644256521549043, 0.8040374535155766,
-            -0.6754428594116577, -1.3402038117481079,-1.5514636444657492])
+            [-5.3301824723861895, 0.002264425652154903, 0.8040374535155792,


I've seen your comment about why coefficients change, but that's still not clear to me. The order of levels should be the same as with PDAs AFAIK. I guess that's rather a change in StatsModels? Anyway, calling levels! should be enough to get the same behavior as before.

The changes are with respect to the contrasts chosen for the levels of admit[:rank], which is read as an integer then converted to categorical. I think the pool conversion from a vector of integers to a PooledDataArray just uses the unique levels without sorting them whereas the categorical conversion does sort. You would know better than I.

In any case, the contrasts chosen here are the sensible ones. I would prefer not to enforce bug-for-bug compatibility with the last version.

nalimilan · 2017-11-26T21:29:12Z

test/runtests.jl

@@ -129,8 +130,7 @@ end
 end

 ## Example with offsets from Venables & Ripley (2002, p.189)
-anorexia = readtable(joinpath(glm_datadir, "anorexia.csv.gz"))
-anorexia[:Treat] = pool(anorexia[:Treat])
+anorexia = CSV.read(joinpath(glm_datadir, "anorexia.csv"))


Maybe pass categorical=true, that way we're safe if the default changes in a future version (as it's been discussed).

According to the documentation the default value of categorical is true. The reason for the explicit call here is that the levels are integers. I could use an explicit types argument in the call to CSV.read but I don't think that would gain much.

OK, I hadn't realized that categorical=true just means that the result will be categorical if the proportion of unique values is below a certain threshold. I guess that's OK. (I think we would have to pass CategoricalValue as the type to be certain we get a categorical vector.)

nalimilan

Looks good, at least reexporting StatsModels is simpler for users (and we already reexport all StatsBase modeling functions).

Maybe README.md needs updating?

dmbates · 2017-11-28T21:08:06Z

I guess I don't understand how the @reexport macro works. Notice below that generics and macros, like @formula and coefnames are exported as StatsModels.#@formula, etc. but the types ModelMatrix and ModelFrame are exported as GLM.#ModelMatrix and GLM.#ModelFrame. A package that uses both GLM and StatsModels encounters a name conflict.

julia> whos(GLM)
                      @formula      0 bytes  StatsModels.#@formula
             AbstractContrasts     92 bytes  DataType
                     Bernoulli     40 bytes  UnionAll
                      Binomial     40 bytes  UnionAll
                   CauchitLink     92 bytes  DataType
                   CloglogLink     92 bytes  DataType
               ContrastsCoding    176 bytes  DataType
                     DensePred     92 bytes  DataType
                 DensePredChol     96 bytes  UnionAll
                   DensePredQR     56 bytes  UnionAll
                   DummyCoding    124 bytes  DataType
                 EffectsCoding    124 bytes  DataType
                       Formula    188 bytes  DataType
                           GLM    128 KB     Module
                         Gamma     40 bytes  UnionAll
        GeneralizedLinearModel    280 bytes  UnionAll
                       GlmResp    200 bytes  UnionAll
                 HelmertCoding    124 bytes  DataType
                  IdentityLink     92 bytes  DataType
               InverseGaussian     40 bytes  UnionAll
                   InverseLink     92 bytes  DataType
             InverseSquareLink     92 bytes  DataType
                       LinPred     92 bytes  DataType
                  LinPredModel     92 bytes  DataType
                   LinearModel    160 bytes  UnionAll
                          Link     92 bytes  DataType
                        LmResp     80 bytes  UnionAll
                       LogLink     92 bytes  DataType
                     LogitLink     92 bytes  DataType
                    ModelFrame      0 bytes  GLM.#ModelFrame
                   ModelMatrix      0 bytes  GLM.#ModelMatrix
                        Normal     40 bytes  UnionAll
                       Poisson     40 bytes  UnionAll
                    ProbitLink     92 bytes  DataType
                      SqrtLink     92 bytes  DataType
                   StatsModels    115 KB     Module
                         adjr2      0 bytes  StatsBase.#adjr2
                         adjr²      0 bytes  StatsBase.#adjr2
                 canonicallink      0 bytes  GLM.#canonicallink
                          coef      0 bytes  StatsBase.#coef
                     coefnames      0 bytes  StatsModels.#coefnames
                     coeftable      0 bytes  StatsBase.#coeftable
                       confint      0 bytes  StatsBase.#confint
                      delbeta!      0 bytes  GLM.#delbeta!
                      deviance      0 bytes  StatsBase.#deviance
                      devresid      0 bytes  GLM.#devresid
                           dof      0 bytes  StatsBase.#dof
                  dof_residual      0 bytes  StatsBase.#dof_residual
                      dropterm      0 bytes  StatsModels.#dropterm
                           fit      0 bytes  StatsBase.#fit
                          fit!      0 bytes  StatsBase.#fit!
                       formula      0 bytes  GLM.#formula
                         ftest      0 bytes  GLM.#ftest
                           glm      0 bytes  GLM.#glm
                        glmvar      0 bytes  GLM.#glmvar
                   inverselink      0 bytes  GLM.#inverselink
                       linkfun      0 bytes  GLM.#linkfun
                       linkinv      0 bytes  GLM.#linkinv
                       linpred      0 bytes  GLM.#linpred
                      linpred!      0 bytes  GLM.#linpred!
                            lm      0 bytes  GLM.#lm
                      logistic      0 bytes  StatsFuns.#logistic
                         logit      0 bytes  StatsFuns.#logit
                 loglikelihood      0 bytes  StatsBase.#loglikelihood
                model_response      0 bytes  StatsBase.#model_response
                         mueta      0 bytes  GLM.#mueta
                       mustart      0 bytes  GLM.#mustart
                          nobs      0 bytes  StatsBase.#nobs
                  nulldeviance      0 bytes  StatsBase.#nulldeviance
             nullloglikelihood      0 bytes  StatsBase.#nullloglikelihood
                       predict      0 bytes  StatsBase.#predict
                            r2      0 bytes  StatsBase.#r2
                     residuals      0 bytes  StatsBase.#residuals
                            r²      0 bytes  StatsBase.#r2
                 setcontrasts!      0 bytes  StatsModels.#setcontrasts!
                        stderr      0 bytes  StatsBase.#stderr
                      updateμ!      0 bytes  GLM.#updateμ!
                          vcov      0 bytes  StatsBase.#vcov
                       wrkresp      0 bytes  GLM.#wrkresp

nalimilan · 2017-11-28T21:17:27Z

Maybe @reexport is confused by the fact that ModelMatrix and ModelFrame are both types and functions (via constructors)? The GLM.#ModelMatrix indicates a function type (the type of a type is DataType). Probably worth filing an issue.

dmbates · 2017-11-28T21:21:59Z

Thanks @nalimilan. I just discovered that problem myself. For the time being I will remove those constructors. I think it is another case of my getting trying to get too cute with the code.

dmbates · 2017-12-11T21:13:50Z

Travis failure was transient.

nalimilan · 2017-12-11T21:38:10Z

Do you want to tag a release?

dmbates · 2017-12-12T16:26:05Z

I'll do so now.

dmbates · 2017-12-12T17:25:04Z

I found that I needed to make one more change. I am waiting on Travis.

It also fixes errors due to #201

dmbates added 2 commits November 26, 2017 15:04

Switch to StatsModels for Formula, etc.

2f43d93

Switch tests to CSV, CategoricalArrays, etc.

4a22fc9

dmbates requested review from nalimilan and ararslan November 26, 2017 21:14

nalimilan reviewed Nov 26, 2017

View reviewed changes

dmbates added 2 commits November 27, 2017 13:35

Address issues raised in review

1db0c9b

Reinstate tests and coverage

a4f24c9

dmbates mentioned this pull request Nov 28, 2017

UndefVarError: @formula not defined #202

Closed

dmbates added 3 commits November 28, 2017 11:52

Avoid exact comparison of floating point

5193b51

Bump required version of Compat

a24ece2

Reexport StatsModels symbols

a76a5a2

nalimilan approved these changes Nov 28, 2017

View reviewed changes

dmbates merged commit 3577753 into master Dec 11, 2017

ararslan deleted the StatsModels branch December 11, 2017 21:15

Nosferican mentioned this pull request Nov 6, 2018

Fixes bug in dof_residual with allowrankdeficient #264

Merged

andreasnoack pushed a commit that referenced this pull request Nov 9, 2018

Adresses #263 (#264)

4966a05

It also fixes errors due to #201

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StatsModels #201

StatsModels #201

dmbates commented Nov 26, 2017

nalimilan left a comment

nalimilan Nov 26, 2017

dmbates Nov 26, 2017

nalimilan Nov 27, 2017

kleinschmidt Nov 27, 2017

nalimilan Nov 27, 2017

kleinschmidt Nov 27, 2017

nalimilan Nov 27, 2017

dmbates Nov 28, 2017

nalimilan Nov 28, 2017

dmbates Nov 28, 2017

nalimilan Nov 26, 2017

dmbates Nov 26, 2017

nalimilan Nov 26, 2017

dmbates Nov 27, 2017

nalimilan Nov 26, 2017

dmbates Nov 27, 2017

nalimilan Nov 27, 2017

nalimilan left a comment

dmbates commented Nov 28, 2017

nalimilan commented Nov 28, 2017

dmbates commented Nov 28, 2017

dmbates commented Dec 11, 2017

nalimilan commented Dec 11, 2017

dmbates commented Dec 12, 2017

dmbates commented Dec 12, 2017

StatsModels #201

StatsModels #201

Conversation

dmbates commented Nov 26, 2017

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan left a comment

Choose a reason for hiding this comment

dmbates commented Nov 28, 2017

nalimilan commented Nov 28, 2017

dmbates commented Nov 28, 2017

dmbates commented Dec 11, 2017

nalimilan commented Dec 11, 2017

dmbates commented Dec 12, 2017

dmbates commented Dec 12, 2017