New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StatsModels #201
StatsModels #201
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks!
src/GLM.jl
Outdated
using Base.LinAlg.LAPACK: potrf!, potrs! | ||
using Base.LinAlg.BLAS: gemm!, gemv! | ||
using Base.LinAlg: QRCompactWY, Cholesky, BlasReal | ||
using StatsBase: StatsBase, CoefTable, StatisticalModel, RegressionModel | ||
using StatsFuns: logit, logistic | ||
using StatsModels: @formula, Formula, ModelFrame, ModelMatrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we should reexport StatsModels for now, to avoid the need to call using StatsModels
explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really dislike reexporting entire packages. Often I will check
whos(PkgName)
to see what the functions and types in a package are and I hate it when everything in, say, Distributions
is in that list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm not sure what's the best approach here. @ararslan @kleinschmidt Ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the trouble of having using StatsModels
inside the package; if GLM users want to mess about with statsmodels stuff then I think it's reasonable to have them explicitly ask for it, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my concern is about people who just want to fit a GLM. It would be too bad to require them to run using StatsModels
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The situation may not be worse but this PR changes the relationship between GLM and StatsModels. Before, if you just did using GLM
you wouldn't get DataFrames or any of the StatsModeling functionality; you'd need to also do using DataFrames
. I'm fine with this PR pulling in StatsModels and DataFrames whenever I do using GLM
now, because I'm never fitting a GLM to non-tabular data anyway. But there may be people out there who don't want the added dependency on DataFrames and could be surprised by it when they update. Just something to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there may be people out there who don't want the added dependency on DataFrames and could be surprised by it when they update. Just something to consider.
Yes, hopefully we'll get rid of the dependency on DataFrames at some point. Until then I'd say it's fine if those people need to install DataFrames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have kind of lost track of the discussion here. Some of the early comments were about @formula
not being available without using StatsModels
but @formula
is exported from this package, as shown below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question was, isn't it weird that you need to call using StatsModels
to adjust contrasts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case I think we should go with your original suggestion of re-exporting the symbols from StatsModels
, as there aren't very many. I will amend the PR accordingly.
test/runtests.jl
Outdated
@test isapprox(aic(lm1), -36.409684288095946) | ||
@test isapprox(aicc(lm1), -24.409684288095946) | ||
@test isapprox(bic(lm1), -37.03440588041178) | ||
@test isapprox(StatsBase.aic(lm1), -36.409684288095946) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StatsBase.
shouldn't be needed AFAIK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching that. It's vestigial. Initially I wasn't including StatsBase
and using fully qualified names instead.
@@ -69,8 +70,8 @@ admit[:rank] = pool(admit[:rank]) | |||
@test isapprox(aicc(gm2), 470.7312329339146) | |||
@test isapprox(bic(gm2), 494.4662797585473) | |||
@test isapprox(coef(gm2), | |||
[-3.9899786606380734, 0.0022644256521549043, 0.8040374535155766, | |||
-0.6754428594116577, -1.3402038117481079,-1.5514636444657492]) | |||
[-5.3301824723861895, 0.002264425652154903, 0.8040374535155792, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen your comment about why coefficients change, but that's still not clear to me. The order of levels should be the same as with PDAs AFAIK. I guess that's rather a change in StatsModels? Anyway, calling levels!
should be enough to get the same behavior as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes are with respect to the contrasts chosen for the levels of admit[:rank]
, which is read as an integer then converted to categorical. I think the pool
conversion from a vector of integers to a PooledDataArray just uses the unique levels without sorting them whereas the categorical
conversion does sort. You would know better than I.
In any case, the contrasts chosen here are the sensible ones. I would prefer not to enforce bug-for-bug compatibility with the last version.
@@ -129,8 +130,7 @@ end | |||
end | |||
|
|||
## Example with offsets from Venables & Ripley (2002, p.189) | |||
anorexia = readtable(joinpath(glm_datadir, "anorexia.csv.gz")) | |||
anorexia[:Treat] = pool(anorexia[:Treat]) | |||
anorexia = CSV.read(joinpath(glm_datadir, "anorexia.csv")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe pass categorical=true
, that way we're safe if the default changes in a future version (as it's been discussed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the documentation the default value of categorical
is true
. The reason for the explicit call here is that the levels are integers. I could use an explicit types
argument in the call to CSV.read
but I don't think that would gain much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I hadn't realized that categorical=true
just means that the result will be categorical if the proportion of unique values is below a certain threshold. I guess that's OK. (I think we would have to pass CategoricalValue
as the type to be certain we get a categorical vector.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, at least reexporting StatsModels is simpler for users (and we already reexport all StatsBase modeling functions).
Maybe README.md needs updating?
I guess I don't understand how the julia> whos(GLM)
@formula 0 bytes StatsModels.#@formula
AbstractContrasts 92 bytes DataType
Bernoulli 40 bytes UnionAll
Binomial 40 bytes UnionAll
CauchitLink 92 bytes DataType
CloglogLink 92 bytes DataType
ContrastsCoding 176 bytes DataType
DensePred 92 bytes DataType
DensePredChol 96 bytes UnionAll
DensePredQR 56 bytes UnionAll
DummyCoding 124 bytes DataType
EffectsCoding 124 bytes DataType
Formula 188 bytes DataType
GLM 128 KB Module
Gamma 40 bytes UnionAll
GeneralizedLinearModel 280 bytes UnionAll
GlmResp 200 bytes UnionAll
HelmertCoding 124 bytes DataType
IdentityLink 92 bytes DataType
InverseGaussian 40 bytes UnionAll
InverseLink 92 bytes DataType
InverseSquareLink 92 bytes DataType
LinPred 92 bytes DataType
LinPredModel 92 bytes DataType
LinearModel 160 bytes UnionAll
Link 92 bytes DataType
LmResp 80 bytes UnionAll
LogLink 92 bytes DataType
LogitLink 92 bytes DataType
ModelFrame 0 bytes GLM.#ModelFrame
ModelMatrix 0 bytes GLM.#ModelMatrix
Normal 40 bytes UnionAll
Poisson 40 bytes UnionAll
ProbitLink 92 bytes DataType
SqrtLink 92 bytes DataType
StatsModels 115 KB Module
adjr2 0 bytes StatsBase.#adjr2
adjr² 0 bytes StatsBase.#adjr2
canonicallink 0 bytes GLM.#canonicallink
coef 0 bytes StatsBase.#coef
coefnames 0 bytes StatsModels.#coefnames
coeftable 0 bytes StatsBase.#coeftable
confint 0 bytes StatsBase.#confint
delbeta! 0 bytes GLM.#delbeta!
deviance 0 bytes StatsBase.#deviance
devresid 0 bytes GLM.#devresid
dof 0 bytes StatsBase.#dof
dof_residual 0 bytes StatsBase.#dof_residual
dropterm 0 bytes StatsModels.#dropterm
fit 0 bytes StatsBase.#fit
fit! 0 bytes StatsBase.#fit!
formula 0 bytes GLM.#formula
ftest 0 bytes GLM.#ftest
glm 0 bytes GLM.#glm
glmvar 0 bytes GLM.#glmvar
inverselink 0 bytes GLM.#inverselink
linkfun 0 bytes GLM.#linkfun
linkinv 0 bytes GLM.#linkinv
linpred 0 bytes GLM.#linpred
linpred! 0 bytes GLM.#linpred!
lm 0 bytes GLM.#lm
logistic 0 bytes StatsFuns.#logistic
logit 0 bytes StatsFuns.#logit
loglikelihood 0 bytes StatsBase.#loglikelihood
model_response 0 bytes StatsBase.#model_response
mueta 0 bytes GLM.#mueta
mustart 0 bytes GLM.#mustart
nobs 0 bytes StatsBase.#nobs
nulldeviance 0 bytes StatsBase.#nulldeviance
nullloglikelihood 0 bytes StatsBase.#nullloglikelihood
predict 0 bytes StatsBase.#predict
r2 0 bytes StatsBase.#r2
residuals 0 bytes StatsBase.#residuals
r² 0 bytes StatsBase.#r2
setcontrasts! 0 bytes StatsModels.#setcontrasts!
stderr 0 bytes StatsBase.#stderr
updateμ! 0 bytes GLM.#updateμ!
vcov 0 bytes StatsBase.#vcov
wrkresp 0 bytes GLM.#wrkresp |
Maybe |
Thanks @nalimilan. I just discovered that problem myself. For the time being I will remove those constructors. I think it is another case of my getting trying to get too cute with the code. |
Travis failure was transient. |
Do you want to tag a release? |
I'll do so now. |
I found that I needed to make one more change. I am waiting on Travis. |
Switch to
StatsModels
v0.1.0Tests now use
DataFrames
v0.11.0,CSV
andCategoricalArrays
. Because of the change fromDataArrays::pool
toCategoricalArrays::categorical
the order of the elements in the pool changed. This caused with changes in the coefficients and standard errors in some models used for tests.