Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

null deviance and null loglikelihood for GLM #352

Closed
wants to merge 11 commits into from

Conversation

pdeffebach
Copy link
Contributor

This is a commit to fix #256, noting that GLM is without a null deviance or null log likelihood function. It's important to have this because likelihood ratio tests and R^2 use null deviance.

I am not 100% if this is correct. So far my results match R's glm with gaussian() and binomial() families, but to be honest I am still taking econometrics courses so my knowledge of this is slim.

I will add nullloglikelihood next, then tests for everything I can, then add r2 functions and anything else that might use them.

@codecov-io
Copy link

codecov-io commented Jan 6, 2020

Codecov Report

Merging #352 into master will decrease coverage by 2.37%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #352      +/-   ##
==========================================
- Coverage   79.56%   77.18%   -2.38%     
==========================================
  Files           7        7              
  Lines         685      653      -32     
==========================================
- Hits          545      504      -41     
- Misses        140      149       +9     
Impacted Files Coverage Δ
src/glmfit.jl 77.35% <0.00%> (-0.02%) ⬇️
src/lm.jl 78.88% <0.00%> (-15.37%) ⬇️
src/glmtools.jl 68.42% <0.00%> (-7.21%) ⬇️
src/linpred.jl 62.10% <0.00%> (-0.53%) ⬇️
src/ftest.jl 98.43% <0.00%> (-0.03%) ⬇️
src/negbinfit.jl 92.72% <0.00%> (+11.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e2b1b9...f385a65. Read the comment docs.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Should be OK if it matches R, but can you check the other families too? You can reproduce examples from the tests.

src/glmfit.jl Outdated Show resolved Hide resolved
src/glmfit.jl Outdated Show resolved Hide resolved
@pdeffebach
Copy link
Contributor Author

I realized I forgot about this. I apologize for that. I will be picking this up slowly, starting with a rebase and then adding more checks against R as I continue tests.

@codecov-commenter
Copy link

codecov-commenter commented Jun 8, 2020

Codecov Report

Merging #352 into master will decrease coverage by 2.37%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #352      +/-   ##
==========================================
- Coverage   79.56%   77.18%   -2.38%     
==========================================
  Files           7        7              
  Lines         685      653      -32     
==========================================
- Hits          545      504      -41     
- Misses        140      149       +9     
Impacted Files Coverage Δ
src/glmfit.jl 77.35% <0.00%> (-0.02%) ⬇️
src/lm.jl 78.88% <0.00%> (-15.37%) ⬇️
src/glmtools.jl 68.42% <0.00%> (-7.21%) ⬇️
src/linpred.jl 62.10% <0.00%> (-0.53%) ⬇️
src/ftest.jl 98.43% <0.00%> (-0.03%) ⬇️
src/negbinfit.jl 92.72% <0.00%> (+11.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7e2b1b9...77c5b3b. Read the comment docs.

@pdeffebach pdeffebach changed the base branch from master to docsupdate June 8, 2020 15:45
@pdeffebach pdeffebach changed the base branch from docsupdate to master June 8, 2020 15:46
@pdeffebach
Copy link
Contributor Author

Bumping this for @andreasnoack and @nalimilan. I forgot about this PR for a long time but am determined to finish it.

It just adds a nulldeviance method, previously this was defined for lm but not for glm.

I think I need some help getting this over the finish line. As you can see in the test file, I don't know how to add tests for models with offset and for negative binomial models.

The original intention of this PR was to add nullloglikelihood as well. Though I don't quite know the way forward on that, hopefully it is easy. My RCall testing setup has been very useful, but also a lot of work to set up. I would rather not merge this PR, deleting all the RCall stuff and have to re-write it for nullloglikelihood tests.

@nalimilan
Copy link
Member

Thanks. Looks good. For offsets, I think you can just add + offset(x) to the formula in R (there's an example in ?glm) to check whether results are consistent.

For the negative binomial, I think there are two different cases:

  • If theta is fixed (as in two testsets), you can just fit the null model in R with the same value using glm(formula, family = MASS::negative.binomial(2), ...). That should give the same results as in Julia.
  • If theta is not fixed (as in the two testsets calling negbin), the null deviance will correspond to the null model but with the same theta value as the considered model, which is not equivalent to fitting the null model separately. You can check that the null log-likelihood is correct by fitting the model with theta fixed to the value that negbin estimated.

Regarding nulloglikelihood, I think you can just take the definition for loglikelihood and adjust it to use the mean as a prediction, as you did for nulldeviance.

@pdeffebach
Copy link
Contributor Author

I just took a look at this again and spent a few hours trying to get offset to work. No luck.

I tried replicating what R does here but couldn't get numbers to match. In particular, I played around with gm6 in the runtests.jl file.

Additionally, I noticed that R has different null deviance based on whether or not there is an intercept. However, GLM doesn't store whether or not a model has an intercept in anything but the object gm6, i.e. not the AbstractGLM model or the GlmResp object. It's not clear where user-land starts and which functions are internal.

I will take another look at this soon and see if things start making sense.

@pdeffebach
Copy link
Contributor Author

pdeffebach commented Aug 21, 2020

I guess, a concrete question: we have nested structs starting with

m = glm(...)
  1. TableRegressionModel (m)
  2. AbstractGLM (m.model)
  3. GlmResp (m.model.rr)

Currently, nulldeviance takes in only 3. But to depend on the intercept term we really need m to go in the function and then pass down whether or not the model has an intercept into inner functions. Is that acceptable?

@nalimilan
Copy link
Member

It's probably OK to have the functions take an AbstractGLM object. TableRegressionModel is supposed to go away at some point, in which case the relevant information would be available directly from AbstractGLM. But we allow fitting models from raw matrices, without going through a formula/TableRegressionModel, so it would be better to detect whether the model matrix has a constant. That should be quite trivial since you just have to check whether the first column is full of 1.

If offset doesn't work, can you just throw an error when an offset has been specified?

@kleinschmidt Opinions?

@kleinschmidt
Copy link
Member

One possibility is extending the use of hasintercept which right now just applies to terms

https://github.com/JuliaStats/StatsModels.jl/blob/04a0ccfb80f6ef7fe4a618556e9d3d21d1c445d1/src/terms.jl#L562-L565

The default would check the formula stored with a model and packages can implement their own specializations as needed...

@nalimilan
Copy link
Member

We're in GLM here so we don't have to follow any public interface. :-)

@kleinschmidt
Copy link
Member

Well, at the moment in order to support GLM models as they are usual fit by folks you'd have to define a nulldeviance(::TableRegressionModel) method to unwrap the model, which is pretty extreme type piracy...although now that I'm thinking about it I don't see how my solution solves that problem...and this PR isn't changing anything there.

The advantage of using hasintercept is that it's potentially more efficient (once we require that models store the formula) than checking al the values in the model matrix. But that's probably a pretty minor gain.

@pdeffebach
Copy link
Contributor Author

But we allow fitting models from raw matrices, without going through a formula/TableRegressionModel, so it would be better to detect whether the model matrix has a constant.

Following discussion on Slack, I will write a little function that checks if the matrix has all constant for some value. This should be fast since if it was constructed via a formula then the first column will be all ones.

After this I will add a few tests of models without intercepts.

I am going to abandon my effort to get offsets to work. If someone is using offsets and sees error: nulldeviance with offsets not yet implemented then hopefully they know more than I do and can implement them.

I also still have to work on negative binomial models.

@nalimilan
Copy link
Member

I ended up reimplementing this at #479. The handling of offsets is indeed tricky. Closing in favor the the other PR.

@nalimilan nalimilan closed this Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nulldeviance function
5 participants