Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions src/StatsBase.jl
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,8 @@ export
residuals,
r2,
r²,
vif,
viftable,

ConvergenceException,

Expand Down
23 changes: 23 additions & 0 deletions src/statmodels.jl
Original file line number Diff line number Diff line change
Expand Up @@ -423,6 +423,29 @@ Return all parameters of a model.
params(model) = error("params is not defined for $(typeof(model))")
function params! end

"""
vif(model, param)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
vif(model, param)
vif(model::StatisticalModel, param)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to restrict this to ::RegressionModel? AFAIK the VIF is only used for linear regression, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's used for any model falling within the generalized linear model and there might even be meaningful definitions for a broader class of models. But it's much easier to start narrow and broaden, so I'm fine restricting this to RegressionModel for now.


Return the variance inflation factor for a parameter in a model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about saying "coefficient" rather than "parameter"? That's how we call these in general (function coef and type CoefTable, while params returns distributions parameters IIRC).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Yeah, coef is also a better match if we restrict to RegressionModel.


The [variance inflation factor (VIF)](https://en.wikipedia.org/wiki/Variance_inflation_factor) measures
the increase in the variance of a parameter's estimate in a model with multiple parameters relative to
the variance in of a paremeter's estimate in a model containing only that parameter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the variance in of a paremeter's estimate in a model containing only that parameter.
the variance of a parameter's estimate in a model containing only that parameter.

"""
vif(model::StatisticalModel, param) =
error("vif(model::StatisticalModel, param) is not defined for $(typeof(model)).")


"""
viftable(model)

Return the variance inflation factor for all parameters in a model as a `CoefTable`.

See also [`vif`](@ref)
"""
viftable(model::StatisticalModel) =
Comment on lines +442 to +446
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant about adding this given that we have the coeftable function with the same name as the CoefTable type. If we add a function for the VIF, we should add pvaluetable, tvaluetable, confinttable and so on. How about adding a keyword argument vif=false to coeftable instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. Tough one. I see your point, but I'm not sure I like having the kwarg in coeftable because it makes it sound like it's returning the usual CoefTable with an extra column for for VIF.

My motivation for having a table method is:

  • A lot of the VIF computation is common to all model params, so having all VIFs in one call might save some overhead
  • Specifying parameters can be awkward (especially for interaction terms), so this is a decent way to dodge that for people who want VIF for overall diagnostics
  • Often people want VIF for all parameters anyway, so this makes that convenient.

Maybe having a second vif method without the param argument would be better? Still have type stability then and it matches the behavior in other common languages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very tricky. Using a single-argument vif method is appealing, but that would be inconsistent with coef, which returns a vector of coefficients. I think this is sub-optimal and we should return a named vector like in R. That sounds more appropriate than a CoefTable when there's a single value, and that can be changed without breaking since the result is still an AbstractArray.

Unfortunately, so far we haven't standardized on a package for arrays with names. Maybe StatsBase could remain agnostic regarding the exact kind of vector which is returned, but packages could choose to return e.g. a NamedArray, AxisArray, KeyedArray or similar type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, we could do single-argument vif as AbstractArray (then implementations are free to choose a named array of some type if they so desire), but then also have a separate function for extracting names. I'm tempted to have vifnames(model::RegressionModel) = coefnames(model), thus giving a reasonable default that type owners are free to override. This would be a reasonable default because the VIFs are computed for each coef. I would omit two-argument vif for now -- it might potentially be useful in the future (computing the VIFs isn't quite "free" in the same way that extracting the coefficients for a fitted model is), having vif behavior match coef behavior seems reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would vifnames return something different from coefnames? Couldn't a single function be enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you generally don't compute the VIF for the intercept term, which means that

vifnames(model) = filter(!=("(Intercept)"), coefnames(model))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry for the delay.) That's a good point, but I'm not sure that justifies adding a method to the API if that's just for convenience. Maybe we can just expect users to skip the intercept manually? Anyway, as I noted above, I hope that we will be able to return a vector with names from vif and coef instead of adding functions to get names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, then I think we can just document that the vif filters out the (Intercept) term. Now I know enough to mock up a real VIF implementation for GLM or MixedModels and then we can see how much we like the API when applied downstream. 😄

error("viftable(model::StatisticalModel) is not defined for $(typeof(model)).")

## coefficient tables with specialized show method

mutable struct CoefTable
Expand Down
3 changes: 3 additions & 0 deletions test/statmodels.jl
Original file line number Diff line number Diff line change
Expand Up @@ -131,3 +131,6 @@ end

err = @test_throws ArgumentError ConvergenceException(10,.1,.2)
@test err.value.msg == "Change must be greater than tol."

@test_throws MethodError vif()
@test_throws MethodError viftable()