Get rid of TableRegressionModel by storing formula directly in model #339

nalimilan · 2019-10-18T17:12:15Z

This is an exploration of what is needed to get rid of TableRegressionModel (in GLM to help designing the correct API in StatsModels (JuliaStats/StatsModels.jl#32). Overall changes are incredibly limited.

The approach taken here is to have packages define fit(::Type{M}, f::FormulaTerm, data, ...) for their model type M and call modelmatrix(f::FormulaTerm, data, contrasts) to get the model matrix from that. Model objects then have to store the formula, which can be retrieved using the formula function which is added to the StatisticalModel API. fit can still allow passing matrices directly without any formula, in which case it is nothing.

Thanks to the formula generic function, fallback definitions can be provided in StatsModels for many functions, like coefnames or predict. The latter is the most tricky case as missing values have to be handled: the StatsModels fallback method allocates vectors and passes them to predict! (which has to be defined by the package), and returns them.

For simplicity, the PR currently includes a few functions which should defined in StatsModels once we have agreed on the API. modelmatrix already exists there and just needs to handle missing values.

For reference, here are notes @kleinschmidt wrote after we discussed this issue:

Remove the TableStatisticalModel wrapper
Repurpose ModelFrame as a data transformer that keeps track of, e.g., which rows in the model matrix have missing values (in order to map the missings-free array rows onto the underlying table rows).
By including the formula in the model frame, this allows for removing rows from the arrays that are introduced by terms like lead and lag, as long as the filtering of missing values happens after modelcols is called
Models are responsible for holding onto the model frame. When the input data is not in the form of a table when the model is constructed, a fallback IdentityTransformer will be a placeholder. (Or maybe Nothing).
Part of the api will be an accessor function transformer(::MyModel) which will return the transformer (ModelFrame, IdentityTransformer, etc.). This then allows StatsModels.jl to define generic methods like coeftable and coefnames that are sensitive to the presence of a model frame and the formula it contains.

And an updated summary:

maybe we want to think of terms as "lazy views" into the underlying data. with this philosophy, we could think of having an additional step of apply_data, which annotates each term with a pointer to the underlying data, which are then coalesced into array-likes with modelcols or Tables.matrix or something
then we can think of missing-dropping as another lazy view that only views rows that have no missings. we can apply this to the original table OR to the "table" that's generated on the output end. this would be implemented as a wrapper around the table
treating the missing-dropping as a view has the advantage of allowing us to naturally write predicted values into the corresponding rows in the original table (or at least in a way that aligns with those rows).
this also has the advantage of making the formula (and even individual terms) completely self-contained data transformers which eliminates the need for a ModelFrame struct

kleinschmidt

I think this is looking pretty good actually. I'm pleased to see how minimal the required changes are!

The idea is that we're replacing ModelFrame and ModelMatrix with modelmatrix right? And requiring that models store the returned formula? That seems like a good "separation of concerns": you're splitting out the convenience of the one-step schema-apply-modelcols stuff that happens in the ModelFrame/ModelMatrix constructors from the structural assumptions of those as structs. That feel like a good division to make, to me: it doesn't lock anyone into using modelmatrix (e.g., if a package like MixedModels needs more control over what happens in that pipeline) but still provides a convenient fallback.

I think the one thing I'm still not sure about is how you'd handle the "missing mask" that ModelFrame currently stores. Committing to storing the formula means that we'd have to find a way to incorporate that kind of information into the formula itself; I'm in favor of that approach in principle but I haven't really worked out the consequences of that choice.

nalimilan · 2019-10-18T17:52:10Z

Yes, actually it doesn't appear that ModelFrame and ModelMatrix are needed now. So indeed packages are free to avoid modelmatrix and call individual steps instead. Though it would be good to make modelmatrix as general as possible so that most packages can use it: that makes it easier to change the pipeline if needed without breaking too many packages.

Regarding the missing mask, that indeed raises an issue with the current state of the PR: it doesn't store the mask (nor the data). That means functions like fitted (not sure if there are others) return the fitted values without the missing values from the original data.

There's also a related issue with the predict/predict! system I described above: in that approach, predict includes missing values, but predict! skips them. That sounds inconsistent/arbitrary. We should probably find another mechanism, like predict_nonmissing! or something using dispatch. The same approach could be used to define a fallback for fitted if we stored the missing mask in the formula.

But is the formula the best place to store the missing mask? That sounds like a slight abuse, since that's not an information that can be known statically.

kleinschmidt · 2019-10-18T18:06:06Z

My philosophy is that terms should be self-contained table-to-array transformers, including FormulaTerm. Whether or not things like the missing mask should be part of that transformation I'm not so clear on. But the missing mask probably at least should be aware of the formula so that things like lead/lag can be supported. Then again, you could also leave it up to the modeling package what to do with missing values in the arrays and make it part of the API that modelcols can return arrays with missing values. But that seems a little undesirable to me.

nalimilan · 2019-10-18T20:01:32Z

Right, but does that say anything about where the missingness mask should live? Terms that can generate missing values (like lead and lag) just need to mark corresponding observations as missing in the mask, and either no matrix row would be generated for them (if we can sync with other terms), or they would be dropped after the fact. In both cases, that operation can happen without storing the mask in the formula, right?

For example, modelmatrix could return the missingness mask. But that wouldn't be convenient for model implementations since they would have to pass it to all functions, and defining formula wouldn't be enough to get the nice fallback definitions to work. So I see two solutions:

store the mask in the formula
store it in a structure which behaves like a formula but also stores the missingness mask

I agree modeling packages shouldn't have to deal with missing values by default, especially since it can get tricky with lead and lag and because we could implement optimizations in common cases.

kleinschmidt · 2019-10-19T02:05:41Z

And as far as making modelmatrix as general as possible, I think returning a tuple of arrays is the right way to go. That's all MixedModels needs I think. I'm not sure about the econometrics models though (@Nosferican or @matthieugomez could weigh in...)

Nosferican · 2019-10-19T02:23:00Z

In Econometrics.jl and similar packages there are multiple intermediate models (e.g., instrumental variable models with two stages, or transformations such as features absorption). A formula also has multiple parts since there are multiple model components (response, exogenous features, endogenous features, instruments, high-dimensional features to absorb, etc.). At least in my package, I don't rely on ModelFrame or ModelMatrix and just have separate calls to modelcols for each parsed/split Formula. The models store the Formulas as part of the model. I would love to see tools in the panel/temporal context to make the model building less involved in each package. It is in my to-do list to work out a nice predict(model, newdata) which I believe it should be doable, but haven't implement it just yet... Some term names change a bit, especially for multi-response models like mlogit and I have to keep track of dropped terms (i.e., linearly deficient). As for modelmatrix(model::RegressionModel) I think it should return a Matrix{<:Real}.
It might be good to have in the modelcols or related,

eterms I believe there were some time ago to keep track of the features involved (important for doing marginal effects such as with logistic regression)
mask are nice especially after lead/lag/diff and similar operators.
These are fine to obtain I think, mask by checking any missing in rows in each model term. eterm I 🤔 Formula had it, but not sure it that was only ModelFrame... that one would be good to keep.

codecov-commenter · 2022-06-19T18:06:09Z

Codecov Report

Base: 87.39% // Head: 88.75% // Increases project coverage by +1.35% 🎉

Coverage data is based on head (5c6bf20) compared to base (9dc4d6b).
Patch coverage: 97.52% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #339      +/-   ##
==========================================
+ Coverage   87.39%   88.75%   +1.35%     
==========================================
  Files           7        8       +1     
  Lines         952     1040      +88     
==========================================
+ Hits          832      923      +91     
+ Misses        120      117       -3

Impacted Files	Coverage Δ
src/GLM.jl	`50.00% <ø> (ø)`
src/ftest.jl	`100.00% <ø> (ø)`
src/glmfit.jl	`81.79% <94.28%> (+0.77%)`	⬆️
src/lm.jl	`94.26% <97.43%> (+0.93%)`	⬆️
src/deprecated.jl	`100.00% <100.00%> (ø)`
src/linpred.jl	`91.17% <100.00%> (+5.56%)`	⬆️
src/negbinfit.jl	`82.66% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

nalimilan · 2022-06-19T20:43:27Z

I've rebased and cleaned the PR. For now I propose we just merge this to get rid of TableRegressionModel while keeping feature parity with the current release. AFAICT, we don't have to find a solution to the issue of how to store the missingness mask for that. Later we can try to improve GLM (e.g. by allowing predict to insert missing values where appropriate) and StatsModels (by moving some convenience methods like predict and a simpler modelmatrix from this PR there).

kleinschmidt

this is looking great, thanks for resurrecting it. I do want to see the coverage to make sure that all the new fiddly bits are actually covered by the tests. I don't see it in the checks, but from manually looking up the PR in codecov (https://app.codecov.io/gh/JuliaStats/GLM.jl/compare/339/diff) it looks like one place where I think I found a bug isn't covered (predictions with non-nothing intervals) so that at least needs some tests.

src/linpred.jl

src/lm.jl

src/glmfit.jl

src/linpred.jl

src/deprecated.jl

src/glmfit.jl

palday · 2022-09-29T04:44:42Z

src/glmfit.jl

+             l::Link=canonicallink(d);
+             offset::Union{AbstractVector, Nothing} = nothing,
+             wts::Union{AbstractVector, Nothing} = nothing,
+             dofit::Bool = true,


if they're calling fit instead of a constructor, I kinda feel like we shouldn't add the complexity of dofit kwarg.

Yeah that doesn't make a lot of sense, but it's supported currently so I'd say we'd better keep supporting it. I've added deprecations so that we can remove them at some point.

palday · 2022-09-29T04:47:42Z

src/glmfit.jl

+$PREDICT_COMMON
+"""
+function predict!(res::Union{AbstractVector,
+                             NamedTuple{(:prediction, :lower, :upper),


should we allow Table.ColumnTable? Then we have an intermediate method that takes res::Any and just called Table.columntable(res) to invoke this method making it possible to pass e.g. DataFrames

We could do that, but TBH I don't expect people to use this method a lot (if at all). I mainly added it because it makes the design more composable and because I needed it internally. So I'd be inclined to wait and see whether people request it before making the code more complex.

palday

very close, but some additional nitpicking

src/glmfit.jl

src/linpred.jl

src/lm.jl

test/runtests.jl

…odel2

test/runtests.jl

kleinschmidt

i think in general this is great, thanks for doing all this! my only quibble is that there are a few places where inconsistencies with the old behavior might sneak in:

printing does not print the formula anymore
apply_schema previously woudl get teh ACTUAL type of model being fit; now it just get LinPredModel which may cause issues if someone has defined some special overloads for a particular subtype. instead, I'd suggest passing the type parameer M to modelframe as a fourth argument.

test/runtests.jl

Co-authored-by: Dave Kleinschmidt <dave.f.kleinschmidt@gmail.com>

nalimilan · 2022-10-21T07:42:19Z

Good catch. While we're at it, I propose that we no longer print type parameters, which are not super useful and obscure the interesting part of the output. See last commit. If we support switching between QR and Cholesky in the future, we may print only the name of the decomposition without all the parameters.

I've also changed model_frame to take the model type as you suggest.

bkamins · 2022-10-21T08:43:04Z

I propose that we no longer print type parameters

For standard display I think we should print as little as possible. In my lectures I always need to tell my students "ignore this type info as it is not relevant to you".

…odel2

…/tableregressionmodel2

nalimilan requested a review from kleinschmidt October 18, 2019 17:12

kleinschmidt reviewed Oct 18, 2019

View reviewed changes

kleinschmidt mentioned this pull request Feb 13, 2020

Plots for Statistical Models JuliaPlots/StatsPlots.jl#290

Open

kleinschmidt mentioned this pull request Mar 26, 2021

Cooks Distance tests #415

Merged

nalimilan added 3 commits June 18, 2022 14:27

Get rid of TableRegressionModel by storing formula directly in model

5df0fb5

Cleanup

d3f9430

Reuse existing test instead of adding another one

b670a14

nalimilan force-pushed the nl/tableregressionmodel branch from 32c1b9c to b670a14 Compare June 19, 2022 17:35

nalimilan added 2 commits June 19, 2022 19:40

Tables.jl fixes

a83cb1b

Fixes

bd00e51

nalimilan added 2 commits June 19, 2022 21:48

Docs, etc.

c244147

Docs, contrasts

fc0704a

nalimilan marked this pull request as ready for review June 19, 2022 20:43

nalimilan requested review from kleinschmidt and palday June 19, 2022 20:43

kleinschmidt requested changes Jun 21, 2022

View reviewed changes

src/linpred.jl Outdated Show resolved Hide resolved

src/lm.jl Outdated Show resolved Hide resolved

src/lm.jl Outdated Show resolved Hide resolved

src/glmfit.jl Outdated Show resolved Hide resolved

palday reviewed Jun 21, 2022

View reviewed changes

src/glmfit.jl Outdated Show resolved Hide resolved

palday reviewed Jun 21, 2022

View reviewed changes

src/linpred.jl Outdated Show resolved Hide resolved

This was referenced Jul 1, 2022

Readability improvement #482

Merged

Drop dependency on GLM gragusa/CovarianceMatrices.jl#60

Closed

Reducing time to first model #448

Open

This was referenced Aug 24, 2022

Propagation of missing values #496

Open

Taking weighting seriously #487

Open

nalimilan requested a review from kleinschmidt September 22, 2022 20:14

palday mentioned this pull request Sep 29, 2022

Version 2.0 Breaking Changes #500

Open

12 tasks

palday reviewed Sep 29, 2022

View reviewed changes

src/deprecated.jl Outdated Show resolved Hide resolved

palday reviewed Sep 29, 2022

View reviewed changes

src/glmfit.jl Outdated Show resolved Hide resolved

palday reviewed Sep 29, 2022

View reviewed changes

src/glmfit.jl Outdated Show resolved Hide resolved

palday reviewed Sep 29, 2022

View reviewed changes

src/glmfit.jl Outdated Show resolved Hide resolved

src/linpred.jl Outdated Show resolved Hide resolved

src/lm.jl Outdated Show resolved Hide resolved

src/lm.jl Show resolved Hide resolved

test/runtests.jl Outdated Show resolved Hide resolved

test/runtests.jl Show resolved Hide resolved

nalimilan added 2 commits October 8, 2022 15:44

Review fixes

93b3d97

Merge remote-tracking branch 'origin/master' into nl/tableregressionm…

fa2e5e8

…odel2

nalimilan requested a review from palday October 19, 2022 20:08

palday reviewed Oct 19, 2022

View reviewed changes

test/runtests.jl Show resolved Hide resolved

palday approved these changes Oct 19, 2022

View reviewed changes

kleinschmidt reviewed Oct 19, 2022

View reviewed changes

test/runtests.jl Outdated Show resolved Hide resolved

kleinschmidt approved these changes Oct 19, 2022

View reviewed changes

nalimilan and others added 2 commits October 21, 2022 09:40

More fixes

5fb340d

Update test/runtests.jl

1925039

Co-authored-by: Dave Kleinschmidt <dave.f.kleinschmidt@gmail.com>

nalimilan added 3 commits November 20, 2022 16:16

Merge remote-tracking branch 'origin/master' into nl/tableregressionm…

21a047c

…odel2

Merge remote-tracking branch 'origin/nl/tableregressionmodel' into nl…

c262432

…/tableregressionmodel2

Add deprecation to access formula

5c6bf20

nalimilan merged commit b728917 into master Nov 20, 2022

nalimilan deleted the nl/tableregressionmodel branch November 20, 2022 16:01

gragusa mentioned this pull request Nov 24, 2022

Get rid of TableRegressionModel by storing formula directly in model … gragusa/GLM.jl#4

Merged

PharmCat mentioned this pull request Jan 26, 2023

Get rid of TableRegressionModel by storing formula directly in model … PharmCat/GLM.jl#1

Merged

cecileane mentioned this pull request Feb 10, 2023

Get rid of Tableregressionmodel JuliaPhylo/PhyloNetworks.jl#194

Merged

kleinschmidt mentioned this pull request Mar 14, 2023

add compat for statsmodels 0.7 #512

Merged

kleinschmidt mentioned this pull request Apr 4, 2023

use released GLM in docs now JuliaStats/StatsModels.jl#288

Merged

getzze mentioned this pull request Apr 19, 2023

Remove TableRegressionModel wrapper getzze/RobustModels.jl#31

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get rid of TableRegressionModel by storing formula directly in model #339

Get rid of TableRegressionModel by storing formula directly in model #339

nalimilan commented Oct 18, 2019 •

edited

Loading

kleinschmidt left a comment

nalimilan commented Oct 18, 2019

kleinschmidt commented Oct 18, 2019

nalimilan commented Oct 18, 2019

kleinschmidt commented Oct 19, 2019

Nosferican commented Oct 19, 2019 •

edited

Loading

codecov-commenter commented Jun 19, 2022 •

edited

Loading

nalimilan commented Jun 19, 2022

kleinschmidt left a comment

palday Sep 29, 2022

nalimilan Oct 8, 2022

palday Sep 29, 2022

nalimilan Oct 8, 2022

palday left a comment

kleinschmidt left a comment

nalimilan commented Oct 21, 2022

bkamins commented Oct 21, 2022

Get rid of TableRegressionModel by storing formula directly in model #339

Get rid of TableRegressionModel by storing formula directly in model #339

Conversation

nalimilan commented Oct 18, 2019 • edited Loading

kleinschmidt left a comment

Choose a reason for hiding this comment

nalimilan commented Oct 18, 2019

kleinschmidt commented Oct 18, 2019

nalimilan commented Oct 18, 2019

kleinschmidt commented Oct 19, 2019

Nosferican commented Oct 19, 2019 • edited Loading

codecov-commenter commented Jun 19, 2022 • edited Loading

Codecov Report

nalimilan commented Jun 19, 2022

kleinschmidt left a comment

Choose a reason for hiding this comment

palday Sep 29, 2022

Choose a reason for hiding this comment

nalimilan Oct 8, 2022

Choose a reason for hiding this comment

palday Sep 29, 2022

Choose a reason for hiding this comment

nalimilan Oct 8, 2022

Choose a reason for hiding this comment

palday left a comment

Choose a reason for hiding this comment

kleinschmidt left a comment

Choose a reason for hiding this comment

nalimilan commented Oct 21, 2022

bkamins commented Oct 21, 2022

nalimilan commented Oct 18, 2019 •

edited

Loading

Nosferican commented Oct 19, 2019 •

edited

Loading

codecov-commenter commented Jun 19, 2022 •

edited

Loading