Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make predict easier? #134

Closed
nilshg opened this issue Oct 22, 2020 · 6 comments
Closed

Make predict easier? #134

nilshg opened this issue Oct 22, 2020 · 6 comments

Comments

@nilshg
Copy link
Contributor

nilshg commented Oct 22, 2020

predict currently raises an error, advising the user to estimate the model with the save = :fe kwarg and the access the fixed effects using fe().

If I'm not mistaken, this advise only covers the retrieval of fixed effects, but doesn't actually give a prediction. I'm currently doing:

fem = reg(df, @formula(y ~ x1 + x2 + fe(x3) + fe(x4)), save = :fe)

[df.x1 df.x2] * coef(fem) .+ fe(fem).fe_x3 .+ fe(fem).fe_x4

which I believe produces the correct predictions from the model (do correct me if I'm getting something wrong here!)

Could this be turned into a predict method, which accepts the fem object directly provided it has been estimated with save = :fe?

@nilshg
Copy link
Contributor Author

nilshg commented Oct 22, 2020

FWIW I'm now using this function:

function predict(m::FixedEffectModel, df)
	!isempty(m.fe) || error("Model object does not contain fixed effects, to predict estimate the model with save = :fe")
	Matrix(df[!, m.coefnames]) * coef(m) .+ sum(Matrix(m.fe), dims = 2)
end

which seems to do the job, but clearly is lacking any sort of uncertainty estimates / prediction intervals

@matthieugomez
Copy link
Member

Yes, would be nice to define predict in this case.
The issue is that predict should work on any dataset, not only the one that was originally passed to the function. Yet, your solution only works on the original dataframe. A better way would be to do a join on the variables used to define fixed effects.

You'd need to modify this function

function StatsBase.predict(x::FixedEffectModel, df)

@nilshg
Copy link
Contributor Author

nilshg commented Oct 31, 2020

I'll have a look at this and try to get a PR together

@nilshg
Copy link
Contributor Author

nilshg commented Nov 7, 2020

Okay I got stuck and am wondering whether I'm missing something:

  • predict is supposed to accept two arguments, the fitted model and some data for the covariates (could be old or new)
  • The only information we have on the fixed effects is in the fe field of the struct, which is an n-by-fe DataFrame

Based on this, it seems to me that it's not in general possible to make a prediction from the model object and some new data - the only way to know which number in fe belongs to which level of a fixed effect is by having the old data available (to match up the columns in fe with the columns holding the original categories.

If this is correct, it seems to me there are two options here to allow predictions from new data:

  1. Require passing the old data to predict as well to identify which estimates belong to which category for each fixed effects; or
  2. Change the construction of augmentdf in reg() to include the info about the original categories. This could be done by either just including those columns as well and make fe n-by-2*fe, or by changing this more drastically to have a smaller DataFrame or Dict with just the mapping category => estimate of fixed effect instead.

I suppose (2) is potentially breaking to the extent that people have relied on the format of the fe field, while (1) could be implemented without changes to the layout of the fields in the model, but at the cost of diverging somewhat from the usual predict API.

Any comments appreciated - happy to implement either of these or both, or indeed none of them if I've missed something and there's a better way!

@matthieugomez
Copy link
Member

matthieugomez commented Apr 12, 2021

Sorry for the late response — you're exactly right. I think Option2 is best. fe(x::FixedEffectModel, keepkeys = true) now returns a DataFrame with the original categories (#164). It is not breaking, and it should allow you to write a better predict using DataFrames leftjoin

@matthieugomez
Copy link
Member

Fixed by #185

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants