Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terms 2.0: son of Terms #71

Merged
merged 193 commits into from
Mar 10, 2019
Merged

Terms 2.0: son of Terms #71

merged 193 commits into from
Mar 10, 2019

Conversation

kleinschmidt
Copy link
Member

@kleinschmidt kleinschmidt commented Aug 7, 2018

This is a pretty major re-thinking of how to represent terms in a formula. It builds on #4, #54, and #57. The basic idea is that the @formula macro lowers a formula expression to an expression where symbols are "wrapped" in a Term struct, and overloads operators like +, &, and ~ with methods of Terms that generate higher-order terms like interactions. Additionally, this PR includes a mechanism by which calls to functions that don't have special meanings in the formula DSL are lowered to a call to capture_call which gets the original function called, the original expression, and an anonymous function that "wraps" that call. The default result of that function is that it passes these onto a FunctionTerm constructor, but in principle package authors could intercept things at this point. Whether or not a call is considered 'special' is also customizable, dispatching on is_special(Val(::Symbol)). Edit: For posterity's sake, the extension mechanism is now to provide methods for apply_schema(::FunctionTerm{typeof(myfunc)}, schema, Modeltype) which return your custom term type.

The other major new component is a schema representation that I "borrowed" from JuliaDB.ML. Schemas are computed from a namedtuple of vectors (e.g., what DataStreams calls a Data.TableTables.jl calls a ColumnTable), and when applied to a formula will replace leaf Terms with Categorical/ContinuousTerms.

The major conceptual difference is that any subtype of AbstractTerm can generate model matrix columns, and the way that columns are combined to make higher order model matrices is handled by dispatch. This provides, I think, much more flexibility in how package authors can "plug into" the formula pipeline, because they are no longer restricted to using fully-formed ModelFrames.

That being said, I've tried to keep the ModelFrame/ModelMatrix structure for now to make it easier to see how things have changed. I'd also like to consider how we actually use these structures to generate and fit models (e.g. #32). But that's orthogonal enough that this is worth considering as is.

This is work in progress and I haven't even tried to get the tests passing yet because I wanted to talk about this at juliacon. But I think it's close enough to the kind of structure I've had in mind for a long time that it's worth considering.

@kleinschmidt
Copy link
Member Author

I intentionally didn't build in missing-skipping in concrete_term because someone might want to actually model missing data as another level.

Function composition should work normally:

julia> f =@formula(y ~ log(abs2(x)))
FormulaTerm
Response:
  y(unknown)
Predictors:
  (x)->log(abs2(x))

julia> f.rhs
(x)->log(abs2(x))

julia> f.rhs.fanon
#7 (generic function with 1 method)

julia> f.rhs.fanon(10)
4.605170185988092

julia> log(abs2(10))
4.605170185988092

As for using try/catch to support auto-de-vectorizing of column-wise functions, I'm hesitant. Or I'll need more convincing that it's a good idea and doesn't hurt performance or usability all that much. So better handled as a PR after this is merged :) Or as a "special term" as I suggested above, where you can define behavior directly for an entire column of data by dispatching on the data type in model_cols.

@Nosferican
Copy link
Contributor

Shouldn't FunctionTerm yield either a CategoricalTerm or ContinousTerm after the function is applied?

@kleinschmidt
Copy link
Member Author

kleinschmidt commented Mar 4, 2019

No, it should yield a FunctionTerm after apply_schema unless someone has somewhere defined a method for apply_schema(::FunctionTerm{F}, data, Model).

Edit: and a function term should return a single value when called with the arguments given in its names parameter pulled from a named tuple

@kleinschmidt
Copy link
Member Author

And actually you could implement both CategoricalTerm and ContinuousTerms as special cases of (a more general, n-ary) FunctionTerm. CategoricalTerm with a closure that captures the contrasts matrix and then does a getindex, and ContinuousTerm as x -> convert(Float64, x). They're all just transformers that take a table and return some numerical array element-by-element (potentially with shortcuts in situations where you can do better if you have the whole column at once).

@kleinschmidt
Copy link
Member Author

Okay, I've merged the prediction interval changes from master (#81); other changes to master in the mean time are superseded/functionally covered by this PR (#82, #72, #76).

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge then!

@matthieugomez
Copy link
Contributor

matthieugomez commented Mar 15, 2019

Can you talk a bit more about why you decided to apply functions elementwise? It sounds like a pretty big departure from the rest of Julia syntax. Moreover, using lagged variable (or converting continuous variable to categorical) on the fly sounds like a common use case, but it cannot be done in your framework (AFAIK)
You mention something about lazy transformations but I'm not convinced that it is worth doing it just for this reason.

@nalimilan
Copy link
Member

See also #75 (comment). Probably better move the discussion there, or file a dedicated issue, as this PR is coming dangerously close to 200 comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.