New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terms 2.0: son of Terms #71

Open
wants to merge 142 commits into
base: master
from

Conversation

Projects
None yet
@kleinschmidt
Copy link
Member

kleinschmidt commented Aug 7, 2018

This is a pretty major re-thinking of how to represent terms in a formula. It builds on #4, #54, and #57. The basic idea is that the @formula macro lowers a formula expression to an expression where symbols are "wrapped" in a Term struct, and overloads operators like +, &, and ~ with methods of Terms that generate higher-order terms like interactions. Additionally, this PR includes a mechanism by which calls to functions that don't have special meanings in the formula DSL are lowered to a call to capture_call which gets the original function called, the original expression, and an anonymous function that "wraps" that call. The default result of that function is that it passes these onto a FunctionTerm constructor, but in principle package authors could intercept things at this point. Whether or not a call is considered 'special' is also customizable, dispatching on is_special(Val(::Symbol)).

The other major new component is a schema representation that I "borrowed" from JuliaDB.ML. Schemas are computed from a namedtuple of vectors (e.g., what DataStreams calls a Data.Table), and when applied to a formula will replace leaf Terms with Categorical/ContinuousTerms.

The major conceptual difference is that any subtype of AbstractTerm can generate model matrix columns, and the way that columns are combined to make higher order model matrices is handled by dispatch. This provides, I think, much more flexibility in how package authors can "plug into" the formula pipeline, because they are no longer restricted to using fully-formed ModelFrames.

That being said, I've tried to keep the ModelFrame/ModelMatrix structure for now to make it easier to see how things have changed. I'd also like to consider how we actually use these structures to generate and fit models (e.g. #32). But that's orthogonal enough that this is worth considering as is.

This is work in progress and I haven't even tried to get the tests passing yet because I wanted to talk about this at juliacon. But I think it's close enough to the kind of structure I've had in mind for a long time that it's worth considering.

kleinschmidt added some commits Jul 31, 2018

@kleinschmidt

This comment has been minimized.

Copy link
Member

kleinschmidt commented Jan 15, 2019

Okay I've written fairly comprehensive docs (including docstrings for all the API functions that I could think of).

@oxinabox FWIW that produces exactly the same result in the current released version so I don't think it should block merging this. It is an important edge case to check but better to handle in another PR.

Show resolved Hide resolved docs/src/internals.md Outdated
Show resolved Hide resolved docs/src/internals.md Outdated
push!(encountered_columns, view(trms.factors, :, i_term))
## copied from DataFrames:
function _nonmissing!(res, col)
# workaround until JuliaLang/julia#21256 is fixed

This comment has been minimized.

@Nosferican

Nosferican Jan 15, 2019

Contributor

That issue seems to have been fixed.

@kleinschmidt

This comment has been minimized.

Copy link
Member

kleinschmidt commented Jan 15, 2019

@kleinschmidt

This comment has been minimized.

Copy link
Member

kleinschmidt commented Jan 16, 2019

Okay I think this is about ready for final review. @nalimilan, @dmbates, @Nosferican take a final look if you don't mind? If you want to build the docs locally you can do the now-standard

Pkg.activate("docs")
include("docs/make.jl")
@Nosferican

This comment has been minimized.

Copy link
Contributor

Nosferican commented Jan 16, 2019

There seems to be some legacy code (e.g., model_matrix now modelmatrix in StatsBase, model_response and response in StatsBase, appveyor config file using master and release-0.6, etc.) REQUIRE also seems to be using Missings (is that still necessary if targeting 1.0?).

@kleinschmidt

This comment has been minimized.

Copy link
Member

kleinschmidt commented Jan 16, 2019

ah I'd missed that rename. and actually didn't even realize that modelmatrix was a function from StatsBase.

@Nosferican

This comment has been minimized.

Copy link
Contributor

Nosferican commented Jan 16, 2019

Importing those methods and extending it to ModelFrame/ModelMatrix might be the optimal approach. I think, coefnames is extended, but if not, same advice.

@kleinschmidt

This comment has been minimized.

Copy link
Member

kleinschmidt commented Jan 16, 2019

Agreed.

@kleinschmidt

This comment has been minimized.

Copy link
Member

kleinschmidt commented Jan 16, 2019

I don't think we even have appveyor enabled for this repo? I've never set up one of those so it's a bit above my pay grade.

@Nosferican

This comment has been minimized.

Copy link
Contributor

Nosferican commented Jan 16, 2019

The config dates to 0.5/0.6 timeframe so I suspected as much. I am fairly certain nothing on this package is Windows or 32-bits sensitive (or even Mac OS), so we could drop it and keep Travis. Travis currently tests on Mac as well. I think is overly cautious so if you wanna save a tree or two we could leave it with Linux only given the scope of the package. I just tested the package without DataStructures and seems like that dependency is also legacy.

If we target 1.0,

name = "StatsModels"
uuid = "3eaba693-59b7-5ba5-a881-562e759f1c8d"

[compat]
julia = "1"

[deps]
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
SparseArrays = "2f01184e-e22b-5df5-ae63-d93ebab69eaf"
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"

[extras]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Test", "DataFrames"]

and REQUIRE as

julia 1
CategoricalArrays
StatsBase
Tables

Not that the cap on StatsBase loses bite since 0.7-1.2 julia compatibility already complies with >0.22.0.

Also, precompile(true) at top no longer needed so we can drop it if focusing on 0.7/1.0.

@Nosferican

This comment has been minimized.

Copy link
Contributor

Nosferican commented Jan 16, 2019

The Missings.T is 0.6, no?

@Nosferican

This comment has been minimized.

Copy link
Contributor

Nosferican commented Jan 16, 2019

New ModelFrame is neat! I like the schema.
One issue while playing around was how to access the mf.schema,

data = DataFrame(y = categorical(["a","a","b","b"]), x = zeros(4))
fm = @formula(y ~ x)
mf = ModelFrame(fm, data)
mf.schema[Term(:y)] # the Term keys are a bit unintuitive, maybe the keys as Symbol or a note?

For the, CategoricalTerm, maybe include one more field for nominal vs ordinal? Maybe a ordinal::Bool that uses isordered(obj) during construction?

@kleinschmidt

This comment has been minimized.

Copy link
Member

kleinschmidt commented Jan 16, 2019

I think you can get the ordered property from the contrasts matrix (which stores the eltype) right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment