Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formula should include : and * interactions #18

Closed
HarlanH opened this issue Jul 15, 2012 · 5 comments
Closed

Formula should include : and * interactions #18

HarlanH opened this issue Jul 15, 2012 · 5 comments
Labels

Comments

@HarlanH
Copy link
Contributor

HarlanH commented Jul 15, 2012

No description provided.

@doobwa
Copy link
Contributor

doobwa commented Jul 19, 2012

I'm curious about how to go about this. In the following it seems that + precedes : in the order of operations for Expr objects (which of course is incorrect for the model notation).

julia> f = Formula(:(y ~ x1 + x1:x2))
Formula([y],[:(+(x1,x1),x2)])

julia> f.rhs[1].args
2-element Any Array:
 +(x1,x1)
 x2      

Doesn't this make it harder to use the : notation without changing Expr objects?

@tshort
Copy link
Contributor

tshort commented Jul 19, 2012

Maybe we'll have to change operators. :: looks like it might work. So would & and %. Here's a list of operators ordered by precedence from julia-parser.scm:

(define ops-by-prec
  '#((= := += -= *= /= //= .//= .*= ./= |\\=| |.\\=| ^= .^= %= |\|=| &= $= => <<= >>= >>>= ~ |.+=| |.-=|)
     (?)
     (|\|\||)
     (&&)
     ; note: there are some strange-looking things in here because
     ; the way the lexer works, every prefix of an operator must also
     ; be an operator.
     (<- -- -->)
     (> < >= <= == === != |.>| |.<| |.>=| |.<=| |.==| |.!=| |.=| |.!| |<:| |>:|)
     (: |..|)
     (+ - |.+| |.-| |\|| $)
     (<< >> >>>)
     (* / |./| % & |.*| |\\| |.\\|)
     (// .//)
     (^ |.^|)
     (|::|)
     (|.|)))

@tshort tshort closed this as completed Jul 19, 2012
@HarlanH HarlanH reopened this Jul 19, 2012
@HarlanH
Copy link
Contributor Author

HarlanH commented Jul 19, 2012

(Tom, think you hit the close button by mistake! A bit of a GitHub UI quirk...)

I concur. I think we should go with & instead of :. y ~ 1 + x + x&y. There are also those redundant formula features I never use, like subtracting a predictor: y ~ 1 + x * y - y and whatnot. I don't really care if we support those or not. I'd prefer we stick with 0+ to remove the interaction term too, and not support - 1, which I find harder to read.

@doobwa
Copy link
Contributor

doobwa commented Jul 19, 2012

There is something to be said for supporting R's syntax: it's been around long enough for people to be familiar with it, and the Python people are starting to use it as well. Would this be possible if we instead parsed strings? As soon as I said that, though, it doesn't seem worth it.

On the other hand, the number of operations we're talking about is pretty minimal, so people will just need to look up Julia's way of doing it. One direction I think would be cool: extend this notation to also include namespaces of features a la Vowpal Wabbit's sparse format. For example, if you have a sparse, bag-of-words representation for a text document, all of these features could be under the words namespace. If you also have a categorical variable for day of week, all y ~ words * day would create interaction terms between all the word features and the day feature.

@HarlanH
Copy link
Contributor Author

HarlanH commented Jul 19, 2012

Yeah, I don't think a single-character change is a big deal here, and using Julia's parser seems a big enough win that I think we should stick with it.

As for namespaces (cool -- I need to actually try VW out sometime!), we'd need a way to define them separate from the formula. Would we want to include something like "colname groups" in the DataFrame? So, you'd somehow define "dims" to be a colname group for "height", "width", and "depth", then you could use "dims" instead of a list of those three column names? That could be useful for other things too. df["dims"] becomes a shorthand for df[["height", "width", "depth"]], and df["predictors"] and df["response"] seem natural things to define, too. So you could then call lm(:(response ~ predictors + covariants), df) or something. That's fairly awesome. I'm going to spin off an issue!

@HarlanH HarlanH mentioned this issue Jul 19, 2012
@doobwa doobwa mentioned this issue Jul 23, 2012
@tshort tshort closed this as completed Aug 4, 2012
nalimilan pushed a commit that referenced this issue May 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants