Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove tilde macro from formula interface #116

Closed
simonbyrne opened this issue Dec 2, 2015 · 9 comments
Closed

Remove tilde macro from formula interface #116

simonbyrne opened this issue Dec 2, 2015 · 9 comments

Comments

@simonbyrne
Copy link
Member

I think we should get rid of the tilde macro: it is way too inconsistent with the rest of the language (see discussion in JuliaLang/julia#4882), and takes up a valuable keyboard ASCII symbol. One simple way to replace it is to use a wrapping macro:

@model Y ~ X1 + X2

but there might be something better?

This would also affect MixedModels.jl: are there any other packages which use it?

cc: @johnmyleswhite, @dmbates

@nalimilan
Copy link
Member

I wouldn't remove it until the discussion settles. It isn't clear to me yet what's going to happen to ~. If it were changed to return a package-neutral expression object, it could be used by several packages without conflicts.

Also, the syntax you suggest is rather verbose, we might as well directly work with expressions: :(x ~ y + z).

@dmbates
Copy link
Contributor

dmbates commented Dec 2, 2015

I'm okay with changing the model formula specification if that is the conclusion once the dust settles, as @nalimilan said. I can adjust the MixedModels package to whatever approach is chosen.

Naturally the GLM, DataArrays, DataFrames, etc. packages were heavily influenced by the way things were done in R. I am less convinced these days that slavishly following the R model is the way to go. I do like the formula/data specification and would prefer not to abandon it but bug-for-bug compatibility (such as the implicit intercept term in formulas) with R is not necessary.

By the way, I have settled on a numerical approach in the MixedModels package and am in the process of documenting it. After that I could move the package to the JuliaStats group if desired. I kept it under my repositories mainly because I was changing the computational approach frequently and didn't want to explain to collaborators why everything was ripped apart yet again.

@StefanKarpinski
Copy link

At this point I'm in favor of something like @model Y ~ X1 + X2 which would return some Model type of object (which could retain the original expression object). It's hard to imagine people specifying so many models by hand that the extra seven characters are prohibitively expensive. Using bare expressions for this strikes me as less good. Making ~ a macro was an interesting experiment but far too many people have found it very strange and confusing.

@dmbates
Copy link
Contributor

dmbates commented Dec 2, 2015

@StefanKarpinski The model formula is generally used as an argument to a function like lm, lmm, or glm that creates the numerical representation of the model.

Initially we used an expression, like :(Y ~ 1 + X1 + X2) to delay evaluation of the terms. The special treatment of ~ was introduced more-or-less for R compatibility. As I said, I don't think R compatibility should be the predominant consideration now.

The point of the exercise is to be able to parse the expression into a left hand side and a right hand side where terms on the right hand side can have special meanings and the evaluation is done in an environment determined by the data argument.

In some ways I think I would prefer to return to the expression syntax instead of creating a special purpose macro, which, in most cases, would need to be written @model(Y ~ 1 + X1 + X2) to be a self-contained expression for an argument.

@simonster
Copy link
Member

In the past I proposed that this should be a custom string literal that uses $ interpolation to differentiate local variables from DataFrame variables, e.g. you could do:

df = DataFrame(X1=randn(10), X2=randn(10))
Y = randn(10)
fit(GeneralizedLinearModel, model"$Y ~ X1 + X2", df)

@nalimilan
Copy link
Member

I've thought about the custom string literal, but it has the drawback that highlighting is lost. Since a formula is valid Julia syntax, I don't see the point of putting it inside a string.

(Also, more fundamentally but quite OT, I don't think we should support accessing variables outside of the data frame argument as in R: in my experience, it's a nightmare for package authors, and doesn't really bring any real advantage. It's even confusing for students who have objects in the global scale and variables with the same name in the data frame, and modify only one of the two.)

What's the issue with the :(x ~ y + z) syntax?

@simonbyrne
Copy link
Member Author

I agree with Doug that it's a bit of an R smell, and he is correct that we would have to use @model(Y ~ X1 + X2), as the unparenthesised version will capture any trailing arguments separated by commas.

The main advantage of the @model approach is that it can create a Formula or Model object, rather than a Expr. Given they have very different uses, I don't think we should conflate the two. The other reason is that it serves as a warning that what we're doing is non-standard (i.e. column names become variables, we're adding bases to form a vector/affine space, not vectors to form another vector).

@pmilovanov
Copy link

Just adding my $0.02: indeed, having @~ lurking around as a special case macro is a slippery slope. Support infix macros in general or not at all.

@andreasnoack
Copy link
Member

This discussion continues in JuliaStats/StatsModels.jl#3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants