Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider changing the column identifier syntax to $column #187

Closed
jkrumbiegel opened this issue Oct 7, 2020 · 15 comments
Closed

Consider changing the column identifier syntax to $column #187

jkrumbiegel opened this issue Oct 7, 2020 · 15 comments
Milestone

Comments

@jkrumbiegel
Copy link
Contributor

jkrumbiegel commented Oct 7, 2020

I don't know if this is something that is even up for debate. But one aspect of DataFramesMeta which I think could be improved is that I can't use symbols as symbols in computation expressions without wrapping them in ^().

To use columns with string identifiers, I have to use cols("columnname") or for integer indices cols(123). I think this could be more streamlined by deciding on one way to use all three types.

I think the $ syntax lends itself to this very nicely, because $ is a symbol that is not used in normal Julia code aside from string interpolation (where it's a direct syntax transformation), so it doesn't conflict with other things.

My proposal is to make this change:

df = DataFrame(group => rand([:alpha, :beta], 100), "total weight" => rand(100))

# old syntax
@transform(df, valid = (:group .== ^(:alpha)) .& (cols("total weight") .< 0.5))

# new syntax
@transform(df, valid = ($group .== :alpha) .& ($"total weight" .< 0.5)

This also works with integers like $1 .* $20 and with any other expression that can result in a column identifier like $("column" * "1").

I know that this package has existed for a long time with the old syntax, but I think with this change it's clearer where macro transformations are taking place (the unusual $ symbols stick out) and there is no confusion with real symbols, or the cols function.

@pdeffebach
Copy link
Collaborator

Thanks for this proposal. I think this is a very good idea, particularly with the use of strings with spaces.

I was also thinking of having some syntax like s"a variable name with spaces" to refer to string literals without cols easier.

One problem is that $x is usually meant for escaping x rather than working with the literal x. This is a bit of an inconsistency. Another thing I like about :x is the syntax highlighting, but maybe that isn't so important. It would be great if other people could chime in and give their thoughts.

I think my general reaction is to just discourage using Symbols as data. We have pooled arrays and similar for efficient use of strings. Is there a particular use-case you have in mind for people with lots of Symbols in their data?

Note that something like @transform(df, y = :x + cols(1)) is actually about to be disallowed in #183 which is a bit regrettable but good for consistency.

@jkrumbiegel
Copy link
Contributor Author

In my example I use symbols as data just to show the problem, I think it's much more common that functions you want to apply on your data need some symbols passed to them to define which behavior is chosen. The symbols don't have to come out of the dataframe for the syntax collision to be annoying for the user. In my view it's valuable to conflict with as little standard syntax as possible, and I'd argue the $ syntax helps with that.

@pdeffebach
Copy link
Collaborator

I wonder if actually we want $ to escape symbols instead of ^. That is more consistent with the notion of $ as evaluating the expression unaltered as opposed to the literal.

@jkrumbiegel
Copy link
Contributor Author

Yes, I always felt that the ^ syntax was peculiar, I wonder why it was chosen at the time.

Although I still feel that it's nicer to mark column references with the unusual $, which maps nicely to strings and integers. The unusual symbols mark the places where the macro transformations are actually taking place, and even to newcomers it should be relatively clear what's going on there. Whereas the mixture of :column, cols("column"), etc. does not look so good to me. It feels a bit arbitrary, especially with cols looking like a function which it really isn't, but which could collide with a variable cols that someone has created before.

@jkrumbiegel
Copy link
Contributor Author

Ah one more idea..

One problem is that $x is usually meant for escaping x rather than working with the literal x

There is another symbol which is sometimes used for macros like these, which would work: &.
If you are worried that the $ operator communicates something else going on, I'd argue that & is relatively common as a reference operator. So it could be the column reference operator in the macro.

@transform(df, valid = (&group .== :alpha) .& (&"total weight" .< 0.5)

@pdeffebach pdeffebach added this to the Decision 1.0 milestone Mar 7, 2021
@pdeffebach
Copy link
Collaborator

@jkrumbiegel how does DFMacros do string interpolation and broadcasting?

I was pleasantly surprised to learn that

@transform df @byrow c = "$(:a)_$(:b)"

works on master. And while writing docs I realized that

@transform df c = @. :a - $(mean(:a))

is a pretty useful pattern.

If we do $ instead of cols I want to make sure there are still easy ways to interpolate. Perhaps with $$?

@jkrumbiegel
Copy link
Contributor Author

This works in DFMacros:

df = DataFrame(a = [1, 2, 3], b = [4, 5, 6])
@transform(df, "$(:a)$(:b)")

3×3 DataFrame
 Row │ a      b      a_b_function 
     │ Int64  Int64  String       
─────┼────────────────────────────
   11      4  14
   22      5  25
   33      6  36

The$ doesn't conflict because string interpolation is a parse-time thing, the $ doesn't appear in the AST.

julia> Meta.@dump "$a$b$c"
Expr
  head: Symbol string
  args: Array{Any}((3,))
    1: Symbol a
    2: Symbol b
    3: Symbol c

@jkrumbiegel
Copy link
Contributor Author

And this also works:

df = DataFrame(a = [1, 2, 3], b = [4, 5, 6])
x = 5
y = :b
@transform(df, "$(x * :a)$($y)")

3×3 DataFrame
 Row │ a      b      a_b_function 
     │ Int64  Int64  String       
─────┼────────────────────────────
   11      4  54
   22      5  105
   33      6  156

@pdeffebach
Copy link
Collaborator

Ah i see. So so the order of operations works itself out. That's good to know.

How about broadcasting?

@jkrumbiegel
Copy link
Contributor Author

What about it?

@pdeffebach
Copy link
Collaborator

I want to make sure users can use the $ with @. that protects part of an expression against broadcasting

julia> using DataFrames, DFMacros, Statistics;

julia> df = DataFrame(a = [1, 2, 3], b = [4, 5, 6]);

julia> @. df.a - $(mean(df.a))
3-element Vector{Float64}:
 -1.0
  0.0
  1.0

julia> @transform df :c = @c @. :a - $(mean(:a))
ERROR: MethodError: no method matching iterate(::Symbol)
Closest candidates are:

@jkrumbiegel
Copy link
Contributor Author

Ah I see, I have never used that, is it a common expression? I think that's too niche for me to consider in DFMacros at least. But in principle, one could use $$ or something like that.

@pdeffebach
Copy link
Collaborator

Yeah, I've never used it either, but I think it might end up as an important performance consideration. Check out the docs here where I outline a few gotchas that are probably relevant for DFMacros.jl.

I was going to make an example, but it looks like I have a bug with the use of $ anyways on master when used with @byrow...

@pdeffebach
Copy link
Collaborator

Ah here's an MWE that isn't too niche

julia> using DataFramesMeta, Statistics;

julia> df = DataFrame(a = 1:10);

julia> function expensive(a)
           sleep(.5)
           mean(a)
       end;

julia> @time @with df @byrow :a - expensive(df.a);
  5.086925 seconds (63.54 k allocations: 3.822 MiB, 1.47% compilation time)

julia> @time @with df @. :a - $(expensive(:a));
  0.546911 seconds (7.63 k allocations: 372.349 KiB, 8.19% compilation time)

@pdeffebach
Copy link
Collaborator

Closed via #266

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants