Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan for DataFrames 1.0 #148

Closed
pdeffebach opened this issue May 8, 2020 · 28 comments
Closed

Plan for DataFrames 1.0 #148

pdeffebach opened this issue May 8, 2020 · 28 comments

Comments

@pdeffebach
Copy link
Collaborator

pdeffebach commented May 8, 2020

@bkamins @nalimilan

I am tentatively offering to take over maintenance of DataFramesMeta after it reaches 1.0. I think this will be feasible for me because with the new features in data frames we can really cut down the LOC in DataFramesMeta.

First, all work will be performed in one macro. This macro will turn

:x + :y + reduce(+, r"^a") + reduce(*, Between(:x2, :x5)

into

function temp_name(a, b, c, d)
    a + b + reduce(+, c) + reduce(*, c)
end

[:x, :y,  r"^a", Between(:x2, :x5)] => temp_name

Where r^"a" and Between(:x2, :x5) are vectors of vectors. This is currently disallowed in DataFrames (and I may have even argued against it at some point) but now seems intuitive.

EDIT: The term asArgs(...) has come up before. Maybe this is a good use-case for that.

This means that a workflow would be like

@transform(df, :x + :y + reduce(+, r"^a") + reduce(*, Between(:x2, :x5) => :newcol) ===
transform(df, 
    [:x + :y + reduce(+, r"^a") + reduce(*, Between(:x2, :x5)] => 
    (a, b, c, d) -> a + b + reduce(+, c) + reduce(*, c) =>
    :newcol)

The only code that really gets executed by DataFramesMeta is working with the first part of the pair.

Perhaps keyword arguments could be added, but people seem happy with the Pairs syntax and it would be nice if we mirrored DataFrames closely.

Like DataFrames, operating on columns will be the default. Maybe we can introduce a @byrow macro which just takes an expression and wraps it in ByRow(...).

Unless there are large performance benefits, I think we no longer need to support the @linq pipeline. My impression is that the closure that @linq does no longer helps us. Rather, we can re-export Pipe.jl or Lazy.jl's @>.

Escaping rules are changed so that escaping symbols (and only symbols) is the default. This mirrors DataFrames more closely and, selfishly, it's easier to put @transform etc. in functions. Perhaps this is really hard to do, but I would like to explore it so that working without literals is just as easy as in base data frames.

@with could still work. But I don't know if people use it, since escaping gets complicated very quickly.

Deprecate @based_on, @by. These are now @combine

Add some features in DataFrames to make up for @where(gd::GroupedDataFrame, ...) (filter) and @orderby (sort).

I think think that this will result in a DataFramesMeta that both mirrors DataFrames closely and is easier to maintain.

let me know what you think of this proposal.

@bkamins
Copy link
Member

bkamins commented May 8, 2020

This is excellent that you would be willing to work on this! I will think of your proposal over the weekend and let you know.

@mkborregaard
Copy link

I use @with :-)

@nalimilan
Copy link
Member

Thanks for the offer!

I'm fine with dropping @based_on and @by. @orderby also sounds like it should be a method of sort. For @linq, it would indeed be nice if we could reuse a more general piping system from another package. But @with sounds useful.

I don't really understand what reduce would do in your example. Can you develop?

Like DataFrames, operating on columns will be the default. Maybe we can introduce a @byrow macro which just takes an expression and wraps it in ByRow(...).

Note that we already have @byrow!.

Escaping rules are changed so that escaping symbols (and only symbols) is the default. This mirrors DataFrames more closely and, selfishly, it's easier to put @transform etc. in functions. Perhaps this is really hard to do, but I would like to explore it so that working without literals is just as easy as in base data frames.

What do you mean? What kind of objects are escaped currently that wouldn't anymore?

Regarding keyword arguments, dropping that syntax would be quite disruptive. Maybe better keep supporting it, as it's relatively natural to write select(df, mean_x = mean(:x)): contrary to DataFrames, the input columns are indicated in the same expression as the function, so the incols => fun => outcol syntax is just expression => outcol, which is less useful. Or at the minimum it would be nice to provide at least a working deprecation for some time.

Cc: @piever

@pdeffebach
Copy link
Collaborator Author

pdeffebach commented May 8, 2020

I don't really understand what reduce would do in your example. Can you develop?

I just needed an example of a function that would take in a Vector of Vectors to illustrate how collections of columns would be implemented. So the reduce would be applied to a Vector of Vectors whose names start with "a".

Note that we already have @byrow!.

I forgot about @byrow!. My idea was related to ByRow in the sense of

transform(df, [:x, :y] => ByRow(sum)) ===
@transform(df, @byrow sum([:x, :y]) 

I don't know what I would do with @byrow!. I think it's a very nice macro, but it's functionality is technically superseded by transform with ByRow. But I do like the anonymous function syntax a lot. I think that DataFrames allowing functions to return NamedTuples would make it even easier to implement and maintain. For example

@byrow! df begin
    @newcol z # allow user to avoid type annotations?
    @newcol d
    :d = :x + :y
    :z = :x * y 
end

could get turned into

functio temp_name(a, b)
    temp_1 = a + b
    temp_2 = a * b
    (d = temp_1, z = temp_2)
end

transform(df, [:x, :y] => ByRow(temp_name))

This is already very close to the current implementation of @byrow!

I want to emphasize that my vision of the future of DataFramesMeta is simply to make it slightly easier to construct transform and combine statements and not a whole lot else.

@pdeffebach
Copy link
Collaborator Author

Escaping rules are changed so that escaping symbols (and only symbols) is the default. This mirrors DataFrames more closely and, selfishly, it's easier to put @transform etc. in functions. Perhaps this is really hard to do, but I would like to explore it so that working without literals is just as easy as in base data frames.

What do you mean? What kind of objects are escaped currently that wouldn't anymore?

julia> s = :x; t = :y;
julia> @transform(df, z = t + s) # fails 
julia> @transform(df, z = cols(t) + cols(s)) # works

Given that

transform(df, [s, t] => +)

works, it seems natural to try and make the above escaping rules work with variables.

@matthieugomez
Copy link

matthieugomez commented May 10, 2020

Thanks @pdeffebach for improving this package.

I agree with @nalimilan that it would be better to keep kwargs.

It may be worth thinking about defining only one macro, that combines piping + transforming second argument of a combine/select/transform call. That would be much cleaner than defining a macro for every verb IMO.

One last thing: is that too late to use x rather than :x to refer to a colum name?

@pdeffebach
Copy link
Collaborator Author

I was thinking that it might be nice to define only one macro, that combines piping + transforming second argument of a combine/select/transform call.

That might be a good idea, but could also make things more confusing since select would have different behavior depending on the environment.

One last thing: is that too late to use x rather than :x to refer to a colum name?

I am against this. If anything, I would like to do the opposite, where if x is a variable assigned to a symbol it would use that symbol, i.e.

x = :my_col
@transform(df, :p + x => :y) # @transfor(df, :p + :mycol => :y)

The behavior you describe would make it very difficult to use these macros inside of functions where the columns operated on are inputted as variables instead of literals. It's a major pain point with dplyr that I would like to avoid.

@matthieugomez
Copy link

matthieugomez commented May 10, 2020

In this case, one could do @transform(df, y = p + col(x)) no?

@matthieugomez
Copy link

matthieugomez commented May 10, 2020

in any case it’s true that @transform(df, y= :x) is kind of weird. It sounds simpler to have either :y = :x or y = x, and, in particular, to allow the LHS and RHS to be substituted in the same way (I think you have opened some issues about that)

@pdeffebach
Copy link
Collaborator Author

In this case, one could do @transform(df, y = p + col(x)) no?

Yes this is current behavior, and likely to stay since my proposal for more "normal" scoping rules might be very hard to implement. It would involve evaluating every thing in the expression to see if it's a symbol.

in any case it’s true that @transform(df, y= :x) is kind of weird. It sounds simpler to have either :y = :x or y = x, and, in particular, to allow the LHS and RHS to be substituted in the same way (I think you have opened some issues about that)

Yes. I would really like to get that to work, and perhaps even deprecate y = x unless y and x are both variables defined as symbols. The problem is that they julia places keyword arguments in an expression is weird and I don't understand it. But other macro packages have gotten around it so I think this should be possible.

@nalimilan
Copy link
Member

Got it. Probably regexes shouldn't be interpreted as referring to columns by default, so you'd have to write cols(r"x").

I actually agree with @matthieugomez that it would be nice to treat any variable as referring to a column. In particular that would avoid problems when you need to use an actual symbol, which is quite common to pass options to functions. As it is relatively rare to refer to a variable from the outer scope, having it escape it sounds OK. But of course changing this would be quite disruptive.

I am against this. If anything, I would like to do the opposite, where if x is a variable assigned to a symbol it would use that symbol, i.e.

x = :my_col
@transform(df, :p + x => :y) # @transfor(df, :p + :mycol => :y)

The behavior you describe would make it very difficult to use these macros inside of functions where the columns operated on are inputted as variables instead of literals. It's a major pain point with dplyr that I would like to avoid.

Why not, but then you need to provide a convenient syntax to escape variables that refer to the outer scope (just like in the approach from my last paragraph).

@pdeffebach
Copy link
Collaborator Author

@nalimilan Just so I understand what you are proposing, let's say we have

x = :mycol
y = 5
@transform(df, :z .+ y .+ x => :a)

This would error because if we wanted to refer to something that wasn't a symbol or string in the variable y, you would have to escape it with something akin to the current ^(y) syntax.

But the following

x = :mycol
y = 5
@transform(df, :z  .+ x => :a)

Would work because x is the only thing that is being escaped and x is a symbol.

This is okay. It's odd that escaping rules are complex, but what's most important to me is that you can work with variables that are symbols so functions like

make_index(df, vars::Vector{Symbol}, newname::Symbol)
    return @pipe df |>
        @transform(_, reduce(+, vars) => nename) |>
        @transform(_, (newname  .- mean(newname)) ./ std(newname) => newname) 

would work without worrying too much about escaping.

Notice that in the above example, I don't use anything from any other scope. I don't have a lot of experience in a classroom setting but I would bet most users of dplyr don't use any scope outside their immediate data-frame.

@matthieugomez
Copy link

Do you really think these macros will be used inside functions btw? Would not people be better off using DataFrames.jl at that point?

@pdeffebach
Copy link
Collaborator Author

Do you really think these macros will be used inside functions btw? Would not people be better off using DataFrames.jl at that point?

I think its easier to manage a team of RAs if you are able to enforce a single standard for how data cleaning should work. having two syntaxes makes this hard.

Plus its common to write code in global scope and then realize a lot of cleaning should be put in functions. This proposal makes it a lot easier for new users to put code in function, no need to re-write everything.

I would definitely use it. I like piping and the way it reads like a sentence. I would prefer to use this workflow wherever possible.

@nalimilan
Copy link
Member

@pdeffebach Yes, that's what I described based on your original proposal. But I'm not sure how much I like it. It might be a bit too magical.

@CameronBieganek
Copy link

CameronBieganek commented May 15, 2020

I would also really like to keep the keyword argument syntax. Perhaps they can both be supported?

I don't have too strong an opinion on whether we should use variables like x or symbols like :x to represent columns in the DataFramesMeta macros. However, it seems like if we use variables to represent columns, then we can use expression interpolation instead of the cols wrapper:

julia> var = :y
:y

julia> :(x + $var)
:(x + y)

julia> var = [:y, :z];

julia> :(x + $(var...))
:(x + y + z)

julia> :(x + hypot.($(var...)))
:(x + hypot.(y, z))

julia> k = 5; var = :y;

julia> :($k*x + $var)
:(5x + y)

Seems pretty slick to me. 😁

@tbeason
Copy link

tbeason commented Jun 13, 2020

I use @with a ton as well. I mostly use it when actually using the data in the DataFrame, not when manipulating it, ie. result = @with df f(:x,:y)

If there is a better/simpler way to do that, I'd be happy to change my workflow, but I have used lines like that 100+ times in my code.

@pdeffebach
Copy link
Collaborator Author

@tbeason Thanks for your feedback. @with will definitely stay after this package gets more attention.

I have been reading up on this discussion and I think the best way forward is to try and have a non-breaking release which which uses DataFrames's new transform and select where possible. From there it is easier to discuss changes to functionality.

@tk3369
Copy link

tk3369 commented Jun 19, 2020

I use @linq most of the time. When I first come to DataFramesMeta, it's great as it allows me to create very clean code as compared to writing vanilla DataFrames operations. My SQL background grumbled about few things though:

  1. column names are symbols, and I had to put colons everywhere
  2. operations must be broadcasted, but I kept forgetting to write that dot

A query that looks like this:

@linq df |>
    where(:A .== 1, :B .== 2, occursin.("XY", :C)) |>
    transform(D = :A .+ :B, E = SubString.(:C, 1, 3)) |>
    select(:A, :C, :E)

...is honestly quite ugly. So, I had wished that I could do:

@linq df |>
    where(A == 1, B == 2, occursin("XY", C)) |>
    transform(D = A + B, E = SubString(C, 1, 3)) |>
    select(A, C, E)

I do realize that it causes an ambiguity when I need to use a real variable in the transformation pipeline. Sure, but I think that's less common. To handle that, we could possibly wrap the variable with syntax like var(x).

I worked with R's dplyr a little bit before. I really like their syntax e.g. filter

If we just use the column names as is, then the whole Symbol vs. String issue goes away, right?

@pdeffebach
Copy link
Collaborator Author

pdeffebach commented Jun 19, 2020

Thanks for your feedback!

  1. I think that keeping transform as acting on the full column should stay, otherwise there will be different behavior for DataFrames vs. DataFramesMeta. However I think that if @where would probably be an exception as DataFrames filter acts on rows (I would be in favor of renaming @where to @filter to make this comparison more explicit).

    Personally, I also appreciate the explicit broadcasting, I think the ambiguity about broadcasting in dplyr causes more mental overhead than I would like.

  2. I am going back and forth on whether to use literals as column names. I am glad to have your feedback. Given that there is strong support for literals as column names, I propose the following, which I think is feasible.

     @transform(df, newvar = var1 + var2)

looks for columns :var1 and :var2 in df and adds it, making new column :newvar.

    y = "newvar"
    b = "var1"
    c = "var2"
    @transform(df, `y` = `b` + `c`)

looks for the variables in the current scope b and c, which are Symbols or Strings, and then uses those columns to make a new variable called :newvar

     c = ones(nrow(df))
     @transform(df, newvar = var1 + $c)

looks for the column :b in df and the variable c in local scope, which is a Vector.

Users familiar with Stata will appreciate the backticks. I think this series of rules is relatively robust.

What do others think?

There are a number of issues to consider when it comes to using literals, however. For example, what about functions?

@tbeason
Copy link

tbeason commented Jun 19, 2020

I don't understand your middle example. Shouldn't y always be "not decorated"?

After working longer with the newer select-based DataFrames.jl functions, I find you can do much more with them than you could before. Could you clarify what gaps DataFramesMeta.jl fills now or will be expected to fill (besides providing @with 😃)?

@pdeffebach
Copy link
Collaborator Author

I don't understand your middle example. Shouldn't y always be "not decorated"?

See my example above, the middle example shows that you can programmatically construct a variable with a new name. This is very easy to do in stata and frustrating to do in dplyr.

The problem with transform and select in DataFrames.jl is something like

@transform(df, z = (:a + :b) / (:c + :d) * sum(:y))

needs

@transform(df, [:a, :b, :c, :d, :y] => ((a, b, c, d, y) -> (a + b) / (c + d) * sum(y)) => :z

which is very verbose!

The goal for DataFramesMeta, aside from @with is to provide a convenient syntax to map expressions to [variables] => fun => newvar calls using DataFrames.transform.

@CameronBieganek
Copy link

The backtick and dollar sign notation seems pretty good to me. 👍

@tk3369
Copy link

tk3369 commented Jun 20, 2020

I agree. I like the proposal above @pdeffebach. It's going to be awesome! :-)

Do I have to use @transform macro though (instead of just transform)? Would it be possible to do it without when @linq is used?

@nalimilan
Copy link
Member

I like the proposed used of $, but I'm less happy about using backticks. Maybe keep the current cols(c) syntax, or use cols($c)? In the latter case, $ would then always indicate that you access a variable in the local scope.

@tbeason
Copy link

tbeason commented Jun 20, 2020

Ah ok that does make much more sense about verbosity. Thanks for the clarity. I like the backticks and the $ for those uses.

Do you have a sense of how this all would compose with some other piping system like Pipe.jl, Lazy.jl, Underscores.jl (which is the one I've been using lately), etc? I find that Underscores.jl works pretty well with the DataFrames.jl select/transform/filter functions (using two underscores to refer to the table in your expression). Can the DataFramesMeta.jl macros play nicely with a different piping solution or will @linq still be needed?

@pdeffebach
Copy link
Collaborator Author

Are people okay with me closing this?

There are a lot of ideas in this thread, but I think they are mostly addressed in more specific issues, such as x vs :x and using multiple columns in @transform.

A lot of things are already addressed, like referring to columns with string names.

@bkamins
Copy link
Member

bkamins commented Mar 7, 2021

I would close such meta-issue if we do not have a list of things to do. It is better to open concrete issues for separate things I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants