New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plan for DataFrames 1.0 #148
Comments
This is excellent that you would be willing to work on this! I will think of your proposal over the weekend and let you know. |
I use |
Thanks for the offer! I'm fine with dropping I don't really understand what
Note that we already have
What do you mean? What kind of objects are escaped currently that wouldn't anymore? Regarding keyword arguments, dropping that syntax would be quite disruptive. Maybe better keep supporting it, as it's relatively natural to write Cc: @piever |
I just needed an example of a function that would take in a
I forgot about
I don't know what I would do with
could get turned into
This is already very close to the current implementation of I want to emphasize that my vision of the future of DataFramesMeta is simply to make it slightly easier to construct |
Given that
works, it seems natural to try and make the above escaping rules work with variables. |
Thanks @pdeffebach for improving this package. I agree with @nalimilan that it would be better to keep kwargs. It may be worth thinking about defining only one macro, that combines piping + transforming second argument of a combine/select/transform call. That would be much cleaner than defining a macro for every verb IMO. One last thing: is that too late to use x rather than :x to refer to a colum name? |
That might be a good idea, but could also make things more confusing since
I am against this. If anything, I would like to do the opposite, where if
The behavior you describe would make it very difficult to use these macros inside of functions where the columns operated on are inputted as variables instead of literals. It's a major pain point with dplyr that I would like to avoid. |
In this case, one could do |
in any case it’s true that |
Yes this is current behavior, and likely to stay since my proposal for more "normal" scoping rules might be very hard to implement. It would involve evaluating every thing in the expression to see if it's a symbol.
Yes. I would really like to get that to work, and perhaps even deprecate |
Got it. Probably regexes shouldn't be interpreted as referring to columns by default, so you'd have to write I actually agree with @matthieugomez that it would be nice to treat any variable as referring to a column. In particular that would avoid problems when you need to use an actual symbol, which is quite common to pass options to functions. As it is relatively rare to refer to a variable from the outer scope, having it escape it sounds OK. But of course changing this would be quite disruptive.
Why not, but then you need to provide a convenient syntax to escape variables that refer to the outer scope (just like in the approach from my last paragraph). |
@nalimilan Just so I understand what you are proposing, let's say we have
This would error because if we wanted to refer to something that wasn't a symbol or string in the variable But the following
Would work because This is okay. It's odd that escaping rules are complex, but what's most important to me is that you can work with variables that are symbols so functions like
would work without worrying too much about escaping. Notice that in the above example, I don't use anything from any other scope. I don't have a lot of experience in a classroom setting but I would bet most users of |
Do you really think these macros will be used inside functions btw? Would not people be better off using DataFrames.jl at that point? |
I think its easier to manage a team of RAs if you are able to enforce a single standard for how data cleaning should work. having two syntaxes makes this hard. Plus its common to write code in global scope and then realize a lot of cleaning should be put in functions. This proposal makes it a lot easier for new users to put code in function, no need to re-write everything. I would definitely use it. I like piping and the way it reads like a sentence. I would prefer to use this workflow wherever possible. |
@pdeffebach Yes, that's what I described based on your original proposal. But I'm not sure how much I like it. It might be a bit too magical. |
I would also really like to keep the keyword argument syntax. Perhaps they can both be supported? I don't have too strong an opinion on whether we should use variables like julia> var = :y
:y
julia> :(x + $var)
:(x + y)
julia> var = [:y, :z];
julia> :(x + $(var...))
:(x + y + z)
julia> :(x + hypot.($(var...)))
:(x + hypot.(y, z))
julia> k = 5; var = :y;
julia> :($k*x + $var)
:(5x + y) Seems pretty slick to me. 😁 |
I use If there is a better/simpler way to do that, I'd be happy to change my workflow, but I have used lines like that 100+ times in my code. |
@tbeason Thanks for your feedback. I have been reading up on this discussion and I think the best way forward is to try and have a non-breaking release which which uses DataFrames's new |
I use
A query that looks like this: @linq df |>
where(:A .== 1, :B .== 2, occursin.("XY", :C)) |>
transform(D = :A .+ :B, E = SubString.(:C, 1, 3)) |>
select(:A, :C, :E) ...is honestly quite ugly. So, I had wished that I could do: @linq df |>
where(A == 1, B == 2, occursin("XY", C)) |>
transform(D = A + B, E = SubString(C, 1, 3)) |>
select(A, C, E) I do realize that it causes an ambiguity when I need to use a real variable in the transformation pipeline. Sure, but I think that's less common. To handle that, we could possibly wrap the variable with syntax like I worked with R's dplyr a little bit before. I really like their syntax e.g. filter If we just use the column names as is, then the whole Symbol vs. String issue goes away, right? |
Thanks for your feedback!
looks for columns
looks for the variables in the current scope
looks for the column Users familiar with Stata will appreciate the backticks. I think this series of rules is relatively robust. What do others think? There are a number of issues to consider when it comes to using literals, however. For example, what about functions? |
I don't understand your middle example. Shouldn't After working longer with the newer |
See my example above, the middle example shows that you can programmatically construct a variable with a new name. This is very easy to do in stata and frustrating to do in The problem with
needs
which is very verbose! The goal for DataFramesMeta, aside from |
The backtick and dollar sign notation seems pretty good to me. 👍 |
I agree. I like the proposal above @pdeffebach. It's going to be awesome! :-) Do I have to use |
I like the proposed used of |
Ah ok that does make much more sense about verbosity. Thanks for the clarity. I like the backticks and the Do you have a sense of how this all would compose with some other piping system like Pipe.jl, Lazy.jl, Underscores.jl (which is the one I've been using lately), etc? I find that Underscores.jl works pretty well with the DataFrames.jl |
Are people okay with me closing this? There are a lot of ideas in this thread, but I think they are mostly addressed in more specific issues, such as A lot of things are already addressed, like referring to columns with |
I would close such meta-issue if we do not have a list of things to do. It is better to open concrete issues for separate things I think. |
@bkamins @nalimilan
I am tentatively offering to take over maintenance of DataFramesMeta after it reaches 1.0. I think this will be feasible for me because with the new features in data frames we can really cut down the LOC in DataFramesMeta.
First, all work will be performed in one macro. This macro will turn
into
Where
r^"a"
andBetween(:x2, :x5)
are vectors of vectors. This is currently disallowed in DataFrames (and I may have even argued against it at some point) but now seems intuitive.EDIT: The term
asArgs(...)
has come up before. Maybe this is a good use-case for that.This means that a workflow would be like
The only code that really gets executed by DataFramesMeta is working with the first part of the pair.
Perhaps keyword arguments could be added, but people seem happy with the
Pairs
syntax and it would be nice if we mirrored DataFrames closely.Like DataFrames, operating on columns will be the default. Maybe we can introduce a
@byrow
macro which just takes an expression and wraps it inByRow(...)
.Unless there are large performance benefits, I think we no longer need to support the
@linq
pipeline. My impression is that the closure that@linq
does no longer helps us. Rather, we can re-export Pipe.jl or Lazy.jl's@>
.Escaping rules are changed so that escaping symbols (and only symbols) is the default. This mirrors DataFrames more closely and, selfishly, it's easier to put
@transform
etc. in functions. Perhaps this is really hard to do, but I would like to explore it so that working without literals is just as easy as in base data frames.@with
could still work. But I don't know if people use it, since escaping gets complicated very quickly.Deprecate
@based_on
,@by
. These are now@combine
Add some features in DataFrames to make up for
@where(gd::GroupedDataFrame, ...)
(filter
) and@orderby
(sort
).I think think that this will result in a DataFramesMeta that both mirrors DataFrames closely and is easier to maintain.
let me know what you think of this proposal.
The text was updated successfully, but these errors were encountered: