-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unique
macros
#340
Add unique
macros
#340
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I think we should change the implementation a bit to focus on select
and mirror the internals of DataFrames.jl here.
Also, this means that
@unique(df, [:a, :b])
means something very different from
unique(df, [:a, :b]
that's okay, but this should be documented.
src/macros.jl
Outdated
unique(df, :x => ByRow((x,y) -> x + y)) | ||
``` | ||
|
||
a transformation which cannot be conveniently expressed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can delete this line.
src/macros.jl
Outdated
""" | ||
@unique(x, args) | ||
|
||
Select unique rows in `AbstractDataFrame`s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also communicate that args
can be transformations?
What happens if multiple arguments are passed? Currently it errors
julia> @unique(df, :x, :b)
ERROR: MethodError: no method matching unique(::DataFrame, ::Symbol, ::Symbol)
Maybe just allow one argument for consistency? Or we need to do some other implementation which uses select
under the hood and allows multiple arguments.
I think that's the best way forward.
src/macros.jl
Outdated
exprs, outer_flags = create_args_vector(args...; wrap_byrow=true) | ||
t = (fun_to_vec(ex; no_dest=true, outer_flags=outer_flags) for ex in exprs) | ||
quote | ||
$DataFrames.unique($x, $(t...)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bkamins Looking at the docs for unique
, the fact that transformations can be passed to unique
is not explicitly documented.
Instead of using unique
, do
df2 = select(df, t...; copycols = false)
rowidxs = (!).(nonunique(df2))
df[rowidxs, :]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cols
in unique
is documented as:
in selected columns or their transformations
so it explicitly mentions that transformations are allowed. But maybe a better wording could be used (this was written a long time ago, when we have a much less precise vocabulary in DataFrames.jl 😄). If you have a proposal out of hand then please comment.
test/unique.jl
Outdated
|
||
@test @unique(df, $:A .+ :B) ≅ unique(df, [:A, :B] => (x, y) -> x .+ y) | ||
@test @unique(df, $:A.+$:B) ≅ unique(df, [:A, :B] => (x, y) -> x .+ y) | ||
@test @unique(df, :A.+$:B) ≅ unique(df, [:A, :B] => (x, y) -> x .+ y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test with string column names?
test/unique.jl
Outdated
df = DataFrame(A=[1, 1, 3, missing], B=[1, 1, 2, 1]) | ||
|
||
d = @unique df begin | ||
:A .+ :B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You didn't test with multiple arguments ;)
julia> @unique df begin
:x
:b
end
ERROR: MethodError: no method matching unique(::DataFrame, ::Symbol, ::Symbol)
Hi, @pdeffebach. Thank you for your thoughtful feedback! I've addressed these requested changes in 88a779b. In this commit, I implemented the The documentation is updated to describe the difference between Please note that I used |
src/macros.jl
Outdated
t = (fun_to_vec(ex; no_dest=true, outer_flags=outer_flags) for ex in exprs) | ||
tmp = gensym("tmp") | ||
quote | ||
$(tmp) = DataFrames.select($x, $(t...); copycols = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this all into a separate function? That way we can dispatch on ::AbstractDataFrame
and avoid thinking about GroupedDataFrame
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as in
function DataFramesMeta.make_unique(df, args...)
...
end
function unique_helper(x, args...)
...
quote
$make_unique($x, $t...)
end
end
src/macros.jl
Outdated
to given combinations of values in selected columns or their transformation. | ||
`args` can be any column selector or transformation accepted by `select`. | ||
|
||
The implementation of `@unique` differs from the `unique` function in DataFrames.jl. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph is too technical.
Please create a ### Details
section beneath ### Returns
to discuss the implementation. The main thing the user needs to know is that
unique(df, [:a, :b])
is different from @unique(df, [:a, :b])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. Also, I want to note that in the current version, @unique(df, [:a,:b])
will error, while @unique(df, $[:a,:b])
works. Would you like me to adjust the function such that @unique(df, [:a,:b])
works in the same manner as @unique(df, $[:a,:b])
. Or, should I alter the function to throw an error similar to that which is observed via @select(df, [:a,:b])
:
# example
df = DataFrame(x = ones(10), z = rand(0.0:1.0,10), d = [1,1,1,1,1,2,2,2,2,2])
unique(df, [:x, :z, :d])
@unique(df, $[:x, :z, :d])
@unique(df, [:x, :z, :d])
ERROR: ArgumentError: length 3 of vector returned from function #5 is different from number of rows 10 of the source data frame.
select(df, [:x,:z])
@select(df, [:x,:z])
ERROR: LoadError: ArgumentError: Malformed expression in DataFramesMeta.jl macro```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I have the solution.
Follow the implementation of @orderby
and use gensym_names = true
. That will fix the errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to be wrong, but I find that@orderby(df, [:x, :z, :d])
results in the same error as above on v0.12.0
, and that following the implementation of @orderby
doesn't seem to resolve these errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. If the user understands that it uses select
machinery under the hood, and we communicate that unique(df, [:x, :y, :z])
is fundamentally different than @unique(df, [:x, :y, :z])
then we are fine.
The problem is that
@unique(df, [:x, :y, :z])
creates a three element array and tries to make that a column. If the data frame is a different length, it will error.
The major problem here is that @unique(df, [:x, :y, :z])
will work if the DataFrames has 3 rows...
I think this is okay though. As long as users know what it's doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this deviates enough from DataFrames.unique
to merit a new name? Similar to sort
and @orderby
. Maybe we should call this @distinct
? I hope we don't get sued by rstudio for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the deviation is that you want to allow @distinct(df, :a, :b, :c)
? While @distinct(df, $[:a, :b, :c])
is required as @distinct(df, [:a, :b, :c])
errors? Then I think it would be better to have a new name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
I agree, a new name is probably better here. @MatthewRGonzalez can you change to @distinct
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I'll commit with some updated documentation later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
src/macros.jl
Outdated
and | ||
|
||
``` | ||
@unique df :x + :y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@unique df :x + :y | |
@unique(df, :x + :y) |
src/macros.jl
Outdated
│ Int64 Int64 | ||
─────┼────────────── | ||
1 │ 1 5 | ||
```` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
```` |
src/macros.jl
Outdated
t = (fun_to_vec(ex; no_dest=true, outer_flags=outer_flags) for ex in exprs) | ||
tmp = gensym("tmp") | ||
quote | ||
if isempty($(args)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a separate function where we re-implement DataFrames.unique
also helps reduce code-duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Docs could be re-written a little but we can do that later.
Happy to merge after you make the following small changes.
src/macros.jl
Outdated
of `df` are thus determined by this intermediate data frame. This focus on `select` allows | ||
for multiple arguments to be passed conveniently in the form of column names or transformations. | ||
|
||
Users should be cautious when passing function arguments as vectors. E.g., `@distinct(df, $[:x,:y])` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users should be cautious when passing function arguments as vectors. E.g., `@distinct(df, $[:x,:y])` | |
Users should be cautious when passing function arguments as vectors. E.g., `@distinct(df, $DOLLAR[:x,:y])` |
src/macros.jl
Outdated
of `df` are thus determined by this intermediate data frame. This focus on `select` allows | ||
for multiple arguments to be conveniently passed in the form of column names or transformations. | ||
|
||
Users should be cautious when passing function arguments as vectors. E.g., `@distinct(df, $[:x,:y])` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users should be cautious when passing function arguments as vectors. E.g., `@distinct(df, $[:x,:y])` | |
Users should be cautious when passing function arguments as vectors. E.g., `@distinct(df, $DOLLAR[:x,:y])` |
src/macros.jl
Outdated
end | ||
|
||
""" | ||
rdistinct(d, args...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rdistinct(d, args...) | |
rdistinct!(d, args...) |
src/macros.jl
Outdated
│ Int64 Int64 | ||
─────┼────────────── | ||
1 │ 1 5 | ||
```` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
```` |
JK I figured out how to push to this branch so I will merge when tests pass. Thanks for your hard work on this! |
@pdeffebach It looks like the checks failed due to issues with |
Tests pass locally... which is confusing. If you feel comfortable rebasing and changing the logic of the macro to mirror the new implementations of But rebasing is hard and you've done a lot of work on this so far, so I can take it from here if you need me to. |
Rebasing made easy is:
The point is that if you do it this way you only need to resolve conflicts once and against the final version of the code. This is much easier than sequentially resolving all commits. |
c46e8ea
to
13632ed
Compare
…itional tests Rename to , add helper functions, and update documentation doc fixes small fixes
13632ed
to
940f1e4
Compare
@pdeffebach I followed the advice above to update this PR. The tests pass locally after rebasing and I've updated the macro logic (e04fb01) to mirror the new implementations of |
@pdeffebach can check best the details of the implementation, but I would like to thank you for an excellent and big PR (and tests now also pass CI 😄). |
Thanks! |
This pull request resolves #233 by providing new functionalities to
DataFramesMeta.jl
New Macros
The following macros are added
@unique
and@runique
@unique!
and@runique!
These macros extend the functionality of
DataFrames.unique
andDataFrames.unique!
to theDataFramesMeta.jl
ecosystem. Both general and row-wise versions are included for consistency with similar macro functions.Tests
unique.jl
byrow
inbyrow.jl