Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unique macros #340

Merged
merged 3 commits into from
Oct 28, 2022
Merged

Conversation

MatthewRGonzalez
Copy link
Contributor

This pull request resolves #233 by providing new functionalities to DataFramesMeta.jl

New Macros

The following macros are added

  • @unique and @runique
  • @unique! and @runique!

These macros extend the functionality of DataFrames.unique and DataFrames.unique! to the DataFramesMeta.jl ecosystem. Both general and row-wise versions are included for consistency with similar macro functions.

Tests

  • General tests are added in unique.jl
  • Tested with byrow in byrow.jl

Copy link
Collaborator

@pdeffebach pdeffebach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I think we should change the implementation a bit to focus on select and mirror the internals of DataFrames.jl here.

Also, this means that

@unique(df, [:a, :b])

means something very different from

unique(df, [:a, :b]

that's okay, but this should be documented.

src/macros.jl Outdated
unique(df, :x => ByRow((x,y) -> x + y))
```

a transformation which cannot be conveniently expressed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can delete this line.

src/macros.jl Outdated
"""
@unique(x, args)

Select unique rows in `AbstractDataFrame`s.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also communicate that args can be transformations?

What happens if multiple arguments are passed? Currently it errors

julia> @unique(df, :x, :b)
ERROR: MethodError: no method matching unique(::DataFrame, ::Symbol, ::Symbol)

Maybe just allow one argument for consistency? Or we need to do some other implementation which uses select under the hood and allows multiple arguments.

I think that's the best way forward.

src/macros.jl Outdated
exprs, outer_flags = create_args_vector(args...; wrap_byrow=true)
t = (fun_to_vec(ex; no_dest=true, outer_flags=outer_flags) for ex in exprs)
quote
$DataFrames.unique($x, $(t...))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkamins Looking at the docs for unique, the fact that transformations can be passed to unique is not explicitly documented.
Instead of using unique, do

df2 = select(df, t...; copycols = false)
rowidxs = (!).(nonunique(df2))
df[rowidxs, :]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cols in unique is documented as:

in selected columns or their transformations

so it explicitly mentions that transformations are allowed. But maybe a better wording could be used (this was written a long time ago, when we have a much less precise vocabulary in DataFrames.jl 😄). If you have a proposal out of hand then please comment.

test/unique.jl Outdated

@test @unique(df, $:A .+ :B) ≅ unique(df, [:A, :B] => (x, y) -> x .+ y)
@test @unique(df, $:A.+$:B) ≅ unique(df, [:A, :B] => (x, y) -> x .+ y)
@test @unique(df, :A.+$:B) ≅ unique(df, [:A, :B] => (x, y) -> x .+ y)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test with string column names?

test/unique.jl Outdated
df = DataFrame(A=[1, 1, 3, missing], B=[1, 1, 2, 1])

d = @unique df begin
:A .+ :B
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't test with multiple arguments ;)

julia> @unique df begin 
           :x
           :b
       end
ERROR: MethodError: no method matching unique(::DataFrame, ::Symbol, ::Symbol)

@MatthewRGonzalez
Copy link
Contributor Author

Hi, @pdeffebach. Thank you for your thoughtful feedback!

I've addressed these requested changes in 88a779b. In this commit, I implemented the select method you described above and also added a condition to handle cases where t... is not present. This additional condition allows @unique(df) to operate in the same manner as unique(df)

The documentation is updated to describe the difference between @unique and unique, and also describes that arguments can be transformations. Tests are also updated to include multiple arguments and string column names.

Please note that I used gensym in the helper functions in an attempt to keep the macros hygienic. If there's a more appropriate or concise method that you'd prefer, please let me know!

src/macros.jl Outdated
t = (fun_to_vec(ex; no_dest=true, outer_flags=outer_flags) for ex in exprs)
tmp = gensym("tmp")
quote
$(tmp) = DataFrames.select($x, $(t...); copycols = false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this all into a separate function? That way we can dispatch on ::AbstractDataFrame and avoid thinking about GroupedDataFrames

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in

function DataFramesMeta.make_unique(df, args...)
    ...
end

function unique_helper(x, args...)
    ...
    quote
        $make_unique($x, $t...)
    end
end

src/macros.jl Outdated
to given combinations of values in selected columns or their transformation.
`args` can be any column selector or transformation accepted by `select`.

The implementation of `@unique` differs from the `unique` function in DataFrames.jl.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph is too technical.

Please create a ### Details section beneath ### Returns to discuss the implementation. The main thing the user needs to know is that

unique(df, [:a, :b])

is different from @unique(df, [:a, :b])

Copy link
Contributor Author

@MatthewRGonzalez MatthewRGonzalez Oct 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Also, I want to note that in the current version, @unique(df, [:a,:b]) will error, while @unique(df, $[:a,:b]) works. Would you like me to adjust the function such that @unique(df, [:a,:b]) works in the same manner as @unique(df, $[:a,:b]). Or, should I alter the function to throw an error similar to that which is observed via @select(df, [:a,:b]):

# example

df = DataFrame(x = ones(10), z = rand(0.0:1.0,10), d =  [1,1,1,1,1,2,2,2,2,2])

unique(df, [:x, :z, :d]) 
@unique(df, $[:x, :z, :d]) 
@unique(df, [:x, :z, :d]) 
ERROR: ArgumentError: length 3 of vector returned from function #5 is different from number of rows 10 of the source data frame.

select(df, [:x,:z])
@select(df, [:x,:z])
ERROR: LoadError: ArgumentError: Malformed expression in DataFramesMeta.jl macro```


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I have the solution.

Follow the implementation of @orderby and use gensym_names = true. That will fix the errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to be wrong, but I find that@orderby(df, [:x, :z, :d]) results in the same error as above on v0.12.0, and that following the implementation of @orderby doesn't seem to resolve these errors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. If the user understands that it uses select machinery under the hood, and we communicate that unique(df, [:x, :y, :z]) is fundamentally different than @unique(df, [:x, :y, :z]) then we are fine.

The problem is that

@unique(df, [:x, :y, :z])

creates a three element array and tries to make that a column. If the data frame is a different length, it will error.

The major problem here is that @unique(df, [:x, :y, :z]) will work if the DataFrames has 3 rows...

I think this is okay though. As long as users know what it's doing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this deviates enough from DataFrames.unique to merit a new name? Similar to sort and @orderby. Maybe we should call this @distinct? I hope we don't get sued by rstudio for that.

cc @bkamins @nalimilan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the deviation is that you want to allow @distinct(df, :a, :b, :c)? While @distinct(df, $[:a, :b, :c]) is required as @distinct(df, [:a, :b, :c]) errors? Then I think it would be better to have a new name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

I agree, a new name is probably better here. @MatthewRGonzalez can you change to @distinct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll commit with some updated documentation later on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

src/macros.jl Outdated
and

```
@unique df :x + :y
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@unique df :x + :y
@unique(df, :x + :y)

src/macros.jl Outdated
│ Int64 Int64
─────┼──────────────
1 │ 1 5
````
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
````

src/macros.jl Outdated
t = (fun_to_vec(ex; no_dest=true, outer_flags=outer_flags) for ex in exprs)
tmp = gensym("tmp")
quote
if isempty($(args))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a separate function where we re-implement DataFrames.unique also helps reduce code-duplication.

Copy link
Collaborator

@pdeffebach pdeffebach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Docs could be re-written a little but we can do that later.

Happy to merge after you make the following small changes.

src/macros.jl Outdated
of `df` are thus determined by this intermediate data frame. This focus on `select` allows
for multiple arguments to be passed conveniently in the form of column names or transformations.

Users should be cautious when passing function arguments as vectors. E.g., `@distinct(df, $[:x,:y])`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Users should be cautious when passing function arguments as vectors. E.g., `@distinct(df, $[:x,:y])`
Users should be cautious when passing function arguments as vectors. E.g., `@distinct(df, $DOLLAR[:x,:y])`

src/macros.jl Outdated
of `df` are thus determined by this intermediate data frame. This focus on `select` allows
for multiple arguments to be conveniently passed in the form of column names or transformations.

Users should be cautious when passing function arguments as vectors. E.g., `@distinct(df, $[:x,:y])`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Users should be cautious when passing function arguments as vectors. E.g., `@distinct(df, $[:x,:y])`
Users should be cautious when passing function arguments as vectors. E.g., `@distinct(df, $DOLLAR[:x,:y])`

src/macros.jl Outdated
end

"""
rdistinct(d, args...)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rdistinct(d, args...)
rdistinct!(d, args...)

src/macros.jl Outdated
│ Int64 Int64
─────┼──────────────
1 │ 1 5
````
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
````

@pdeffebach
Copy link
Collaborator

JK I figured out how to push to this branch so I will merge when tests pass.

Thanks for your hard work on this!

@MatthewRGonzalez
Copy link
Contributor Author

MatthewRGonzalez commented Oct 27, 2022

@pdeffebach It looks like the checks failed due to issues with create_args_vector(). I'm thinking this is due to my use of create_args_vector() instead of get_df_args_kwargs() introduced in #323. I'll resolve this soon. Sorry for not catching this sooner!

@pdeffebach
Copy link
Collaborator

Tests pass locally... which is confusing.

If you feel comfortable rebasing and changing the logic of the macro to mirror the new implementations of @select etc. that might help.

But rebasing is hard and you've done a lot of work on this so far, so I can take it from here if you need me to.

@bkamins
Copy link
Member

bkamins commented Oct 27, 2022

But rebasing is hard and you've done a lot of work on this so far, so I can take it from here if you need me to.

Rebasing made easy is:

  1. on your local git squash all commits in the PR into a single commit
  2. rebase

The point is that if you do it this way you only need to resolve conflicts once and against the final version of the code. This is much easier than sequentially resolving all commits.

Matthew Gonzalez added 2 commits October 27, 2022 12:31
…itional tests

Rename to , add helper functions, and update documentation

doc fixes

small fixes
@MatthewRGonzalez
Copy link
Contributor Author

@pdeffebach I followed the advice above to update this PR. The tests pass locally after rebasing and I've updated the macro logic (e04fb01) to mirror the new implementations of @select etc. I apologize for the additional clutter. Please let me know if this is what you had in mind!

@bkamins
Copy link
Member

bkamins commented Oct 27, 2022

@pdeffebach can check best the details of the implementation, but I would like to thank you for an excellent and big PR (and tests now also pass CI 😄).

@pdeffebach pdeffebach merged commit 6739447 into JuliaData:master Oct 28, 2022
@pdeffebach
Copy link
Collaborator

Thanks!

@MatthewRGonzalez MatthewRGonzalez deleted the UniqueFeature branch October 28, 2022 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a unique macro
3 participants