Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

silent error when using , in operations #320

Open
vjd opened this issue Jan 9, 2022 · 11 comments
Open

silent error when using , in operations #320

vjd opened this issue Jan 9, 2022 · 11 comments
Labels

Comments

@vjd
Copy link

vjd commented Jan 9, 2022

consider the use of @orderby below. Look at the last example. Somehow, the syntax allows the use of , does not error, but provides an incorrect result.

julia> dd = DataFrame(a = [3, 1, 2], b = [5,4, 6])
3×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   13      5
   21      4
   32      6

# works correctly
julia> @orderby(dd, :a, :b)
3×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   11      4
   22      6
   33      5

# works correctly
julia> @orderby dd :a :b
3×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   11      4
   22      6
   33      5

# works silenty, but incorrect
julia> @orderby dd :a, :b
3×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   13      5
   21      4
   32      6
@pdeffebach
Copy link
Collaborator

Weird, thanks for the issue report. I would have expected the last one to sort the tuples lexicographically, but it does not seem to do that.

@pdeffebach
Copy link
Collaborator

pdeffebach commented Jan 10, 2022

oh wait, it is sorting the tuple lexicographically. But within this it's sorting the arrays lexicographically. This is "right" in the sense that it's behaving as Julia behaves, but it is unintuitive behavior.

MacroTools.@macroexpand(@orderby dd :a, :b) |> MacroTools.prettify
:((DataFramesMeta).orderby(dd, DataFramesMeta.make_source_concrete([:b, :a]) => (((bat, falcon)->(falcon, bat)) => Symbol("##257"))))

with

function orderby(x::SubDataFrame, @nospecialize(args...))
    t = DataFrames.select(x, args...)
    x[sortperm(t), :]
end

so we see

julia> select(dd, DataFramesMeta.make_source_concrete([:b, :a]) => (((bat, falcon)->(falcon, bat)) => Symbol("##257")))
3×1 DataFrame
 Row │ ##257
     │ Tuple…
─────┼────────────────────────
   1 │ ([3, 1, 2], [5, 4, 6])
   2 │ ([3, 1, 2], [5, 4, 6])
   3 │ ([3, 1, 2], [5, 4, 6])

Sorry, i'll finish writing this up tomorrow

@pdeffebach
Copy link
Collaborator

Okay I'm back.

Here's whats going on. The transformation

([:b, :a]) => ((bat, falcon)->(falcon, bat))

creates a tuple of vectors. Because a tuple of vectors is functionally a scalar in DataFrames.jl's transformation language (i.e. not a table or a vector), it gets "spread" across rows, the same way

julia> df = DataFrame(a = [1, 2]);

julia> transform(df, :a => (t -> 1) => :b)
2×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      1
   2 │     2      1

works.

So it's creating the data frame

3×1 DataFrame
 Row │ ##257
     │ Tuple…
─────┼────────────────────────
   1 │ ([3, 1, 2], [5, 4, 6])
   2 │ ([3, 1, 2], [5, 4, 6])
   3 │ ([3, 1, 2], [5, 4, 6])

and then sorting by the ##257 column. Which means the order doesn't change (or is undefined ?) after the sort happens.

I agree that it's very confusing that @orderby df :a, :b looks so similar to @orderby(df, :a, :b).

One potential solution is to disallow expressions that look like Expr(:tuple, ...) to avoid this problem... but I'm worried that could have unforseen side effects.

Now that you understand what DataFramesMeta.jl is doing behind the scenes, do you have more thoughts on how to approach this issue?

@bkamins
Copy link
Member

bkamins commented Jan 11, 2022

So in essence:

@orderby df :a, :b

is the same as

@orderby df tuple(:a, :b)

due to how Julia parser works.

Right?

@pdeffebach
Copy link
Collaborator

Yes, exactly.

@vjd
Copy link
Author

vjd commented Jan 11, 2022

I'm not sure what the correct solution here would be. Maybe we document the use of tuple in such cases, which is essentially equivalent to writing the syntax with the parenthesis. So maybe just Advocate the use of one common syntax rather than two different ways of writing? Given a choice, to be consistent with dataframes.jl, it is probably easier to write the macros with parentheses and start deprecating the one without. It will be consistent also with the syntax that is used in R where most of the users are coming from

@bkamins
Copy link
Member

bkamins commented Jan 12, 2022

it is probably easier to write the macros with parentheses and start deprecating the one without

This is what I do (i.e. I always use a function call style). But we cannot "deprecate it" - this is an intentional feature of Julia not of DataFramesMeta.jl.
Probably we should highlight to users that there are two styles allowed by Julia and explained them better.

@pdeffebach
Copy link
Collaborator

So maybe just Advocate the use of one common syntax rather than two different ways of writing?

Yes, this doesn't fully solve the problem, since @mymacro a, b is always going to be parsed as a tuple by default. Even if the recommended way is @mymacro(a, b).

it is probably easier to write the macros with parentheses and start deprecating the one without.

The reason we support the non-parentheses version is because of @passmissing and other macro flags. Ironically, for the same problems that we are writing out here.

@rorderby(df, @passmissing :y, :b)

will get parsed as

@rorderby(df, @passmissing(tuple(:y, :b))

the solution to this problem was to allow multiple lines so the macro-flags could get properly parsed

@rorderby df begin 
    @passmissing :y 
    :b 
end

It will be consistent also with the syntax that is used in R where most of the users are coming from

Yes but @rtransform df :a or inside a chain block is closer to Stata, which is also a market-share to appeal to. Plus, I dislike the requirement of parentheses in dplyr, because I think they discourage complicated transformations. I was actually hoping to get rid of all the parentheses in documentation to get people to use the begin ... end version.

I think the way forward is to disallow @rtransform :a, :b so users dont get confused.

@jkrumbiegel
Copy link
Contributor

jkrumbiegel commented Jan 13, 2022

I think the way forward is to disallow @rtransform :a, :b so users dont get confused.

I'm not sure that helps, as you can of course disallow a tuple, but you would also need to disallow macros that work on tuples. This is because @rtransform @byrow something, other_thing would resolve to one macro expression as an argument, and I'm not sure that you don't get into trouble if you also disallow tuples within a macro as well.

One solution would be to disallow multiple arguments except listed in a begin end block, but that feels overly restrictive.

The premise of this whole issue is that there is a silent "error", but of course it's not that, it's just confusing syntax. But given that Julia's macro parsing rules are fixed and you can't discriminate between a call with or without parentheses inside the macro, I think this is just not avoidable.

@bkamins
Copy link
Member

bkamins commented Jan 14, 2022

But given that Julia's macro parsing rules are fixed and you can't discriminate between a call with or without parentheses inside the macro

This was my point above. I think it is best to very clearly explain the options and their consequences in the manual. @pdeffebach - if you prefer I can submit a PR.

@pdeffebach
Copy link
Collaborator

Maybe we need a "Gotcha's" section of the docs? Could be a good opportunity to beef up the docs before 1.0.

@pdeffebach pdeffebach added the 1.0 label Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants