Skip to content

Conversation

@bkamins
Copy link
Member

@bkamins bkamins commented Mar 30, 2022

Fixes #2740

@bkamins bkamins added this to the 1.4 milestone Mar 30, 2022
@bkamins bkamins requested review from nalimilan and pdeffebach March 30, 2022 20:40
@bkamins
Copy link
Member Author

bkamins commented Mar 30, 2022

I have thought about this PR.
Maybe the suggestion to accept only vectors for AbstractDataFrame input, but allow scalars for GroupedDataFrame input is not that bad in the end?
(this is not what I currently implemented)

Rationale:

  1. we can explain in the docstring the difference and rationale.
  2. for AbstractDataFrame the "wrong" condition can either keep all or drop all rows - neither of these operations seem useful.
  3. we can always later change the behavior of AbstractDataFrame to match GroupedDataFrame (i.e. allow scalars) if users complain, but I assume that in practice no one will want a scalar with AbstractDataFrame.

What do you think?

@bkamins
Copy link
Member Author

bkamins commented Mar 31, 2022

After having slept over the issue I came to the conclusion that the behavior "disallow vector for AbstractDataFrame" is desirable. I will write appropriate explanations and warning message prompting users to open an issue if they would want this behavior to be changed. I will ping you here when the PR update is done.

@bkamins
Copy link
Member Author

bkamins commented Apr 6, 2022

bump (from my perspective this PR is finished 😄). I have now rebased it and added some small cleanups in the documentaion.

Comment on lines +139 to +149
# we special case 0-length cond, as in this case broadcasting does not
# guarantee setting a proper eltype for the result
if isempty(cond)
if eltype(cond) !== Bool
throw(ArgumentError("passed conditions produce $(eltype(cond)) " *
"as element type of the result while only " *
"Bool is allowed."))
end
else
@assert eltype(cond) === Bool
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. Contrary to what the comment says, the check is dependent on the eltype even for empty vectors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When cond is non-empty then broadcasting mechanism guarantees that in line 148 must hold. That is why we have @assert there.

Conversely, if cond is empty broadcasting lets incorrect element type slip through:

julia> subset(DataFrame(), [] => ByRow(() -> "aaa"))
ERROR: ArgumentError: passed conditions produce Union{} as element type of the result while only Bool is allowed.

As you can see here although the condition returns String it is not correctly identified as incorrect by _and.(cols...) call because the output vector is empty.

Is it clear now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But won't that error also be thrown when cond is empty and broadcast uses eltype Any because inference fails?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think it is possible. Can you think of an example? Here are things that I have checked:

julia> DataFrames._and.([]) # supertype
0-element BitVector

julia> DataFrames._and.(Number[]) # supertype
0-element BitVector

julia> DataFrames._and.(Union{Bool, Missing}[]) # supertype
0-element BitVector

julia> DataFrames._and.(Int[]) # empty intersection
Union{}[]

julia> DataFrames._and.(Char[]) # empty intersection
Union{}[]

julia> DataFrames._and.(Union{Int, Missing}[]) # empty intersection
Union{}[]

and my rule is the best we can count on AFAICT.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how it's possible that inferences guesses that the result is Bool in so difficult cases. In theory there's no guaranty that the return type is inferred, but since I can't come up with an example...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty container is not likely in practice, so I prefer to be on the safe side.

bkamins and others added 2 commits April 7, 2022 17:20
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@matthieugomez
Copy link
Contributor

I think it’s the best solution too. Thanks!

@bkamins
Copy link
Member Author

bkamins commented Apr 21, 2022

@nalimilan - could you please have a look at this PR?

@bkamins bkamins merged commit c1c5b14 into main Apr 21, 2022
@bkamins bkamins deleted the bk/subset_flexible branch April 21, 2022 21:57
@bkamins
Copy link
Member Author

bkamins commented Apr 21, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential confusion with subset

4 participants