Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

require AbstractVector from subset selectors #2744

Merged
merged 6 commits into from
May 4, 2021
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
# DataFrames v1.0.x Patch Release Notes

* patch v1.0.3: make sure `subset` checks if the passed condition function
returns a vector of values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this if bumping to 1.1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated and added more explanations. Fortunately the change we make will now throw an error so it will be easy to catch.

([#2744](https://github.com/JuliaData/DataFrames.jl/pull/2744))
* patch v1.0.2: fix of performance issue of `groupby` when using multi-threading
([#2736](https://github.com/JuliaData/DataFrames.jl/pull/2736))
* patch v1.0.1: fix of performance issue of `groupby` when using `PooledVector`
([2733](https://github.com/JuliaData/DataFrames.jl/pull/2733))

# DataFrames v1.0 Release Notes

## Breaking changes
Expand Down
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name = "DataFrames"
uuid = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
version = "1.0.2"
version = "1.0.3"

[deps]
Compat = "34da2185-b29b-5c13-b0c7-acf172513d20"
Expand Down
31 changes: 27 additions & 4 deletions src/abstractdataframe/subset.jl
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,21 @@ function _and_missing(x::Any...)
"but only true, false, or missing are allowed"))
end

# we are guaranteed that ByRow returns a vector
# this workaround is needed for 0-argument ByRow
assert_bool_vec(fun::ByRow) = fun

function assert_bool_vec(@nospecialize(fun))
return function(x...)
val = fun(x...)
if !(val isa AbstractVector)
throw(ArgumentError("functions passed to `subset` must return " *
"an AbstractVector."))
bkamins marked this conversation as resolved.
Show resolved Hide resolved
end
return val
end
end

# Note that _get_subset_conditions will have a large compilation time
# if more than 32 conditions are passed as `args`.
function _get_subset_conditions(df::Union{AbstractDataFrame, GroupedDataFrame},
Expand All @@ -39,7 +54,7 @@ function _get_subset_conditions(df::Union{AbstractDataFrame, GroupedDataFrame},
conditions = Any[if a isa ColumnIndex
a => Symbol(:x, i)
elseif a isa Pair{<:Any, <:Base.Callable}
first(a) => last(a) => Symbol(:x, i)
first(a) => assert_bool_vec(last(a)) => Symbol(:x, i)
else
throw(ArgumentError("condition specifier $a is not supported by `subset`"))
end for (i, a) in enumerate(args)]
Expand Down Expand Up @@ -77,7 +92,8 @@ Each argument passed in `args` can be either a single column selector or a
described for [`select`](@ref).

Note that as opposed to [`filter`](@ref) the `subset` function works on whole
columns (or all rows in groups for `GroupedDataFrame`).
columns and always makes a subset of rows (or all rows in groups for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "makes a subset of rows" mean? Also it's not mentioned for subset!.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is intended to mean that selection happens row-wise always (not whole groups) - this was the original reason why this PR is made. In particular this is a crucial difference between subset and filter for GroupedDataFrame (as the latter works per group). Could you please propose a better wording to convey this message?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Then maybe this?

works on whole columns (and selects rows within groups for `GroupedDataFrame`
rather than whole groups).

And I'd mention that the result should be a vector of Boolean in the first paragraph as that's really an essential information.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - added

`GroupedDataFrame`) and must return a vector.

If `skipmissing=false` (the default) `args` are required to produce vectors
containing only `Bool` values. If `skipmissing=true`, additionally `missing` is
Expand Down Expand Up @@ -180,7 +196,11 @@ Each argument passed in `args` can be either a single column selector or a
described for [`select`](@ref).

Note that as opposed to [`filter!`](@ref) the `subset!` function works on whole
columns (or all rows in groups for `GroupedDataFrame`).
columns (or all rows in groups for `GroupedDataFrame`) and must return a vector.
If you want to filter rows of an `AbstractDataFrame` or groups of
`GroupedDataFrame` by a function that returns `Bool` value use [`filter!`](@ref)
or [`filter`](@ref) (note that [`filter!`](@ref) is not supported for
`GroupedDataFrame`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sync this with subset docstring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will sync once we finalize subset docstring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


If `skipmissing=false` (the default) `args` are required to produce vectors
containing only `Bool` values. If `skipmissing=true`, additionally `missing` is
Expand All @@ -191,7 +211,10 @@ If `ungroup=false` the resulting data frame is re-grouped based on the same
grouping columns as `gdf` and a `GroupedDataFrame` is returned.

If `GroupedDataFrame` is subsetted then it must include all groups present in the
`parent` data frame, like in [`select!`](@ref).
`parent` data frame, like in [`select!`](@ref). Also note that in general
the parent of the passed `GroupedDataFrame` is mutated in place, which means
that the `GroupedDataFrame` is not safe to use as most likely it will
get corrupted due to row removal in the parent.

See also: [`subset`](@ref), [`filter!`](@ref), [`select!`](@ref)

Expand Down
232 changes: 0 additions & 232 deletions test/grouping.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3482,238 +3482,6 @@ end
@test df == df2
end

@testset "subset and subset!" begin
refdf = DataFrame(x = repeat(Any[true, false], 4),
y = repeat([true, false, missing, missing], 2),
z = repeat([1, 2, 3, 3], 2),
id = 1:8)

for df in (copy(refdf), @view copy(refdf)[1:end-1, :])
df2 = copy(df)
@test subset(df, :x) ≅ filter(:x => identity, df)
@test df ≅ df2
@test subset(df, :x) isa DataFrame
@test subset(df, :x, view=true) ≅ filter(:x => identity, df)
@test subset(df, :x, view=true) isa SubDataFrame
@test_throws ArgumentError subset(df, :y)
@test_throws ArgumentError subset(df, :y, :x)
@test subset(df, :y, skipmissing=true) ≅ filter(:y => x -> x === true, df)
@test subset(df, :y, skipmissing=true, view=true) ≅ filter(:y => x -> x === true, df)
@test subset(df, :y, :y, skipmissing=true) ≅ filter(:y => x -> x === true, df)
@test subset(df, :y, :y, skipmissing=true, view=true) ≅ filter(:y => x -> x === true, df)
@test subset(df, :x, :y, skipmissing=true) ≅
filter([:x, :y] => (x, y) -> x && y === true, df)
@test subset(df, :y, :x, skipmissing=true) ≅
filter([:x, :y] => (x, y) -> x && y === true, df)
@test subset(df, :x, :y, skipmissing=true, view=true) ≅
filter([:x, :y] => (x, y) -> x && y === true, df)
@test subset(df, :x, :y, :id => ByRow(<(4)), skipmissing=true) ≅
filter([:x, :y, :id] => (x, y, id) -> x && y === true && id < 4, df)
@test subset(df, :x, :y, :id => ByRow(<(4)), skipmissing=true, view=true) ≅
filter([:x, :y, :id] => (x, y, id) -> x && y === true && id < 4, df)
@test subset(df, :x, :id => ByRow(<(4))) ≅
filter([:x, :id] => (x, id) -> x && id < 4, df)
@test subset(df, :x, :id => ByRow(<(4)), view=true) ≅
filter([:x, :id] => (x, id) -> x && id < 4, df)
@test_throws ArgumentError subset(df)
@test isempty(subset(df, :x, :x => ByRow(!)))
@test_throws ArgumentError subset(df, :x => x -> false, :x => x -> missing)
@test_throws ArgumentError subset(df, :x => x -> true, :x => x -> missing)
@test_throws ArgumentError subset(df, :x => x -> true, :x => x -> 2)
end

for df in (copy(refdf), @view copy(refdf)[1:end-1, :]),
gdf in (groupby_checked(df, :z), groupby_checked(df, :z)[[3, 2, 1]])
df2 = copy(df)
@test subset(gdf, :x) ≅ filter(:x => identity, df)
@test df ≅ df2
@test subset(gdf, :x) isa DataFrame
@test subset(gdf, :x, ungroup=false) ≅
groupby_checked(filter(:x => identity, df), :z)
@test subset(gdf, :x, ungroup=false) isa GroupedDataFrame{DataFrame}
@test subset(gdf, :x, view=true) ≅ filter(:x => identity, df)
@test subset(gdf, :x, view=true) isa SubDataFrame
@test subset(gdf, :x, view=true, ungroup=false) ≅
groupby_checked(filter(:x => identity, df), :z)
@test subset(gdf, :x, view=true, ungroup=false) isa GroupedDataFrame{<:SubDataFrame}
@test_throws ArgumentError subset(gdf, :y)
@test_throws ArgumentError subset(gdf, :y, :x)
@test subset(gdf, :y, skipmissing=true) ≅ filter(:y => x -> x === true, df)
@test subset(gdf, :y, skipmissing=true, view=true) ≅ filter(:y => x -> x === true, df)
@test subset(gdf, :y, :y, skipmissing=true) ≅ filter(:y => x -> x === true, df)
@test subset(gdf, :y, :y, skipmissing=true, view=true) ≅ filter(:y => x -> x === true, df)
@test subset(gdf, :x, :y, skipmissing=true) ≅
filter([:x, :y] => (x, y) -> x && y === true, df)
@test subset(gdf, :y, :x, skipmissing=true) ≅
filter([:x, :y] => (x, y) -> x && y === true, df)
@test subset(gdf, :x, :y, skipmissing=true, view=true) ≅
filter([:x, :y] => (x, y) -> x && y === true, df)
@test subset(gdf, :x, :y, :id => ByRow(<(4)), skipmissing=true) ≅
filter([:x, :y, :id] => (x, y, id) -> x && y === true && id < 4, df)
@test subset(gdf, :x, :y, :id => ByRow(<(4)), skipmissing=true, view=true) ≅
filter([:x, :y, :id] => (x, y, id) -> x && y === true && id < 4, df)
@test subset(gdf, :x, :id => ByRow(<(4))) ≅
filter([:x, :id] => (x, id) -> x && id < 4, df)
@test subset(gdf, :x, :id => ByRow(<(4)), view=true) ≅
filter([:x, :id] => (x, id) -> x && id < 4, df)
@test_throws ArgumentError subset(gdf)
@test isempty(subset(gdf, :x, :x => ByRow(!)))
@test_throws ArgumentError subset(gdf, :x => x -> false, :x => x -> missing)
@test_throws ArgumentError subset(gdf, :x => x -> true, :x => x -> missing)
@test_throws ArgumentError subset(gdf, :x => x -> true, :x => x -> 2)
end

df = copy(refdf)
@test subset!(df, :x) === df
@test subset!(df, :x) ≅ df ≅ filter(:x => identity, refdf)
df = copy(refdf)
@test_throws ArgumentError subset!(df, :y)
@test df ≅ refdf
df = copy(refdf)
@test subset!(df, :y, skipmissing=true) === df
@test subset!(df, :y, skipmissing=true) ≅ df ≅ filter(:y => x -> x === true, refdf)
df = copy(refdf)
@test subset!(df, :x, :y, skipmissing=true) === df
@test subset!(df, :x, :y, skipmissing=true) ≅ df ≅
filter([:x, :y] => (x, y) -> x && y === true, refdf)
df = copy(refdf)
@test subset!(df, :x, :y, :id => ByRow(<(4)), skipmissing=true) ≅ df ≅
filter([:x, :y, :id] => (x, y, id) -> x && y === true && id < 4, refdf)
df = copy(refdf)
@test subset!(df, :x, :id => ByRow(<(4))) ≅ df ≅
filter([:x, :id] => (x, id) -> x && id < 4, refdf)
df = copy(refdf)
@test_throws ArgumentError subset!(df)
df = copy(refdf)
@test isempty(subset!(df, :x, :x => ByRow(!)))
@test isempty(df)

df = copy(refdf)
@test_throws ArgumentError subset!(df, :x => x -> false, :x => x -> missing)
@test_throws ArgumentError subset!(df, :x => x -> true, :x => x -> missing)
@test_throws ArgumentError subset!(df, :x => x -> true, :x => x -> 2)

df = copy(refdf)
gdf = groupby_checked(df, :z)
@test subset!(gdf, :x) === df

df = copy(refdf)
gdf = groupby_checked(df, :z)
gdf2 = subset!(gdf, :x, ungroup=false)
@test gdf2 isa GroupedDataFrame{DataFrame}
@test parent(gdf2) === df
@test gdf2 ≅ groupby_checked(df, :z) ≅ groupby_checked(filter(:x => identity, refdf), :z)

df = copy(refdf)
gdf = groupby_checked(df, :z)
@test subset!(gdf, :x) ≅ df ≅ filter(:x => identity, refdf)
df = copy(refdf)
gdf = groupby_checked(df, :z)
@test_throws ArgumentError subset!(gdf, :y)
@test df ≅ refdf
df = copy(refdf)
gdf = groupby_checked(df, :z)
@test subset!(gdf, :y, skipmissing=true) === df
df = copy(refdf)
gdf = groupby_checked(df, :z)
@test subset!(gdf, :y, skipmissing=true) ≅ df ≅ filter(:y => x -> x === true, refdf)
df = copy(refdf)
gdf = groupby_checked(df, :z)
@test subset!(gdf, :x, :y, skipmissing=true) === df
df = copy(refdf)
gdf = groupby_checked(df, :z)
@test subset!(gdf, :x, :y, skipmissing=true) ≅ df ≅
filter([:x, :y] => (x, y) -> x && y === true, refdf)
df = copy(refdf)
gdf = groupby_checked(df, :z)
@test subset!(gdf, :x, :y, :id => ByRow(<(4)), skipmissing=true) ≅ df ≅
filter([:x, :y, :id] => (x, y, id) -> x && y === true && id < 4, refdf)
df = copy(refdf)
gdf = groupby_checked(df, :z)
@test subset!(gdf, :x, :id => ByRow(<(4))) ≅ df ≅
filter([:x, :id] => (x, id) -> x && id < 4, refdf)
df = copy(refdf)
gdf = groupby_checked(df, :z)
@test_throws ArgumentError subset!(gdf)
df = copy(refdf)
gdf = groupby_checked(df, :z)
@test isempty(subset!(gdf, :x, :x => ByRow(!)))
@test isempty(df)
df = copy(refdf)
gdf = groupby_checked(df, :z)
@test_throws ArgumentError subset!(gdf, :x => x -> false, :x => x -> missing)
@test_throws ArgumentError subset!(gdf, :x => x -> true, :x => x -> missing)
@test_throws ArgumentError subset!(gdf, :x => x -> true, :x => x -> 2)

df = copy(refdf)
gdf = groupby_checked(df, :z)[[3, 2, 1]]
@test subset!(gdf, :x) ≅ df ≅ filter(:x => identity, refdf)
df = copy(refdf)
gdf = groupby_checked(df, :z)[[3, 2, 1]]
@test_throws ArgumentError subset!(gdf, :y)
@test df ≅ refdf
df = copy(refdf)
gdf = groupby_checked(df, :z)[[3, 2, 1]]
@test subset!(gdf, :y, skipmissing=true) ≅ df ≅ filter(:y => x -> x === true, refdf)
df = copy(refdf)
gdf = groupby_checked(df, :z)[[3, 2, 1]]
@test subset!(gdf, :x, :y, skipmissing=true) ≅ df ≅
filter([:x, :y] => (x, y) -> x && y === true, refdf)
df = copy(refdf)
gdf = groupby_checked(df, :z)[[3, 2, 1]]
@test subset!(gdf, :x, :y, :id => ByRow(<(4)), skipmissing=true) ≅ df ≅
filter([:x, :y, :id] => (x, y, id) -> x && y === true && id < 4, refdf)
df = copy(refdf)
gdf = groupby_checked(df, :z)[[3, 2, 1]]
@test subset!(gdf, :x, :id => ByRow(<(4))) ≅ df ≅
filter([:x, :id] => (x, id) -> x && id < 4, refdf)
df = copy(refdf)
gdf = groupby_checked(df, :z)[[3, 2, 1]]
@test_throws ArgumentError subset!(gdf)
df = copy(refdf)
gdf = groupby_checked(df, :z)[[3, 2, 1]]
@test isempty(subset!(gdf, :x, :x => ByRow(!)))
@test isempty(df)

df = copy(refdf)
gdf = groupby_checked(df, :z)[[3, 2, 1]]
@test_throws ArgumentError subset!(gdf, :x => x -> false, :x => x -> missing)
@test_throws ArgumentError subset!(gdf, :x => x -> true, :x => x -> missing)
@test_throws ArgumentError subset!(gdf, :x => x -> true, :x => x -> 2)

@test_throws ArgumentError subset!(view(refdf, :, :), :x)
@test_throws ArgumentError subset!(groupby_checked(view(refdf, :, :), :z), :x)

df = DataFrame(g=[2, 2, 1, 1, 1, 1, 3, 3, 3], x = 1:9)
@test subset(df, :x => x -> x .< mean(x)) == DataFrame(g=[2, 2, 1, 1], x = 1:4)
@test subset(groupby_checked(df, :g), :x => x -> x .< mean(x)) ==
DataFrame(g=[2, 1, 1, 3], x=[1, 3, 4, 7])

@test_throws ArgumentError subset(df, :x => x -> missing)
@test isempty(subset(df, :x => x -> missing, skipmissing=true))
@test isempty(subset(df, :x => x -> false))
@test subset(df, :x => x -> true) ≅ df
@test_throws ArgumentError subset(df, :x => x -> (a=x,))
@test_throws ArgumentError subset(df, :x => (x -> (a=x,)) => AsTable)

@test_throws ArgumentError subset(DataFrame(x=false, y=missing), :x, :y)
@test_throws ArgumentError subset(DataFrame(x=missing, y=false), :x, :y)
@test_throws ArgumentError subset(DataFrame(x=missing, y=false), :x)
@test_throws ArgumentError subset(DataFrame(x=false, y=missing), :y)
@test_throws ArgumentError subset(DataFrame(x=false, y=1), :x, :y)
@test_throws ArgumentError subset(DataFrame(x=1, y=false), :x, :y)
@test_throws ArgumentError subset(DataFrame(x=1, y=false), :y, :x)
@test_throws ArgumentError subset(DataFrame(x=false, y=1), :y)

@test_throws ArgumentError subset(DataFrame(x=false, y=1), :x, :y, skipmissing=true)
@test_throws ArgumentError subset(DataFrame(x=1, y=false), :x, :y, skipmissing=true)
@test_throws ArgumentError subset(DataFrame(x=1, y=false), :y, :x, skipmissing=true)
@test_throws ArgumentError subset(DataFrame(x=false, y=1), :y, skipmissing=true)

@test_throws ArgumentError DataFrames._and()
@test_throws ArgumentError DataFrames._and_missing()
end

@testset "make sure we handle idx correctly when groups are reordered" begin
df = DataFrame(g=[2, 2, 1, 1, 1], id = 1:5)
@test select(df, :g, :id, :id => ByRow(identity) => :id2) ==
Expand Down
1 change: 1 addition & 0 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ my_tests = ["utils.jl",
"conversions.jl",
"sort.jl",
"grouping.jl",
"subset.jl",
"join.jl",
"iteration.jl",
"duplicates.jl",
Expand Down
Loading