Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] deprecate DataFrame constructors #2464

Merged
merged 21 commits into from
Nov 6, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Oct 1, 2020

Fixes #2433. Before I change tests and documentation please review if we agree with what is proposed to get deprecated. Thank you.

@bkamins bkamins added the breaking The proposed change is breaking. label Oct 1, 2020
@bkamins bkamins added this to the 1.0 milestone Oct 1, 2020
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/deprecated.jl Outdated Show resolved Hide resolved
src/deprecated.jl Show resolved Hide resolved
src/deprecated.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
@bkamins bkamins mentioned this pull request Oct 2, 2020
3 tasks
@matthieugomez
Copy link
Contributor

matthieugomez commented Oct 3, 2020 via email

@bkamins
Copy link
Member Author

bkamins commented Oct 6, 2020

OK - so the decision is (unless I grossly misunderstood something 😄): we keep the current proposal (including deprecating eltype constructors).

I will wait for 1-2 days in case more comments come in and then update the whole package not to use deprecated functionality (mainly tests).

@bkamins
Copy link
Member Author

bkamins commented Oct 8, 2020

It should be good for a review.

@bkamins
Copy link
Member Author

bkamins commented Oct 8, 2020

@rofinn - can you please check if with this PR your use case works as expected?

@rofinn
Copy link
Member

rofinn commented Oct 8, 2020

Yep, looks good to me. I only really use the tables or pairs constructors anyways. 👍

@bkamins
Copy link
Member Author

bkamins commented Oct 9, 2020

Last commit follows the discussion in JuliaData/Tables.jl#204 and fixes embarrassing cases we had:

julia> DataFrame(Pair[])
1×1 DataFrame
│ Row │ makeunique │
│     │ Bool       │
├─────┼────────────┤
│ 1   │ 0          │

julia> DataFrame(Any["a" => [1]])
1×2 DataFrame
│ Row │ first  │ second │
│     │ String │ Array… │
├─────┼────────┼────────┤
│ 1   │ a      │ [1]    │

@bkamins
Copy link
Member Author

bkamins commented Oct 11, 2020

This should be good for a final check before merging.

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/deprecated.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
if x isa AbstractVector && all(col -> isa(col, AbstractVector), x)
if !Tables.istable(x) && x isa AbstractVector && !isempty(x)
# here we handle the cases that are accepted by standard DataFrame constructors
# but eltype(x) is more flexible than assumed in these methods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand this comment. Do you mean that here we handle cases where eltype(x) doesn't match what other (more standard) constructors accept?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. What we handle is essentially Any[...] case (or some weird Union{...}[...]) as it will not be handled by standard constructors.

It was not very clear earlier and this is essentially what this code path is for only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I think what confused me is "that are accepted by standard DataFrame constructors", since these are not really accepted, or they wouldn't have to be dealt with in this method. :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you - as usual - propose a better wording (or what we have now is OK?). Thank you!

Comment on lines 39 to 42
@test (df .+ df) ./ 2 !== df
@test df .+ Matrix(df) == 2 .* df
@test Matrix(df) .+ df == 2 .* df
@test (Matrix(df) .+ df .== 2 .* df) == DataFrame(trues(size(df)), names(df))
@test (Matrix(df) .+ df .== 2 .* df) == DataFrame(Tables.table(trues(size(df)), header=names(df)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks at these examples, it feels unfortunate to lose the symmetry between Matrix(df) and DataFrame(mat), especially given the fact that broadcasting between data frames and matrices is allowed.

I think it makes sense to have some asymmetry, since Matrix(df) can just drop names, but DataFrame(mat) needs to generate names: so requiring DataFrame(mat, colnames) isn't illegitimate. But DataFrame(Tables.table(df, header=colnames)) really breaks that symmetry (and it's really verbose as we see here). Maybe we could keep DataFrame(mat, colnames), since in that case there's no ambiguity (mat is treated as a matrix, not as a Table.jl object).

(Sorry for suggesting this now. I can adjust the tests if you like.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with this (I can fix the tests - it should be quick this way as we have Tables.table in all places we need to fix 😄).

@oxinabox + @pdeffebach : can you comment here please?

The proposal is to have DataFrame(::AbstractMatrix, ::Union{AbstractVector{Symbol}, AbstractVector{<:AbstractString}) signature (with no default for the second positional argument, so it would be required).

Then there would be no ambiguity with DataFrame(table) constructor that can handle matrices as tables.

Copy link
Contributor

@oxinabox oxinabox Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, I stand by my claim that every extra constructor adds confusion.
For this case i think nicer than using Tables.jl is:

DataFrame(names .=> eachcol(mat))

that is a nice clear explict statement about what is going on, and is I think clearer than
DataFrame(mat, names) since we are explictly stating that we are going to use the columns one per name.
Also naturally extends to:
DataFrame(names .=> vec.(eachrow(mat))
for constructing dataframes based on the rows of the matrix.
Particularly relevent as sometimes we do observations as columns, e.g. Distances.jl, the Models.jl interface, some others.

Constructing DataFrames from a matrix is in general a weird thing to do outside of toy examples
because a typical dataframe has a mixture of different types of columns,
and a typical matrix does not.
I don't think we should go out of our way to encourage people writing unrealistic toy examples, by making it easier

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan - @oxinabox has convinced me. Writing DataFrame(names .=> eachcol(mat)) is indeed clean. We have to use Tables.table for now because eachcol is not supported in Julia 1.0, but once we move the support to new Julia LTS we can clean it up.

Copy link
Member

@nalimilan nalimilan Oct 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I find problematic is that by supporting Matrix(df) and df .+ mat, we consider that data frames and matrices are objects with compatible shapes. This alone means that we exclude the view of observations as columns (which BTW is the exception in the ecosystem, even in Distances you now have to explicitly give dims). Yet DataFrame(names .=> eachcol(mat)) conceptually requires you to decompose the matrix into columns to create a data frame. The notion that shapes are compatible is completely lost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* `DataFrame` constructor is about being a table (which in particular implies having column names)

I don't think this rule holds in general, does it? It's only true for the one-argument constructor (and even in that case, not e.g. for the ::GroupedDataFrame one). For the two-argument constructor, we don't actually allow tables as the first argument. So it would be OK to accept matrices (just like we accept vectors of vectors).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this rule holds in general, does it?

We have to have some minimal set of constructors. For instance I hate DataFrame(a=[1,2], b=[2,3]) constructor as it is inconsistent with the general design we have, but I think we should keep it as it is convenient. How I understand @oxinabox is that he wants to limit the number of special cases as much as possible, and it seems that for matrices DataFrame(names .=> eachcol(matrix)) is an easy enough pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance I hate DataFrame(a=[1,2], b=[2,3]) constructor as it is inconsistent with the general design we have

I was going to suggest deprecating that as well; for DataFrame(:a=>[1,2], ...) 😂
because getting rid of it would mean we are free to add what ever keywords we wanted to the constructor later without clashing with column names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have this clash already with copycols and makeunique kwargs which are handled inconsistently:

julia> DataFrame(makeunique=2, copycols=false)
1×1 DataFrame
│ Row │ makeunique │
│     │ Int64      │
├─────┼────────────┤
│ 1   │ 2          │

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest deprecating that as well; for DataFrame(:a=>[1,2], ...)

Oh, is that an option here? I'm 100% on board with that change 😸

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks mostly good I'd say.

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
test/broadcasting.jl Outdated Show resolved Hide resolved
src/other/tables.jl Outdated Show resolved Hide resolved
test/broadcasting.jl Show resolved Hide resolved
src/dataframe/dataframe.jl Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Nov 3, 2020

I have applied all the suggestions. It should be good for another look at.

@bkamins
Copy link
Member Author

bkamins commented Nov 3, 2020

CI passes cleanly (I am testing against current master with all the changes made since this PR was started)

src/dataframe/dataframe.jl Show resolved Hide resolved
src/dataframe/dataframe.jl Show resolved Hide resolved
@@ -337,4 +344,11 @@ end
end
end

@testset "Dict constructor corner case" begin
@test DataFrame(Dict('a' => 1, true => 2)) == DataFrame("a" => 1, "true" => 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really allow this?

Copy link
Member Author

@bkamins bkamins Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the original design. Actually I thought that maybe it is best to allow only string or Symbol as keys. It is last time to change it :). Do we go this way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds safer -- as usual. :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I will change this then. Thank you!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - now we are as strict as with Pair constructors (except for sorting when Dict is passed).

@bkamins
Copy link
Member Author

bkamins commented Nov 6, 2020

CI passes, only coverage fails

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Nov 6, 2020

Thank you! I will merge this once the CI passes.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins bkamins merged commit 55533d1 into JuliaData:master Nov 6, 2020
@bkamins bkamins deleted the deprecate_constructors branch November 6, 2020 21:55
@bkamins
Copy link
Member Author

bkamins commented Nov 6, 2020

Thank you!

@0kto
Copy link

0kto commented Nov 16, 2020

Hi all, I am late to this discussion.
I updated yesterday, and with v0.22 I receive the deprecation warning:

DataFrame` constructor with passed eltypes is deprecated.
[...]

Reading through the docs and the source code, I was not able to find a suitable replacement:
In my scripts I initialize a DataFrame object by defining column-names and types (and zero rows), to which I append data read from (rather messy) instrument data files. Having the eltypes defined before appending data to the DataFrame, bad data (where the Type is not matching) fails and immediately helps figuring out if there is a problem. Also, adding a column is as easy as adding a Dictionary entry, making my code easily usable for non-developers.

So, my question is if there is a suitable replacement I haven't seen, or if there is a recommended workflow I could implement instead. Or if you could revive at least one constructor that accepts a Type declaration for columns.

@pdeffebach
Copy link
Contributor

How about this? I think it's decently intuitive

julia> d = Dict(:id => String, :income => Float64, :education => Int64);

julia> df = DataFrame([n => T[] for (n, T) in d])
0×3 DataFrame

@bkamins
Copy link
Member Author

bkamins commented Nov 16, 2020

or just e.g.:

DataFrame([:a, :b] .=> [T[] for T in [Int, Float64]])

Dict is not 100% correct as it will reorder columns.

@0kto
Copy link

0kto commented Nov 16, 2020

Thanks, that is really helpful.
Since I am using an OrderedDict from the DataStructures.jl package, order should be preserved.
Cheers!

@bkamins
Copy link
Member Author

bkamins commented Nov 16, 2020

Yes only Dict is sorted.

khosravipasha added a commit to Tractables/LogicCircuits.jl that referenced this pull request Nov 22, 2020
broke our code, probably major refactoring needed
JuliaData/DataFrames.jl#2464
khosravipasha added a commit to Tractables/ProbabilisticCircuits.jl that referenced this pull request Nov 22, 2020
broke our code, probably major refactoring needed
JuliaData/DataFrames.jl#2464
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deprecate DataFrame(::AbstractMatrix)