Skip to content

Conversation

@Sov-trotter
Copy link
Contributor

@Sov-trotter Sov-trotter commented Mar 28, 2020

@bkamins Can you take a look at it?

@bkamins bkamins linked an issue Mar 28, 2020 that may be closed by this pull request
function Base.isapprox(df1::AbstractDataFrame, df2::AbstractDataFrame)
size(df1, 2) == size(df2, 2) || return false
isequal(index(df1), index(df2)) || return false
Matrix(df1) Matrix(df2) || return false
Copy link
Member

@bkamins bkamins Mar 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole point of this PR is not do do conversion to Matrix as this is inefficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you suggest an alternative to matrix conversion ?
Should I align with the isequal approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. But it order to stay consistent with Base the norm should be calculated using the rules that are defined by LinearAlgebra/src/generic.jl:1588.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also probably it will be more complex than for isequal as we also have to handle missing efficiently which means that probably a barrier function would be useful there.

Test for "approximate" equality between two DataFrames
"""

function Base.isapprox(df1::AbstractDataFrame, df2::AbstractDataFrame)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kwargs defined in isapprox in Base are missing here.

@bkamins
Copy link
Member

bkamins commented Mar 28, 2020

Also I think that for isapprox to be useful for data frames we should handle non-nonmissingtype(eltype)<:Number columns in a special way - i.e. make a standard isequal test for them. Or we could add a kwarg that allows the user to choose an appropriate behavior (if such columns should be ignored or not).

The reason is that this is why defining isapprox for data frames is valuable. Eg. you have some key columns that are strings and value columns that are numbers and you want to check if the keys are equal and numbers are approximate.

There is a corner case of columns like Any[1, 2, 3] here, but I think it is not a problem if we treat them as non-number.

Also we should properly handle missing values in comparisons (and return missing correctly if needed). This is not implemented in Base for performance reasons, but for data frames we are not going to be fast anyway. What I mean here is that in Base you have to do (and at the same time be slightly inaccurate here - which should be handled):

julia> all([1,2,3] .≈ [1,2,missing])
missing

because the fast variant

julia> [1,2,3] ≈ [1,2,missing]
ERROR: MethodError: no method matching rtoldefault(::Type{Int64}, ::Type{Union{Missing, Int64}}, ::Int64)

errors.

In summary - if we do not cover all these corner cases in this PR I do not think it is worth to add it, because the user can then use standard isapprox for matrix without a problem.

Because of all these complexities (that require a very careful design) I have put "helpwanted" but not put "intro issue" in the original issue.

@nalimilan
Copy link
Member

Also we should properly handle missing values in comparisons (and return missing correctly if needed). This is not implemented in Base for performance reasons, but for data frames we are not going to be fast anyway.

Are you sure that's for performance? The fact that [1,2,3] ≈ [1,2,missing] throws an error ("no method matching rtoldefault(::Type{Int64}, ::Type{Union{Missing, Int64}}, ::Int64)") looks like an oversight in Base, since 1 ≈ missing returns missing. So we could try to fix that in Base, and define isapprox on DataFrame as calling isapprox on each column, just like == and isequal.

The question of how to handle non-numeric columns is interesting. We could use == for all non-Union{Number,Missing} columns. Or we could require people to pass that type as an argument.

@bkamins
Copy link
Member

bkamins commented Mar 28, 2020

So we could try to fix that in Base

I was considering this but this is an unrelated thing and did not want to complicate this PR. It should be fixable in Base. In general isapprox we should be == if all floats are identical and [1,2,3,] == [1,2,missing] is missing.

and define isapprox on DataFrame as calling isapprox on each column, just like == and isequal.

In Base isapprox considers a norm of whole matrix not norms of its columns separately. But maybe you are right - we could say that data frame is not a matrix, but a nested vector of vectors (this is what Base also supports) which would make sense.

The question of how to handle non-numeric columns is interesting. We could use == for all non-Union{Number,Missing} columns.

This is what I assumed.

Or we could require people to pass that type as an argument.

This is also what I assumed, but to test these columns with == by default. Why: normally a data frame you want to test for isapprox is a result of some transformation like combine when it is natural to have some non-numeric columns (e.g. grouping) and values.

As I have written above, if we do not give all these options then we do not have to add this method at all (mostly), as you can write:

all(df .≈ matrix_or_data_frame_you_want_to_compare_against)

and use the fact that data frame fully supports broadcasting (so we hit a similar situation to any(ismissing, collecton) case 😄).

(I say mostly, because then we compute elementwise norms which in some cases might be different, but in some - like for L-infinity norm - it will be the same)

@nalimilan
Copy link
Member

In Base isapprox considers a norm of whole matrix not norms of its columns separately. But maybe you are right - we could say that data frame is not a matrix, but a nested vector of vectors (this is what Base also supports) which would make sense.

Yes, AFAICT that's the only efficient way of doing this operation anyway.

I'm fine with the plan you propose.

@bkamins
Copy link
Member

bkamins commented Mar 30, 2020

@Sov-trotter - we have settled with the design now I think, unless you have some comments on it.
Now, if you want you can propose an implementation in this PR.

@Sov-trotter
Copy link
Contributor Author

I think we can go with the implementation you suggest, meanwhile we should begin with a PR in Base?

@bkamins
Copy link
Member

bkamins commented Apr 5, 2020

Yes - these two things can probably be done in parallel.

@bkamins
Copy link
Member

bkamins commented Aug 20, 2020

@Sov-trotter - hi, I just wanted to check with you if you still are willing to work on trying to finalize this PR? Thank you!

@Sov-trotter
Copy link
Contributor Author

Yeah. Extremely sorry for holding this up for too long, I am not really sure of what can be done here exactly?

@bkamins
Copy link
Member

bkamins commented Aug 20, 2020

As discussed above:

  • do not use a conversion to matrix, but rather do isapprox for all columns
  • the function should work column-wise and use = for columns that are not AbstractVector{<:Union{Missing, Number}}
  • the function should support the full API of isapprox:
isapprox(x, y; rtol::Real=atol>0 ? 0 : √eps, atol::Real=0, nans::Bool=false, norm::Function)

except the comments above (i.e. non numeric columns are compared using = and the norm is calculated per column)

@Sov-trotter
Copy link
Contributor Author

Also this goes in DataFrames? We are not adding on to Base?

@bkamins
Copy link
Member

bkamins commented Aug 20, 2020

Yes - I would just add Base.isapprox taking two AbstractDataFrame objects

@bkamins
Copy link
Member

bkamins commented Aug 21, 2020

Closing as #2373 is the same

@bkamins bkamins closed this Aug 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Define isapprox for DataFrames

3 participants