Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Might Tables.jl be a new Home for StatsPlots.@df ? #287

Open
BeastyBlacksmith opened this issue Aug 23, 2022 · 29 comments
Open

Might Tables.jl be a new Home for StatsPlots.@df ? #287

BeastyBlacksmith opened this issue Aug 23, 2022 · 29 comments

Comments

@BeastyBlacksmith
Copy link

The @df macro allows writing column names in function calls as in

using StatsPlots
import RDatasets
iris = RDatasets.dataset("datasets", "iris")
@df iris scatter(:SepalLength, :SepalWidth, group = :Species)
@df iris sum(:SepalLength)  # not only useful for plotting

It will be removed from StatsPlots.jl probably in 1.0 and if we don't find a better home, it will got to Plots.jl (JuliaPlots/Plots.jl#3351), but actually I think it should be somewhere more general purpose.

If it's going here it should be probably be renamed to something like @table.

@rofinn
Copy link
Member

rofinn commented Aug 23, 2022

I don't think Tables.jl is the right place for it. Maybe we should just have a TablePlots.jl package similar to how we have TableOperations.jl?

@BeastyBlacksmith
Copy link
Author

BeastyBlacksmith commented Aug 23, 2022

Maybe we should just have a TablePlots.jl package

That doesn't make sense, since this macro is not tied to plotting at all. It works for any function call/ block of function calls. It's only historic accident that it ended up in StatsPlots.jl, since there was no Table interface back then.

The only thing that is kind of plotting related is the automatic passing of a label keyword if, the function to be called supports this, but that part could also be stripped and put in a Plots-specific extension of this macro.

@nalimilan
Copy link
Member

This sounds like something which could live in DataFramesMeta (with a different name). Cc: @pdeffebach @bkamins

@pdeffebach
Copy link

Can someone try and see how much of the @df macro functionality can be replicated by just using @with from DataFramesMeta.jl?

@bkamins
Copy link
Member

bkamins commented Aug 25, 2022

I already checked it. @with is better. The only problem is that @with works with data frames, while @df works with any tables.

@pdeffebach
Copy link

Ah.

Current implementation of @with does not use any DataFrames.jl-specific things actually, so I could make it broader.

@bkamins
Copy link
Member

bkamins commented Aug 25, 2022

So if we document that @with contract is Tables.jl (which I think should be easy) then @df can be made an alias for @with if someone wants it.

Other than that @with has all functionality that @df has but it has some more functionality for better handling special cases (like $)

@BeastyBlacksmith
Copy link
Author

Thats a good idea. Could we then move @with to Tables.jl?

@nalimilan
Copy link
Member

Tables.jl is supposed to be lightweight interface package that other packages depend on, not something where convenience functions intended for end users should live. I don't think any package exists currently for that.

@DilumAluthge
Copy link
Contributor

Perhaps TableOperations.jl?

@BeastyBlacksmith
Copy link
Author

Perhaps TableOperations.jl?

Sounds good to me

@pdeffebach
Copy link

One problem is the current implementation relies on ByRow, which is defined in DataFrames.jl. I wonder if ByRow can go somewhere else too?

@bkamins
Copy link
Member

bkamins commented Sep 6, 2022

ByRow is not depending on DataFrames.jl, so:

struct ByRow{T} <: Function
    fun::T
end

# invoke the generic AbstractVector function to ensure function is called
# exactly once for each element
(f::ByRow)(cols::AbstractVector...) =
    invoke(map,
           Tuple{typeof(f.fun), ntuple(i -> AbstractVector, length(cols))...},
           f.fun, cols...)
(f::ByRow)(table::NamedTuple) = [f.fun(nt) for nt in Tables.namedtupleiterator(table)]

could be extracted to Tables.jl if all agree it is useful to have it there (we probably would need to restrict table::NamedTuple in the second signature to ensure we are working with Tables.columntable).

@nalimilan
Copy link
Member

@quinnj What do you think?

@quinnj
Copy link
Member

quinnj commented Sep 6, 2022

Yeah, I'd be down for that. If someone can make a PR, we're getting close to doing a new minor release here soon once I get the getrows stuff sorted out.

@bkamins
Copy link
Member

bkamins commented Sep 6, 2022

I will make a PR since we need to sync it with DataFrames.jl 1.4 release.

@bkamins
Copy link
Member

bkamins commented Sep 6, 2022

@quinnj - why does ColumnTable allow other containers than vectors?

const ColumnTable = NamedTuple{names, T} where {names, T <: NTuple{N, AbstractArray{S, D} where S}} where {N, D}

@bkamins
Copy link
Member

bkamins commented Sep 6, 2022

Also let us confirm that the definition of ByRow for Tables.jl would be:

struct ByRow{T} <: Function
    fun::T
end

# invoke the generic AbstractVector function to ensure function is called
# exactly once for each element
(f::ByRow)(cols::AbstractVector...) =
    invoke(map,
           Tuple{typeof(f.fun), ntuple(i -> AbstractVector, length(cols))...},
           f.fun, cols...)
(f::ByRow)(col::AbstractVector) =
    invoke(map,
           Tuple{typeof(f.fun), (AbstractVector,)},
           f.fun, col)
(f::ByRow)(table) = [f.fun(nt) for nt in Tables.namedtupleiterator(table)]

The idea is:

  • if we pass vector or vectors (in the case of a single vector we treat it as a vector even if it is a table) - we just apply map to the passed data (making sure to always process each element)
  • if we pass a table (other than a vector) - then we pass named tuples to the function

What is an important corner case is when a table is a vector. Then ByRow will ignore the fact that such vector is a table (for Tables.RowTable this is fortunately the same behavior, but in general e.g. vector of structs, e.g. [1=>2, 3=>4] will be treated as a vector and not as a table).

This is what we need in DataFrames.jl, but I want to confirm that the same expectation is OK for Tables.jl.

@nalimilan
Copy link
Member

Better always treat the argument as a table, otherwise the behavior is inconsistent. Do we really need the (f::ByRow)(cols::AbstractVector...) method in DataFrames or can it be considered as an implementation detail that could be done using a separate function?

@bkamins
Copy link
Member

bkamins commented Sep 6, 2022

Better always treat the argument as a table, otherwise the behavior is inconsistent.

I know and that is why I am asking. The problem is that whatever you do if you write in DataFrames.jl e.g.:

julia> using DataFrames

julia> df = DataFrame(col=[1=>2, 3=>4])
2×1 DataFrame
 Row │ col
     │ Pair…
─────┼───────
   1 │ 1=>2
   2 │ 3=>4

julia> select(df, :col => ByRow(identity))
2×1 DataFrame
 Row │ col_identity
     │ Pair…
─────┼──────────────
   1 │ 1=>2
   2 │ 3=>4

you get a no-op, while if you made ByRow always treat a table as a table you would get instead:

julia> select(df, :col => x -> collect(Tables.namedtupleiterator(x)))
2×1 DataFrame
 Row │ col_function
     │ NamedTuple…
─────┼─────────────────────────
   1 │ (first = 1, second = 2)
   2 │ (first = 3, second = 4)

so it would be a change in DataFrames.jl (moreover - a change that we do not want).

@bkamins
Copy link
Member

bkamins commented Sep 6, 2022

Which means that maybe we need to restrict ByRow to just NamedTuple of vectors (which brings me back to the earlier question - why ColumnTable allows arrays of any dimensionality - this is OK, but I would prefer to have a clear situation)

@quinnj
Copy link
Member

quinnj commented Sep 6, 2022

I have a diff locally that changes it to:

const ColumnTable = NamedTuple{names, T} where {names, T <: NTuple{N, AbstractVector{S} where S}} where {N}

as I also noticed this definition being too general.

@nalimilan
Copy link
Member

Then this is problematic. Vectors of named tuples are generally considered as completely valid tables, so it would be weird for them not to work with ByRow if it was moved to Tables.jl. I don't know what to do...

@bkamins
Copy link
Member

bkamins commented Sep 6, 2022

If we want to move @with and let it use @byrow we need the behavior from DataFrames.jl, the reason is that e.g.:

julia> using DataFramesMeta

julia> df = DataFrame(col=[1=>2, 3=>4])
2×1 DataFrame
 Row │ col
     │ Pair…
─────┼───────
   1 │ 1=>2
   2 │ 3=>4

julia> @with df @byrow begin
           :col
       end
2-element Vector{Pair{Int64, Int64}}:
 1 => 2
 3 => 4

is a desired behavior (and not transforming it to named tuples as I have shown above)

I don't know what to do...

As I have said - the only thing that we can do is restrict ByRow to its current behavior in DataFrames.jl (i.e. working on vectors as vectors or Tables.ColumnTable as a special case).
Then users - if they have other tables - would have to convert them to Tables.ColumnTable - it is a bit problematic but not end of the world.

Note that it is only a problem if ByRow is used outside of a table, which is not its typical use-case. Normally ByRow is designed to work on columns of a table and then there is no problem.

@bkamins bkamins mentioned this issue Sep 7, 2022
@bkamins
Copy link
Member

bkamins commented Sep 7, 2022

I have opened #293 as a follow up to ByRow discussion.

@nalimilan
Copy link
Member

Then users - if they have other tables - would have to convert them to Tables.ColumnTable - it is a bit problematic but not end of the world.

My fear is rather that custom table types will override ByRow to work on them, so that in the end it supports most table types, except AbstractVectors (including Vector{<:NamedTuple}.

Note that it is only a problem if ByRow is used outside of a table, which is not its typical use-case. Normally ByRow is designed to work on columns of a table and then there is no problem.

I think the difficulty is that we designed ByRow to work in DataFrames, where AsTable => ByRow(f) passes a NamedTuple to ByRow so there's no need to support other kinds of tables. But in TableOperations for example it could make sense for AsTable to support other types. Not sure. Let's hope that ColumnTable is the right choice for most table types, including row-oriented ones.

@bkamins
Copy link
Member

bkamins commented Sep 7, 2022

Let's hope that ColumnTable is the right choice for most table types, including row-oriented ones.

RowTable is also supported now without a problem if a source is row-oriented.

@nalimilan
Copy link
Member

Right. The point is that as soon as people start using ByRow with more than just ColumnTable, it's likely that somebody will try it with Vector{<:NamedTuple}. And that could happen indirectly, due to a user passing such an object to a package which assumed any table would work.

@bkamins
Copy link
Member

bkamins commented Sep 8, 2022

I agree. That is why I tried to highlight in the documentation that the default API is vectors, and ColumnTable is a special case exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants