Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selections.jl + DataFrames.jl #1936

Closed
Drvi opened this issue Aug 29, 2019 · 15 comments
Closed

Selections.jl + DataFrames.jl #1936

Drvi opened this issue Aug 29, 2019 · 15 comments

Comments

@Drvi
Copy link
Contributor

Drvi commented Aug 29, 2019

Hi!

I've put together a package that implements quite powerful column selection and renaming capabilities for DataFrames.jl, Selections.jl and would love to see it incorporated into DataFrames.jl.

You can select columns based on their names, positions, ranges and regular expressions, just like DataFrames does. Apart from that one can select columns by boolean indexing and by applying predicate functions to column names or values or both; so you can (de)select columns having more than 60 % missing values, whose names are all caps containing the string "ID" like this:

using DataFrames: DataFrame
using Selections, Statistics

julia> df = DataFrame(A_ID = 1:4, b_ID = repeat([missing], 4), C_ID = [missing, missing, missing, 1])
4×3 DataFrame
│ Row │ A_ID  │ b_ID    │ C_ID    │
│     │ Int64 │ Missing │ Int64⍰  │
├─────┼───────┼─────────┼─────────┤
│ 11missingmissing │
│ 22missingmissing │
│ 33missingmissing │
│ 44missing1       │


julia> select(df, if_pairs((k,v) -> uppercase(k) == k && occursin("ID", k) && (mean(ismissing.(v)) > 0.6)))
4×1 DataFrame
│ Row │ C_ID    │
│     │ Int64⍰  │
├─────┼─────────┤
│ 1missing │
│ 2missing │
│ 3missing │
│ 41
  • You can also chain all kinds of conditions together using & and | in order to create quite complex selection rules.
  • All selection conditions can be negated which will select the complement of the original selection.
  • You can also rename selected columns and you can apply multiple renaming functions to multiple columns based on the selection criteria
# here I use Selections.rename to make sure I keep all the columns in their original order
julia> rename(df, -1 => key_suffix("_B"), r"^[A-Z]" => key_prefix("ac_"))
4×3 DataFrame
│ Row │ ac_A_ID │ b_ID_B  │ ac_C_ID_B │
│     │ Int64   │ Missing │ Int64⍰    │
├─────┼─────────┼─────────┼───────────┤
│ 11missingmissing   │
│ 22missingmissing   │
│ 33missingmissing   │
│ 44missing1

Please see the README.md for a more comprehensive description of the package.

Currently Selections export both select and rename functions which is conflicting with DataFrames exports. So my question is -- would you like this functionality to be a part of DataFrames? I'd be more than happy to make the necessary changes (e.g, make the api compliant with DataAPI) and iron out the API if you think there is room for improvement. In any case, I'd love to get some feedback on the package so that it can be useful for the community.

Thank you for reading this.:)

@bkamins
Copy link
Member

bkamins commented Aug 29, 2019

Hi,

Thank you for your interest and willingness to contribute.

The way we approached column selection in DataFrames.jl is the following:

  • we try to be consistent across packages (especially JuliaDB.jl), therefore we have standard Not, All and Between selectors (they have some overlap with what you propose if I understand your proposal correctly)
  • the approach we currently take is that column selections can depend only on AbstractIndex of a DataFrame (not its contents) - this is related to the fact that we try to have the same column selection rules that can be used in many functions (like select, categorical, allowmissing, etc.)

All this is not carved in stone - please feel free to comment.

Given your package has some overlap with the current (and different) way we handle similar things I would recommend (if you were willing to) to split your requests into a series of atomic proposals (i.e. what exactly you propose to change/add to the functionality we have now).

@quinnj
Copy link
Member

quinnj commented Aug 29, 2019

@Drvi , very cool stuff! @bkamins and I have actually discussed a bit moving the DataFrames selection code out into it's own package, so it's great to see someone do this! One thing we talked about was making the selection code Tables.jl-based instead of DataFrames.jl-specific. Have you looked into this at all? It'd be great if we could make all the logic just work on the result of doing Tables.schema(table). Anyway, excited to see progress here.

@Drvi
Copy link
Contributor Author

Drvi commented Sep 2, 2019

Thanks @quinnj and @bkamins. So if I understand correctly, the best approach would be to make Selections.jl working with Tables.jl so DataFrames can then opt-in into it later, once the package is ready. That makes a lot of sense.

My bigger plan for Selections was to extend the way people can select, rename, order and even transform columns, i.e. to be able to refer to columns not only by their name but by their properties (like eltype or some statistic based on the actual values). The api I have in mind was

# can change order and/or names of columns + can deselect columns
select(df, selection_fun() => renaming_fun())
select!(df, selection_fun() => renaming_fun())

# can change the names of columns
rename(df, selection_fun() => renaming_fun())
rename!(df, selection_fun() => renaming_fun())

# can change the order of columns
permutecols(df, selection_fun() => ordering_fun())
permutecols!(df, selection_fun() => ordering_fun())

# can change the values of columns by applying fun() to each one
transform(df, selection_fun() => fun())
# transform!(df, selection_fun() => fun())

My original goal was to emulate the dplyrs select_at, select_if, mutate_at, mutate_if and so on. These were super useful to me, but with selection_fun() creating appropriate selections types, we can dispatch easily to the correct selection logic while keeping api nice and modular.

If my understanding is correct, Selections are more general than the JuliaDBs selectors (expect I don't have a special selection for primary keys of a table) as you can combine many of them with & and |.

It'd be great if we could make all the logic just work on the result of doing Tables.schema(table)

Yes, relying on Tables.jl would be great, but I'm not sure if it supports all the operations I need:

  • select a single column as an array by its Symbol name
  • select multiple columns as a table by a vector of Symbol names (possibly in different order)
  • changing column names without copying the table (for rename!())
  • some way of deleting columns inplace (for select!())
  • some way of permuting columns inplace (for select!() and permutecols!())

I'd appreciate any guidance here, I'm not that familiar with what Tables.jl can and cannot do. If there is not a way to do this with Tables.jl, people would have to define these to opt-int.

@nalimilan
Copy link
Member

Interesting. AFAICT, we could integrate Selections.jl with DataFrames and Tables.jl by defining an AbstractSelection type in DataAPI or Tables.jl. Tables (including DataFrames) would call such objects, passing them column vectors and their names, and they would return the (new) names of selected columns. This could also be made to work for any package implementing the Tables.jl interface, at least for some operations.

@Drvi
Copy link
Contributor Author

Drvi commented Sep 8, 2019

@nalimilan Making the Selections.jl just spitting out pairs of old_column_names => new_column_names or just a vector column_names while only depending on Tables.jl would indeed be easy to do. However, right now Tables.jl doesn't really talk about renaming or inplace functionality, so it, afaik, won't be possible to implement the full desired API.

@nalimilan
Copy link
Member

Do we really need Tables.jl to support renaming? As long as the abstract type exists, we can do whatever we want using selections in DataFrames. Generic functions that work on all table types can come later.

@Drvi
Copy link
Contributor Author

Drvi commented Dec 22, 2019

Hi everyone. So I finally get some time to finish the overhaul of Selections.jl.

Please see the README.md for an introduction.

The highlights of this new version:

  • It depends only on Tables.jl and is tested against DataFrames.jl, JuliaDB.jl and TypedTables.jl
  • It now contains functionality for column selection, column renaming and transformation as well.
    • Transformations are using wrapper functions like byrow.(f) or bycol!(f) to signal how to the transformation should be applied (first is applied by each row, the second updates the whole column in place). Related: Row-wise vs. whole vector functions #1952
    • Renaming is guaranteed to produce unique names. More specifically -- when provided a vector of new names that has wrong length, the renaming is skipped (with a warning). In all other cases the names are made unique (also with warning). Related: Allow rename when selecting #1975
  • You can use keyword arguments to create new columns.
  • You can still refer to multiple groups of columns, and apply renamings and transformations to them (the groups can be overlapping or not; there are optimizations for broadcasting for chaining multiple broadcasted transformations). Now you can also explicitly refer to columns that were not matched by a specific selection (allowing for if-else type of logic, like (cols(Number) => bycol(standardize)) | (else_cols() => key_suffix("_non_numeric"))) Related: Standardizing working with multiple columns #2016

I'd love some feedback from anyone interested!

cc: @nalimilan @quinnj @bkamins.

@nalimilan
Copy link
Member

Interesting, thanks! Sounds very powerful. A few remarks:

  • At Allow rename when selecting #1975 we seem to have a consensus to use :oldname => function => :newname instead of newname = :oldname => select. Do you think you could use the same convention in Selections?
  • Things like cols(Float64) => bycol(scale) seem to go against the convention we use (in combine for now, but probably soon in select too) that cols => fun passes all columns incols to fun, rather than applying fun separately to each column in cols. One needs to do cols .=> fun for that. See Allow broadcasting All and Between DataAPI.jl#10 for an implementation for custom selections.
  • The fact that byrow passes (rowtable, name) to the user-provided function sounds inconvenient for the most common case where you just want to apply an element-wise transformation to each row (as discussed at Row-wise vs. whole vector functions #1952). As in combine, byrow could pass the column vector directly to the function (instead of a named tuple) if .=> is used. It could make sense to do the same as eachcol, i.e. pass a Boolean as the second argument to byrow to decide whether the name is passed or not (no by default).
  • bytab looks like it could be a special case of bycol as it passes whole columns. When not using .=>, bycol could pass a named tuple of columns, which would replace bytab.
  • In DataFrames we throw errors by default when duplicate names are generated, unless makeunique=true is passed. Maybe you could do the same thing? Same for incorrect length of new names: in Julia we generally avoid printing warnings when something is incorrect: either it's correct and it works, or it's not and it fails (with an option to allow it).
  • The -cols(:a) syntax looks like it's inspired by R, but it's not very Julian. Do you think !cols(:a) could work? Or maybe not(:a)is enough? BTW, why have not when Not already exists?
  • What's the difference between cols(T) and if_eltype(T)? I find it possibly a bit confusing that the latter also includes Union{T, Missing}, which differs from what eltype returns.
  • Likewise, is if_match needed or could it be replaced with cols(r"...")? I even wonder whether if_keys, if_values and if_pairs couldn't be replaced with cols((k, v) -> ...) (you wouldn't use keys or values if you don't care about them).
  • all_cols seems redundant with All() from DataAPI. Though maybe it's needed for technical reasons. Also looks like colrange is similar to Between, though more powerful since it accepts a step argument (that would be added to Between).

@Drvi
Copy link
Contributor Author

Drvi commented Dec 22, 2019

Thank you for your comments! There were some excellent points.

At #1975 we seem to have a consensus to use :oldname => function => :newname instead of newname = :oldname => select. Do you think you could use the same convention in Selections?

This is something I need to think about more deeply. IIUC, it is a generalization of what Selections are currently doing, because currently the args... never introduce new columns (which makes resolving the nested transforms easy). It changes how things need to be evaluated in an interesting way.

Args & kwargs:

select(df, 
    s1 => r1 => t1, # All the queries in `args...` are chained together.
    s2 => r2 => t2, # First the `selections` are evaluated to identify the columns to retain  
    s3 => r3 => t3, # and what functions to apply to them (and how to apply them).
    s4 => r4 => t4, # E.g. when the set of selected columns is empty, 
    ...             # no `transforms` would be applied.
    ;
    col1 = S1 => T1, ## When the `args...` are done, add `col1` to the modified table
    col2 = S2 => T2, ## Add `col2` to the modified table with `col1` already present
    ...
)

Args only:

select(df, 
    s1 => r1 => t1, # Chain the first two selection queries and materialize them, only 
    s2 => r2 => t2, # the selected columns (after their corresponding transforms) are available 
    S1 => T1 => col1, ## Adds/overwrites column `col1` of the newly materialized table.
    s3 => r3 => t3, # Keep chaining until `S2 => T2 => col2` is met, then
    s4 => r4 => t4  # materialize again. For this phase, `col1` is available
    ... 
    ;
    kwargs  ## are up for grabs 
)

I think that the relative position of S1 => T1 => col1 in the query is an interesting area that will result in a more flexible design -- do I want to create this column after a certain transformation took place or before that? This "args only" approach is basically a notation for:

select(
    select(
        select(df,
            s1 => r1 => t1, 
            s1 => r1 => t1; 
            col1 = S1 => T1), 
        s3 => r3 => t3, 
        s4 => r4 => t4
        ...;
        col2 = S2 => T2), 
    ...)

Things like cols(Float64) => bycol(scale) seem to go against the convention we use (in combine for now, but probably soon in select too) that cols => fun passes all columns incols to fun, rather than applying fun separately to each column in cols. One needs to do cols .=> fun for that. See JuliaData/DataAPI.jl#10 for an implementation for custom selections.

bytab looks like it could be a special case of bycol as it passes whole columns. When not using .=>, bycol could pass a named tuple of columns, which would replace bytab.

If you want to pass all columns into the function, you can use cols(Float64) => bytab(f) instead of cols(Float64) => bycol(f), is this what you meant?

Using broadcasting on the pair constructor (.=>) is a very interesting idea. Not sure, how to integrate it to transformation / renaming of multiple columns, though.

This is a summary of my current approach:

| Wrapper             | Inner function signature    | (For each column in `s`) stores results in 
|---------------------|-----------------------------|-----------------------------------------------------------
| `s => bycol[!](f)`  | `f(column)`                 | replaces the source `column` [inplace]
| `s => byrow[!](f)`  | `f(rowtable, name::Symbol)` | replaces `column` that corresponds to `name` [inplace]
| `s => bytab[!](f)`  | `f(coltable, name::Symbol)` | replaces `column` that corresponds to `name` [inplace]
| `s => bycol[!].(f)` | `f(element)`                | replaces `column` that corresponds to `element`s [inplace]
| `s => byrow[!].(f)` | `f(row, name::Symbol)`      | replaces `column` that corresponds to `name` [inplace]

The column is usually and AbstractVector, rowtable is a Vector of NamedTuples (the rows), coltable is named tuple (keys ~ colnames, values ~ columns).

If what I'm currently doing is the behavior you'd expect for .=>, how would the table look like for => (that doesn't apply results to each column)?

| Wrapper              | Inner function signature    | Stores results in
|----------------------|-----------------------------|-----------------------------------------------------------
| `s => bycol[!](f)`   | `f(columntable)`            | ?
| `s => byrow[!](f)`   | `f(rowtable, name::Symbol)` | ?
| `s => bycol[!].(f)`  | ?                           | ? "Broadcasting over named tuple"
| `s => byrow[!].(f)`  | `f(row, name::Symbol)`      | ?

The fact that byrow passes (rowtable, name) to the user-provided function sounds inconvenient for the most common case where you just want to apply an element-wise transformation to each row (as discussed at #1952). As in combine, byrow could pass the column vector directly to the function (instead of a named tuple) if .=> is used. It could make sense to do the same as eachcol, i.e. pass a Boolean as the second argument to byrow to decide whether the name is passed or not (no by default).

You could use byrow.() (note the dot), to pass (row, name), which I assume is much more useful than the variant without the dot. I'm not happy with the name::Symbol arg, though. If you want to apply a function elementwise, then there is also bycol[!].(), should work just fine.

I hope that in time, I'll develop macro alternatives that would behave differently based on the input function signature (at least I hope that it is possible:-)), so that @byrow.((row, name) -> f(row, name) and @byrow.(row -> f(row)) would pass the required inputs intelligently. Your idea with the Bool switch seems pretty viable too.

In DataFrames we throw errors by default when duplicate names are generated, unless makeunique=true is passed. Maybe you could do the same thing? Same for incorrect length of new names: in Julia we generally avoid printing warnings when something is incorrect: either it's correct and it works, or it's not and it fails (with an option to allow it).

That makes a lot of sense, throwing an error by default seems like a better idea.

The -cols(:a) syntax looks like it's inspired by R, but it's not very Julian. Do you think !cols(:a) could work? Or maybe not(:a)is enough? BTW, why have not when Not already exists?

Yes, this is indeed inspired by R. I liked that I didn't have to wrap integers into cols in order inverse the selection. In fact both ! and ~ currently work for negating selections as well as not(:a) does. I think forbidding - is not a big problem.

Ad not() vs Not() -- starting with a capital letter felt wrong... there is no "negation" struct in Selections, each selection carries around a bool that determines if the selection should be inverted or not.

What's the difference between cols(T) and if_eltype(T)? I find it possibly a bit confusing that the latter also includes Union{T, Missing}, which differs from what eltype returns.

cols is a general function that translates its inputs into actual selections using constructor functions like if_eltype (and if_eltype is actually just a special case of if_values) -- so cols(T) actually calls if_eltype. I agree that the Union{T, Missing} default might be confusing, but that was the behavior I wanted to use most of the time (maybe I can add a kwarg to turn this behavior off?). If T? would mean Union{T, Missing} I wouldn't hesitate, for sure.

Likewise, is if_match needed or could it be replaced with cols(r"...")? I even wonder whether if_keys, if_values and if_pairs couldn't be replaced with cols((k, v) -> ...) (you wouldn't use keys or values if you don't care about them).

Yes, this is a very similar situation. if_matches is a special case of if_keys. cols(::Regex) calls if_matches.

I'm afraid that cols(::Callable) defaulting to if_pairs() might confuse people -- if they provide a single argument function, they'll get a MethodError that won't be very helpful.

all_cols seems redundant with All() from DataAPI. Though maybe it's needed for technical reasons. Also looks like colrange is similar to Between, though more powerful since it accepts a step argument (that would be added to Between).

Ad DataAPI: I didn't use the names from DataAPI, because they felt a little restrictive -- I wanted to help the user by providing common prefixes to functions (like if_, key_ and by). I'm very open to changing the names though. With your proposition to cols(::Callable) fall back on if_pairs, I think the user could use cols() to achieve pretty much everything. Maybe an alias() function could replace the keys_ functions as well?

Ad all_cols(): Currently cols() without any arguments throws and error, all_cols() fit into a pattern of what I called "contextual" selections (the other_cols() and else_cols()). But cols() might as well return all_cols(), as I've said, I'm open to change; I hope these discussions will help me to figure better names / semantics.

@nalimilan
Copy link
Member

If you want to pass all columns into the function, you can use cols(Float64) => bytab(f) instead of cols(Float64) => bycol(f), is this what you meant?

Yes. The difference between => and .=> could more or less replace that between byrow/bycol and byrow./bycol. (I think).

If what I'm currently doing is the behavior you'd expect for .=>, how would the table look like for => (that doesn't apply results to each column)?

The idea is just that => passes all columns to the function, rules regarding the return value are the same as for .=>.

You could use byrow.() (note the dot), to pass (row, name), which I assume is much more useful than the variant without the dot. I'm not happy with the name::Symbol arg, though. If you want to apply a function elementwise, then there is also bycol[!].(), should work just fine.

I guess what I find weird is that bycol calls the provided function on each column (vector), so I would expect byrow to call the function on each row (named tuple) or on each element if you selected a single column. Also a vector of named tuples is really not an efficient representation of a data frame so I'm not sure it's very useful to pass that by default with byrow.

Anyway this discussion sounds like the same problem as #1952, so maybe we can find a common solution or pattern.

I hope that in time, I'll develop macro alternatives that would behave differently based on the input function signature (at least I hope that it is possible:-)), so that @byrow.((row, name) -> f(row, name) and @byrow.(row -> f(row)) would pass the required inputs intelligently. Your idea with the Bool switch seems pretty viable too.

Yes, that's certainly doable, that could even be just @byrow f(row, name). I don't think the dot would work in that position, though. See DataFramesMeta for something similar.

Ad not() vs Not() -- starting with a capital letter felt wrong... there is no "negation" struct in Selections, each selection carries around a bool that determines if the selection should be inverted or not.

Well that's probably something to consider: with InvertedIndices's Not, a negation can just wrap any selection, and you can dispatch on Not{T} if you want. That's what we do in DataFrames, and probably Not will be moved to Base at some point. Anyway what matters most is the user interface: if Not can be used everywhere in Julia indexing it sounds better to use that.

Ad DataAPI: I didn't use the names from DataAPI, because they felt a little restrictive -- I wanted to help the user by providing common prefixes to functions (like if_, key_ and by). I'm very open to changing the names though. With your proposition to cols(::Callable) fall back on if_pairs, I think the user could use cols() to achieve pretty much everything. Maybe an alias() function could replace the keys_ functions as well?

Yes, that would be a possibility.

Ad all_cols(): Currently cols() without any arguments throws and error, all_cols() fit into a pattern of what I called "contextual" selections (the other_cols() and else_cols()). But cols() might as well return all_cols(), as I've said, I'm open to change; I hope these discussions will help me to figure better names / semantics.

Actually I wasn't suggesting to have cols() return all_cols(), but to use All() instead of all_cols() (if technically possible) since we already use All() in DataFrames.

@bkamins
Copy link
Member

bkamins commented Apr 16, 2020

Given the current functionality of select, select!, transform, transform!, rename, rename!, and combine in DataFrames.jl I think this should be closed. Still Selections.jl is a nice extension - it would be good to make sure it works well in combination with DataFrames.jl.

@quinnj
Copy link
Member

quinnj commented Apr 16, 2020

I would still love to abstract all the selection/transform stuff into a Tables.jl-based package someday, probably once things settle down more.

@bkamins
Copy link
Member

bkamins commented Apr 16, 2020

Sure - the design is mostly independent on DataFrame internals (except for ! methods which rather should not go to this umbrella package anyway)

@nalimilan
Copy link
Member

It could be interesting to reconsider how Selections.jl fits in the new system with select and AsTable in particular. Also if we decide not to allow All() .=> f and Between() .=> f it could make sense to have a common wrapper that allows broadcasting common to DataFrames and Selections (like Spread or just All as discussed at #2171).

@bkamins
Copy link
Member

bkamins commented Oct 26, 2021

Closing this as the discussion did not have a follow up.

@bkamins bkamins closed this as completed Oct 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants