Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accepting array element in rows specificed by named tuples, in combine #3335

Closed
jonalm opened this issue May 22, 2023 · 23 comments · Fixed by #3348
Closed

Accepting array element in rows specificed by named tuples, in combine #3335

jonalm opened this issue May 22, 2023 · 23 comments · Fixed by #3348
Labels
Milestone

Comments

@jonalm
Copy link

jonalm commented May 22, 2023

Creating a DataFrame from a list of named tuple works fine, even when the nt elements contain arrays

DataFrame([
    (; foo="bar", boo=[1,2]),
    (; foo="baz", boo=[1,2])
]
)

But that is not the case if the DataFrame is created by a combine operations

combine(groupby(DataFrame(:group=>[1,1,2,2]), :group)) do g
    return (;
        foo="foo",
        boo=[1,2]
    )
end

gives

    nested task error: ArgumentError: mixing single values and vectors in a named tuple is not allowed

The above example would work as expected if one removes these two lines:

throw(ArgumentError("mixing single values and vectors in a named tuple is not allowed"))
and
throw(ArgumentError("mixing single values and vectors in a named tuple is not allowed"))

An obvious use-case for needing this is when you want some ordered indices or values from the group, in a single row for each group.

Would you be willing to reconsider these exceptions, and allow arrays as named tuple elements when using combine?

@bkamins bkamins added this to the 1.6 milestone May 22, 2023
@bkamins
Copy link
Member

bkamins commented May 22, 2023

Would you be willing to reconsider these exceptions, and allow arrays as named tuple elements when using combine?

Please correct me if I am wrong, but this would imply that (; x=[1, 2, 3], y=[2, 3, 4]) would be interpreted as a named tuple defining a single row. Whereas currently, it defines three rows.

@jonalm
Copy link
Author

jonalm commented May 23, 2023

@bkamins I guess there are implicitly two suggestions here for the output of the function argument to combine:

  1. (Non breaking) remove the above mentioned error, and allow named tuple outputs with a combinations of scalar and array values, and interpret those as rows in the resulting DataFrame.

  2. (Breaking) change the behaviour such that named tuple output always produce a row in the resulting DataFrame.

The rational for 2 is that it is easy to explicitly wrap the output (of the function argument to combine) into a DataFrame, if you want the combined DataFrame to be a stack of DataFrames. But it is not clear (at least not to me) how to return a row with both scalar and array elements. It would be natural to achieve that with a named tuple output (always interpreted as a row). 1 would be a non-breaking quick fix to have this functionality for named tuples with mixed (scalar/array) elements, but may render the behaviour unexpected to some users.

My preference is shaped by my usage of combine: in the majority of cases I want to stack row outputs for each group.

@bkamins
Copy link
Member

bkamins commented May 23, 2023

But it is not clear (at least not to me) how to return a row with both scalar and array elements.

It was discussed on Slack.
If you have f() = (; x=1, y=[1,2])
You should return instead of:

julia> f()
(x = 1, y = [1, 2])

rather wrap all in additional vectors

julia> map(x -> [x], f())
(x = [1], y = [[1, 2]])

or equivalently:

julia> map(Base.vect, f())
(x = [1], y = [[1, 2]])

(or directly in definition of f wrap vectors in vectors)

I am aware that it is a bit inconvenient.

We will not make the option 2. (breaking change) because it is breaking.

Option 1. could be acceptable, but it is problematic as the rule determining how named tuple is handled would become even more complex than it is now (in general: users already find combine processing rule to have many cases)

Indeed in the code when have these two errors thrown because we wanted to leave the freedom in the future to decide what to do in that case (@nalimilan style 😄). So the conclusion is that we can discuss it, but we want to reach a consensus before making such a fundamental change. One of the reasons for this is that typically users could expect that (; x=1, y=[1, 2]) should be expanded to DataFrame(x=[1,1], y=[1,2]) instead (as discussed - this is what normally would DataFrame constructor do).

@jonalm
Copy link
Author

jonalm commented May 23, 2023

So if I understand you correctly @bkamins: the best way of having a single row output from the combining function, which is robust against the potential content of that row (in the above mentioned case: arrays), is to wrap that row into a DataFrame with one single row, or to wrap it into a named tuple which would be expanded into a single rowed DataFrame in the DataFrame constructor?

If that is the case: wouldn't it be more natural to have a convention for explicitly returning a row? If named tuples are out of the question, could we have a Row-like type (DataFrameRow?) which acts like a named tuple?

@bkamins
Copy link
Member

bkamins commented May 23, 2023

DataFrameRow - you can use it already. The only limitation is that it is not convenient to create it currently.

Indeed, though, it would be useful to add support for:

combine(gdf, src => Tables.Row ∘ fun => AsTable)

where your fun is returning a named tuple.

Do you think it would be useful to add something like this?

CC @nalimilan, @pdeffebach, @ararslan

The change it would require in the minilanguage is to add support for Tables.AbstractRow, just like we did it for pushing in #3245.

I think it would be OK to add it.

@pdeffebach
Copy link
Contributor

pdeffebach commented May 23, 2023

I wrote a package for this purpose at some point. I didn't make it very far clearly, but the bones are here in DataRows.jl.

But party of the reason it was abandoned was precisely because combine(f, gd) doesn't play well with Tables.jl. combine(f, gd) doesn't check if something is Tables.istable and behave accordingly. We are stuck with NamedTuples, which as you mention have an ambiguity in the case of mixed types.

I mentioned this on Slack, but I do think the long term solution is 2.0, where combine(f, df) is re-written to match combine(gd, src => fun => dest).

As for the current behavior, I think it's not a big deal to require the use of arrays. If performance is important, Refs are always available.

@bkamins
Copy link
Member

bkamins commented May 23, 2023

doesn't check if something is Tables.istable and behave accordingly.

The reason is that Tables.istable is not a very reliable function. I prefer not to use it. Having said that, here we are talking about Tables.AbstractRow subtypes, which is a well defined interface.

As for the current behavior, I think it's not a big deal to require the use of arrays

Still, I think recognizing Tables.AbstractRow is a simple fix that should not be problematic.


My question is, for 1.x release branch of DataFrames.jl if we feel that adding Tables.AbstractRow support is OK (and robust for 2.x release)?

@bkamins
Copy link
Member

bkamins commented May 23, 2023

@pdeffebach - to expand. Tables.istable is trait based, which is fragile. Tables.AbstractRow rule is dispatch based, which makes it more robust.

@jonalm
Copy link
Author

jonalm commented May 24, 2023

Indeed, though, it would be useful to add support for:

combine(gdf, src => Tables.Row ∘ fun => AsTable)

where your fun is returning a named tuple.

Do you think it would be useful to add something like this?

@bkamins I'm not sure I understand the implications of your suggestion. How I would use combine with the proposed solution?

Also, perhaps a bit tangential, why do you consider a trait based approach to be fragile?

@bkamins
Copy link
Member

bkamins commented May 24, 2023

How I would use combine with the proposed solution?

Just as I have written above:

combine(gdf, src => Tables.Row ∘ fun => AsTable)

passing return value of fun to Tables.Row forces the value returned by fun to be treated as one row.

why do you consider a trait based approach to be fragile?

For two reasons:

  • major: if Tables.istable returns true you are sure that you are working with a table; but if it returns false it does not mean that it is not a table (it only means that it might not be a table)
  • minor: with type-based dispatch we are sure that the behavior cannot change dynamically; with trait based dispatch the behavior can be changed by the user (i.e. the user could define a method for Tables.istable and change the behavior of DataFrames.jl - indeed sometimes it could be useful, but in general I prefer to be able to determine statically how DataFrames.jl will evaluate some expression in cases it is possible)

(but this is unrelated to your issue, as table-row is not a table, so Tables.istable will not help here)

@jonalm
Copy link
Author

jonalm commented May 24, 2023

How I would use combine with the proposed solution?

Just as I have written above:

combine(gdf, src => Tables.Row ∘ fun => AsTable)

sorry for being an idiot, but I don't get what src would be in your example?

@bkamins
Copy link
Member

bkamins commented May 24, 2023

src is a specification of source columns that should be passed to fun. See "Basic transformations" section (and sections that follow it) in https://bkamins.github.io/julialang/2020/12/24/minilanguage.html.

@jonalm
Copy link
Author

jonalm commented May 24, 2023

@bkamins Thanks, and thanks for elaborating on the trait issue.

I think your proposal would be very useful.

@pdeffebach
Copy link
Contributor

pdeffebach commented May 24, 2023

My only comment is that a DataFrames-specific type, like AsRow might be more useful.

@bkamins
Copy link
Member

bkamins commented May 24, 2023

What would be the benefit of AsRow over Tables.Row that is already defined in the ecosystem?

The definition would be AsRow(fun) = Tables.Row ∘ fun. (so it would be a function not a type)

@sprmnt21
Copy link

I did not understand if such a workaround was considered and deemed unsatisfactory for some reason.
When in doubt, I propose it, so I better understand why it's not good for the case.

combine(groupby(DataFrame(:a=>[1,2,3,3]),:a)) do g
    DataFrame([(;
        b="foo",
        c=[1,2]
    )])
end

@bkamins
Copy link
Member

bkamins commented May 26, 2023

I better understand why it's not good for the case.

OP wants (; b="foo", c=[1,2]) to be interpreted as a single row not as a table that would expand :b value into two rows.

@sprmnt21
Copy link

I get that you're looking for this, isn't it?

julia> combine(groupby(DataFrame(:group=>[1,1,2,2]), :group)) do g
           DataFrame([(;
               foo="foo",
               boo=[1,2]
           )])
       end
2×3 DataFrame
 Row │ group  foo     boo    
     │ Int64  String  Array…
─────┼───────────────────────
   1 │     1  foo     [1, 2]
   2 │     2  foo     [1, 2]`

@jonalm
Copy link
Author

jonalm commented May 26, 2023

@sprmnt21 : Yes. This issue is a request to avoid having to construct a full DataFrame to get a row.

@bkamins
Copy link
Member

bkamins commented May 26, 2023

Ah - I missed [ in @sprmnt21 proposal. This makes sense, so what currently works is:

julia> combine(groupby(DataFrame(:group=>[1,1,2,2]), :group), :group => (g ->
                  [(;
                      foo="foo",
                      boo=[1,2]
                  )]) => AsTable)
2×3 DataFrame
 Row │ group  foo     boo
     │ Int64  String  Array…
─────┼───────────────────────
   1 │     1  foo     [1, 2]
   2 │     2  foo     [1, 2]

and (this is a different output, but the same intention)

julia> combine(groupby(DataFrame(:group=>[1,1,2,2]), :group), :group => ByRow(g ->
                  (;
                      foo="foo",
                      boo=[1,2]
                  )) => AsTable)
4×3 DataFrame
 Row │ group  foo     boo
     │ Int64  String  Array…
─────┼───────────────────────
   1 │     1  foo     [1, 2]
   2 │     1  foo     [1, 2]
   3 │     2  foo     [1, 2]
   4 │     2  foo     [1, 2]

What we would need to change is the "legacy" syntax (i.e. taking a single function), where the current rule is that we only recognize:

a data frame, a matrix, a NamedTuple, or a DataFrameRow

(and this is a closed list of recognized objects)

@sprmnt21
Copy link

If I'm not getting too off topic, I'd like to try to put the question in a more general way.
What I'd find easiest to follow as a mindset would be one in which the result of cols =>fun whatever is, is intended to occupy a "cell" without being implicitly "interpreted" (expanded, flattened, ...).

Here are some fictitious examples:

julia> combine(groupby(DataFrame(:group=>[1,1,2,2]), :group)) do g
           (; foo="foo", boo=[1,2] )
       end
2×2 DataFrame
 Row │ group  x1
     │ Int64  NamedTup…
─────┼────────────────────────────────────
   1 │     1  (foo = "foo", boo = [1, 2])
   2 │     2  (foo = "foo", boo = [1, 2])



julia> combine(groupby(DataFrame(:group=>[1,1,2,2]), :group)) do g
           DataFrame([(;
               foo="foo",
               boo=[1,2]
           )])
       end
2×2 DataFrame
 Row │ group  x1
     │ Int64  DataFrame
─────┼──────────────────────
   1 │     1  1×2 DataFrame
   2 │     2  1×2 DataFrame


julia> combine(groupby(DataFrame(:group=>[1,1,2,2]), :group)) do g
           [
               (; foo="bar", boo=[1,2]),
               (; foo="baz", boo=[1,2])
           ]
       end
2×2 DataFrame
 Row │ group  x1
     │ Int64  Array…
─────┼──────────────────────────────────────────
   1 │     1  NamedTuple{(:foo, :boo), Tuple{S…
   2 │     2  NamedTuple{(:foo, :boo), Tuple{S…

The "expansion" of a NamedTuple or a DataFrame or any other "complex" structure should be explicitly required/constructed by the user.

@ararslan
Copy link
Member

What would be the benefit of AsRow over Tables.Row that is already defined in the ecosystem?

The definition would be AsRow(fun) = Tables.Row ∘ fun. (so it would be a function not a type)

AsRow(f) permits the use of do blocks, e.g.

combine(gdf, src => AsRow() do x
            # some great stuff happens here
        end => AsTable)

With Tables.Row ∘ fun, the closest to this you can get is

combine(gdf, src => Tables.Row  (x -> begin
            # some other stuff happens here
        end) => AsTable)

which is a bit more awkward to read and to write.

@nalimilan
Copy link
Member

Adding support for Tables.Row seems uncontroversial. Maybe we can try that and see whether AsRow(fun) = Tables.Row ∘ fun is really needed? The fact that you would write AsRow() and AsTable in the same transformation sounds a bit weird to me. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants