Extend setindex api #1571

coreywoodfield · 2018-10-16T23:35:08Z

This resolves #1569

I wasn't sure if we want the second commit, but it was pretty easy to add so I just did it (and currently, trying to make such a call interprets the iterable as a scalar and broadcasts it to all the columns in the row - that may be desirable when the columns hold collections, but that seems like an uncommon use case, if the columns don't hold collections it just throws an error and doesn't do anything). I'm happy to remove it though if we don't want it.

One thing that I considered that could be added to this is that perhaps we want to allow updating just some values in a row with a named tuple (or a dictionary), I just wasn't sure what syntax we would want. It'd be simplest to allow using a named tuple missing some columns in the assignment df[1, :] = tuple, and only update the columns that the tuple has, but it'd be more clear if we required df[1, keys(tuple)] = tuple. I'm happy to add either one to this pull request if we want.

nalimilan

Thanks. I think another case to cover is col_inds::AbstractVector{Bool}.

I also wonder how this interacts with @bkamins work on indexing.

src/dataframe/dataframe.jl

nalimilan · 2018-10-17T08:30:27Z

src/dataframe/dataframe.jl

@@ -500,6 +500,57 @@ function Base.setindex!(df::DataFrame,
    return df
 end

+# df[SingleRowIndex, :] = Union{AbstractDict, NamedTuple}
+function Base.setindex!(df::DataFrame,
+                        row::Union{AbstractDict, NamedTuple},


Can you also allow DataFrameRow, and add tests for it?

DataFrameRow isn't defined until after these functions are, how should I deal with that? I could move the definition of DataFrameRow up, or add a definition of setindex! after DataFrameRow gets defined. Both of those don't seem like great options to me, but I'm leaning towards adding a definition of setindex! in dataframerow.jl

Alternatively I could take in an AbstractDataFrame and then verify the size, that might be best

Yes, sounds like the best approach. BTW, there's a method for DataFrame above which doesn't look great (it uses the column order rather than names, and doesn't handle errors).

So should I use column names strictly? I could see a few different use cases for setindex! where it takes in a dataframe, which gives us a few different options.

We could just use column names and throw an error if every column name doesn't match

We could check to see if the column names match, use them if they do, or just use column order if they don't

We could only use column order

I'm not sure which one is really the behavior that people would expect (if someone is just thinking of the dataframe as a fancy matrix, then all they're worrying about is the dimensions), but I think the second option is the most forgiving of different expectations, so I'd think we should probably do that. If we do do that, it seems appropriate that we should also change the code for df[MultiRowIndex, MultiColumnIndex] = DataFrame around line 580 to have the same behavior. It seems like we should be able to expect that if we were to delegate one to the other that they'd give the same behavior

What we do with push! and vcat is to use names and ignore order, so we should just do that here too for consistency.

nalimilan · 2018-10-17T08:35:55Z

src/dataframe/dataframe.jl

+end
+# df[SingleRowIndex, MultiColumnIndex] = iterable
+function Base.setindex!(df::DataFrame,
+                        iterable::AbstractVector,


Should be Any if you want to support any iterable. But that will require more changes and a deprecation phase, since we currently allow e.g. df[1,:] = 1, which should be deprecated in favor of df[1,:] .= 1. Unless you want to dive into this, just rename the argument to something like v. Also it would be nice to allow Tuple and test it.

I changed it to Any and it spits out a deprecation warning if a single value is passed in. df[1,:] .= 1 isn't working and I'm looking into that

src/dataframe/dataframe.jl

test/dataframe.jl

coreywoodfield · 2018-10-18T03:56:40Z

So I've got setting a row in a dataframe with a 1-row dataframe in a place where I'm happy with it, but there may be a few things left to do, and it may be a few days before I can try and tackle them. I added two @test_skip tests in the setindex! tests for things that were mentioned in the discussion here but aren't working yet, those are broadcasted assignment and setting a row with a DataFrameRow (which we sort of decided not to do, I still think it would make sense to have though). Beyond that I think it'd be good to add checks on calling setindex! with a DataFrame, and I've done some work on that that I have locally on my machine, but a lot of it was overlapping with the methods that take in Colons so I'd need to dedicate a bit more time to make sure I understand how each of those is delegating before I'd be confident that I had a working method for setting from a dataframe.

Here's what I have left to do:

Get broadcasted assignment working (or decide to not deprecate implicit broadcasting of scalar values--I think my check here is pretty robust, and will do what the user expects)
Add a method to setindex! that takes in a DataFrameRow, or decide not to do it, as it'd have to be after the definition of DataFrameRow, and thus separated from the rest of the setindex! definitions
Add error checking to the setindex! methods that take in a DataFrame, and update them to use column names rather than column order when applicable (I gathered from @nalimilan's comments that that would be preferred)

If I could get guidance on what exactly we want on these, I should be able to get them done within the next week or two

…ataframe

coreywoodfield · 2018-10-25T22:05:58Z

@nalimilan I implemented the third point, so that column names are now used instead of order. I had to change a few tests so that they didn't break, and it seems like this change could affect a lot of existing workflows, so I was thinking we should probably have a deprecation period, where we allow the old behavior if column names don't match, but also spit out a deprecation warning. What do you think?

nalimilan · 2018-10-28T09:48:18Z

Yes, it would be nice to print deprecations to avoid breaking code. Also, gave you looked at how broadcasting could be implemented for setindex?

This is likely to have conflicts with #1534.

nalimilan · 2018-10-28T09:30:07Z

src/dataframerow/dataframerow.jl

@@ -4,6 +4,15 @@ struct DataFrameRow{T <: AbstractDataFrame}
    row::Int
 end

+# this is defined here because it has to be after the DataFrameRow definition


I don't really like this, it breaks the current organization of the indexing code. Can't you change the inclusion order of files so that DataFrameRow is defined before DataFrame? If that's not possible, we could move all type definitions to a separate file which would be included first.

nalimilan · 2018-10-28T09:33:35Z

src/dataframe/dataframe.jl

+    end
+end
+
+function checkpreconditions(df::DataFrame,


Better call checknames and checkdimensions directly, defining this helper method actually requires more lines of code.

nalimilan · 2018-10-28T09:33:48Z

src/dataframe/dataframe.jl

+end
+
+function checknames(df::DataFrame,
+                  new_df::AbstractDataFrame,


Please align arguments (here and elsewhere).

nalimilan · 2018-10-28T09:34:11Z

src/dataframe/dataframe.jl

+    new_rows, new_cols = size(new_df)
+    # note that length(::Number) == 1
+    if (new_rows, new_cols) != (length(row_inds), length(col_inds))
+        msg = "Tried to assign $new_rows×$new_cols dataframe to " *


Should be "data frame" (here and elsewhere).

nalimilan · 2018-10-28T09:37:38Z

test/cat.jl

-        df[1, 1:2] = df[2, 2:3]
-        df[1:2, 1:2] = df[2:3, 2:3]
-        df[[true,false,false,true], 2:3] = df[1:2,1:2]
+        df[1, 1:2] = df[2, 1:2]


Do we have checks that an error is thrown when names don't match?

nalimilan · 2018-10-28T09:44:38Z

src/dataframe/dataframe.jl

-    for j in 1:length(col_inds)
-        insert_single_entry!(df, new_df[j][1], row_ind, col_inds[j])
+    # remove this check after implicit scalar broadcasting is deprecated
+    if applicable(length, v) && Base.IteratorSize(v) != Base.HasShape{0}()


Maybe add a special case for AbstractString too, since these are HasLength but they act as scalar when broadcasting, and are quite common in data frames uses. Also put the IteratorSize call first since it can be resolved statically (using isa HasShape{0} or !== would help), but currently applicable cannot.

nalimilan · 2018-10-28T09:46:21Z

src/dataframe/dataframe.jl

+              "across columns is deprecated, use .= instead, i.e. " *
+              "df[row, :] .= val"
+        Base.depwarn(msg, :setindex!)
+        for col_ind in col_inds


The try... catch logic is also needed here, right? Same below.

bkamins · 2018-10-28T14:37:59Z

src/dataframe/dataframe.jl

-end
-function Base.setindex!(df::DataFrame,
-                        v::Any,
+                        row::Union{AbstractDict, NamedTuple},


row_ind must be Integer, but not Bool.

bkamins · 2018-10-28T14:39:52Z

src/dataframe/dataframe.jl

-    for col_ind in col_inds
-        insert_single_entry!(df, v, row_ind, col_ind)
+                        ::Colon)
+    old_row = df[row_ind, :]


This will create DataFrameRow so we cannot use it to handle cleanup.

Indeed. So I guess the best solution is to call copy?

Right point.

bkamins · 2018-10-28T14:42:14Z

src/dataframe/dataframe.jl

-        insert_single_entry!(df, v, row_ind, col_ind)
+                        ::Colon)
+    old_row = df[row_ind, :]
+    for nm in _names(df)


I would think that we should be sure that names(df) matches exactly names of NamedTuple/AbstractDict and not is a subset. Why do we want to accept the subset.

Whatever way we choose I would check this condition instead of try/catch and then cleanup.

Agreed. Though try... catch is still needed to handle conversion issues AFAICT.

bkamins · 2018-10-28T14:44:26Z

src/dataframe/dataframe.jl

                        row_ind::Real,
                        col_inds::AbstractVector{Bool})
-    setindex!(df, new_df, row_ind, findall(col_inds))
+    setindex!(df, v, row_ind, findall(col_inds))


we should check here if col_inds has a correct length.

bkamins · 2018-10-28T14:49:46Z

src/dataframe/dataframe.jl

 end
 function Base.setindex!(df::DataFrame,
-                        new_df::DataFrame,
+                        v::Any,
                        row_ind::Real,
                        col_inds::AbstractVector{<:ColumnIndex})


We should handle the case when col_inds have duplicates also AbstractVector{<:ColumnIndex} allows mixing numbers and symbols which should not be allowed.

bkamins · 2018-10-28T14:50:20Z

src/dataframe/dataframe.jl

+            msg = "Length of iterable does not match DataFrame column count."
+            throw(ArgumentError(msg))
+        end
+        old_row = df[row_ind, col_inds]


This will create a DataFrameRow so we cannot use it as a fallback

bkamins · 2018-10-28T14:52:27Z

src/dataframe/dataframe.jl

+                         col_inds::Any)
+    new_rows, new_cols = size(new_df)
+    # note that length(::Number) == 1
+    if (new_rows, new_cols) != (length(row_inds), length(col_inds))


do we handle the case that row_inds or col_inds have duplicates?
Also do we avoid calling this method if either of them is AbstractVector{Bool}?

bkamins

Great work! Thank you.

I have added some comments, but it will need another round probably. Can you please check specification of getindex #1534 (new file in docs) to make sure that we are consistent.

bkamins · 2018-10-28T14:54:54Z

src/dataframe/dataframe.jl

+                        new_df::AbstractDataFrame,
+                        row_ind::Real,
+                        col_inds::AbstractVector{Bool})
+    setindex!(df, new_df, row_ind, findall(col_inds))


we should check the length of col_inds here.

bkamins · 2018-10-28T14:55:23Z

src/dataframe/dataframe.jl

+function Base.setindex!(df::DataFrame,
+                        new_df::AbstractDataFrame,
+                        row_ind::Real,
+                        col_inds::AbstractVector{<:ColumnIndex})


this allows mixing integers and symbols.

bkamins · 2018-10-28T14:56:03Z

src/dataframerow/dataframerow.jl

+# df[SingleRowIndex, MultiColumnIndex] = DataFrameRow
+function Base.setindex!(df::DataFrame,
+          val::DataFrameRow,
+          row_ind::Real,


it can be only Integer other than Bool.

bkamins · 2018-10-28T15:34:31Z

I would also recommend, after rebasing it against #1534 (when it is merged) to add a specification of setindex! to docs/lib/indexing.md so that we can check if what is implemented is the same as what is indented (and ideally this specification should cover both = and .= assignment options like getindex also covers view now).

bkamins · 2018-10-29T10:48:56Z

@coreywoodfield, I am aware that the number of things I have outlined is large. Therefore, if you do not have time to work on all of that I think that the minimal I would recommend is:

rebase against WIP: clean up getindex: part 2 #1534 as it is now merged;
concentrate in code only on the features you want to change (not fixing all other things) - but in those changes be consistent with WIP: clean up getindex: part 2 #1534;
add a specification of RHS of setindex! in the cases you cover to the documentation page where getindex is now documented (especially outlining how data is interpreted in RHS in cases when in can contain column names and what is the contract assumed).
LHS of setindex! should strictly conform the LHS specification of getindex.

The rest can be done later (I can work on it since #1534 is finished, but of course you are also welcome - let us coordinate the efforts 😄). Thank you!

coreywoodfield · 2018-12-11T22:10:46Z

Sorry I'm taking so long on this, I'm still intending to finish it, but I've been swamped with school and other things. I wanted to tackle all of the setindex! stuff, but at this point I'd like to just get the Dict code working, and then I might try and contribute more depending on how much time I have

bkamins · 2018-12-11T23:18:14Z

Thank you for your contribution. Very soon (after #1603 is merged) similar work has to be done for setindex! (I have outlined the key steps above).

Given your last comment I would recommend the following approach:

you finish Dict code (and concentrate only on this case)
I plan to write down rules for setindex! in the next few weeks in docs/src/lib/indexing.md for DataFrame, SubDataFrame and DataFrameRow - when the plan is proposed we should discuss it (if you are willing to do it please comment)
Then a PR or set of PRs with deprecations only will have to be implemented
Next we have to release DataFrames.jl
Finally we need a PR with the target functionality

bkamins · 2018-12-16T14:28:30Z

A key question here is do we want to setindex! to work by name matching on the RHS. This is what you propose (and it is reasonable), but this is not what we do with setindex! now with a data frame on RHS. In general I would go for this direction, i.e. we select a column in the target, get its name and look for this name in source.

bkamins · 2019-07-24T14:46:09Z

@coreywoodfield After 0.19 I will implement the rules set in https://juliadata.github.io/DataFrames.jl/latest/lib/indexing/ soon. So I think this PR can be closed (but please comment).

@coreywoodfield + @nalimilan The only question is if we should allow AbstractDict as RHS (I planned to disallow it, but we allow it in push! so maybe we also can allow it for setindex!). A comment would be appreciated.

nalimilan · 2019-07-25T13:45:53Z

Why not allow dicts, but that sounds like low priority.

bkamins · 2019-07-25T13:49:41Z

I will write the code soon anyway - I can add dicts then as this is simple

coreywoodfield · 2019-07-25T17:16:21Z

Sounds good, thanks!

I think that generally it makes sense for the allowed signatures for push! to match those of setindex!. If we allow things to be done with push! that we don't allow with setindex!, then it seems to me like we're promoting bad practices (namely, creating an empty dataframe and filling it row by row rather than allocating a dataframe with the appropriate size and then filling it). If someone already has Dicts, and they can push! Dicts, it only makes sense to me that they should also be able to setindex! with Dicts. So in my view, we should either allow it in both places or disallow it in both places (but preferably allow it)

bkamins · 2019-07-25T17:33:49Z

Agreed. This is what we will do.

coreywoodfield added 2 commits October 16, 2018 17:26

Extended setindex! to allow df[row, :] = dict

7e64b16

Extended setindex! to allow df[row, cols] = iterable

59d55a0

nalimilan reviewed Oct 17, 2018

View reviewed changes

coreywoodfield added 4 commits October 17, 2018 11:31

fixup! Extended setindex! to allow df[row, :] = dict

b51dafe

fixup! Extended setindex! to allow df[row, cols] = iterable

76ff582

fixup! Extended setindex! to allow df[row, cols] = iterable

094b0fd

Updated code for setting one row of a dataframe with a 1-row dataframe

aded09d

coreywoodfield force-pushed the extend-setindex-api branch from aabd0df to aded09d Compare October 18, 2018 03:41

coreywoodfield added 3 commits October 25, 2018 15:21

fixup! Updated code for setting one row of a dataframe with a 1-row d…

dae6f69

…ataframe

Added method to df setindex! to take a DataFrameRow

5089079

Added checks for df[x,y] = other_df

1207575

nalimilan requested a review from bkamins October 28, 2018 09:48

nalimilan reviewed Oct 28, 2018

View reviewed changes

bkamins reviewed Oct 28, 2018

View reviewed changes

bkamins requested changes Oct 28, 2018

View reviewed changes

bkamins mentioned this pull request Oct 28, 2018

WIP: clean up getindex: part 2 #1534

Merged

bkamins mentioned this pull request Nov 3, 2018

Fix depwarns of getindex #1585

Merged

pdeffebach mentioned this pull request Nov 10, 2018

[Deprecation] groupby JuliaData/DataFramesMeta.jl#116

Closed

bkamins mentioned this pull request Dec 16, 2018

setindex!/broadcast! design #1645

Closed

coreywoodfield closed this Jul 25, 2019

Extend setindex api #1571

Extend setindex api #1571

Conversation

coreywoodfield commented Oct 16, 2018

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coreywoodfield Oct 17, 2018 • edited Loading

Choose a reason for hiding this comment

coreywoodfield commented Oct 18, 2018 • edited Loading

coreywoodfield commented Oct 25, 2018

nalimilan commented Oct 28, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Oct 28, 2018

bkamins commented Oct 29, 2018 • edited Loading

coreywoodfield commented Dec 11, 2018

bkamins commented Dec 11, 2018

bkamins commented Dec 16, 2018

bkamins commented Jul 24, 2019

nalimilan commented Jul 25, 2019

bkamins commented Jul 25, 2019

coreywoodfield commented Jul 25, 2019

bkamins commented Jul 25, 2019

coreywoodfield Oct 17, 2018 •

edited

Loading

coreywoodfield commented Oct 18, 2018 •

edited

Loading

bkamins commented Oct 29, 2018 •

edited

Loading