Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend setindex api #1571

Closed

Conversation

coreywoodfield
Copy link

This resolves #1569

I wasn't sure if we want the second commit, but it was pretty easy to add so I just did it (and currently, trying to make such a call interprets the iterable as a scalar and broadcasts it to all the columns in the row - that may be desirable when the columns hold collections, but that seems like an uncommon use case, if the columns don't hold collections it just throws an error and doesn't do anything). I'm happy to remove it though if we don't want it.

One thing that I considered that could be added to this is that perhaps we want to allow updating just some values in a row with a named tuple (or a dictionary), I just wasn't sure what syntax we would want. It'd be simplest to allow using a named tuple missing some columns in the assignment df[1, :] = tuple, and only update the columns that the tuple has, but it'd be more clear if we required df[1, keys(tuple)] = tuple. I'm happy to add either one to this pull request if we want.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think another case to cover is col_inds::AbstractVector{Bool}.

I also wonder how this interacts with @bkamins work on indexing.

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
@@ -500,6 +500,57 @@ function Base.setindex!(df::DataFrame,
return df
end

# df[SingleRowIndex, :] = Union{AbstractDict, NamedTuple}
function Base.setindex!(df::DataFrame,
row::Union{AbstractDict, NamedTuple},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also allow DataFrameRow, and add tests for it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrameRow isn't defined until after these functions are, how should I deal with that? I could move the definition of DataFrameRow up, or add a definition of setindex! after DataFrameRow gets defined. Both of those don't seem like great options to me, but I'm leaning towards adding a definition of setindex! in dataframerow.jl

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively I could take in an AbstractDataFrame and then verify the size, that might be best

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sounds like the best approach. BTW, there's a method for DataFrame above which doesn't look great (it uses the column order rather than names, and doesn't handle errors).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I use column names strictly? I could see a few different use cases for setindex! where it takes in a dataframe, which gives us a few different options.

  • We could just use column names and throw an error if every column name doesn't match
  • We could check to see if the column names match, use them if they do, or just use column order if they don't
  • We could only use column order

I'm not sure which one is really the behavior that people would expect (if someone is just thinking of the dataframe as a fancy matrix, then all they're worrying about is the dimensions), but I think the second option is the most forgiving of different expectations, so I'd think we should probably do that. If we do do that, it seems appropriate that we should also change the code for df[MultiRowIndex, MultiColumnIndex] = DataFrame around line 580 to have the same behavior. It seems like we should be able to expect that if we were to delegate one to the other that they'd give the same behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we do with push! and vcat is to use names and ignore order, so we should just do that here too for consistency.

end
# df[SingleRowIndex, MultiColumnIndex] = iterable
function Base.setindex!(df::DataFrame,
iterable::AbstractVector,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be Any if you want to support any iterable. But that will require more changes and a deprecation phase, since we currently allow e.g. df[1,:] = 1, which should be deprecated in favor of df[1,:] .= 1. Unless you want to dive into this, just rename the argument to something like v. Also it would be nice to allow Tuple and test it.

Copy link
Author

@coreywoodfield coreywoodfield Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to Any and it spits out a deprecation warning if a single value is passed in. df[1,:] .= 1 isn't working and I'm looking into that

src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
test/dataframe.jl Outdated Show resolved Hide resolved
test/dataframe.jl Show resolved Hide resolved
test/dataframe.jl Show resolved Hide resolved
@coreywoodfield
Copy link
Author

coreywoodfield commented Oct 18, 2018

So I've got setting a row in a dataframe with a 1-row dataframe in a place where I'm happy with it, but there may be a few things left to do, and it may be a few days before I can try and tackle them. I added two @test_skip tests in the setindex! tests for things that were mentioned in the discussion here but aren't working yet, those are broadcasted assignment and setting a row with a DataFrameRow (which we sort of decided not to do, I still think it would make sense to have though). Beyond that I think it'd be good to add checks on calling setindex! with a DataFrame, and I've done some work on that that I have locally on my machine, but a lot of it was overlapping with the methods that take in Colons so I'd need to dedicate a bit more time to make sure I understand how each of those is delegating before I'd be confident that I had a working method for setting from a dataframe.

Here's what I have left to do:

  • Get broadcasted assignment working (or decide to not deprecate implicit broadcasting of scalar values--I think my check here is pretty robust, and will do what the user expects)
  • Add a method to setindex! that takes in a DataFrameRow, or decide not to do it, as it'd have to be after the definition of DataFrameRow, and thus separated from the rest of the setindex! definitions
  • Add error checking to the setindex! methods that take in a DataFrame, and update them to use column names rather than column order when applicable (I gathered from @nalimilan's comments that that would be preferred)

If I could get guidance on what exactly we want on these, I should be able to get them done within the next week or two

@coreywoodfield
Copy link
Author

@nalimilan I implemented the third point, so that column names are now used instead of order. I had to change a few tests so that they didn't break, and it seems like this change could affect a lot of existing workflows, so I was thinking we should probably have a deprecation period, where we allow the old behavior if column names don't match, but also spit out a deprecation warning. What do you think?

@nalimilan
Copy link
Member

Yes, it would be nice to print deprecations to avoid breaking code. Also, gave you looked at how broadcasting could be implemented for setindex?

This is likely to have conflicts with #1534.

@@ -4,6 +4,15 @@ struct DataFrameRow{T <: AbstractDataFrame}
row::Int
end

# this is defined here because it has to be after the DataFrameRow definition
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this, it breaks the current organization of the indexing code. Can't you change the inclusion order of files so that DataFrameRow is defined before DataFrame? If that's not possible, we could move all type definitions to a separate file which would be included first.

end
end

function checkpreconditions(df::DataFrame,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better call checknames and checkdimensions directly, defining this helper method actually requires more lines of code.

end

function checknames(df::DataFrame,
new_df::AbstractDataFrame,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please align arguments (here and elsewhere).

new_rows, new_cols = size(new_df)
# note that length(::Number) == 1
if (new_rows, new_cols) != (length(row_inds), length(col_inds))
msg = "Tried to assign $new_rows×$new_cols dataframe to " *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "data frame" (here and elsewhere).

df[1, 1:2] = df[2, 2:3]
df[1:2, 1:2] = df[2:3, 2:3]
df[[true,false,false,true], 2:3] = df[1:2,1:2]
df[1, 1:2] = df[2, 1:2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have checks that an error is thrown when names don't match?

for j in 1:length(col_inds)
insert_single_entry!(df, new_df[j][1], row_ind, col_inds[j])
# remove this check after implicit scalar broadcasting is deprecated
if applicable(length, v) && Base.IteratorSize(v) != Base.HasShape{0}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a special case for AbstractString too, since these are HasLength but they act as scalar when broadcasting, and are quite common in data frames uses. Also put the IteratorSize call first since it can be resolved statically (using isa HasShape{0} or !== would help), but currently applicable cannot.

"across columns is deprecated, use .= instead, i.e. " *
"df[row, :] .= val"
Base.depwarn(msg, :setindex!)
for col_ind in col_inds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try... catch logic is also needed here, right? Same below.

end
function Base.setindex!(df::DataFrame,
v::Any,
row::Union{AbstractDict, NamedTuple},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row_ind must be Integer, but not Bool.

for col_ind in col_inds
insert_single_entry!(df, v, row_ind, col_ind)
::Colon)
old_row = df[row_ind, :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will create DataFrameRow so we cannot use it to handle cleanup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. So I guess the best solution is to call copy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right point.

insert_single_entry!(df, v, row_ind, col_ind)
::Colon)
old_row = df[row_ind, :]
for nm in _names(df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that we should be sure that names(df) matches exactly names of NamedTuple/AbstractDict and not is a subset. Why do we want to accept the subset.

Whatever way we choose I would check this condition instead of try/catch and then cleanup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Though try... catch is still needed to handle conversion issues AFAICT.

row_ind::Real,
col_inds::AbstractVector{Bool})
setindex!(df, new_df, row_ind, findall(col_inds))
setindex!(df, v, row_ind, findall(col_inds))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should check here if col_inds has a correct length.

end
function Base.setindex!(df::DataFrame,
new_df::DataFrame,
v::Any,
row_ind::Real,
col_inds::AbstractVector{<:ColumnIndex})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should handle the case when col_inds have duplicates also AbstractVector{<:ColumnIndex} allows mixing numbers and symbols which should not be allowed.

msg = "Length of iterable does not match DataFrame column count."
throw(ArgumentError(msg))
end
old_row = df[row_ind, col_inds]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will create a DataFrameRow so we cannot use it as a fallback

col_inds::Any)
new_rows, new_cols = size(new_df)
# note that length(::Number) == 1
if (new_rows, new_cols) != (length(row_inds), length(col_inds))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we handle the case that row_inds or col_inds have duplicates?
Also do we avoid calling this method if either of them is AbstractVector{Bool}?

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thank you.

I have added some comments, but it will need another round probably. Can you please check specification of getindex #1534 (new file in docs) to make sure that we are consistent.

new_df::AbstractDataFrame,
row_ind::Real,
col_inds::AbstractVector{Bool})
setindex!(df, new_df, row_ind, findall(col_inds))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should check the length of col_inds here.

function Base.setindex!(df::DataFrame,
new_df::AbstractDataFrame,
row_ind::Real,
col_inds::AbstractVector{<:ColumnIndex})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this allows mixing integers and symbols.

# df[SingleRowIndex, MultiColumnIndex] = DataFrameRow
function Base.setindex!(df::DataFrame,
val::DataFrameRow,
row_ind::Real,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can be only Integer other than Bool.

@bkamins
Copy link
Member

bkamins commented Oct 28, 2018

I would also recommend, after rebasing it against #1534 (when it is merged) to add a specification of setindex! to docs/lib/indexing.md so that we can check if what is implemented is the same as what is indented (and ideally this specification should cover both = and .= assignment options like getindex also covers view now).

@bkamins
Copy link
Member

bkamins commented Oct 29, 2018

@coreywoodfield, I am aware that the number of things I have outlined is large. Therefore, if you do not have time to work on all of that I think that the minimal I would recommend is:

  • rebase against WIP: clean up getindex: part 2 #1534 as it is now merged;
  • concentrate in code only on the features you want to change (not fixing all other things) - but in those changes be consistent with WIP: clean up getindex: part 2 #1534;
  • add a specification of RHS of setindex! in the cases you cover to the documentation page where getindex is now documented (especially outlining how data is interpreted in RHS in cases when in can contain column names and what is the contract assumed).
  • LHS of setindex! should strictly conform the LHS specification of getindex.

The rest can be done later (I can work on it since #1534 is finished, but of course you are also welcome - let us coordinate the efforts 😄). Thank you!

@coreywoodfield
Copy link
Author

Sorry I'm taking so long on this, I'm still intending to finish it, but I've been swamped with school and other things. I wanted to tackle all of the setindex! stuff, but at this point I'd like to just get the Dict code working, and then I might try and contribute more depending on how much time I have

@bkamins
Copy link
Member

bkamins commented Dec 11, 2018

Thank you for your contribution. Very soon (after #1603 is merged) similar work has to be done for setindex! (I have outlined the key steps above).

Given your last comment I would recommend the following approach:

  1. you finish Dict code (and concentrate only on this case)
  2. I plan to write down rules for setindex! in the next few weeks in docs/src/lib/indexing.md for DataFrame, SubDataFrame and DataFrameRow - when the plan is proposed we should discuss it (if you are willing to do it please comment)
  3. Then a PR or set of PRs with deprecations only will have to be implemented
  4. Next we have to release DataFrames.jl
  5. Finally we need a PR with the target functionality

@bkamins
Copy link
Member

bkamins commented Dec 16, 2018

A key question here is do we want to setindex! to work by name matching on the RHS. This is what you propose (and it is reasonable), but this is not what we do with setindex! now with a data frame on RHS. In general I would go for this direction, i.e. we select a column in the target, get its name and look for this name in source.

@bkamins
Copy link
Member

bkamins commented Jul 24, 2019

@coreywoodfield After 0.19 I will implement the rules set in https://juliadata.github.io/DataFrames.jl/latest/lib/indexing/ soon. So I think this PR can be closed (but please comment).

@coreywoodfield + @nalimilan The only question is if we should allow AbstractDict as RHS (I planned to disallow it, but we allow it in push! so maybe we also can allow it for setindex!). A comment would be appreciated.

@nalimilan
Copy link
Member

Why not allow dicts, but that sounds like low priority.

@bkamins
Copy link
Member

bkamins commented Jul 25, 2019

I will write the code soon anyway - I can add dicts then as this is simple

@coreywoodfield
Copy link
Author

Sounds good, thanks!

I think that generally it makes sense for the allowed signatures for push! to match those of setindex!. If we allow things to be done with push! that we don't allow with setindex!, then it seems to me like we're promoting bad practices (namely, creating an empty dataframe and filling it row by row rather than allocating a dataframe with the appropriate size and then filling it). If someone already has Dicts, and they can push! Dicts, it only makes sense to me that they should also be able to setindex! with Dicts. So in my view, we should either allow it in both places or disallow it in both places (but preferably allow it)

@bkamins
Copy link
Member

bkamins commented Jul 25, 2019

Agreed. This is what we will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Creating an empty DataFrame is unwieldy and has unexpected behavior
3 participants