-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend setindex api #1571
Extend setindex api #1571
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think another case to cover is col_inds::AbstractVector{Bool}
.
I also wonder how this interacts with @bkamins work on indexing.
@@ -500,6 +500,57 @@ function Base.setindex!(df::DataFrame, | |||
return df | |||
end | |||
|
|||
# df[SingleRowIndex, :] = Union{AbstractDict, NamedTuple} | |||
function Base.setindex!(df::DataFrame, | |||
row::Union{AbstractDict, NamedTuple}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also allow DataFrameRow
, and add tests for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrameRow
isn't defined until after these functions are, how should I deal with that? I could move the definition of DataFrameRow
up, or add a definition of setindex!
after DataFrameRow
gets defined. Both of those don't seem like great options to me, but I'm leaning towards adding a definition of setindex!
in dataframerow.jl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively I could take in an AbstractDataFrame
and then verify the size, that might be best
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sounds like the best approach. BTW, there's a method for DataFrame
above which doesn't look great (it uses the column order rather than names, and doesn't handle errors).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should I use column names strictly? I could see a few different use cases for setindex!
where it takes in a dataframe, which gives us a few different options.
- We could just use column names and throw an error if every column name doesn't match
- We could check to see if the column names match, use them if they do, or just use column order if they don't
- We could only use column order
I'm not sure which one is really the behavior that people would expect (if someone is just thinking of the dataframe as a fancy matrix, then all they're worrying about is the dimensions), but I think the second option is the most forgiving of different expectations, so I'd think we should probably do that. If we do do that, it seems appropriate that we should also change the code for df[MultiRowIndex, MultiColumnIndex] = DataFrame
around line 580 to have the same behavior. It seems like we should be able to expect that if we were to delegate one to the other that they'd give the same behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we do with push!
and vcat
is to use names and ignore order, so we should just do that here too for consistency.
src/dataframe/dataframe.jl
Outdated
end | ||
# df[SingleRowIndex, MultiColumnIndex] = iterable | ||
function Base.setindex!(df::DataFrame, | ||
iterable::AbstractVector, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be Any
if you want to support any iterable. But that will require more changes and a deprecation phase, since we currently allow e.g. df[1,:] = 1
, which should be deprecated in favor of df[1,:] .= 1
. Unless you want to dive into this, just rename the argument to something like v
. Also it would be nice to allow Tuple
and test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to Any
and it spits out a deprecation warning if a single value is passed in. df[1,:] .= 1
isn't working and I'm looking into that
aabd0df
to
aded09d
Compare
So I've got setting a row in a dataframe with a 1-row dataframe in a place where I'm happy with it, but there may be a few things left to do, and it may be a few days before I can try and tackle them. I added two Here's what I have left to do:
If I could get guidance on what exactly we want on these, I should be able to get them done within the next week or two |
@nalimilan I implemented the third point, so that column names are now used instead of order. I had to change a few tests so that they didn't break, and it seems like this change could affect a lot of existing workflows, so I was thinking we should probably have a deprecation period, where we allow the old behavior if column names don't match, but also spit out a deprecation warning. What do you think? |
Yes, it would be nice to print deprecations to avoid breaking code. Also, gave you looked at how broadcasting could be implemented for This is likely to have conflicts with #1534. |
@@ -4,6 +4,15 @@ struct DataFrameRow{T <: AbstractDataFrame} | |||
row::Int | |||
end | |||
|
|||
# this is defined here because it has to be after the DataFrameRow definition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like this, it breaks the current organization of the indexing code. Can't you change the inclusion order of files so that DataFrameRow
is defined before DataFrame
? If that's not possible, we could move all type definitions to a separate file which would be included first.
end | ||
end | ||
|
||
function checkpreconditions(df::DataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better call checknames
and checkdimensions
directly, defining this helper method actually requires more lines of code.
end | ||
|
||
function checknames(df::DataFrame, | ||
new_df::AbstractDataFrame, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please align arguments (here and elsewhere).
new_rows, new_cols = size(new_df) | ||
# note that length(::Number) == 1 | ||
if (new_rows, new_cols) != (length(row_inds), length(col_inds)) | ||
msg = "Tried to assign $new_rows×$new_cols dataframe to " * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "data frame" (here and elsewhere).
df[1, 1:2] = df[2, 2:3] | ||
df[1:2, 1:2] = df[2:3, 2:3] | ||
df[[true,false,false,true], 2:3] = df[1:2,1:2] | ||
df[1, 1:2] = df[2, 1:2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have checks that an error is thrown when names don't match?
for j in 1:length(col_inds) | ||
insert_single_entry!(df, new_df[j][1], row_ind, col_inds[j]) | ||
# remove this check after implicit scalar broadcasting is deprecated | ||
if applicable(length, v) && Base.IteratorSize(v) != Base.HasShape{0}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a special case for AbstractString
too, since these are HasLength
but they act as scalar when broadcasting, and are quite common in data frames uses. Also put the IteratorSize
call first since it can be resolved statically (using isa HasShape{0}
or !==
would help), but currently applicable
cannot.
"across columns is deprecated, use .= instead, i.e. " * | ||
"df[row, :] .= val" | ||
Base.depwarn(msg, :setindex!) | ||
for col_ind in col_inds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The try... catch
logic is also needed here, right? Same below.
end | ||
function Base.setindex!(df::DataFrame, | ||
v::Any, | ||
row::Union{AbstractDict, NamedTuple}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row_ind
must be Integer
, but not Bool
.
for col_ind in col_inds | ||
insert_single_entry!(df, v, row_ind, col_ind) | ||
::Colon) | ||
old_row = df[row_ind, :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will create DataFrameRow
so we cannot use it to handle cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. So I guess the best solution is to call copy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right point.
insert_single_entry!(df, v, row_ind, col_ind) | ||
::Colon) | ||
old_row = df[row_ind, :] | ||
for nm in _names(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think that we should be sure that names(df)
matches exactly names of NamedTuple
/AbstractDict
and not is a subset. Why do we want to accept the subset.
Whatever way we choose I would check this condition instead of try/catch
and then cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Though try... catch
is still needed to handle conversion issues AFAICT.
row_ind::Real, | ||
col_inds::AbstractVector{Bool}) | ||
setindex!(df, new_df, row_ind, findall(col_inds)) | ||
setindex!(df, v, row_ind, findall(col_inds)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should check here if col_inds
has a correct length.
end | ||
function Base.setindex!(df::DataFrame, | ||
new_df::DataFrame, | ||
v::Any, | ||
row_ind::Real, | ||
col_inds::AbstractVector{<:ColumnIndex}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should handle the case when col_inds
have duplicates also AbstractVector{<:ColumnIndex}
allows mixing numbers and symbols which should not be allowed.
msg = "Length of iterable does not match DataFrame column count." | ||
throw(ArgumentError(msg)) | ||
end | ||
old_row = df[row_ind, col_inds] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will create a DataFrameRow
so we cannot use it as a fallback
col_inds::Any) | ||
new_rows, new_cols = size(new_df) | ||
# note that length(::Number) == 1 | ||
if (new_rows, new_cols) != (length(row_inds), length(col_inds)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we handle the case that row_inds
or col_inds
have duplicates?
Also do we avoid calling this method if either of them is AbstractVector{Bool}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Thank you.
I have added some comments, but it will need another round probably. Can you please check specification of getindex
#1534 (new file in docs) to make sure that we are consistent.
new_df::AbstractDataFrame, | ||
row_ind::Real, | ||
col_inds::AbstractVector{Bool}) | ||
setindex!(df, new_df, row_ind, findall(col_inds)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should check the length of col_inds
here.
function Base.setindex!(df::DataFrame, | ||
new_df::AbstractDataFrame, | ||
row_ind::Real, | ||
col_inds::AbstractVector{<:ColumnIndex}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this allows mixing integers and symbols.
# df[SingleRowIndex, MultiColumnIndex] = DataFrameRow | ||
function Base.setindex!(df::DataFrame, | ||
val::DataFrameRow, | ||
row_ind::Real, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can be only Integer
other than Bool
.
I would also recommend, after rebasing it against #1534 (when it is merged) to add a specification of |
@coreywoodfield, I am aware that the number of things I have outlined is large. Therefore, if you do not have time to work on all of that I think that the minimal I would recommend is:
The rest can be done later (I can work on it since #1534 is finished, but of course you are also welcome - let us coordinate the efforts 😄). Thank you! |
Sorry I'm taking so long on this, I'm still intending to finish it, but I've been swamped with school and other things. I wanted to tackle all of the |
Thank you for your contribution. Very soon (after #1603 is merged) similar work has to be done for Given your last comment I would recommend the following approach:
|
A key question here is do we want to |
@coreywoodfield After 0.19 I will implement the rules set in https://juliadata.github.io/DataFrames.jl/latest/lib/indexing/ soon. So I think this PR can be closed (but please comment). @coreywoodfield + @nalimilan The only question is if we should allow |
Why not allow dicts, but that sounds like low priority. |
I will write the code soon anyway - I can add dicts then as this is simple |
Sounds good, thanks! I think that generally it makes sense for the allowed signatures for |
Agreed. This is what we will do. |
This resolves #1569
I wasn't sure if we want the second commit, but it was pretty easy to add so I just did it (and currently, trying to make such a call interprets the iterable as a scalar and broadcasts it to all the columns in the row - that may be desirable when the columns hold collections, but that seems like an uncommon use case, if the columns don't hold collections it just throws an error and doesn't do anything). I'm happy to remove it though if we don't want it.
One thing that I considered that could be added to this is that perhaps we want to allow updating just some values in a row with a named tuple (or a dictionary), I just wasn't sure what syntax we would want. It'd be simplest to allow using a named tuple missing some columns in the assignment
df[1, :] = tuple
, and only update the columns that the tuple has, but it'd be more clear if we requireddf[1, keys(tuple)] = tuple
. I'm happy to add either one to this pull request if we want.