Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: clean up getindex: part 2 #1534

Merged
merged 10 commits into from
Oct 29, 2018
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Sep 24, 2018

This PR was hard for me. Therefore I split it into implementation (pushing now) and tests (which I will add when we decide this is the functionality we want).

It implements discussion in #1190, #142 and #1533 and some minor clean-ups.

A particular difficulty was what to deprecate and what to change. @nalimilan - an initial review is appreciated (I will work on it further if we decide this is a right direction).

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yes, this sounds like a step in the right direction. My main interrogation is about the nature of SubDataFrame (see comment inline).

new_columns = Any[dv[row_inds] for dv in columns(df)]
return DataFrame(new_columns, copy(index(df)))
end

# df[:, :] => DataFrame
function Base.getindex(df::DataFrame, ::Colon, ::Colon)
Base.depwarn("indexing with colon as row will create a copy of rows in the future", :getindex)
Base.depwarn("indexing with colon as row will create a copy in the future" *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean a copy of column vectors?

function SubDataFrame(parent::DataFrame, row::Integer)
Base.depwarn("Selecting a single row from a `DataFrame` will return `DataFrameRow` in future.", :getindex)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SubDataFrame

return SubDataFrame(parent, [Int(row)])
end

function SubDataFrame(parent::DataFrame, rows::AbstractVector{<:Integer})
if any(isa.(rows, Bool))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any(x -> x isa Bool, rows) would avoid an allocation.

Below and elsewhere, better use backticks consistently around types.

return SubDataFrame(parent, convert(Vector{Int}, rows))
end

function SubDataFrame(parent::DataFrame, rows::AbstractVector{Bool})
if length(rows) != nrow(parent)
throw(ArgumentError("invalid length of Vector{Bool} row index"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to print "(got $m, expected $n)". Also, the object is not necessarily a Vector{Bool}.

any(ismissing, rowinds) && throw(MissingException("missing values are not allowed in indices"))
return SubDataFrame(adf, convert(Vector{Missings.T(T)}, rowinds))
function Base.view(adf::AbstractDataFrame, rowinds)
Base.depwarn("view(adf, x) will create a `SubDataFrame` containing columns `x` and all rows in the future.", :view)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better give the replacement syntax. Same for other deprecations.

function Base.view(adf::AbstractDataFrame, rowinds::Any)
return SubDataFrame(adf, rowinds)
function Base.view(adf::AbstractDataFrame, rowinds, colinds)
return SubDataFrame(adf[colinds], rowinds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I realize looking at this is that SubDataFrame is a view on rows of a data frame, but not on columns. So while view(adf, 1, :) will reflect changes due to e.g. rename!(adf, col => newcol), view(df, 1, cols) won't (since adf[colinds] creates a new parent data frame). The difference is subtle: we get a view of column vectors, but not a view of the parent data frame.

I wonder whether it would be more consistent to add a cols fields to SubDataFrame so that it's a real view on both rows and columns. That also depends on whether it would have a significant performance hit or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it when I asked the same about DataFrameRow earlier. I came to the conclusion that it is impossible to consistently handle columns. The reason is that we allow dual column indexing: by their numbers and by their names. Later we can rename columns or move them breaking one type of indexing or the other and if we break number-name mapping in the parent it is not clear what we should do in the view.

Finally SubDataFrame in general is supposed to be only used when original DataFrame is not mutated in terms of rows/columns (e.g. removing rows in source DataFrame can break SubDataFrame) - this is something we should clearly indicate in the documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about it when I asked the same about DataFrameRow earlier. I came to the conclusion that it is impossible to consistently handle columns. The reason is that we allow dual column indexing: by their numbers and by their names. Later we can rename columns or move them breaking one type of indexing or the other and if we break number-name mapping in the parent it is not clear what we should do in the view.

I'm not sure that's really an issue. Anyway as you note one isn't supposed to change the structure of the parent. What bothers me more is the inconsistency between passing : for columns and passing a vector. Though in practice it shouldn't really matter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To sum up: we should leave not creating a copy when passing : for performance reasons. I will add a note in the docs when parent is copied and when not.

In fact - this is consistent with the thinking that DataFrame is a collection of rows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To sum up: we should leave not creating a copy when passing : for performance reasons.

Yes, I agree, what I meant is that if SubDataFrame also stored the indices of a subset of columns we wouldn't need to make a copy when not passing : either. I'm not saying we should definitely do this, even less than it should be done in this PR, but maybe something to keep in mind.

return SubDataFrame(adf[[colind]], rowinds)
end

function Base.view(adf::AbstractDataFrame, rowinds, colind::Bool)
throw(ArgumentError("invalid column index of type Bool"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also print the value?

@@ -132,11 +148,13 @@ nrow(sdf::SubDataFrame) = ncol(sdf) > 0 ? length(rows(sdf))::Int : 0
ncol(sdf::SubDataFrame) = length(index(sdf))

function Base.getindex(sdf::SubDataFrame, colinds)
return parent(sdf)[rows(sdf), colinds]
return view(parent(sdf), rows(sdf), colinds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does't this need a deprecation too?

@bkamins
Copy link
Member Author

bkamins commented Sep 29, 2018

Before implementing, based on #1533 and trying to be consistent with Base behavior I have added the documentation of target getindex/view behavior to this PR before moving forward.

This is complex and subtly breaking in several places (I would recommend that during a review each proposal is verified against current behavior - at least this is what I had to do and sometimes were surprised by the result).

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's really an impressive list!

It should probably go to a separate page as it's really distinct from docstrings. It would also be great if you could summarize in a few sentences the main rules that we follow (in particular when does copying happens).

docs/src/lib/functions.md Outdated Show resolved Hide resolved
docs/src/lib/functions.md Outdated Show resolved Hide resolved
docs/src/lib/functions.md Outdated Show resolved Hide resolved
docs/src/lib/functions.md Outdated Show resolved Hide resolved
docs/src/lib/functions.md Outdated Show resolved Hide resolved
* `df[rows, cols]` -> a `DataFrame` containing columns `cols` and `df[rows, col]` as a vector in each `col` in `cols`.
* `@view df[col]` -> an alias of a vector contained in column `col`;
* `@view df[cols]` -> a `SubDataFrame` with parent `df` if `cols` is a colon and `df[cols]` otherwise;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the special treatment of colon could be problematic, and we should also use df[:] as the parent in that case? Same in similar cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retaining df will be faster and I guess it will be important for groupby when you have a lot of small groups.

This is a case that currently was not on the table as @view df[x] treats x as rows, but it is going to be changed to ensure consistency.

As long as df is not mutated this should not matter in practice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"As long as df is not mutated" is what bothers me. :-)

Why do you expect view(df, :) to be used in groupby? Presumably if the overhead of calling df[cols] is OK when cols != :, it's also OK when cols == :? Or is the latter more common than the former?

docs/src/lib/functions.md Outdated Show resolved Hide resolved
`SubDataFrame`:
* `sdf[col]` -> a view of vector contained in column `col` of `parent(sdf)` with `DataFrames.rows(sdf)` as selector;
* `sdf[cols]` -> a freshly allocated `DataFrame` containing copies of vectors contained in columns `cols` of `sdf`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems inconsistent with sdf[col] and df[cols], which don't make copies. Better return a SubDataFrame.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure what is best here myself.

In Base such an operation destroys a view. Also some code in DataFrames.jl relies on the fact that sdf[cols] returns a DataFrame if I remember correctly.

I will leave as is in the next iteration, but let us discuss it further.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in Base x[inds] always makes a copy, so it's consistent that it's also the case for views. For DataFrame, df[cols] doesn't copy the column vectors, so it would be consistent to do that for SubDataFrame. The goal is that for consistency if you mutate the result, the original AbstractDataFrame is changed in both cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have rewritten it to:

a SubDataFrame, with parent parent(sdf) if cols is a colon and parent(sdf)[cols] otherwise

* `@view df[cols]` -> a `SubDataFrame` with parent `df` if `cols` is a colon and `df[cols]` otherwise;
* `@view df[row, col]` -> translates to `view(df[col], row)` (a `0`-dimensional view into `df[col]`);
* `@view df[row, cols]` -> a `DataFrameRow` with parent `df` if `cols` is a colon and `df[cols]` otherwise;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is going to require creating a new DataFrame for each row? That sounds really bad. Maybe we could change DataFrameRow to also take a subset of rows? That's really a corner case though, so it could be fixed later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the same situation with SubDataFrame as current SubDataFrame and DaraFrameRow were designed with the assumption that only row-indexing is performed.

I would leave it for a separate PR (we will have to handle colon and subset of columns separately and support a mapping from column numbers of a view to column numbers of a parent in the latter). In the worst case user should first create a parent that has correct columns and then run a view on all of them. I will open an issue for this to keep track of it.

Copy link
Member

@nalimilan nalimilan Oct 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkamins
Copy link
Member Author

bkamins commented Oct 6, 2018

I have reviewed the comments. The only problematic thing is what sdf[cols] should produce.

* `df[cols]` -> a freshly allocated `DataFrame` containing the vectors contained in columns `cols`;
* `df[row, col]` -> the value contained in row `row` of column `col`, the same as `df[col][row]`;
* `df[row, cols]` -> a `NamedTuple` containing data from row `row` in columns `cols`;
* `df[rows, col]` -> a copy of the vector `df[col]` with only the entries corresponding to `rows` selected, the same as df[col][rows];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing backticks.

* `df[rows, cols]` -> a `DataFrame` containing columns `cols` and `df[rows, col]` as a vector in each `col` in `cols`.
* `@view df[col]` -> an alias of a vector contained in column `col`;
* `@view df[cols]` -> a `SubDataFrame` with parent `df` if `cols` is a colon and `df[cols]` otherwise;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"As long as df is not mutated" is what bothers me. :-)

Why do you expect view(df, :) to be used in groupby? Presumably if the overhead of calling df[cols] is OK when cols != :, it's also OK when cols == :? Or is the latter more common than the former?

@bkamins
Copy link
Member Author

bkamins commented Oct 7, 2018

Why do you expect view(df, :) to be used in groupby? Presumably if the overhead of calling df[cols] is OK when cols != :, it's also OK when cols == :? Or is the latter more common than the former?

Currently we have:

Base.getindex(gd::GroupedDataFrame, idx::Int) =
    view(gd.parent, gd.idx[gd.starts[idx]:gd.ends[idx]])

which essentially will have to be rewritten as view(gd.parent, gd.idx[gd.starts[idx]:gd.ends[idx]], :) given this PR (after the deprecation period).

So, in short, because we rewrite view(df, i) to mean view(df, i, :) after this PR the form with a colon will be most common in users code. In the current code view(df, i) is a very popular pattern (I would say the only that is encountered in practice given current design).

Also note that currently view(df, r, c) always makes a copy of df while view(df, r) never makes a copy.

@nalimilan
Copy link
Member

OK, I see. The best fix is really to make SubDataFrame store column indices (#1557), or to make it really cheap to construct a new DataFrame. Until then, better keep view(df, rows, :) efficient I guess.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I had forgotten about my other comments regarding the implementation.

@bkamins
Copy link
Member Author

bkamins commented Oct 7, 2018

The good thing is that we will go through a deprecation period when this would not matter much 😄, and in the mean time I will try to make #1557 efficient.

@bkamins
Copy link
Member Author

bkamins commented Oct 7, 2018

Actually there is a lot to improve in the implementation, apart from the earlier comments, once we have decided on the functionality.

Anyway this PR will be mostly deprecations given what we have discussed.

@bkamins bkamins force-pushed the getindex_round_2 branch 2 times, most recently from dbcab70 to 7f25eb9 Compare October 8, 2018 13:09
# df[SingleColumnIndex] => AbstractDataVector
const ColumnIndex = Union{Integer, Symbol}

# df[SingleColumnIndex] => AbstractVector, alias
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be even more precise than "alias": the vector itself is returned, not a view of it.


# df[SingleRowIndex, :] => DataFrame
function Base.getindex(df::DataFrame, row_ind::Integer, ::Colon)
Base.depwarn("Selecting a single row from a `DataFrame` will return a `NamedTupe` in future.", :getindex)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Tuple". Also "in the future".

return DataFrame(new_columns, Index(_names(df)[selected_columns]))
end

# df[MultiRowIndex, SingleColumnIndex] => AbstractVector
# df[SingleRowIndex, :] => DataFrame
function Base.getindex(df::DataFrame, row_ind::Bool, ::Colon)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, wouldn't it be better to do row_ind isa Bool inside the method below? Same below in a similar case.

# df[SingleRowIndex, :] => DataFrame
function Base.getindex(df::DataFrame, row_ind::Integer, ::Colon)
Base.depwarn("Selecting a single row from a `DataFrame` will return a `NamedTupe` in future.", :getindex)
new_columns = AbstractVector[[dv[[row_ind]]] for dv in columns(df)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't there too many brackets here?

src/dataframerow/dataframerow.jl Outdated Show resolved Hide resolved
if length(idxs) != length(idx)
throw(ArgumentError("missing values are not allowed for column indexing"))
end
idxs = disallowmissing(idx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this call really useful now?

function SubDataFrame(sdf::SubDataFrame, rowinds::Colon)
return sdf
SubDataFrame(parent::DataFrame, rows) =
SubDataFrame(parent, (1:nrow(parent))[rows])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this trick?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that:

  • we are sure that we get Vector{Int} as a result;
  • (1:nrow(parent))[rows] will use machinery from Base to catch any problematic values in rows, OTOH anything that is valid in Base will be valid here.

If not for performance and error clarity reasons this could be the only definition we use. I leave it because it is future proof - if anything outside from the definitions in other methods gets allowed in Base for indexing we will allow it.

But I can drop it as it is not strictly needed for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use that approach elsewhere, do we? I'd rather be consistent. Also the error thrown will be probably be unclear for users.


function Base.view(adf::AbstractDataFrame, rowinds::Any)
function Base.view(adf::AbstractDataFrame, rowinds)
Base.depwarn("`view(adf, x)` will be translated to `view(adf, x, :)` in the future.", :view)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better tell people that it's the new syntax they should use, that's clearer.

end

function Base.view(adf::AbstractDataFrame, rowind::Integer, colinds)
Base.depwarn("`view(adf, x)` will create a `DataFrameRow` in the future." *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecation doesn't match the called method.

src/subdataframe/subdataframe.jl Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Oct 9, 2018

OK. I tried to clean up everything (thank you for the patience - this PR is complex unfortunately).

We are left with a NamedTuple debate. But as you have said - we will see how it goes. A user always can get a DataFrame by selecting [row] instead of row. The upside is that after this it gets type stable. The downside is immutability.

But I guess in #1533 we investigated all the options and this one had least downsides. How well this goes also hinges on JuliaLang/julia#29417 decision, but I guess it did not get a lot of attention there.

@bkamins
Copy link
Member Author

bkamins commented Oct 9, 2018

Just to add (is it is better to be sure what we are doing) - I would be also OK, if df[row, columns] produced a DataFrame not a NamedTuple but I guess the majority thinking is that we should drop a dimension if we select a single row only (it has an upside that we have a convenient way to create a NamedTuple from a row in a DataFrame which could be convenient for users).

@bkamins bkamins force-pushed the getindex_round_2 branch 2 times, most recently from fb0d604 to 17a6543 Compare October 10, 2018 19:28
@bkamins
Copy link
Member Author

bkamins commented Oct 10, 2018

OK - I dug through getting this PR green on CI.
Probably it is worth to have another look at it.

Especially as I came to the conclusion that:

  • it is very cheap to create DataFrameRow;
  • I am not sure I know how to create NamedTuple from a row of DataFrame cheap - @nalimilan have you experimented with that?

@bkamins
Copy link
Member Author

bkamins commented Oct 12, 2018

Hopefully I have cleaned up all depwarn messages.

return DataFrame(new_columns, Index(_names(df)[selected_columns]))
end

# df[:, SingleColumnIndex] => AbstractVector
# df[:, MultiColumnIndex] => DataFrame
function Base.getindex(df::DataFrame, row_ind::Colon, col_inds)
Base.depwarn("indexing with colon as row will create a copy in the future" *
" use df[col_inds] to get the columns without copying", :getindex)
Base.depwarn("Sndexing with colon as row will create a copy in the future. " *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love Sndexing into data frames

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new operation I am working on. A combination of "Selecting" and "Indexing" in one go with a bit more of the latter than the former.

@nalimilan
Copy link
Member

I would say it's quite simple, just iterate over columns and extract the requested entry. It should be as fast as we can hope -- which unfortunately should be relatively slow, but not slower than iterating over all entries in a DataFrameRow. The main question is whether the compiler will be able to avoid actually accessing all columns when only some of them are used. I guess we should do some benchmarking before switching.

I've played a bit with creating tuples for each row and it appears that even when passing a tuple of N columns to a function, the compiler isn't able to avoid the creating of a N-tuple for each row when we use only a single row (see this gist). Of course the same applies to named tuples. And inside a function taking a data frame, the type instability makes this even harder for the compiler.

In the end DataFrameRow is much faster than Tuple (I didn't even test NamedTuple) when naively iterating over rows: see this gist. So maybe we should make an exception an have getindex return a DataFrameRow. Even if that's inconsistent with AbstractArray, in practice it shouldn't matter much.

@quinnj I guess you've considered these issues a lot when designing Tables.jl?

@bkamins
Copy link
Member Author

bkamins commented Oct 13, 2018

That is exactly what my experience was (I used something similar to #1564 to test it against NamedTuples). Also - as I note in #1564 there - is an issue that tuples you produce in your gist are covariant (a better situation) while named tuples as planned are invariant which makes things even worse.

@nalimilan
Copy link
Member

Why worse? Can it actually be worse than that anyway? :-)

@bkamins
Copy link
Member Author

bkamins commented Oct 25, 2018

@nalimilan Is there a conclusion on the NamedTuple case and how does it affect this PR? (should we keep it or change NamedTuple to DataFrameRow?)

@nalimilan
Copy link
Member

I'd say use DataFrameRow for now instead of NamedTuple.

@bkamins
Copy link
Member Author

bkamins commented Oct 26, 2018

Good. One last thing before I clean it up.

As DataFrameRow is a kind of view are we sure that we want df[1, cols] return DataFrameRow or should it be DataFrame (this is current behavior).

I am OK with both (given the discussion we had before both alternatives have pros and cons) - but prefer to double check before moving forward.

@nalimilan
Copy link
Member

I'd go with DataFrameRow, as returning a DataFrame is really slow (probably even slower than NamedTuple).

@pdeffebach
Copy link
Contributor

Is it too late to just make our own two types <: AbstractVector? That way we can make sure that DataFrameRow (attached to a dataframe) behaves exactly like DataFrameRow (not attached to a dataframe).

@bkamins
Copy link
Member Author

bkamins commented Oct 27, 2018

@pdeffebach Can you explain what you mean in a bit more detail please because I am not sure what you mean (in general I guess it is not too late for changes - we try to clean up DataFrames.jl as much as possible before 1.0 release).

@pdeffebach
Copy link
Contributor

pdeffebach commented Oct 27, 2018

Something like AttachedDataFrameRow and DetachedDataFrameRow (without the awkward names)

struct AttachedDataFrameRow <: AbstractVector # or similar
    parent::DataFrame
    row::Int
end

Base.getindex(... work with integer using parent Index)
Base.getindex(... work with symbol using parent Index)
function row(dfrow::AttachedDataFrameRow) ...
# Define array interface 

Then there is DetachedDataFrameRow that tries to keep the exact same API

struct DetachedDataFrameRow <: AbstractVector # or similar
    data::Vector
    colindex::Index
    row::Int
end

function DetachedDataFrameRow(df::DataFrame, row::Int)
    data = [df[row, name] for name in names(df)]
    DetachedDataFrameRow(data, getfield(df, :colindex), row)
end

Base.copy(dfrow::AttachedDataFrameRow) = DetachedDataFrameRow(parent(dfrow), row(dfrow))

Base.getindex(... work with integer)
Base.getindex(... work with symbol)
function row(dfrow::DetachedDataFrameRow) ...
# Define array interface  

I like this plan because it means we don't have to worry about losing behavior when a DataFrameRow-like object is attached verses a copy, even down to both using a Index object. We also don't have to make compromises: we can index with mutiple symbols dfrow[:x1, :x2, :x3] (benefit of DataFrame), yet still have mean(dfrow) work (benefit of NamedTuple.

On the other hand, I don't like this plan because its one more type we have to export and maintain.

@nalimilan
Copy link
Member

The problem is that creating a new Vector to hold the data is slow, so DetachedDataFrameRow wouldn't really be usable as a replacement for DataFrameRow.

@pdeffebach
Copy link
Contributor

pdeffebach commented Oct 27, 2018

Thanks for the feedback.

If that's the case, then relatively large change in API (DataFrame on copy vs DataFrameRow attached) is preferred, so people are disincentived to copy it too much.

I must admit I don't really know the use-case of copying it. Seems like it will be pretty rare. However DataFrame is suboptimal to me because it might make the switch to mean(dfrow) more difficult in the future. It makes sense with the current DataFrameRow behavior because both that and a DataFrame are quite Dict-like.

EDIT: I think i misinterpreted the above conversation. returning NamedTuple on copy seems great.

@bkamins
Copy link
Member Author

bkamins commented Oct 27, 2018

I have improved the documentation to explain design considerations.

@nalimilan
Copy link
Member

Thanks. It this good to go then?

@bkamins
Copy link
Member Author

bkamins commented Oct 28, 2018

I would say is is good to go. Also then I think that #1571 should be rebased against it and try to be consistent with target functionality.

@nalimilan nalimilan merged commit ca84253 into JuliaData:master Oct 29, 2018
@nalimilan
Copy link
Member

Great work!

@bkamins bkamins deleted the getindex_round_2 branch October 29, 2018 09:53
@nalimilan
Copy link
Member

There are lots of deprecation warnings when running tests. Some seem to be potentially problematic for users. For example:

┌ Warning: `sdf[colind]` will create a view of `parent(sdf)[colind]` in the future. Use sdf[:, [colind]]` to get a `DataFrame`.
│   caller = isequal(::SubDataFrame{Array{Int64,1}}, ::DataFrame) at abstractdataframe.jl:283
└ @ DataFrames ~/.julia/dev/DataFrames/src/abstractdataframe/abstractdataframe.jl:283

This seems to happen when calling isequal with a SubDataFrame. Maybe we should provide a temporary isequal method in deprecated.jl until we remove the deprecation? I guess this applies to other functions.

@bkamins
Copy link
Member Author

bkamins commented Nov 1, 2018

That was the issue I was afraid of when asking about cleaning up deprecations process.
I will go through the tests of DataFrames.jl, try to clean up and submit a PR today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants