Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behavior of df[1, cols] and DataFrameRow #1533

Closed
bkamins opened this issue Sep 24, 2018 · 28 comments
Closed

Behavior of df[1, cols] and DataFrameRow #1533

bkamins opened this issue Sep 24, 2018 · 28 comments

Comments

@bkamins
Copy link
Member

bkamins commented Sep 24, 2018

@nalimilan This is speculative, but I want to discuss it before we finish getindex cleanup.

  1. I think we should keep DataFrameRow structure (there were some discussion of returning NamedTuple from eachrow but it would not allow for mutation);
  2. Currently getindex[1, :] returns a DataFrame but maybe it should return a DataFrameRow? Also then maybe DataFrameRow could accept only a selection of columns to make getindex[1, columns] work consistently and return DataFrameRow (this would be breaking and I am not 100% convinced we want it)?
  3. If we do not go for point 2) maybe it would be reasonable to expose some function that allows to get a row of a DataFrame as DataFrameRow directly (something like getrow(df, 1))?
@nalimilan
Copy link
Member

That definitely makes sense, especially if we decide that iterating over a data frame gives rows (and therefore DataFrameRow objects). That would also be much more efficient than creating a new DataFrame. Overall I wouldn't expect this to break a lot of code, but who knows. Anyway we should go through a deprecation period first, and we can always go back if it turns out to be a bad idea.

@bkamins
Copy link
Member Author

bkamins commented Sep 24, 2018

Before I move on can you clarify me the following design issues (those happened before I started contributing):

  • why DataFrameRow is not a subtype of AbstractDataFrame?
  • why SubDataFrame does not allow arbitrary AbstractDataFrame as a parent?
  • why DataFrameRow cannot be removed and replaced by SubDataFrame (assuming that the questions above do not have a strong reason to unify them - I guess all the mechanics where DataFrameRow is used in DataFrames.jl ecosystem could be rewritten for SubDataFrame and the mechanics allowing to iterate over DataFrameRow could be removed).

Probably you see what it is getting at: remove DataFrameRow in favor of SubDataFrame. Then df[1, :] would return a DataFrame still but @view df[1,:] would get a view of a row - and the same would be returned by eachrow iterator.

We could even retain DataFrameRow internally in DataFrames.jl if it is really needed but do not expose it to the user but instead expose publicly only SubDataFrame, e.g. in eachrow.

@nalimilan
Copy link
Member

* why `DataFrameRow` is not a subtype of `AbstractDataFrame`?

AFAIK it doesn't support the full interface. For example row[1, :x] doesn't work. Also iterating over a row should yield values like for NamedTuple, but we don't want that for data frames.

why SubDataFrame does not allow arbitrary AbstractDataFrame as a parent?

I guess that's just an oversight (we currently don't have other AbstractDataFrame types which can usefully be combine with SubDataFrame).

* why `DataFrameRow` cannot be removed and replaced by `SubDataFrame` (assuming that the questions above do not have a strong reason to unify them - I guess all the mechanics where `DataFrameRow` is used in DataFrames.jl ecosystem could be rewritten for `SubDataFrame` and the mechanics allowing to iterate over `DataFrameRow` could be removed).

I think it makes sense to differentiate a row from a subset of rows, if only because of the iteration difference I noted above. Having a different type can also allow for more efficient code, just like in by I was able to write faster methods for when a NamedTuple is returned rather than a DataFrame. There's a lot of hashing and comparison code for DataFrameRow which is used in places where performance is critical. Of course we could probably rewrite this for SubDataFrames where the indices are a scalar, but that wouldn't really simplify the code.

@bkamins
Copy link
Member Author

bkamins commented Sep 24, 2018

Thanks. This is what I see in the code. That is why I have said:

We could even retain DataFrameRow internally in DataFrames.jl if it is really needed but do not expose it to the user but instead expose publicly only SubDataFrame, e.g. in eachrow.

i.e. we internally use DataFrameRow but do not expose it to the users.

What is more SubDataFrame knows about the type of row-selector so SubDataFrame{Int} would have the same performance as DataFrameRow if we allow to create a view of a single column row (in fact apart from the fact that currently they have different methods implemented they would be identical).

@nalimilan
Copy link
Member

What is more SubDataFrame knows about the type of row-selector so SubDataFrame{Int} would have the same performance as DataFrameRow if we allow to create a view of a single column row (in fact apart from the fact that currently they have different methods implemented they would be identical).

Yes, but it would be weird to have iteration behave differently depending just on a type parameter, wouldn't it?

@bkamins
Copy link
Member Author

bkamins commented Sep 24, 2018

OK - let us leave it as is (but we have to be clear, at some point, in the documentation why we differentiate between DataFrameRow and SubDataFrame and what is the benefit of it). And actually if there is a benefit we should have either:

  • df[1, cols] return a DataFrameRow
  • df[[1], cols] return a DataFrame
  • @view df[1, cols] a DataFrameRow (the same as above)
  • @view df[[1], cols] a SubDataFrame

or:

  • df[1, cols] return a DataFrame
  • df[[1], cols] return a DataFrame
  • @view df[1, cols] a SubDataFrame
  • @view df[[1], cols] a SubDataFrame
  • row(df, 1) return a DataFrameRow

which one would you prefer?

@nalimilan
Copy link
Member

I think I'd prefer version 1, but as I said if it turns out its inconvenient we could change our strategy.

@pdeffebach
Copy link
Contributor

I like option 1 because #1449 gets us on the road to doing

mean(df[2, :])

Returning a DataFrame loses that progress.

I really like this df[[1], cols] returning AbstractDataFrame objects, and clarifying between vector and scalar sub-setting.

@nalimilan
Copy link
Member

Another option similar to 1 would be to have df[1, :] return a NamedTuple but view(df, 1) return a DataFrameRow. That would mirror more closely what happens with Matrix, i.e. the result of getindex lives independently from its parent, and it cannot be used to mutate the parent. That would have the advantage of consistency with other methods. The drawback would be 1) that mutation requires using view, which can be harder to find (but we also provide eachrow, which is more convenient anyway), and 2) that generating a NamedTuple could be slow if there are many columns and you only use a few of them (especially since the type instability of data frame doesn't allow benefiting from specialization).

Something to also take into account is that it would be easier to start with NamedTuple and move later to DataFrameRow if needed, since the former's immutability makes it less likely that code will break than if we make the change in the opposite direction.

@pdeffebach
Copy link
Contributor

It would be nice for df[1,:] to return the same thing as first(eachrow(df)). Because I anticipate users using

[f(x) for x in eachrow(df)]

and

[f(df[i, :] for i in nrow(df)]

interchangeably.

@bkamins
Copy link
Member Author

bkamins commented Sep 25, 2018

@nalimilan I do not feel NamedTuple is a good approach as it is not mutable which breaks basic property of DataFrame.

However, I do not feel that eachrow(df) and df[i, :] have to be the same - as eachrow returns a view (which can be clearly explained to the users). Actually the argument that df[i, :] should return a copy is convincing.

Here is a summary what I think we should do:

  • df[col] return an original vector

  • df[cols] return a DataFrame of original vectors

  • @view df[col] return an original vector

  • @view df[cols] return a DataFrame of original vectors

  • df[row, col] return a value

  • df[rows, col] return a copy vector

  • @view df[row, col] a value

  • @view df[rows, col] a view

  • df[row, cols] return a DataFrame of copy vectors (this is disputable as it could be DataFrameRow or NamedTuple - here I guess we should choose what is most convenient as all options have their pros and cons)

  • df[rows, cols] return a DataFrame of copy vectors

  • @view df[row, cols] a DataFrameRow (this is disputable as it could be a SubDataFrame to simplify the interface as I have argued above but I guess we dropped it)

  • @view df[rows, cols] a SubDataFrame

When we settle on this I will update #1534.

@nalimilan
Copy link
Member

I agree with all of these except (as expected) the two ones you highlighted:

* `df[row, cols]` return a `DataFrame` of copy vectors (_this is disputable as it could be `DataFrameRow` or `NamedTuple` - here I guess we should choose what is most convenient as all options have their pros and cons_)

Returning a vector doesn't sound appropriate, as we would completely lose the column names which are an essential property defining a row. Either a DataFrameRow or a NamedTuple would be better. As I said I'm slightly leaning towards NamedTuple as it's always easier to be more restrictive first and make things less restrictive later if needed. I don't really get your position about this, since you said mutating is an essential property but then that returning a copy is convincing, two properties which seem incompatible to me.

I agree that eachrow(df) and df[i, :] can diverge, in particular because the former could return a type-specialized iterator which could allow for faster code (with a function barrier, or with compiler improvements), while the latter is doomed to be slow.

* `@view df[row, cols]` a `DataFrameRow` (_this is disputable as it could be a `SubDataFrame` to simplify the interface as I have argued above but I guess we dropped it_)

AFAICT this is not possible unless we find a solution for iteration, right?

@bkamins
Copy link
Member Author

bkamins commented Sep 26, 2018

I will lay out below what I think to let us consciously decide as this is a delicate thing and I guess we all want DataFrames 1.0 release to be stable in this area.

Case of df[row, cols]

mutating is essential

I mean that the result of df[row, cols] should be mutable, because df is mutable

returning a copy is convincing

Here I understand that we agree, i.e. it would be better that the result of df[row, cols] when mutated would not influence df.

Actually I think we should voice a third requirement as this is what I understand you want:

the result of df[row, cols] should drop from 2-D like object (DataFrame) to 1-D like object.

Now let me summarize the options we have discussed (I add NamedArray because you mentioned an option to return a vector):

Return type Mutable Copying 1D Breaking change
DataFrame yes yes no no (current behavior)
DataFrameRow yes no yes mildly
NamedTuple no yes yes yes (code that currently works might break, because it is immutable)
NamedArray yes yes yes yes (introduces a new dependency)

and every option has some disadvantage.

  • Based on this I lean towards DataFrame because it is not-breaking (this is what we currently have - although there is a duplication of functionality between df[[1], :] and df[1, :]).
  • I am OK with DataFrameRow as a second option as it is only mildly breaking and users are aware of this type in DataFrames.jl ecosystem (although you will be able to get it via view(df, 1, cols) so we would have a duplication of functionality). Actually we could make DataFrameRow be a wrapper around a one-row DataFrame like this df[1,:] would be view(df[[1], :], 1, :) to make it not-influence df on mutation but I am not sure if it would give any advantage over simply returning a DataFrame;
  • Returning NamedTuple can break existing code in unexpected places and introduces a new type for a user to think about (and already user has to handle DataFrame, SubDataFrame and DataFrameRow which is a lot).
  • Finally NamedArray creates a significant dependency.

Case @view df[row, cols]

I understand you support DataFrameRow here - right? I am OK with this choice (my only objection is that we have one more type for the user to learn that is why I put it on the table).

@nalimilan
Copy link
Member

Good summary. Though note that returning a DataFrameRow is as breaking as returning a NamedTuple, since mutating the DataFrameRow will now mutate the parent, while currently it doesn't affect it. So code which mutates the result of df[1,:] would change its behavior significantly.

Case @view df[row, cols]
I understand you support DataFrameRow here - right? I am OK with this choice (my only objection is that we have one more type for the user to learn that is why I put it on the table).

Right. I'm also open to investigating merging DataFrameRow and SubDataFrame, but maybe better make this a separate discussion since it shouldn't really change semantics (just simplify code and reduce the number of exported types).

@bkamins
Copy link
Member Author

bkamins commented Sep 26, 2018

DataFrameRow - I agree this is breaking that is why I considered this view(df[[1], :], 1, :) pattern to avoid mutation of source. Anyway - given what was written what is your opinion? As said - I am leaning towards DataFrame as I have said above because it is non-breaking (and we can always change it in the future if we have more evidence what would be best - but if you feel you have a strong case against it I am open).

Regarding merging DataFrameRow and SubDataFrame - agreed to postpone this discussion for another PR.

@nalimilan
Copy link
Member

I'd return either a DataFrameRow or a NamedTuple. Returning a DataFrame sounds like a bad solution since it's inefficient and one can always get the same result with df[1:1, :] or df[[1], :]. It's also inconsistent with Matrix indexing and with df[:, 1], which both return an object of a different type than the parent. Also I prefer breaking things earlier rather than later.

@bkamins
Copy link
Member Author

bkamins commented Sep 26, 2018

OK - you have convinced me for NamedTuple because:

  • it is an in-built concept, so we are not requiring user to learn something new
  • it can play nicely when passed to functions (in particular even if it is passed as a whole we get type stability for free inside the function)
  • we give a way to get other objects if user wants it If the drawbacks, i.e. compilation time or immutability, are significant to the user she can always choose to create what is best.

Let us wait if we get any comments and then I will update the PR accordingly.

@bkamins
Copy link
Member Author

bkamins commented Sep 27, 2018

OK - moving next step further. If df[1, cols] returns a NamedTuple then what should copy(::DataFrameRow) return?

@pdeffebach
Copy link
Contributor

Oh man. The difference might come down to the fact that a DataFrameRow lets you know which row number it came from. Philosophically when you copy something, you are removing its connection to the underlying Dataframe, so shouldn't you not care if you lose that row number? Hence returning a NamedTuple. But having copy return a different type is not right...

@bkamins
Copy link
Member Author

bkamins commented Sep 27, 2018

@nalimilan I assume you know why I am asking 😄 and DataFrameRow is a view (for the reference of others see #1536).
For me it is a valid question because it tests the consistency of the setup. And following the behavior in Base if we go the way we plan here we should return a NamedTuple as you have said which is a bit weird.

@pdeffebach
Copy link
Contributor

pdeffebach commented Sep 27, 2018

note that you can't copy a SubDataFrame so we might not need to define this.

DataFrame(df[1,:]) == DataFrame(DataFrameRow(df, 1))?

@nalimilan
Copy link
Member

note that you can't copy a SubDataFrame so we might not need to define this.

See #1536.

I think it's fine for copy to return a different type, that's what Base does for SubArray. What's more annoying is that NamedTuple is immutable, while one would often call copy to be able to mutate an object without affecting its parent. We could also return a DataFrameRow with a one-row parent. Anyway I'm really not sure that's a common pattern, so I also vote for leaving this undefined at least for now.

@bkamins
Copy link
Member Author

bkamins commented Sep 28, 2018

OK - so I will start updating #1534 following what was discussed here.

@pdeffebach
Copy link
Contributor

One issue with NamedTuple is that you can't use a vector of symbols to Index like you can with a DataFrameRow.

Though I'm not sure the use case for df[1,:]. It's tough to know how much this will be a problem in practice.

julia> t = (a = 1, b = 2, c = 5)
julia> t[[:a, :b]]
julia> t[[:a, :b]]
ERROR: MethodError: no method matching getindex(::NamedTuple{(:a, :b, :c),Tuple{Int64,Int64,Int64}}, ::Array{Symbol,1})
Closest candidates are:
  getindex(::NamedTuple, ::Int64) at namedtuple.jl:101
  getindex(::NamedTuple, ::Symbol) at namedtuple.jl:102
Stacktrace:
 [1] top-level scope at none:0

@XilinJia
Copy link
Contributor

Can I request, if it is relavant here, that DataFrameRow should support "haskey", as now it doesn't?

@bkamins
Copy link
Member Author

bkamins commented Sep 28, 2018

@LeoK987 - can you open a separate issue for this so that we do not lose track of this request and can discuss it?

@bkamins
Copy link
Member Author

bkamins commented Sep 28, 2018

@pdeffebach I think it is worth a PR to Julia Base, as indexing by a vector works for Tuple so it would be consistent if it worked for NamedTuple. EDIT: JuliaLang/julia#29417

(I will wait a bit with #1534 as it seems we still get some feedback)

@bkamins
Copy link
Member Author

bkamins commented Dec 14, 2018

Closing as df[1, cols] returns a DataFrameRow at least for now.
DataFrameRow now supports haskey.

@bkamins bkamins closed this as completed Dec 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants