Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make `getproperty(df, col)` return a full length view of the column #1844

Open
oxinabox opened this issue Jun 10, 2019 · 22 comments

Comments

Projects
None yet
4 participants
@oxinabox
Copy link
Contributor

commented Jun 10, 2019

Similar for the single arg getindex (getindex(df, col::Symbol)), and eachcol.

We are progressively knocking out ways of ending up with columns of different sizes.

Right now here are ways that you can't end up with incosistant sizes:

  • You can't assign in a new column that doesn't have the right size, as setproperty checks the size.
  • You can't pass in a Vector then resize it using another reference, as the DataFrame constructor copies.
  • You can't use normal indexing indexing to get access to columns that you then resize, as normal indexing returns copies
  • you can't use @view indexing returns SubDataFrames which disallow size mutating operations, and who's column vectors are views anyway, so also disallow resizing operations (right?)
  • and size setproperty (and 1 arg setindex) check the size of the column being added, you can't just insert one with incorrect size.

Right now the only way I can think of getting access to a inner column and then mutating its size,
is the use of getproperty or single argument getindex or eachcol.
Which return the actually underlying Vector.

And I was thinking: It would be great if we could just wrap those in some kind of view like array wrapper that doesn't allow resizing in place, but allows all the other operations one would hope from a Vector, including in-place setindex of elements.

Turns out such a view likes wrapper does exist.
The SubArray.
We can just return a the equivalent of @view df.col[:].
The overhead of creating a view is tiny, as is the overhead of working with a view (at least when the indexing is simple which it is in this case).
You can do all the things to a SubArray as long as you don't call resize! on it -- that is a MethodError.

If someone needs to actually access the raw array, then they can use parent(df.col).
Only reason I can think that would really be needed is for if a method has a overly strict set of type constraints, then accessing the parent would be an alternative to collect (with its own trade-offs).

I think this would knock off the last possible way to end up with a corrupt DataFrame,
via column related shenanigans.

@bkamins

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

There would be still one hole:

df.a = vector

and then resize vector.

Still - patching up things is not that bad :). Actually this would be consistent with df[row,cols] returning a view.

An alternative (and this is something we really should discuss here - as we will always have holes) is to create a list of methods in DataFrames.jl that actively check if all columns have the same length before performing their operation (essentially - all expensive methods).

@oxinabox

This comment has been minimized.

Copy link
Contributor Author

commented Jun 10, 2019

Those are 2 good points, lets open seperate issues to discuss them.
So we can keep things focused, on this change.
As you say this plugs some leaks anyway,
We can do this change and the others too, or not.

Openned:

@bkamins

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

Thanks for opening the issues (they will end up in on PR probably later, but it is easier to manage the discussion this way).

Let us wait what other people think df.a should do.

Following up #1846 (comment) here it would be related to the fact that df.a should expand to getcol and getcol could have one (or more) of the following behaviors:

  1. return a view
  2. return a copy
  3. return a vector as is

Now naturally getcol can take a kwarg and allow for all three options above, but then still we have a question what df.a should default to (and a minor question what should be the name of kwarg, as copycols would be useful only if we provided a choice between options 2. and 3., but we have this option 1. that is potentially appealing).

@oxinabox

This comment has been minimized.

Copy link
Contributor Author

commented Jun 10, 2019

Return as is:

Pros return as is

  • consistent with normal struct getfield
  • preserve ability to do inplace mutatation
  • fast and nonallocating

Cons return as is:

  • unsafe to resizing

Returning copy

Pros returning copy:

  • consistent with getindex
    • this is particularly nice for the single argument getindex overload, but we've already accepted that as being kinda special as a thing.
  • safe to resizing

Cons returning copy

  • Allocating,
  • and we have no nice obvious syntax for getting a nonallocating version for getproperty version.
    Though @view df[:col] already works .
    @view df.col gives an error (we could consider upstreaming that to julialang/julia. @view could be changed to turn @view getproperty(x. name) into getproperty_view(x,name). (and if we really wanted we could monkey patch the macro, but that is super gross)

Return a view

Pros return a view

  • safe
  • fast and nonallocating (Tiny overhead)

Cons return a view

  • Kinda weird type to return.
@oxinabox

This comment has been minimized.

Copy link
Contributor Author

commented Jun 10, 2019

Thinking a bit more.
Creating masks is a thing people tend to do.
It isn't a great pattern (filter is better) but still it is a holdover from other languages.
If df.col returned a copy
that would mean that df[:, df.col.==1] would do 3 large allocations
whereas currently (and with a view) df[:, df.col.==1] does only 2 large allocations.

More out there idea:
We could be to make getcol return a newly defined Column type: which is a subtype of AbstractVector, that can not be resized.

  • Really that is not all that hard to implement, it only actually requires a few methods.
  • All the safety advantages of a full length view
  • evan smaller overhead then a SubArray
  • We are now open to being allowed to do things like write special cases for broadcast equality to return another special AbstractVector subtype, such that df[:, df.col.==1] could be transformed into a filter operation.
  • We also could potentially add functionality in the future to allow it to do copy-on-write under certain circumstances.

Neither of the last two points have to be done immediately, but they would not be breaking changes.
Important with the 1.0 release coming up soon.

@nalimilan

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

Having df.col and df[col] return a SubArray is appealing. A custom Column wrapper is also interesting, but I'm not sure it's worth the additional complexity. A SubArray with : should be as efficient as a simpler wrapper since : has its own type (Colon). And df[df.col .== 1, :] should actually be very efficient, probably the most efficient way of subsetting a data frame that one can imagine (since allocating a vector of Bool allows computing the size of the result using a simple sum, rather than repeatedly pushing to a vector of indices). EDIT: see also #1848.

I'm hesitant about getcol. It wouldn't be very useful if it was exactly the same thing as getproperty (i.e. returning a view). On the other hand, having it behave differently (returning the raw vector, or a copy) could be confusing. I'm not even sure we really need getcol if it's equivalent to parent(df[col]) or to copy(df[col]). I guess it depends 1) on whether it's very common to require the raw vector (I doubt it, but experience can tell), and 2) whether we also need setcol (#1846).

@bkamins

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

SubArray will use Base.OneTo but it will be fast.

getcol is not strictly needed - as you say. I liked it because it would be consistent if we decided to define getcol. Anyway - it can always be added later.

@bkamins

This comment has been minimized.

Copy link
Member

commented Jun 13, 2019

So is the conclusion to return a view and add instructions to the users that they can later copy or parent on them to get what they want (i.e. a copy or a raw source verctor)?

As a side note then df.col and df[:col] will be essentially the same as @view(df, :, :col).

Also maybe if we go this way in one shot we should deprecate df[cols] in favor of select(df, cols)? I am asking because if we do this change df.col and df[:col] will become very non-standard and produce a view while df[cols] will still produce a copy of a data frame.

@nalimilan

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2019

Yeah, let's do that. The nice thing with that change is that we could even revert it without breaking anything if needed.

Though to replace df[cols], df[:, cols] may be more logical/similar than select(df, cols).

(Then we may even remove df[col], but that's another issue.)

@bkamins

This comment has been minimized.

Copy link
Member

commented Jun 14, 2019

df[:, cols] can stay in any case as it is not problematic.

@oxinabox

This comment has been minimized.

Copy link
Contributor Author

commented Jun 14, 2019

Changing return types is breaking.
But it should not practically break much.

@bkamins

This comment has been minimized.

Copy link
Member

commented Jun 20, 2019

Side note: when this change is implemented whole codebase of DataFrames.jl should be reviewed as there are places that internally assume that df[col] returns an un-copied original column.

@bkamins

This comment has been minimized.

Copy link
Member

commented Jun 26, 2019

Given the recent discussion #1856 will be probably closed (unless some new comments are raised) and the following rules will be implemented:

  • getcol(df, col; copycols=false) - get a column col from a data frame with or without copying (the question what should be default, but I think false by default here is OK)
  • df[col] is deprecated and redirects to getcol(df, col; copycols=false)
  • df[:, col] and df.col - get a column col from a data frame with copying, the same as getcol(df, col; copycols=true)
  • df[cols] is deprecated to select(df, cols) or equivalently df[:, cols]

Also note that this in particular means that df[:, col] and df.col are synonyms (this is important in setcol! discussion in the other issue)

Partially related change:

  • df[cols] for multiple columns is deprecated to select(df, cols) or equivalently df[:, cols] (we should decide which deprecation to print, I think df[:, cols])
  • eachcol will get a copycols keyword argument which will default to true (but maybe the default should be false?)

The deprecations of df[col] and df[cols] should be printed only once I think per Julia session (but we might do something else here if you prefer).

If we are OK with this I will open a new PR implementing this (given we have agreement on the four questions I asked above).

@quinnj

This comment has been minimized.

Copy link
Member

commented Jun 26, 2019

I still think having df.col do copycols=true will be too disruptive and break a lot of things for people.

@bkamins

This comment has been minimized.

Copy link
Member

commented Jun 26, 2019

too disruptive and break a lot of things for people.

Given #1846 (comment) the plan would be for df[!, col] not to copy.

In what scenarios do you think df.col doing a copy is disruptive?. Something like:

x = df.col
x[5] = 10

and people assume it would update df?
?

@nalimilan

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2019

That, but I'd be mainly concerned about the performance impact of copying on common operations like mean(df.col). That's why this issue started about returning a view (not a copy).

@bkamins

This comment has been minimized.

Copy link
Member

commented Jun 26, 2019

Having df.col being the same as df[!, co] is not the end of the world, so I am OK with this. Then eachcol would not copy (I guess then we do not have to add copycols kwarg to it) and we do not need getcol.

@bkamins

This comment has been minimized.

Copy link
Member

commented Jul 12, 2019

After #1866 df.col and df[!, col] are the same and return an un-copied vector. The rule is that single column operations do not copy (although it is acknowledged that this is slightly unsafe).

@quinnj

This comment has been minimized.

Copy link
Member

commented Jul 17, 2019

Should we close this then? Or do we still want to do this for 1.0? I think I'd actually be ok w/ doing something like:

struct DataFrameColumn{T} <: AbstractVector{T}
    source::T
end

which basically supported normal AbstractArray interface, getindex, setindex!, but not resizing operations.

@oxinabox

This comment has been minimized.

Copy link
Contributor Author

commented Jul 17, 2019

I want that because we can later and more functionality around it.

@bkamins

This comment has been minimized.

Copy link
Member

commented Jul 17, 2019

The major stopper for using one or the other approach was that this design is not compatible with CategoricalArrays.jl (at least currently).

All methods using CategoricalArray are not able (at least currently) rely on the vector being CategoricalVector. The key challenge is that we need to distinguish two cases:

  • we have a collection of CategoricalValues coming from different CategoricalArrays
  • we have a collection of CategoricalValues coming from the same CategoricalArray

eltype will return the same for both such collections, but the expected behavior is different. A similar issue was already encountered and reported in MLJ.jl so it is not a purely artificial case.

Let us wait for @nalimilan to comment on this.

Also, my personal view (but not a very strong one) is that when we have #1845 the problem with possible resizing of underlying vectors will be reduced so this is not a must-be functionality.

@bkamins

This comment has been minimized.

Copy link
Member

commented Jul 17, 2019

Also - if we go for some kind of wrapper I would love to have:

  • this wrapper (or abstract version of this wrapper)
  • a wrapper for read-only vector (or abstract version of it)

in DataAPI.jl so that we do not populate "data ecosystem" with numerous types that do very similar thing (also I think it would simplify their adoption as people would learn to use them in different contexts).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.