Make `getproperty(df, col)` return a full length view of the column #1844

oxinabox · 2019-06-10T20:03:07Z

Similar for the single arg getindex (getindex(df, col::Symbol)), and eachcol.

We are progressively knocking out ways of ending up with columns of different sizes.

Right now here are ways that you can't end up with incosistant sizes:

You can't assign in a new column that doesn't have the right size, as setproperty checks the size.
You can't pass in a Vector then resize it using another reference, as the DataFrame constructor copies.
You can't use normal indexing indexing to get access to columns that you then resize, as normal indexing returns copies
you can't use @view indexing returns SubDataFrames which disallow size mutating operations, and who's column vectors are views anyway, so also disallow resizing operations (right?)
and size setproperty (and 1 arg setindex) check the size of the column being added, you can't just insert one with incorrect size.

Right now the only way I can think of getting access to a inner column and then mutating its size,
is the use of getproperty or single argument getindex or eachcol.
Which return the actually underlying Vector.

And I was thinking: It would be great if we could just wrap those in some kind of view like array wrapper that doesn't allow resizing in place, but allows all the other operations one would hope from a Vector, including in-place setindex of elements.

Turns out such a view likes wrapper does exist.
The SubArray.
We can just return a the equivalent of @view df.col[:].
The overhead of creating a view is tiny, as is the overhead of working with a view (at least when the indexing is simple which it is in this case).
You can do all the things to a SubArray as long as you don't call resize! on it -- that is a MethodError.

If someone needs to actually access the raw array, then they can use parent(df.col).
Only reason I can think that would really be needed is for if a method has a overly strict set of type constraints, then accessing the parent would be an alternative to collect (with its own trade-offs).

I think this would knock off the last possible way to end up with a corrupt DataFrame,
via column related shenanigans.

The text was updated successfully, but these errors were encountered:

bkamins · 2019-06-10T20:15:01Z

There would be still one hole:

df.a = vector

and then resize vector.

Still - patching up things is not that bad :). Actually this would be consistent with df[row,cols] returning a view.

An alternative (and this is something we really should discuss here - as we will always have holes) is to create a list of methods in DataFrames.jl that actively check if all columns have the same length before performing their operation (essentially - all expensive methods).

oxinabox · 2019-06-10T20:29:52Z

Those are 2 good points, lets open seperate issues to discuss them.
So we can keep things focused, on this change.
As you say this plugs some leaks anyway,
We can do this change and the others too, or not.

Openned:

bkamins · 2019-06-10T20:40:48Z

Thanks for opening the issues (they will end up in on PR probably later, but it is easier to manage the discussion this way).

Let us wait what other people think df.a should do.

Following up #1846 (comment) here it would be related to the fact that df.a should expand to getcol and getcol could have one (or more) of the following behaviors:

return a view
return a copy
return a vector as is

Now naturally getcol can take a kwarg and allow for all three options above, but then still we have a question what df.a should default to (and a minor question what should be the name of kwarg, as copycols would be useful only if we provided a choice between options 2. and 3., but we have this option 1. that is potentially appealing).

oxinabox · 2019-06-10T21:06:34Z

Return as is:

Pros return as is

consistent with normal struct getfield
preserve ability to do inplace mutatation
fast and nonallocating

Cons return as is:

unsafe to resizing

Returning copy

Pros returning copy:

consistent with getindex
- this is particularly nice for the single argument getindex overload, but we've already accepted that as being kinda special as a thing.
safe to resizing

Cons returning copy

Allocating,
and we have no nice obvious syntax for getting a nonallocating version for getproperty version.
Though @view df[:col] already works .
@view df.col gives an error (we could consider upstreaming that to julialang/julia. @view could be changed to turn @view getproperty(x. name) into getproperty_view(x,name). (and if we really wanted we could monkey patch the macro, but that is super gross)

Return a view

Pros return a view

safe
fast and nonallocating (Tiny overhead)

Cons return a view

Kinda weird type to return.

oxinabox · 2019-06-10T21:52:32Z

Thinking a bit more.
Creating masks is a thing people tend to do.
It isn't a great pattern (filter is better) but still it is a holdover from other languages.
If df.col returned a copy
that would mean that df[:, df.col.==1] would do 3 large allocations
whereas currently (and with a view) df[:, df.col.==1] does only 2 large allocations.

More out there idea:
We could be to make getcol return a newly defined Column type: which is a subtype of AbstractVector, that can not be resized.

Really that is not all that hard to implement, it only actually requires a few methods.
All the safety advantages of a full length view
evan smaller overhead then a SubArray
We are now open to being allowed to do things like write special cases for broadcast equality to return another special AbstractVector subtype, such that df[:, df.col.==1] could be transformed into a filter operation.
We also could potentially add functionality in the future to allow it to do copy-on-write under certain circumstances.

Neither of the last two points have to be done immediately, but they would not be breaking changes.
Important with the 1.0 release coming up soon.

nalimilan · 2019-06-11T11:41:44Z

Having df.col and df[col] return a SubArray is appealing. A custom Column wrapper is also interesting, but I'm not sure it's worth the additional complexity. A SubArray with : should be as efficient as a simpler wrapper since : has its own type (Colon). And df[df.col .== 1, :] should actually be very efficient, probably the most efficient way of subsetting a data frame that one can imagine (since allocating a vector of Bool allows computing the size of the result using a simple sum, rather than repeatedly pushing to a vector of indices). EDIT: see also #1848.

I'm hesitant about getcol. It wouldn't be very useful if it was exactly the same thing as getproperty (i.e. returning a view). On the other hand, having it behave differently (returning the raw vector, or a copy) could be confusing. I'm not even sure we really need getcol if it's equivalent to parent(df[col]) or to copy(df[col]). I guess it depends 1) on whether it's very common to require the raw vector (I doubt it, but experience can tell), and 2) whether we also need setcol (#1846).

bkamins · 2019-06-11T17:11:05Z

SubArray will use Base.OneTo but it will be fast.

getcol is not strictly needed - as you say. I liked it because it would be consistent if we decided to define getcol. Anyway - it can always be added later.

bkamins · 2019-06-13T22:08:50Z

So is the conclusion to return a view and add instructions to the users that they can later copy or parent on them to get what they want (i.e. a copy or a raw source verctor)?

As a side note then df.col and df[:col] will be essentially the same as @view(df, :, :col).

Also maybe if we go this way in one shot we should deprecate df[cols] in favor of select(df, cols)? I am asking because if we do this change df.col and df[:col] will become very non-standard and produce a view while df[cols] will still produce a copy of a data frame.

nalimilan · 2019-06-14T06:59:08Z

Yeah, let's do that. The nice thing with that change is that we could even revert it without breaking anything if needed.

Though to replace df[cols], df[:, cols] may be more logical/similar than select(df, cols).

(Then we may even remove df[col], but that's another issue.)

bkamins · 2019-06-14T08:51:28Z

df[:, cols] can stay in any case as it is not problematic.

oxinabox · 2019-06-14T09:08:04Z

Changing return types is breaking.
But it should not practically break much.

bkamins · 2019-06-20T09:42:12Z

Side note: when this change is implemented whole codebase of DataFrames.jl should be reviewed as there are places that internally assume that df[col] returns an un-copied original column.

bkamins · 2019-06-26T10:51:50Z

Given the recent discussion #1856 will be probably closed (unless some new comments are raised) and the following rules will be implemented:

getcol(df, col; copycols=false) - get a column col from a data frame with or without copying (the question what should be default, but I think false by default here is OK)
df[col] is deprecated and redirects to getcol(df, col; copycols=false)
df[:, col] and df.col - get a column col from a data frame with copying, the same as getcol(df, col; copycols=true)
df[cols] is deprecated to select(df, cols) or equivalently df[:, cols]

Also note that this in particular means that df[:, col] and df.col are synonyms (this is important in setcol! discussion in the other issue)

Partially related change:

df[cols] for multiple columns is deprecated to select(df, cols) or equivalently df[:, cols] (we should decide which deprecation to print, I think df[:, cols])
eachcol will get a copycols keyword argument which will default to true (but maybe the default should be false?)

The deprecations of df[col] and df[cols] should be printed only once I think per Julia session (but we might do something else here if you prefer).

If we are OK with this I will open a new PR implementing this (given we have agreement on the four questions I asked above).

quinnj · 2019-06-26T13:08:01Z

I still think having df.col do copycols=true will be too disruptive and break a lot of things for people.

bkamins · 2019-06-26T13:11:17Z

too disruptive and break a lot of things for people.

Given #1846 (comment) the plan would be for df[!, col] not to copy.

In what scenarios do you think df.col doing a copy is disruptive?. Something like:

x = df.col
x[5] = 10

and people assume it would update df?
?

nalimilan · 2019-06-26T13:38:50Z

That, but I'd be mainly concerned about the performance impact of copying on common operations like mean(df.col). That's why this issue started about returning a view (not a copy).

bkamins · 2019-06-26T13:42:49Z

Having df.col being the same as df[!, co] is not the end of the world, so I am OK with this. Then eachcol would not copy (I guess then we do not have to add copycols kwarg to it) and we do not need getcol.

bkamins · 2019-07-12T20:13:42Z

After #1866 df.col and df[!, col] are the same and return an un-copied vector. The rule is that single column operations do not copy (although it is acknowledged that this is slightly unsafe).

quinnj · 2019-07-17T12:49:40Z

Should we close this then? Or do we still want to do this for 1.0? I think I'd actually be ok w/ doing something like:

struct DataFrameColumn{T} <: AbstractVector{T}
    source::T
end

which basically supported normal AbstractArray interface, getindex, setindex!, but not resizing operations.

oxinabox · 2019-07-17T13:26:31Z

I want that because we can later and more functionality around it.

bkamins · 2019-07-17T14:56:43Z

The major stopper for using one or the other approach was that this design is not compatible with CategoricalArrays.jl (at least currently).

All methods using CategoricalArray are not able (at least currently) rely on the vector being CategoricalVector. The key challenge is that we need to distinguish two cases:

we have a collection of CategoricalValues coming from different CategoricalArrays
we have a collection of CategoricalValues coming from the same CategoricalArray

eltype will return the same for both such collections, but the expected behavior is different. A similar issue was already encountered and reported in MLJ.jl so it is not a purely artificial case.

Let us wait for @nalimilan to comment on this.

Also, my personal view (but not a very strong one) is that when we have #1845 the problem with possible resizing of underlying vectors will be reduced so this is not a must-be functionality.

bkamins · 2019-07-17T15:05:33Z

Also - if we go for some kind of wrapper I would love to have:

this wrapper (or abstract version of this wrapper)
a wrapper for read-only vector (or abstract version of it)

in DataAPI.jl so that we do not populate "data ecosystem" with numerous types that do very similar thing (also I think it would simplify their adoption as people would learn to use them in different contexts).

bkamins · 2019-07-19T06:51:41Z

xref: https://discourse.julialang.org/t/release-announcements-for-dataframes-jl/18258/61 (as I think it summarizes why I slightly lean towards allowing an access to an unwrapped vector using df[!, col]).

nalimilan · 2019-07-22T12:35:05Z

I also think the current (new) design is OK. Wrapping column vector in a wrapper would be overkill and confusing for users, for a limited gain. The CategoricalArray case is a good illustration; we could (and should) probably fix this, but that shows how a wrapper makes everything more complex, when one of the advantages of DataFrames is that it's relatively simple.

bkamins · 2019-12-01T13:53:18Z

I think we have settled for the design that returns the column not its view, so closing this issue (please reopen if you feel it should be discussed more).

This was referenced Jun 10, 2019

Add column length checks to expensive operations #1845

Closed

Consider making setproperty copy Vector #1846

Closed

bkamins mentioned this issue Jun 23, 2019

WIP: make indexing and eachcol return views of data frame columns #1856

Closed

bkamins closed this as completed Dec 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `getproperty(df, col)` return a full length view of the column #1844

Make `getproperty(df, col)` return a full length view of the column #1844

oxinabox commented Jun 10, 2019 •

edited by nalimilan

Loading

bkamins commented Jun 10, 2019

oxinabox commented Jun 10, 2019 •

edited

Loading

bkamins commented Jun 10, 2019 •

edited

Loading

oxinabox commented Jun 10, 2019 •

edited

Loading

oxinabox commented Jun 10, 2019 •

edited

Loading

nalimilan commented Jun 11, 2019 •

edited

Loading

bkamins commented Jun 11, 2019

bkamins commented Jun 13, 2019

nalimilan commented Jun 14, 2019

bkamins commented Jun 14, 2019

oxinabox commented Jun 14, 2019

bkamins commented Jun 20, 2019

bkamins commented Jun 26, 2019

quinnj commented Jun 26, 2019

bkamins commented Jun 26, 2019 •

edited

Loading

nalimilan commented Jun 26, 2019

bkamins commented Jun 26, 2019

bkamins commented Jul 12, 2019

quinnj commented Jul 17, 2019

oxinabox commented Jul 17, 2019

bkamins commented Jul 17, 2019

bkamins commented Jul 17, 2019

bkamins commented Jul 19, 2019

nalimilan commented Jul 22, 2019

bkamins commented Dec 1, 2019

Make getproperty(df, col) return a full length view of the column #1844

Make getproperty(df, col) return a full length view of the column #1844

Comments

oxinabox commented Jun 10, 2019 • edited by nalimilan Loading

bkamins commented Jun 10, 2019

oxinabox commented Jun 10, 2019 • edited Loading

bkamins commented Jun 10, 2019 • edited Loading

oxinabox commented Jun 10, 2019 • edited Loading

Return as is:

Pros return as is

Cons return as is:

Returning copy

Pros returning copy:

Cons returning copy

Return a view

Pros return a view

Cons return a view

oxinabox commented Jun 10, 2019 • edited Loading

nalimilan commented Jun 11, 2019 • edited Loading

bkamins commented Jun 11, 2019

bkamins commented Jun 13, 2019

nalimilan commented Jun 14, 2019

bkamins commented Jun 14, 2019

oxinabox commented Jun 14, 2019

bkamins commented Jun 20, 2019

bkamins commented Jun 26, 2019

quinnj commented Jun 26, 2019

bkamins commented Jun 26, 2019 • edited Loading

nalimilan commented Jun 26, 2019

bkamins commented Jun 26, 2019

bkamins commented Jul 12, 2019

quinnj commented Jul 17, 2019

oxinabox commented Jul 17, 2019

bkamins commented Jul 17, 2019

bkamins commented Jul 17, 2019

bkamins commented Jul 19, 2019

nalimilan commented Jul 22, 2019

bkamins commented Dec 1, 2019

Make `getproperty(df, col)` return a full length view of the column #1844

Make `getproperty(df, col)` return a full length view of the column #1844

oxinabox commented Jun 10, 2019 •

edited by nalimilan

Loading

oxinabox commented Jun 10, 2019 •

edited

Loading

bkamins commented Jun 10, 2019 •

edited

Loading

oxinabox commented Jun 10, 2019 •

edited

Loading

oxinabox commented Jun 10, 2019 •

edited

Loading

nalimilan commented Jun 11, 2019 •

edited

Loading

bkamins commented Jun 26, 2019 •

edited

Loading