WIP: implementation of new setindex! and broadcasting rules #1646

bkamins · 2018-12-19T10:42:20Z

@nalimilan Following #1645 I have written down the possible target rules, so that we can edit them.

I will add a specification for SubDataFrame and DataFrameRow when we settle AbstractDataFrame.

CC @coreywoodfield

docs/src/lib/indexing.md

nalimilan · 2018-12-20T15:25:05Z

docs/src/lib/indexing.md

+> `df[rows, col] = v`
+
+* The same rules as for `df[col] = v` but on selected rows and always with copying.
+* Empty data frames are not allowed and no column adding is possible.


Maybe df[:, col] = v and/or df[:, col] .= v should allow empty data frames?

We have df[col] = v and df[col] .= v for this case. It would render df[:, col] = v inconsistent.

And here I exactly try to fix the inconsistency in the current design of DataFrames.jl. See the example how x and y are treated differently:

julia> x = [1,2,3] 3-element Array{Int64,1}: 1 2 3 julia> y = [1,2,3] 3-element Array{Int64,1}: 1 2 3 julia> df = DataFrame(x=x,y=y) 3×2 DataFrame │ Row │ x │ y │ │ │ Int64 │ Int64 │ ├─────┼───────┼───────┤ │ 1 │ 1 │ 1 │ │ 2 │ 2 │ 2 │ │ 3 │ 3 │ 3 │ julia> df[:, :x] = [0,0,0] 3-element Array{Int64,1}: 0 0 0 julia> df[1:3, :y] = [0,0,0] 3-element Array{Int64,1}: 0 0 0 julia> df 3×2 DataFrame │ Row │ x │ y │ │ │ Int64 │ Int64 │ ├─────┼───────┼───────┤ │ 1 │ 0 │ 0 │ │ 2 │ 0 │ 0 │ │ 3 │ 0 │ 0 │ julia> x 3-element Array{Int64,1}: 1 2 3 julia> y 3-element Array{Int64,1}: 0 0 0 julia> df[:, :x] = 'a':'c' 'a':1:'c' julia> df 3×2 DataFrame │ Row │ x │ y │ │ │ Char │ Int64 │ ├─────┼──────┼───────┤ │ 1 │ 'a' │ 0 │ │ 2 │ 'b' │ 0 │ │ 3 │ 'c' │ 0 │ julia> df[1:3, :y] = 'a':'c' 'a':1:'c' julia> df 3×2 DataFrame │ Row │ x │ y │ │ │ Char │ Int64 │ ├─────┼──────┼───────┤ │ 1 │ 'a' │ 97 │ │ 2 │ 'b' │ 98 │ │ 3 │ 'c' │ 99 │

I agree this is problematic, but I don't see the relationship with empty data frames (it's been a long time....).

The proposed rule is that if we use df[:, col] or df[rows, col] in general we write to an existing vector in the DataFrame in-place. So - naturally it has to fail in an empty DataFrame.

Or - from another angle - the proposed rule is that on LHS df[:, col] is rewritten as df[axes(df, 1), col].

This approach is consistent with Base, where the manual states here:

This includes Colon (:) to select all indices within the entire dimension

I would be OK to bend the rules in this case, if there were no other convenient way to assign a new column to a DataFrame, but there is one and it is df[col] = v.

But if you really think it is useful we can make an exception for an empty DataFrame.

But if we have e.g.

df = DataFrame(x=[]) df[:, 1] = []

Then what's the problem? We've set all entries in df.x to the values from the RHS -- it's just that there aren't any of them. :-)

There is no problem 😄. Your df is not empty, it has 0-rows, but it is a different story. An empty data frame is returned by the call DataFrame().

Note that this distinction of emptyness is very relevant in other cases.

This should work:

df = DataFrame() df.a = [1,2,3]

while this should fail

df = DataFrame(x=[]) df.a = [1,2,3]

An interesting corner case is:

julia> df = DataFrame(x=[1,2]) 2×1 DataFrame │ Row │ x │ │ │ Int64 │ ├─────┼───────┤ │ 1 │ 1 │ │ 2 │ 2 │ julia> df.x = [1,2,3,4] ERROR: ArgumentError: New columns must have the same length as old columns

which fails now and I thing it should still fail.

The short rule - an empty data frame DataFrame() has an undefined number of rows (not 0) while any non-empty data frame has a concrete number of rows.

Actually we could consider to make nrow(DataFrame()) return nothing instead of 0, this would be technically stricter, but I always thought it would not be very useful, so I have not proposed to add this change.

OK, then maybe say "with zero columns" to avoid any possible ambiguity.

Actually we could consider to make nrow(DataFrame()) return nothing instead of 0, this would be technically stricter, but I always thought it would not be very useful, so I have not proposed to add this change.

Or even missing, since it's really unknown. ;-) But 0 is probably the most useful answer.

OK - I will replace empty with zero columns everywhere to disambiguate.

docs/src/lib/indexing.md

bkamins · 2019-02-03T11:21:46Z

docs/src/lib/indexing.md

+      (which means that passing `Integer` or `Bool` will fail for nonexistent columns,
+      but adding new columns as `Symbol` is allowed)
+    * then `df[col] = v[col]` is called for each column name `col`
+* if `v` is a vector of vectors or a tuple of vectors the same process is performed but


I have added this rule. I do not know if we want to keep it (or keep it only for vectors and not for tuples)

Sounds fine. Am I right that this doesn't introduce any ambiguity if you want a column which is a vector of vectors (since you would have to wrap it in a vector/tuple/data frame anyway since this is a multi-column assignment)?

Right - you would have to wrap it. Single column assignment is another rule.

Actually this is very delicate. My idea is that df[cols] = v should ideally give the same result as df[cols] = DataFrame(v) just without having to materialize the intermediate DataFrame.

The same for df[rows, cols] = v.

But now I see that:

julia> x = ([1,2,3], [1,2,3]) ([1, 2, 3], [1, 2, 3]) julia> DataFrame(x) ERROR: ArgumentError: 'Tuple{Array{Int64,1},Array{Int64,1}}' iterates 'Array{Int64,1}' values, which don't satisfy the Tables.jl Row-iterator interface julia> x = [[1,2,3], [1,2,3]] 2-element Array{Array{Int64,1},1}: [1, 2, 3] [1, 2, 3] julia> DataFrame(x) 3×2 DataFrame │ Row │ x1 │ x2 │ │ │ Int64 │ Int64 │ ├─────┼───────┼───────┤ │ 1 │ 1 │ 1 │ │ 2 │ 2 │ 2 │ │ 3 │ 3 │ 3 │

which is quite surprising (in general the Tables.jl & friends catch all DataFrame constructor always catches me off guard 😞). So I think we should drop tuple here and leave only the vector?

Why doesn't Tables.jl accept a tuple here?

It accepts it, but assumes that the contents is row-wise not col-wise, see e.g.:

julia> DataFrame(((a=1,b=2),(a=3,b=4))) 2×2 DataFrame │ Row │ a │ b │ │ │ Int64 │ Int64 │ ├─────┼───────┼───────┤ │ 1 │ 1 │ 2 │ │ 2 │ 3 │ 4 │

That is why I say that this Tables.jl API is tricky when combined with what we already had in DataFrames.jl.

Following your suggestion I have commented on the constructors in a separate issue DataFrame constructors #1599.

Are you on latest versions of all packages from Tables.jl & friends? For me it works:

julia> DataFrame([(a=1,b=2),(a=3,b=4)]) 2×2 DataFrame │ Row │ a │ b │ │ │ Int64 │ Int64 │ ├─────┼───────┼───────┤ │ 1 │ 1 │ 2 │ │ 2 │ 3 │ 4 │

(and even if it did not work it would not be an error in DataFrames.jl 😀)

OK, actually I tested that on my PooledArrays branch which was behind master. It works now.

Given the discussion in #1599 (comment) I changed my mind and would leave only vector of vectors here for now.

The resason is that in general it is not clear if you have a collection if it should be treated row-wise or col-wise. We can add other values of v later if there is a need.

My suggestion was that the collection type should not matter to decide whether it's row- or column-wise: tuples and vectors should be treated the same. Two constructors would take tuples and vectors of vectors. Other collections would be assumed to be row-oriented and handled by Tables.jl.

OK :). So I add tuples here and in the constructors thread.

bkamins · 2019-02-03T11:22:07Z

docs/src/lib/indexing.md

+    * `length(v)` must be equal to `length(cols)`
+    * column names in `v` must be the same as selected by `cols`
+    * an operation `df[row, col] = v[col]` for `col in cols` is performed
+* if `v` is a vector or a tuple:


this additional rule is related to this rule.

richgoldberg · 2019-03-22T05:43:07Z

Regarding the specification in docs/src/lib/indexing.md:

df[rows, col] = v
The same rules as for df[col] = v but on selected rows and always with copying.
Empty data frames are not allowed and no column adding is possible.

df[rows, col] .= v
The same rules as for df[col] .= v but on selected rows and always with copying.
Empty data frames are not allowed and no column adding is possible.

This doesn't seem to be the case right now (maybe you know this?) even when there is no issue with empty DataFrames. Here's an example where df[rows, col] .= v does not behave like df[col] .= v. Further, it doesn't throw any sort of error or warning to the user.

# Example:  Start with a DataFrame and update a column in-place

# initial DataFrame to modify
using DataFrames
df = DataFrame(a = [1,3,5], b = ["A", "B", "C"])
display(df)

#Case 1: df[col] = v works
df = DataFrame(a = [1,3,5], b = ["A", "B", "C"])
v = ["a", "b", "c"]
df[ :b] = v
display(df)

#Case 2: df[rows, col] = v works
df = DataFrame(a = [1,3,5], b = ["A", "B", "C"])
v = ["a", "b", "c"]
df[:, :b] = v
display(df)

#Case 3: df[col] .= v works
df = DataFrame(a = [1,3,5], b = ["A", "B", "C"])
v = ["a", "b", "c"]
df[ :b] .= v
display(df)

#Case 4:, df[rows, col] .= v doesn't work!
df = DataFrame(a = [1,3,5], b = ["A", "B", "C"])
v = ["a", "b", "c"]
df[:, :b] .= v
display(df)

Note: Run in Julia 1.1.0 using DataFrames v. 0.17.1

nalimilan · 2019-03-22T07:59:28Z

This PR describes what should happen, not what happens right now.

docs/src/lib/indexing.md

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

…ame is

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

bkamins · 2019-05-05T09:03:12Z

When implementing the setindex! and broadcasting rules I started to have a radical thought that we should not allow a general assignment to a DataFrame subsets, but only to df[col], df[row, col], df[rows, col]. So essentially to allow assignment to a single column, which is a vector and then the rules are clear.

What I mean is that assignment to df[cols], df[row, cols], df[rows, cols] is problematic in two ways:

should we do matching by column name or column index (we could set some rules but I am not sure which is better);
when broadcasting how should we treat a right hand side. df[cols] and df[rows, cols] theoretically should be viewed as a collection of rows, but df[row, cols] should be a collection of 1-element columns;

Additionally I am not sure how useful having df[cols], df[row, cols], df[rows, cols] on LHS is in practice while if it is really needed then it can be explicitly handled by a simple iteration over columns.

@nalimilan, @oxinabox - do you have any thoughts on it (if it is not clear what I mean here please comment and I can expand on my thoughts).

bkamins · 2019-05-05T20:52:19Z

Just to give more perspective to it I am not convinced we want to accept any of the following:

julia> df = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │

julia> df2 = DataFrame(x=4:6)
3×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 5     │
│ 3   │ 6     │

julia> df[[:b]] = df2
3×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 5     │
│ 3   │ 6     │

julia> df
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │

julia> df[[:c, :d]] = 10
10

julia> df
3×4 DataFrame
│ Row │ a     │ b     │ c     │ d     │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 4     │ 10    │ 10    │
│ 2   │ 2     │ 5     │ 10    │ 10    │
│ 3   │ 3     │ 6     │ 10    │ 10    │

julia> df[1, [1]] = df2
3×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 5     │
│ 3   │ 6     │

julia> df
3×4 DataFrame
│ Row │ a     │ b     │ c     │ d     │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 4     │ 4     │ 10    │ 10    │
│ 2   │ 2     │ 5     │ 10    │ 10    │
│ 3   │ 3     │ 6     │ 10    │ 10    │

julia> df[3, [1]] = df2
3×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 5     │
│ 3   │ 6     │

julia> df
3×4 DataFrame
│ Row │ a     │ b     │ c     │ d     │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 4     │ 4     │ 10    │ 10    │
│ 2   │ 2     │ 5     │ 10    │ 10    │
│ 3   │ 4     │ 6     │ 10    │ 10    │

and in general now I feel that allowing a data frame on the RHS of an assignment is problematic.

What we could allow for (but I am not sure how much useful that would be so that is why I am hesitant to propose these - we can always add them later) are:

df[row, cols] = DataFrameRow | AbstractDict | NamedTuple | iterable (in general to have the same functionality as push!)
df[rows, cols] = matrix (treating df as a 2-dimensional object)
df[row, cols] .= any 0-dimensional value in terms of broadcasting and df[rows, cols] .= any 0, 1 or 2-dimensional value in terms of broadcasting (in general: adhering to standard broadcasting that assumes we have numerical axis and disallowing any other thing on RHS)

bkamins · 2019-07-12T19:57:48Z

#1866 implements current rules

nalimilan reviewed Dec 20, 2018

View reviewed changes

bkamins mentioned this pull request Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

bkamins commented Feb 3, 2019

View reviewed changes

This was referenced Feb 10, 2019

DataFrame constructors #1599

Closed

push! which promotes type #1716

Closed

Add DataFrame constructors allowing NTuple and collection of Pair-s #1717

Merged

bkamins mentioned this pull request Feb 17, 2019

Policy regarding in-place operations #1695

Closed

bkamins mentioned this pull request Apr 2, 2019

broadcasted setindex not working as expected #1507

Closed

oxinabox reviewed Apr 2, 2019

View reviewed changes

docs/src/lib/indexing.md Outdated Show resolved Hide resolved

bkamins and others added 10 commits April 28, 2019 16:29

initial specification of rules

ecb0d2d

Apply suggestions from code review

346d8b8

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

Incorporate review messages

d1878bd

a small additional comment

a37a4e7

disallow a tuple on RHS of assignments and clarify what empty data fr…

a3d2be5

…ame is

better df[row, cols] = v rule

4c34717

expect vector of vectors in df[rows,cols]=v on RHS

03982f0

re-introduce tuple of vectors

c8f7f49

Update docs/src/lib/indexing.md

9f56e86

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

improvement of description of rules

7fa4896

bkamins force-pushed the broadcasted_assignment branch from 7e590b0 to 7fa4896 Compare April 28, 2019 14:30

bkamins added 3 commits April 28, 2019 17:05

improve specification of the rules

36262aa

improve description

3af8490

first part of implementation of setindex! (WIP)

f78f524

bkamins changed the title ~~Initial specification of new setindex! rules~~ WIP: implementation of new setindex! and broadcasting rules Apr 28, 2019

bkamins added 3 commits May 4, 2019 09:39

Merge branch 'master' into broadcasted_assignment

6c7985f

additional fixes

f41a830

make df[cols] = v perform a copy of passed columns

cca9f77

implement temporary setindex! rules

90855b9

bkamins mentioned this pull request May 6, 2019

Broadcasting in DataFrames #1804

Merged

bkamins closed this Jul 12, 2019

bkamins deleted the broadcasted_assignment branch July 15, 2019 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: implementation of new setindex! and broadcasting rules #1646

WIP: implementation of new setindex! and broadcasting rules #1646

bkamins commented Dec 19, 2018

nalimilan Dec 20, 2018

bkamins Feb 3, 2019

nalimilan Feb 10, 2019

bkamins Feb 10, 2019

nalimilan Feb 10, 2019

bkamins Feb 10, 2019

bkamins Feb 10, 2019 •

edited

Loading

nalimilan Feb 10, 2019

bkamins Feb 10, 2019

bkamins Feb 3, 2019

nalimilan Feb 10, 2019

bkamins Feb 10, 2019

nalimilan Feb 10, 2019

bkamins Feb 10, 2019

bkamins Feb 10, 2019

nalimilan Feb 11, 2019

bkamins Feb 11, 2019

nalimilan Feb 11, 2019

bkamins Feb 11, 2019

bkamins Feb 3, 2019

richgoldberg commented Mar 22, 2019

nalimilan commented Mar 22, 2019

bkamins commented May 5, 2019 •

edited

Loading

bkamins commented May 5, 2019

bkamins commented Jul 12, 2019

WIP: implementation of new setindex! and broadcasting rules #1646

WIP: implementation of new setindex! and broadcasting rules #1646

Conversation

bkamins commented Dec 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins Feb 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richgoldberg commented Mar 22, 2019

nalimilan commented Mar 22, 2019

bkamins commented May 5, 2019 • edited Loading

bkamins commented May 5, 2019

bkamins commented Jul 12, 2019

bkamins Feb 10, 2019 •

edited

Loading

bkamins commented May 5, 2019 •

edited

Loading