Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: implementation of new setindex! and broadcasting rules #1646

Closed
wants to merge 17 commits into from

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Dec 19, 2018

@nalimilan Following #1645 I have written down the possible target rules, so that we can edit them.

I will add a specification for SubDataFrame and DataFrameRow when we settle AbstractDataFrame.

CC @coreywoodfield

docs/src/lib/indexing.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
> `df[rows, col] = v`

* The same rules as for `df[col] = v` but on selected rows and always with copying.
* Empty data frames are not allowed and no column adding is possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe df[:, col] = v and/or df[:, col] .= v should allow empty data frames?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have df[col] = v and df[col] .= v for this case. It would render df[:, col] = v inconsistent.

And here I exactly try to fix the inconsistency in the current design of DataFrames.jl. See the example how x and y are treated differently:

julia> x = [1,2,3]
3-element Array{Int64,1}:
 1
 2
 3

julia> y = [1,2,3]
3-element Array{Int64,1}:
 1
 2
 3

julia> df = DataFrame(x=x,y=y)
3×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │

julia> df[:, :x] = [0,0,0]
3-element Array{Int64,1}:
 0
 0
 0

julia> df[1:3, :y] = [0,0,0]
3-element Array{Int64,1}:
 0
 0
 0

julia> df
3×2 DataFrame
│ Row │ x     │ y     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 0     │ 0     │
│ 2   │ 0     │ 0     │
│ 3   │ 0     │ 0     │

julia> x
3-element Array{Int64,1}:
 1
 2
 3

julia> y
3-element Array{Int64,1}:
 0
 0
 0

julia> df[:, :x] = 'a':'c'
'a':1:'c'

julia> df
3×2 DataFrame
│ Row │ x    │ y     │
│     │ Char │ Int64 │
├─────┼──────┼───────┤
│ 1   │ 'a'  │ 0     │
│ 2   │ 'b'  │ 0     │
│ 3   │ 'c'  │ 0     │

julia> df[1:3, :y] = 'a':'c'
'a':1:'c'

julia> df
3×2 DataFrame
│ Row │ x    │ y     │
│     │ Char │ Int64 │
├─────┼──────┼───────┤
│ 1   │ 'a'  │ 97    │
│ 2   │ 'b'  │ 98    │
│ 3   │ 'c'  │ 99    │

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is problematic, but I don't see the relationship with empty data frames (it's been a long time....).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed rule is that if we use df[:, col] or df[rows, col] in general we write to an existing vector in the DataFrame in-place. So - naturally it has to fail in an empty DataFrame.

Or - from another angle - the proposed rule is that on LHS df[:, col] is rewritten as df[axes(df, 1), col].

This approach is consistent with Base, where the manual states here:

This includes Colon (:) to select all indices within the entire dimension

I would be OK to bend the rules in this case, if there were no other convenient way to assign a new column to a DataFrame, but there is one and it is df[col] = v.

But if you really think it is useful we can make an exception for an empty DataFrame.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we have e.g.

df = DataFrame(x=[])
df[:, 1] = []

Then what's the problem? We've set all entries in df.x to the values from the RHS -- it's just that there aren't any of them. :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no problem 😄. Your df is not empty, it has 0-rows, but it is a different story. An empty data frame is returned by the call DataFrame().

Copy link
Member Author

@bkamins bkamins Feb 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this distinction of emptyness is very relevant in other cases.

This should work:

df = DataFrame()
df.a = [1,2,3]

while this should fail

df = DataFrame(x=[])
df.a = [1,2,3]

An interesting corner case is:

julia> df = DataFrame(x=[1,2])
2×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │

julia> df.x = [1,2,3,4]
ERROR: ArgumentError: New columns must have the same length as old columns

which fails now and I thing it should still fail.

The short rule - an empty data frame DataFrame() has an undefined number of rows (not 0) while any non-empty data frame has a concrete number of rows.

Actually we could consider to make nrow(DataFrame()) return nothing instead of 0, this would be technically stricter, but I always thought it would not be very useful, so I have not proposed to add this change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then maybe say "with zero columns" to avoid any possible ambiguity.

Actually we could consider to make nrow(DataFrame()) return nothing instead of 0, this would be technically stricter, but I always thought it would not be very useful, so I have not proposed to add this change.

Or even missing, since it's really unknown. ;-) But 0 is probably the most useful answer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I will replace empty with zero columns everywhere to disambiguate.

docs/src/lib/indexing.md Show resolved Hide resolved
@bkamins bkamins mentioned this pull request Jan 15, 2019
31 tasks
(which means that passing `Integer` or `Bool` will fail for nonexistent columns,
but adding new columns as `Symbol` is allowed)
* then `df[col] = v[col]` is called for each column name `col`
* if `v` is a vector of vectors or a tuple of vectors the same process is performed but
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this rule. I do not know if we want to keep it (or keep it only for vectors and not for tuples)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fine. Am I right that this doesn't introduce any ambiguity if you want a column which is a vector of vectors (since you would have to wrap it in a vector/tuple/data frame anyway since this is a multi-column assignment)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - you would have to wrap it. Single column assignment is another rule.

Actually this is very delicate. My idea is that df[cols] = v should ideally give the same result as df[cols] = DataFrame(v) just without having to materialize the intermediate DataFrame.

The same for df[rows, cols] = v.

But now I see that:

julia> x = ([1,2,3], [1,2,3])
([1, 2, 3], [1, 2, 3])

julia> DataFrame(x)
ERROR: ArgumentError: 'Tuple{Array{Int64,1},Array{Int64,1}}' iterates 'Array{Int64,1}' values, which don't satisfy the Tables.jl Row-iterator interface

julia> x = [[1,2,3], [1,2,3]]
2-element Array{Array{Int64,1},1}:
 [1, 2, 3]
 [1, 2, 3]

julia> DataFrame(x)
3×2 DataFrame
│ Row │ x1    │ x2    │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 2     │
│ 3   │ 3     │ 3     │

which is quite surprising (in general the Tables.jl & friends catch all DataFrame constructor always catches me off guard 😞). So I think we should drop tuple here and leave only the vector?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't Tables.jl accept a tuple here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It accepts it, but assumes that the contents is row-wise not col-wise, see e.g.:

julia> DataFrame(((a=1,b=2),(a=3,b=4)))
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │
│ 2   │ 3     │ 4     │

That is why I say that this Tables.jl API is tricky when combined with what we already had in DataFrames.jl.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Following your suggestion I have commented on the constructors in a separate issue DataFrame constructors #1599.
  2. Are you on latest versions of all packages from Tables.jl & friends? For me it works:
julia> DataFrame([(a=1,b=2),(a=3,b=4)])
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │
│ 2   │ 3     │ 4     │

(and even if it did not work it would not be an error in DataFrames.jl 😀)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, actually I tested that on my PooledArrays branch which was behind master. It works now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the discussion in #1599 (comment) I changed my mind and would leave only vector of vectors here for now.

The resason is that in general it is not clear if you have a collection if it should be treated row-wise or col-wise. We can add other values of v later if there is a need.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion was that the collection type should not matter to decide whether it's row- or column-wise: tuples and vectors should be treated the same. Two constructors would take tuples and vectors of vectors. Other collections would be assumed to be row-oriented and handled by Tables.jl.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK :). So I add tuples here and in the constructors thread.

* `length(v)` must be equal to `length(cols)`
* column names in `v` must be the same as selected by `cols`
* an operation `df[row, col] = v[col]` for `col in cols` is performed
* if `v` is a vector or a tuple:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this additional rule is related to this rule.

@richgoldberg
Copy link

Regarding the specification in docs/src/lib/indexing.md:

df[rows, col] = v
The same rules as for df[col] = v but on selected rows and always with copying.
Empty data frames are not allowed and no column adding is possible.

df[rows, col] .= v
The same rules as for df[col] .= v but on selected rows and always with copying.
Empty data frames are not allowed and no column adding is possible.

This doesn't seem to be the case right now (maybe you know this?) even when there is no issue with empty DataFrames. Here's an example where df[rows, col] .= v does not behave like df[col] .= v. Further, it doesn't throw any sort of error or warning to the user.

# Example:  Start with a DataFrame and update a column in-place

# initial DataFrame to modify
using DataFrames
df = DataFrame(a = [1,3,5], b = ["A", "B", "C"])
display(df)

#Case 1: df[col] = v works
df = DataFrame(a = [1,3,5], b = ["A", "B", "C"])
v = ["a", "b", "c"]
df[ :b] = v
display(df)

#Case 2: df[rows, col] = v works
df = DataFrame(a = [1,3,5], b = ["A", "B", "C"])
v = ["a", "b", "c"]
df[:, :b] = v
display(df)

#Case 3: df[col] .= v works
df = DataFrame(a = [1,3,5], b = ["A", "B", "C"])
v = ["a", "b", "c"]
df[ :b] .= v
display(df)

#Case 4:, df[rows, col] .= v doesn't work!
df = DataFrame(a = [1,3,5], b = ["A", "B", "C"])
v = ["a", "b", "c"]
df[:, :b] .= v
display(df)

Note: Run in Julia 1.1.0 using DataFrames v. 0.17.1

@nalimilan
Copy link
Member

This PR describes what should happen, not what happens right now.

docs/src/lib/indexing.md Outdated Show resolved Hide resolved
@bkamins bkamins changed the title Initial specification of new setindex! rules WIP: implementation of new setindex! and broadcasting rules Apr 28, 2019
@bkamins
Copy link
Member Author

bkamins commented May 5, 2019

When implementing the setindex! and broadcasting rules I started to have a radical thought that we should not allow a general assignment to a DataFrame subsets, but only to df[col], df[row, col], df[rows, col]. So essentially to allow assignment to a single column, which is a vector and then the rules are clear.

What I mean is that assignment to df[cols], df[row, cols], df[rows, cols] is problematic in two ways:

  • should we do matching by column name or column index (we could set some rules but I am not sure which is better);
  • when broadcasting how should we treat a right hand side. df[cols] and df[rows, cols] theoretically should be viewed as a collection of rows, but df[row, cols] should be a collection of 1-element columns;

Additionally I am not sure how useful having df[cols], df[row, cols], df[rows, cols] on LHS is in practice while if it is really needed then it can be explicitly handled by a simple iteration over columns.

@nalimilan, @oxinabox - do you have any thoughts on it (if it is not clear what I mean here please comment and I can expand on my thoughts).

@bkamins
Copy link
Member Author

bkamins commented May 5, 2019

Just to give more perspective to it I am not convinced we want to accept any of the following:

julia> df = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │

julia> df2 = DataFrame(x=4:6)
3×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 5     │
│ 3   │ 6     │

julia> df[[:b]] = df2
3×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 5     │
│ 3   │ 6     │

julia> df
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 4     │
│ 2   │ 2     │ 5     │
│ 3   │ 3     │ 6     │

julia> df[[:c, :d]] = 10
10

julia> df
3×4 DataFrame
│ Row │ a     │ b     │ c     │ d     │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 4     │ 10    │ 10    │
│ 2   │ 2     │ 5     │ 10    │ 10    │
│ 3   │ 3     │ 6     │ 10    │ 10    │

julia> df[1, [1]] = df2
3×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 5     │
│ 3   │ 6     │

julia> df
3×4 DataFrame
│ Row │ a     │ b     │ c     │ d     │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 4     │ 4     │ 10    │ 10    │
│ 2   │ 2     │ 5     │ 10    │ 10    │
│ 3   │ 3     │ 6     │ 10    │ 10    │

julia> df[3, [1]] = df2
3×1 DataFrame
│ Row │ x     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 4     │
│ 2   │ 5     │
│ 3   │ 6     │

julia> df
3×4 DataFrame
│ Row │ a     │ b     │ c     │ d     │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 4     │ 4     │ 10    │ 10    │
│ 2   │ 2     │ 5     │ 10    │ 10    │
│ 3   │ 4     │ 6     │ 10    │ 10    │

and in general now I feel that allowing a data frame on the RHS of an assignment is problematic.

What we could allow for (but I am not sure how much useful that would be so that is why I am hesitant to propose these - we can always add them later) are:

  • df[row, cols] = DataFrameRow | AbstractDict | NamedTuple | iterable (in general to have the same functionality as push!)
  • df[rows, cols] = matrix (treating df as a 2-dimensional object)
  • df[row, cols] .= any 0-dimensional value in terms of broadcasting and df[rows, cols] .= any 0, 1 or 2-dimensional value in terms of broadcasting (in general: adhering to standard broadcasting that assumes we have numerical axis and disallowing any other thing on RHS)

@bkamins
Copy link
Member Author

bkamins commented Jul 12, 2019

#1866 implements current rules

@bkamins bkamins closed this Jul 12, 2019
@bkamins bkamins deleted the broadcasted_assignment branch July 15, 2019 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants