Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setindex!/broadcast! design #1645

Closed
bkamins opened this issue Dec 16, 2018 · 15 comments
Closed

setindex!/broadcast! design #1645

bkamins opened this issue Dec 16, 2018 · 15 comments

Comments

@bkamins
Copy link
Member

bkamins commented Dec 16, 2018

Here is a review of the current API (#1571 not included, but accounted for in the comments). I have added decision fields - sometimes I am pretty sure what to do. Sometimes it is a question. The biggest question I have is if we actually want to go for support of broadcasting when LHS is a DataFrame at all. Maybe it is OK to leave the current mechanics we have as a DataFrame is not an AbstractArray. I am not really sure what is best. The benefit of adding broadcasting is that we are more consistent with Base in terms of notation (now when setindex! is called broadcasting is done implicitly).

setindex!(df::DataFrame, v::AbstractVector, col_ind::ColumnIndex)

- set `df[col_ind]` to `v` in place
- will add a column if it is a new name or `ncol(df)+1` index
- decision: I really do not like allowing `ncol(df)+1` index
  and I would deprecate it. This is a work for `hcat!` which should
  be exported.

setindex!(df::DataFrame, v, col_ind::ColumnIndex)

- set `df[col_ind]` to `[v]` if `ncol(df)==0`
- set `df[col_ind]` to `fill(v, nrow(df))` if `ncol(df)>0`
- decision: what do we do when `ncol(df)>0` but `nrow(df)==0`?
- decision: do we deprecate it and replace with `broadcast!`?

setindex!(df::DataFrame, new_df::DataFrame, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, new_df::DataFrame, col_inds::AbstractVector{<:ColumnIndex})

- insert columns `col_inds` from `new_df` to `df` in place as `col_inds`
- handling of `Bool` and `Integer` indexing is weird, see below
- `new_df` uses integer indexing left-to-right; no column name matching is performed
- decision: fix `Bool` and `Integer` indexing to be strict
- decision: do we want to allow `NamedTuple` and `AbstractDict` of vectors also here?
- decision: do we want to allow `AbstractMatrix` here?
- decision: do we want to do column name matching

setindex!(df::DataFrame, v::AbstractVector, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, val::Any, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, val::Any, col_inds::AbstractVector{<:ColumnIndex})

- `copy` of `v` is always performed if it is a `vector`
- handling of `Bool` and `Integer` indexing is weird, see below
- decision: do we want to keep making a `copy`?
- decision: fix `Bool` and `Integer` indexing to be strict
- decision: do we deprecate `val::Any` it and replace with `broadcast!`?
julia> df = DataFrame()
0×0 DataFrame


julia> df[[true,true,true,true]] = 1:2
1:2

julia> df
2×4 DataFrame
│ Row │ x1    │ x2    │ x3    │ x4    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 1     │ 1     │
│ 2   │ 2     │ 2     │ 2     │ 2     │

setindex!(df::DataFrame, v, ::Colon)

- assuming all else is fixed this can stay as is

setindex!(df::DataFrame, v::Any, row_ind::Real, col_ind::ColumnIndex)

- can stay as is except for `row_ind` signature
- decision: restrict `row_ind::Integer` (`Bool` is caught by `insert_single_entry!`)

setindex!(df::DataFrame, v::Any, row_ind::Real, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, v::Any, row_ind::Real, col_inds::AbstractVector{<:ColumnIndex})

- decision: do we deprecate it and replace with `broadcast!`?

setindex!(df::DataFrame, new_df::DataFrame, row_ind::Real, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, new_df::DataFrame, row_ind::Real, col_inds::AbstractVector{<:ColumnIndex})

- weird: uses 1st row of new_df
- decision: deprecate (allow `NamedTuple`, `AbstractDict` and `DataFrameRow` - PR in preparation)
- decision: do we want to allow `AbstractVector` here?

setindex!(df::DataFrame, v::AbstractVector, row_inds::AbstractVector{Bool}, col_ind::ColumnIndex)
setindex!(df::DataFrame, v::AbstractVector, row_inds::AbstractVector{<:Real}, col_ind::ColumnIndex)
setindex!(df::DataFrame, v::Any, row_inds::AbstractVector{Bool}, col_ind::ColumnIndex)
setindex!(df::DataFrame, v::Any, row_inds::AbstractVector{<:Real}, col_ind::ColumnIndex)

- `insert_multiple_entries!` uses broadcast assignment `.=` internally
- decision: do we deprecate it and replace with `broadcast!` allowing only vectors here?

setindex!(df::DataFrame, new_df::DataFrame, row_inds::AbstractVector{Bool}, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, new_df::DataFrame, row_inds::AbstractVector{Bool}, col_inds::AbstractVector{<:ColumnIndex})
setindex!(df::DataFrame, new_df::DataFrame, row_inds::AbstractVector{<:Real}, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, new_df::DataFrame, row_inds::AbstractVector{<:Real}, col_inds::AbstractVector{<:ColumnIndex})

- `new_df` uses integer indexing left-to-right; no column name matching is performed
- decision: do we want to do column name matching
- decision: restrict `row_inds` signature to `Intger`
- decision: do we want to allow `NamedTuple` and `AbstractDict` of vectors also here?
- decision: do we want to allow `AbstractMatrix` here?

setindex!(df::DataFrame, v::AbstractVector, row_inds::AbstractVector{Bool}, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, v::AbstractVector, row_inds::AbstractVector{Bool}, col_inds::AbstractVector{<:ColumnIndex})
setindex!(df::DataFrame, v::AbstractVector, row_inds::AbstractVector{<:Real}, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, v::AbstractVector, row_inds::AbstractVector{<:Real}, col_inds::AbstractVector{<:ColumnIndex})

- decision: restrict `row_inds` signature to `Intger`

setindex!(df::DataFrame, v::Any, row_inds::AbstractVector{Bool}, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, v::Any, row_inds::AbstractVector{Bool}, col_inds::AbstractVector{<:ColumnIndex})
setindex!(df::DataFrame, v::Any, row_inds::AbstractVector{<:Real}, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, v::Any, row_inds::AbstractVector{<:Real}, col_inds::AbstractVector{<:ColumnIndex})

- decision: do we deprecate it and replace with `broadcast!`?

setindex!(df::DataFrame, new_df::DataFrame, row_inds::Colon, col_inds::Colon=Colon())

- does a copy of `new_df` to `df`
- this is inconsistent with other `setindex!` methods which traverse `new_df` left to right
- decision: expand `Colon` to appropriate ranges and call defined methods

setindex!(df::DataFrame, v, ::Colon, ::Colon)

- this can stay
- decision: use `axes` in implementation

setindex!(df::DataFrame, v, row_inds, ::Colon)

- this can stay
- decision: use `axes` in implementation

setindex!(df::DataFrame, v, ::Colon, col_inds)

- this is inconsistent as it leads to in-place operation for `v isa AbstractVector`
- decision: expand `Colon` to appropriate range and call defined methods

I am not sure what is the best way to work with this as the list is really long and there are many interconnected decisions to be made.

@nalimilan, @coreywoodfield:
In particular in relation to #1571 note that current setindex! mechanics uses only position - not name of RHS if RHS is a DataFrame.

@bkamins
Copy link
Member Author

bkamins commented Dec 16, 2018

Regarding the broadcasting support, the reason why I consider leaving "automatic broadcasting" for DataFrame is for example the following situation:

julia> using DataFrames

julia> df = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │

julia> df.b = 1
1

julia> df
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 1     │
│ 3   │ 3     │ 1     │

julia> df.c .= 1
ERROR: KeyError: key :c not found
Stacktrace:
 [1] getindex at .\dict.jl:478 [inlined]
 [2] getindex at C:\Users\bogum\.julia\packages\DataFrames\H4l83\src\other\index.jl:137 [inlined]
 [3] getindex at C:\Users\bogum\.julia\packages\DataFrames\H4l83\src\dataframe\dataframe.jl:245 [inlined]
 [4] getproperty(::DataFrame, ::Symbol) at C:\Users\bogum\.julia\packages\DataFrames\H4l83\src\abstractdataframe\abstractdataframe.jl:226
 [5] top-level scope at none:0

When we would disable automatic broadcasting then .df.b = 1 would not be allowed.
The issue here is that for arrays setindex! and also broadcast! guarantee not to change their dimensionality. For DataFrame we have the semantics that setindex! allows to add a column to a DataFrame.

@nalimilan
Copy link
Member

Overall I'd answer "yes" to most of your questions. A few remarks:

setindex!(df::DataFrame, v::AbstractVector, col_ind::ColumnIndex)

- set `df[col_ind]` to `v` in place
- will add a column if it is a new name or `ncol(df)+1` index
- decision: I really do not like allowing `ncol(df)+1` index
  and I would deprecate it. This is a work for `hcat!` which should
  be exported.

Given that we allow creating a column with a new name, which will add it at the end, isn't it consistent to allow indexing the ncol(df)+1?


`setindex!(df::DataFrame, v, ::Colon)`

  • assuming all else is fixed this can stay as is

Can you develop? Does this function automatically broadcast?

setindex!(df::DataFrame, v::AbstractVector, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, val::Any, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, val::Any, col_inds::AbstractVector{<:ColumnIndex})

- `copy` of `v` is always performed if it is a `vector`
- handling of `Bool` and `Integer` indexing is weird, see below
- decision: do we want to keep making a `copy`?
- decision: fix `Bool` and `Integer` indexing to be strict
- decision: do we deprecate `val::Any` it and replace with `broadcast!`?

Strictly speaking, even the first method implies broadcasting if we assume the similarity with matrices. So maybe we should also require broadcast! here.

Regarding your second comment, it's indeed annoying that we can't allow df.c .= 1 to create a new column. We can support df[:c] .= 1, but the error thrown by df.c .= 1 will likely confuse users and we can't print an helpful message.

So I'm not sure whether we should disallow implicit broadcasting. That would be more consistent with Base, but maybe less convenient. Or maybe we should allow df.c .= 1 but require broadcasting for all more complex forms?

@bkamins
Copy link
Member Author

bkamins commented Dec 17, 2018

isn't it consistent to allow indexing the ncol(df)+1.

I would disallow it because setindex! in base normally does not allow the change of dimensionality. I would try to minimize the differences we introduce so allow new column with Symbol indexing, but disallow it with integer indexing. As I have noted above, the problem is that if we allow it we accept weird behaviors like:

julia> df = DataFrame(a=1:2)
2×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │

julia> df[[false, true, true, false]] = 10:11
10:11

Now tell me what is df after this. The answer is:

julia> df
2×3 DataFrame
│ Row │ a     │ x2    │ x3    │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 10    │ 10    │
│ 2   │ 2     │ 11    │ 11    │

which is not intuitive. As I have mentioned as an alternative to append a column in-place I would export hcat! (which additionally has a benefit that you do not have to specify the column number).

Can you develop? (on df[:])

The problem is the following:

julia> df = DataFrame(a=1:2, b=3:4)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> df2 = DataFrame(x=11:12, y=13:14, z=15:16)
2×3 DataFrame
│ Row │ x     │ y     │ z     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 11    │ 13    │ 15    │
│ 2   │ 12    │ 14    │ 16    │

julia> df[:] = df2
ERROR: type is immutable

(this is a bug that we should fix anyway - it was introduced by me and I did not catch this special case), but in general after this df is expected to be the same as df2 - i.e. we loose columns :a and :b from df while e.g.:

julia> df[:] = 1
1

julia> df
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 1     │ 1     │

writes into these columns,
and moreover:

julia> df[1:end] = df2
2×3 DataFrame
│ Row │ x     │ y     │ z     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 11    │ 13    │ 15    │
│ 2   │ 12    │ 14    │ 16    │

julia> df
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 11    │ 13    │
│ 2   │ 12    │ 14    │

also leaves them but changes their values and it is natural to expect that df[:] and df[1:end] on LHS should produce the same thing.

Strictly speaking, even the first method implies broadcasting if we assume the similarity with matrices.

This is the main problem I have. I.e. if we treat DataFrame as 2-dimensional and apply broadcasting approach then we have (and should) broadcast when we get:

  • a scalar (possibly in two dimensions depending on the LHS)
  • a vector (broadcast over columns)
  • a row-object (broadcast over rows)

This would leave standard setindex! only reserved for cases when both dimensions match exactly. We could do it. Then the only exception I would allow is to allow df.c .= something to add a column and df[cols] .= something to add multiple columns (then we have to decide if we make a copy of something if it is a vector).

@nalimilan
Copy link
Member

Makes sense.

Then the only exception I would allow is to allow df.c .= something to add a column and df[cols] .= something to add multiple columns (then we have to decide if we make a copy of something if it is a vector).

Unfortunately I don't think we can support df.c .= something, as it's parsed as getproperty(df, :c) .= something before we have a chance to do anything.

@bkamins
Copy link
Member Author

bkamins commented Dec 18, 2018

In general not only df.c .= something for a new column will not work but also df[:c] .= something will not work and similarly df[cols] .= something will not work if cols contains a new column (maybe it is possible to make it work but it will involve heavy tweaking).

The general reason is that broadcasting is designed not to mutate the size of the target and for DataFrame we allow it (and actually this is a key benefit over other table-type designs).

This means that (unless we do heavy tweaking of broadcasting mechanism) we have to keep:
df[col] = something and df[cols] = something doing implicit broadcasting.

So the question is - do we want to go for it as most likely we will have to keep inconsistency anyway? The alternative is to keep doing implicit broadcast as we do currently (and this would be a recommended style) and do not care about explicit broadcasting (unless it already works).

I have the possible rules with broadcasting written down if you would want to see them. The only benefit would be that the following would work and now it fails:

julia> df = DataFrame(a = [[1,2],[3,4]])
2×1 DataFrame
│ Row │ a      │
│     │ Array… │
├─────┼────────┤
│ 1   │ [1, 2] │
│ 2   │ [3, 4] │

julia> df[1:2, 1] = [5,6]
ERROR: MethodError: Cannot `convert` an object of type Int64 to an object of type Array{Int64,1}

julia> df[1:2, 1] .= [5,6]
ERROR: MethodError: Cannot `convert` an object of type Int64 to an object of type Array{Int64,1}

at the cost of problems when users want to add columns via broadcasting.

@nalimilan
Copy link
Member

Are you sure that adding new columns requires more work than supporting broadcast in general? AFAICT we need to implement lots of custom methods anyway.

@bkamins
Copy link
Member Author

bkamins commented Dec 18, 2018

Yes - try running it yourself - but in general the problem is that broadcast! tries to make a view of the LHS (it makes sense) and then assign to it via copyto!. Actually to make broadcast! work we do not have to implement much - mostly copyto! methods.
But the problem is that when a column does not exist then you are not able to make a view containing it. You could try to create it "on the way" and have it present when the view is called, but you have a problem that you do not easily know what the eltype of this column should be.

@bkamins
Copy link
Member Author

bkamins commented Dec 18, 2018

Notice that even in https://docs.julialang.org/en/latest/manual/interfaces/#man-interfaces-broadcasting-1 it is recommended to define only copyto!. We could of course try to rewrite all broadcast! mechanics from scratch.

@bkamins
Copy link
Member Author

bkamins commented Dec 18, 2018

As I have said - it is possible to work around it, e.g. like this (this is not a full solution, but it roughly shows how you can do it, but it is messy):

julia> Base.dotview(df::DataFrame, s::Symbol) = (df, s)

julia> Base.copyto!((df, col)::Tuple{DataFrame, Symbol}, x::Base.Broadcast.Broadcasted{P,Q,R,S}) where {P,Q,R,S} =
    haskey(df, col) ? copyto!(df[col], x) : (t = Vector{S.parameters[1]}(undef, nrow(df)); copyto!(t, x); df[col] = t)

julia> df = DataFrame(a=1:2)
2×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │

julia> df[:b] .= 100
2-element Array{Int64,1}:
 100
 100

julia> df
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 100   │
│ 2   │ 2     │ 100   │

@pdeffebach
Copy link
Contributor

Unfortunately I don't think we can support df.c .= something, as it's parsed as getproperty(df, :c) .= something before we have a chance to do anything.

This isn't a huge concern for me. It's not like df.c exists already. I think of broadcasting as iterating over an existing array and filling in values, so that doesn't apply here.

@bkamins
Copy link
Member Author

bkamins commented Dec 19, 2018

I think of broadcasting as iterating over an existing array and filling in values

This is the whole problem - in DataFrames.jl setindex! is used also for growing a DataFrame as opposed to Base. Of course you could say that users should write df.c = fill(something, nrow(df)) to add a new column, but I guess this is something we wanted to avoid.

@nalimilan
Copy link
Member

Thanks for the explanation. I didn't expect this to be so easy. Actually your example above is quite convincing about the fact that we can allow broadcasting to create columns. ;-)

Anyway, I guess we can start implementing standard broadcasting support, keeping the automatic broadcasting via =, and then we can evaluate how hard/hacky it would be to allow creating columns with broadcast and whether we want to deprecate automatic broadcasting (in some cases, or all the time).

@bkamins
Copy link
Member Author

bkamins commented Dec 19, 2018

OK - I will push a PR with a specification of a target functionality so that we can work on a file.
Actually defining:

Base.dotview(adf::AbstractDataFrame, idxs) = (adf, idxs)
Base.dotview(dfr::DataFrameRow, idxs) = (dfr, idxs)

and

Base.copyto!((adf, idxs)::Tuple{AbstractDataFrame, Any}, ::Base.Broadcast.Broadcasted)
Base.copyto!((dfr, idxs)::Tuple{DataFrameRow, Any}, ::Base.Broadcast.Broadcasted)

seems to allow us to intercept and process whatever we want.

CC: @mbauman - if you can spare some time could you please comment if this is an appropriate way to specialize broadcasted assignment in DataFrames.jl? The challenge we have is that we have non-standard indexing (e.g. by column names) and that we want a broadcasted assignment to be able to add columns (in which case it will not be in place, but will have to allocate these new columns).

@nalimilan
Copy link
Member

Thanks. Though I'd avoid dispatching on Tuple{AbstractDataFrame, Any}, and use a custom type instead. That sounds cleaner.

@bkamins
Copy link
Member Author

bkamins commented Sep 8, 2019

All is done here.

@bkamins bkamins closed this as completed Sep 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants