`setindex!`/`broadcast!` design #1645

bkamins · 2018-12-16T14:22:38Z

Here is a review of the current API (#1571 not included, but accounted for in the comments). I have added decision fields - sometimes I am pretty sure what to do. Sometimes it is a question. The biggest question I have is if we actually want to go for support of broadcasting when LHS is a DataFrame at all. Maybe it is OK to leave the current mechanics we have as a DataFrame is not an AbstractArray. I am not really sure what is best. The benefit of adding broadcasting is that we are more consistent with Base in terms of notation (now when setindex! is called broadcasting is done implicitly).

setindex!(df::DataFrame, v::AbstractVector, col_ind::ColumnIndex)

- set `df[col_ind]` to `v` in place
- will add a column if it is a new name or `ncol(df)+1` index
- decision: I really do not like allowing `ncol(df)+1` index
  and I would deprecate it. This is a work for `hcat!` which should
  be exported.

setindex!(df::DataFrame, v, col_ind::ColumnIndex)

- set `df[col_ind]` to `[v]` if `ncol(df)==0`
- set `df[col_ind]` to `fill(v, nrow(df))` if `ncol(df)>0`
- decision: what do we do when `ncol(df)>0` but `nrow(df)==0`?
- decision: do we deprecate it and replace with `broadcast!`?

setindex!(df::DataFrame, new_df::DataFrame, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, new_df::DataFrame, col_inds::AbstractVector{<:ColumnIndex})

- insert columns `col_inds` from `new_df` to `df` in place as `col_inds`
- handling of `Bool` and `Integer` indexing is weird, see below
- `new_df` uses integer indexing left-to-right; no column name matching is performed
- decision: fix `Bool` and `Integer` indexing to be strict
- decision: do we want to allow `NamedTuple` and `AbstractDict` of vectors also here?
- decision: do we want to allow `AbstractMatrix` here?
- decision: do we want to do column name matching

setindex!(df::DataFrame, v::AbstractVector, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, val::Any, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, val::Any, col_inds::AbstractVector{<:ColumnIndex})

- `copy` of `v` is always performed if it is a `vector`
- handling of `Bool` and `Integer` indexing is weird, see below
- decision: do we want to keep making a `copy`?
- decision: fix `Bool` and `Integer` indexing to be strict
- decision: do we deprecate `val::Any` it and replace with `broadcast!`?

julia> df = DataFrame()
0×0 DataFrame


julia> df[[true,true,true,true]] = 1:2
1:2

julia> df
2×4 DataFrame
│ Row │ x1    │ x2    │ x3    │ x4    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 1     │ 1     │
│ 2   │ 2     │ 2     │ 2     │ 2     │

setindex!(df::DataFrame, v, ::Colon)

- assuming all else is fixed this can stay as is

setindex!(df::DataFrame, v::Any, row_ind::Real, col_ind::ColumnIndex)

- can stay as is except for `row_ind` signature
- decision: restrict `row_ind::Integer` (`Bool` is caught by `insert_single_entry!`)

setindex!(df::DataFrame, v::Any, row_ind::Real, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, v::Any, row_ind::Real, col_inds::AbstractVector{<:ColumnIndex})

- decision: do we deprecate it and replace with `broadcast!`?

setindex!(df::DataFrame, new_df::DataFrame, row_ind::Real, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, new_df::DataFrame, row_ind::Real, col_inds::AbstractVector{<:ColumnIndex})

- weird: uses 1st row of new_df
- decision: deprecate (allow `NamedTuple`, `AbstractDict` and `DataFrameRow` - PR in preparation)
- decision: do we want to allow `AbstractVector` here?

setindex!(df::DataFrame, v::AbstractVector, row_inds::AbstractVector{Bool}, col_ind::ColumnIndex)
setindex!(df::DataFrame, v::AbstractVector, row_inds::AbstractVector{<:Real}, col_ind::ColumnIndex)
setindex!(df::DataFrame, v::Any, row_inds::AbstractVector{Bool}, col_ind::ColumnIndex)
setindex!(df::DataFrame, v::Any, row_inds::AbstractVector{<:Real}, col_ind::ColumnIndex)

- `insert_multiple_entries!` uses broadcast assignment `.=` internally
- decision: do we deprecate it and replace with `broadcast!` allowing only vectors here?

setindex!(df::DataFrame, new_df::DataFrame, row_inds::AbstractVector{Bool}, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, new_df::DataFrame, row_inds::AbstractVector{Bool}, col_inds::AbstractVector{<:ColumnIndex})
setindex!(df::DataFrame, new_df::DataFrame, row_inds::AbstractVector{<:Real}, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, new_df::DataFrame, row_inds::AbstractVector{<:Real}, col_inds::AbstractVector{<:ColumnIndex})

- `new_df` uses integer indexing left-to-right; no column name matching is performed
- decision: do we want to do column name matching
- decision: restrict `row_inds` signature to `Intger`
- decision: do we want to allow `NamedTuple` and `AbstractDict` of vectors also here?
- decision: do we want to allow `AbstractMatrix` here?

setindex!(df::DataFrame, v::AbstractVector, row_inds::AbstractVector{Bool}, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, v::AbstractVector, row_inds::AbstractVector{Bool}, col_inds::AbstractVector{<:ColumnIndex})
setindex!(df::DataFrame, v::AbstractVector, row_inds::AbstractVector{<:Real}, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, v::AbstractVector, row_inds::AbstractVector{<:Real}, col_inds::AbstractVector{<:ColumnIndex})

- decision: restrict `row_inds` signature to `Intger`

setindex!(df::DataFrame, v::Any, row_inds::AbstractVector{Bool}, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, v::Any, row_inds::AbstractVector{Bool}, col_inds::AbstractVector{<:ColumnIndex})
setindex!(df::DataFrame, v::Any, row_inds::AbstractVector{<:Real}, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, v::Any, row_inds::AbstractVector{<:Real}, col_inds::AbstractVector{<:ColumnIndex})

- decision: do we deprecate it and replace with `broadcast!`?

setindex!(df::DataFrame, new_df::DataFrame, row_inds::Colon, col_inds::Colon=Colon())

- does a copy of `new_df` to `df`
- this is inconsistent with other `setindex!` methods which traverse `new_df` left to right
- decision: expand `Colon` to appropriate ranges and call defined methods

setindex!(df::DataFrame, v, ::Colon, ::Colon)

- this can stay
- decision: use `axes` in implementation

setindex!(df::DataFrame, v, row_inds, ::Colon)

- this can stay
- decision: use `axes` in implementation

setindex!(df::DataFrame, v, ::Colon, col_inds)

- this is inconsistent as it leads to in-place operation for `v isa AbstractVector`
- decision: expand `Colon` to appropriate range and call defined methods

I am not sure what is the best way to work with this as the list is really long and there are many interconnected decisions to be made.

@nalimilan, @coreywoodfield:
In particular in relation to #1571 note that current setindex! mechanics uses only position - not name of RHS if RHS is a DataFrame.

The text was updated successfully, but these errors were encountered:

bkamins · 2018-12-16T18:00:28Z

Regarding the broadcasting support, the reason why I consider leaving "automatic broadcasting" for DataFrame is for example the following situation:

julia> using DataFrames

julia> df = DataFrame(a=1:3)
3×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │
│ 3   │ 3     │

julia> df.b = 1
1

julia> df
3×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 2     │ 1     │
│ 3   │ 3     │ 1     │

julia> df.c .= 1
ERROR: KeyError: key :c not found
Stacktrace:
 [1] getindex at .\dict.jl:478 [inlined]
 [2] getindex at C:\Users\bogum\.julia\packages\DataFrames\H4l83\src\other\index.jl:137 [inlined]
 [3] getindex at C:\Users\bogum\.julia\packages\DataFrames\H4l83\src\dataframe\dataframe.jl:245 [inlined]
 [4] getproperty(::DataFrame, ::Symbol) at C:\Users\bogum\.julia\packages\DataFrames\H4l83\src\abstractdataframe\abstractdataframe.jl:226
 [5] top-level scope at none:0

When we would disable automatic broadcasting then .df.b = 1 would not be allowed.
The issue here is that for arrays setindex! and also broadcast! guarantee not to change their dimensionality. For DataFrame we have the semantics that setindex! allows to add a column to a DataFrame.

nalimilan · 2018-12-17T14:40:37Z

Overall I'd answer "yes" to most of your questions. A few remarks:

setindex!(df::DataFrame, v::AbstractVector, col_ind::ColumnIndex)

- set `df[col_ind]` to `v` in place
- will add a column if it is a new name or `ncol(df)+1` index
- decision: I really do not like allowing `ncol(df)+1` index
  and I would deprecate it. This is a work for `hcat!` which should
  be exported.

Given that we allow creating a column with a new name, which will add it at the end, isn't it consistent to allow indexing the ncol(df)+1?

`setindex!(df::DataFrame, v, ::Colon)`
assuming all else is fixed this can stay as is

Can you develop? Does this function automatically broadcast?

setindex!(df::DataFrame, v::AbstractVector, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, val::Any, col_inds::AbstractVector{Bool})
setindex!(df::DataFrame, val::Any, col_inds::AbstractVector{<:ColumnIndex})
- `copy` of `v` is always performed if it is a `vector`
- handling of `Bool` and `Integer` indexing is weird, see below
- decision: do we want to keep making a `copy`?
- decision: fix `Bool` and `Integer` indexing to be strict
- decision: do we deprecate `val::Any` it and replace with `broadcast!`?

Strictly speaking, even the first method implies broadcasting if we assume the similarity with matrices. So maybe we should also require broadcast! here.

Regarding your second comment, it's indeed annoying that we can't allow df.c .= 1 to create a new column. We can support df[:c] .= 1, but the error thrown by df.c .= 1 will likely confuse users and we can't print an helpful message.

So I'm not sure whether we should disallow implicit broadcasting. That would be more consistent with Base, but maybe less convenient. Or maybe we should allow df.c .= 1 but require broadcasting for all more complex forms?

bkamins · 2018-12-17T15:44:03Z

isn't it consistent to allow indexing the ncol(df)+1.

I would disallow it because setindex! in base normally does not allow the change of dimensionality. I would try to minimize the differences we introduce so allow new column with Symbol indexing, but disallow it with integer indexing. As I have noted above, the problem is that if we allow it we accept weird behaviors like:

julia> df = DataFrame(a=1:2)
2×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │

julia> df[[false, true, true, false]] = 10:11
10:11

Now tell me what is df after this. The answer is:

julia> df
2×3 DataFrame
│ Row │ a     │ x2    │ x3    │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 10    │ 10    │
│ 2   │ 2     │ 11    │ 11    │

which is not intuitive. As I have mentioned as an alternative to append a column in-place I would export hcat! (which additionally has a benefit that you do not have to specify the column number).

Can you develop? (on df[:])

The problem is the following:

julia> df = DataFrame(a=1:2, b=3:4)
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 3     │
│ 2   │ 2     │ 4     │

julia> df2 = DataFrame(x=11:12, y=13:14, z=15:16)
2×3 DataFrame
│ Row │ x     │ y     │ z     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 11    │ 13    │ 15    │
│ 2   │ 12    │ 14    │ 16    │

julia> df[:] = df2
ERROR: type is immutable

(this is a bug that we should fix anyway - it was introduced by me and I did not catch this special case), but in general after this df is expected to be the same as df2 - i.e. we loose columns :a and :b from df while e.g.:

julia> df[:] = 1
1

julia> df
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 1     │ 1     │

writes into these columns,
and moreover:

julia> df[1:end] = df2
2×3 DataFrame
│ Row │ x     │ y     │ z     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 11    │ 13    │ 15    │
│ 2   │ 12    │ 14    │ 16    │

julia> df
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 11    │ 13    │
│ 2   │ 12    │ 14    │

also leaves them but changes their values and it is natural to expect that df[:] and df[1:end] on LHS should produce the same thing.

Strictly speaking, even the first method implies broadcasting if we assume the similarity with matrices.

This is the main problem I have. I.e. if we treat DataFrame as 2-dimensional and apply broadcasting approach then we have (and should) broadcast when we get:

a scalar (possibly in two dimensions depending on the LHS)
a vector (broadcast over columns)
a row-object (broadcast over rows)

This would leave standard setindex! only reserved for cases when both dimensions match exactly. We could do it. Then the only exception I would allow is to allow df.c .= something to add a column and df[cols] .= something to add multiple columns (then we have to decide if we make a copy of something if it is a vector).

nalimilan · 2018-12-17T16:40:55Z

Makes sense.

Then the only exception I would allow is to allow df.c .= something to add a column and df[cols] .= something to add multiple columns (then we have to decide if we make a copy of something if it is a vector).

Unfortunately I don't think we can support df.c .= something, as it's parsed as getproperty(df, :c) .= something before we have a chance to do anything.

bkamins · 2018-12-18T20:57:24Z

In general not only df.c .= something for a new column will not work but also df[:c] .= something will not work and similarly df[cols] .= something will not work if cols contains a new column (maybe it is possible to make it work but it will involve heavy tweaking).

The general reason is that broadcasting is designed not to mutate the size of the target and for DataFrame we allow it (and actually this is a key benefit over other table-type designs).

This means that (unless we do heavy tweaking of broadcasting mechanism) we have to keep:
df[col] = something and df[cols] = something doing implicit broadcasting.

So the question is - do we want to go for it as most likely we will have to keep inconsistency anyway? The alternative is to keep doing implicit broadcast as we do currently (and this would be a recommended style) and do not care about explicit broadcasting (unless it already works).

I have the possible rules with broadcasting written down if you would want to see them. The only benefit would be that the following would work and now it fails:

julia> df = DataFrame(a = [[1,2],[3,4]])
2×1 DataFrame
│ Row │ a      │
│     │ Array… │
├─────┼────────┤
│ 1   │ [1, 2] │
│ 2   │ [3, 4] │

julia> df[1:2, 1] = [5,6]
ERROR: MethodError: Cannot `convert` an object of type Int64 to an object of type Array{Int64,1}

julia> df[1:2, 1] .= [5,6]
ERROR: MethodError: Cannot `convert` an object of type Int64 to an object of type Array{Int64,1}

at the cost of problems when users want to add columns via broadcasting.

nalimilan · 2018-12-18T21:14:26Z

Are you sure that adding new columns requires more work than supporting broadcast in general? AFAICT we need to implement lots of custom methods anyway.

bkamins · 2018-12-18T21:21:01Z

Yes - try running it yourself - but in general the problem is that broadcast! tries to make a view of the LHS (it makes sense) and then assign to it via copyto!. Actually to make broadcast! work we do not have to implement much - mostly copyto! methods.
But the problem is that when a column does not exist then you are not able to make a view containing it. You could try to create it "on the way" and have it present when the view is called, but you have a problem that you do not easily know what the eltype of this column should be.

bkamins · 2018-12-18T21:23:21Z

Notice that even in https://docs.julialang.org/en/latest/manual/interfaces/#man-interfaces-broadcasting-1 it is recommended to define only copyto!. We could of course try to rewrite all broadcast! mechanics from scratch.

bkamins · 2018-12-18T22:11:47Z

As I have said - it is possible to work around it, e.g. like this (this is not a full solution, but it roughly shows how you can do it, but it is messy):

julia> Base.dotview(df::DataFrame, s::Symbol) = (df, s)

julia> Base.copyto!((df, col)::Tuple{DataFrame, Symbol}, x::Base.Broadcast.Broadcasted{P,Q,R,S}) where {P,Q,R,S} =
    haskey(df, col) ? copyto!(df[col], x) : (t = Vector{S.parameters[1]}(undef, nrow(df)); copyto!(t, x); df[col] = t)

julia> df = DataFrame(a=1:2)
2×1 DataFrame
│ Row │ a     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │

julia> df[:b] .= 100
2-element Array{Int64,1}:
 100
 100

julia> df
2×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 100   │
│ 2   │ 2     │ 100   │

pdeffebach · 2018-12-19T05:39:17Z

Unfortunately I don't think we can support df.c .= something, as it's parsed as getproperty(df, :c) .= something before we have a chance to do anything.

This isn't a huge concern for me. It's not like df.c exists already. I think of broadcasting as iterating over an existing array and filling in values, so that doesn't apply here.

bkamins · 2018-12-19T07:25:07Z

I think of broadcasting as iterating over an existing array and filling in values

This is the whole problem - in DataFrames.jl setindex! is used also for growing a DataFrame as opposed to Base. Of course you could say that users should write df.c = fill(something, nrow(df)) to add a new column, but I guess this is something we wanted to avoid.

nalimilan · 2018-12-19T08:43:33Z

Thanks for the explanation. I didn't expect this to be so easy. Actually your example above is quite convincing about the fact that we can allow broadcasting to create columns. ;-)

Anyway, I guess we can start implementing standard broadcasting support, keeping the automatic broadcasting via =, and then we can evaluate how hard/hacky it would be to allow creating columns with broadcast and whether we want to deprecate automatic broadcasting (in some cases, or all the time).

bkamins · 2018-12-19T09:35:12Z

OK - I will push a PR with a specification of a target functionality so that we can work on a file.
Actually defining:

Base.dotview(adf::AbstractDataFrame, idxs) = (adf, idxs)
Base.dotview(dfr::DataFrameRow, idxs) = (dfr, idxs)

and

Base.copyto!((adf, idxs)::Tuple{AbstractDataFrame, Any}, ::Base.Broadcast.Broadcasted)
Base.copyto!((dfr, idxs)::Tuple{DataFrameRow, Any}, ::Base.Broadcast.Broadcasted)

seems to allow us to intercept and process whatever we want.

CC: @mbauman - if you can spare some time could you please comment if this is an appropriate way to specialize broadcasted assignment in DataFrames.jl? The challenge we have is that we have non-standard indexing (e.g. by column names) and that we want a broadcasted assignment to be able to add columns (in which case it will not be in place, but will have to allocate these new columns).

nalimilan · 2018-12-20T09:25:17Z

Thanks. Though I'd avoid dispatching on Tuple{AbstractDataFrame, Any}, and use a custom type instead. That sounds cleaner.

bkamins · 2019-09-08T03:40:53Z

All is done here.

bkamins mentioned this issue Dec 19, 2018

WIP: implementation of new setindex! and broadcasting rules #1646

Closed

nalimilan mentioned this issue Mar 15, 2019

Policy regarding in-place operations #1695

Closed

bkamins closed this as completed Sep 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`setindex!`/`broadcast!` design #1645

`setindex!`/`broadcast!` design #1645

bkamins commented Dec 16, 2018

bkamins commented Dec 16, 2018

nalimilan commented Dec 17, 2018

bkamins commented Dec 17, 2018

nalimilan commented Dec 17, 2018

bkamins commented Dec 18, 2018

nalimilan commented Dec 18, 2018

bkamins commented Dec 18, 2018

bkamins commented Dec 18, 2018

bkamins commented Dec 18, 2018

pdeffebach commented Dec 19, 2018

bkamins commented Dec 19, 2018

nalimilan commented Dec 19, 2018

bkamins commented Dec 19, 2018

nalimilan commented Dec 20, 2018

bkamins commented Sep 8, 2019

setindex!/broadcast! design #1645

setindex!/broadcast! design #1645

Comments

bkamins commented Dec 16, 2018

bkamins commented Dec 16, 2018

nalimilan commented Dec 17, 2018

bkamins commented Dec 17, 2018

nalimilan commented Dec 17, 2018

bkamins commented Dec 18, 2018

nalimilan commented Dec 18, 2018

bkamins commented Dec 18, 2018

bkamins commented Dec 18, 2018

bkamins commented Dec 18, 2018

pdeffebach commented Dec 19, 2018

bkamins commented Dec 19, 2018

nalimilan commented Dec 19, 2018

bkamins commented Dec 19, 2018

nalimilan commented Dec 20, 2018

bkamins commented Sep 8, 2019

`setindex!`/`broadcast!` design #1645

`setindex!`/`broadcast!` design #1645