Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add broadcasting of AbstractDataFrame #1840

Merged
merged 30 commits into from
Jun 23, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
8bb061d
add broadcasting of AbstractDataFrame
bkamins Jun 8, 2019
6a6e954
switch to Tables.allocatecolumn
bkamins Jun 8, 2019
6f151ac
one more fix
bkamins Jun 8, 2019
628229b
revert similar and fix tests
bkamins Jun 8, 2019
8af3a0f
Apply suggestions from code review
bkamins Jun 9, 2019
c9a98bd
corrections after code review
bkamins Jun 9, 2019
5b4cc36
fix typo
bkamins Jun 9, 2019
24566e4
fix broadcasting assignment bug
bkamins Jun 20, 2019
2b82f79
fix SubDataFrame case
bkamins Jun 20, 2019
3810f75
add unaliasing of data frame against data frame
bkamins Jun 20, 2019
8e335f1
small fixes in legacy code
bkamins Jun 20, 2019
80a131e
optimized broadcasting
bkamins Jun 20, 2019
82e53a6
correct unaliasing
bkamins Jun 20, 2019
87206a2
small performance optimization
bkamins Jun 20, 2019
fba7cef
performance improvements
bkamins Jun 20, 2019
0e63fb8
add more broadcasting tests
bkamins Jun 20, 2019
bb4862b
more tests
bkamins Jun 20, 2019
5b8d2ec
Merge branch 'master' into new_dataframe_broadcasting
bkamins Jun 21, 2019
699cb6b
Merge branch 'master' into new_dataframe_broadcasting
bkamins Jun 21, 2019
6780b26
even more tests
bkamins Jun 21, 2019
432a530
getcolbc cleanup
bkamins Jun 21, 2019
3fdf733
fix after a code review
bkamins Jun 21, 2019
a67a3f5
unalias optimizations
bkamins Jun 21, 2019
5784100
more tests for common cases
bkamins Jun 21, 2019
b1813e1
improve helper signature
bkamins Jun 21, 2019
6a87c42
minor improvements
bkamins Jun 22, 2019
27c730a
minor improvements 2
bkamins Jun 22, 2019
9a68d30
Apply suggestions from code review
bkamins Jun 23, 2019
5d3ec40
fixes after code review
bkamins Jun 23, 2019
1f46086
Fix indentation
nalimilan Jun 23, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/src/lib/indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ For performance reasons, accessing, via `getindex` or `view`, a single `row` and
* `df[col]` -> the vector contained in column `col`;
* `df[cols]` -> a freshly allocated `DataFrame` containing the copies of vectors contained in columns `cols`;
* `df[row, col]` -> the value contained in row `row` of column `col`, the same as `df[col][row]`;
* `df[CartesianIndex(row, col)]` -> the same as `df[row,col]`;
* `df[row, cols]` -> a `DataFrameRow` with parent `df` if `cols` is a colon and `df[cols]` otherwise;
* `df[rows, col]` -> a copy of the vector `df[col]` with only the entries corresponding to `rows` selected, the same as `df[col][rows]`;
* `df[rows, cols]` -> a `DataFrame` containing copies of columns `cols` with only the entries corresponding to `rows` selected.
Expand Down Expand Up @@ -83,6 +84,14 @@ Under construction

## Broadcasting

The following broadcasting rules apply to `AbstractDataFrame` objects:
* `AbstractDataFrame` behaves in broadcasting like a two-dimensional collection compatible with matrices.
* If an `AbstractDataFrame` takes part in broadcasting then a `DataFrame` is always produced as a result.
In this case the requested broadcasting operation produce an object with exactly two dimensions.
An exception is when an `AbstractDataFrame` is used only as a source of broadcast assignment into an object
of dimensionality higher than two.
* If multiple `AbstractDataFrame` objects take part in broadcasting then they have to have identical column names.

It is possible to assign a value to `AbstractDataFrame` and `DataFrameRow` objects using the `.=` operator.
In such an operation `AbstractDataFrame` is considered as two-dimensional and `DataFrameRow` as single-dimensional.

Expand Down
168 changes: 163 additions & 5 deletions src/other/broadcasting.jl
Original file line number Diff line number Diff line change
@@ -1,3 +1,78 @@
### Broadcasting

Base.getindex(df::AbstractDataFrame, idx::CartesianIndex{2}) = df[idx[1], idx[2]]
Base.setindex!(df::AbstractDataFrame, val, idx::CartesianIndex{2}) =
(df[idx[1], idx[2]] = val)

Base.broadcastable(df::AbstractDataFrame) = df

struct DataFrameStyle <: Base.Broadcast.BroadcastStyle end

Base.Broadcast.BroadcastStyle(::Type{<:AbstractDataFrame}) =
DataFrameStyle()

Base.Broadcast.BroadcastStyle(::DataFrameStyle, ::Base.Broadcast.BroadcastStyle) = DataFrameStyle()
Base.Broadcast.BroadcastStyle(::Base.Broadcast.BroadcastStyle, ::DataFrameStyle) = DataFrameStyle()
Base.Broadcast.BroadcastStyle(::DataFrameStyle, ::DataFrameStyle) = DataFrameStyle()

function copyto_widen!(res::AbstractVector{T},
bc::Base.Broadcast.Broadcasted{DataFrameStyle},
pos, col) where T
for i in pos:length(axes(bc)[1])
bkamins marked this conversation as resolved.
Show resolved Hide resolved
val = bc[CartesianIndex(i, col)]
bkamins marked this conversation as resolved.
Show resolved Hide resolved
S = typeof(val)
if S <: T || promote_type(S, T) <: T
res[i] = val
else
newres = similar(Vector{promote_type(S, T)}, length(res))
copyto!(newres, 1, res, 1, i-1)
newres[i] = val
return copyto_widen!(newres, bc, i + 1, 2)
end
end
return res
end

function getcolbc(bcf::Base.Broadcast.Broadcasted{Style}, colind) where {Style}
# we assume that bcf is already flattened and unaliased
newargs = map(bcf.args) do x
Base.Broadcast.extrude(x isa AbstractDataFrame ? x[colind] : x)
end
Base.Broadcast.Broadcasted{Style}(bcf.f, newargs, bcf.axes)
end

function Base.copy(bc::Base.Broadcast.Broadcasted{DataFrameStyle})
ndim = length(axes(bc))
if ndim != 2
throw(DimensionMismatch("cannot broadcast a data frame into $ndim dimensions"))
end
bcf = Base.Broadcast.flatten(bc)
colnames = unique([_names(df) for df in bcf.args if df isa AbstractDataFrame])
if length(colnames) != 1
wrongnames = setdiff(union(colnames...), intersect(colnames...))
msg = join(wrongnames, ", ", " and ")
throw(ArgumentError("Column names in broadcasted data frames must match. " *
"Non matching column names are $msg"))
end
nrows = length(axes(bcf)[1])
bkamins marked this conversation as resolved.
Show resolved Hide resolved
df = DataFrame()
for i in axes(bcf)[2]
if nrows == 0
col = Any[]
else
bcf′ = getcolbc(bcf, i)
v1 = bcf′[CartesianIndex(1, i)]
startcol = similar(Vector{typeof(v1)}, nrows)
startcol[1] = v1
col = copyto_widen!(startcol, bcf′, 2, i)
end
df[colnames[1][i]] = col
end
return df
end

### Broadcasting assignment

struct LazyNewColDataFrame
df::DataFrame
col::Symbol
Expand Down Expand Up @@ -50,9 +125,88 @@ function _copyto_helper!(dfcol::AbstractVector, bc::Base.Broadcast.Broadcasted,
end
end

function Base.Broadcast.broadcast_unalias(dest::AbstractDataFrame, src)
for col in eachcol(dest)
src = Base.Broadcast.unalias(col, src)
end
src
end

function Base.Broadcast.broadcast_unalias(dest, src::AbstractDataFrame)
wascopied = false
for (i, col) in enumerate(eachcol(src))
if Base.mightalias(dest, col)
if src isa SubDataFrame
if !wascopied
src = SubDataFrame(copy(parent(src), copycols=false),
index(src), rows(src))
end
parentidx = parentcols(index(src), i)
parent(src)[parentidx] = Base.unaliascopy(parent(src)[parentidx])
else
if !wascopied
src = copy(src, copycols=false)
end
src[i] = Base.unaliascopy(col)
end
wascopied = true
end
end
src
end

function _broadcast_unalias_helper(dest::AbstractDataFrame, scol::AbstractVector,
src::AbstractDataFrame, col2::Int, wascopied::Bool)
# col1 can be checked till col2 point as we are writing broadcasting
# results from 1 to ncol
# we go downwards because aliasing when col1 == col2 is most probable
for col1 in col2:-1:1
dcol = dest[col1]
if Base.mightalias(dcol, scol)
if src isa SubDataFrame
if !wascopied
src =SubDataFrame(copy(parent(src), copycols=false),
index(src), rows(src))
end
parentidx = parentcols(index(src), col2)
parent(src)[parentidx] = Base.unaliascopy(parent(src)[parentidx])
else
if !wascopied
src = copy(src, copycols=false)
end
src[col2] = Base.unaliascopy(scol)
end
return src, true
end
end
return src, wascopied
end

function Base.Broadcast.broadcast_unalias(dest::AbstractDataFrame, src::AbstractDataFrame)
if size(dest, 2) != size(src, 2)
throw(ArgumentError("Dimension mismatch in broadcasting."))
end
wascopied = false
for col2 in axes(dest, 2)
scol = src[col2]
src, wascopied = _broadcast_unalias_helper(dest, scol, src, col2, wascopied)
end
src
end

function Base.copyto!(df::AbstractDataFrame, bc::Base.Broadcast.Broadcasted)
for col in axes(df, 2)
_copyto_helper!(df[col], bc, col)
bcf = Base.Broadcast.flatten(bc)
colnames = unique([_names(df) for df in bcf.args if df isa AbstractDataFrame])
if length(colnames) > 1 || (length(colnames) == 1 && _names(df) != colnames[1])
wrongnames = setdiff(union(colnames...), intersect(colnames...))
msg = join(wrongnames, ", ", " and ")
throw(ArgumentError("Column names in broadcasted data frames must match. " *
"Non matching column names are $msg"))
end

bcf′ = Base.Broadcast.preprocess(df, bcf)
for i in axes(df, 2)
_copyto_helper!(df[i], getcolbc(bcf′, i), i)
end
df
end
Expand All @@ -65,13 +219,17 @@ function Base.copyto!(df::AbstractDataFrame, bc::Base.Broadcast.Broadcasted{<:Ba
end
df
else
copyto!(df, convert(Broadcasted{Nothing}, bc))
copyto!(df, convert(Base.Broadcast.Broadcasted{Nothing}, bc))
end
end

Base.Broadcast.broadcast_unalias(dest::DataFrameRow, src) =
Base.Broadcast.broadcast_unalias(parent(dest), src)

function Base.copyto!(dfr::DataFrameRow, bc::Base.Broadcast.Broadcasted)
for I in eachindex(bc)
dfr[I] = bc[I]
bc′ = Base.Broadcast.preprocess(dfr, bc)
for I in eachindex(bc′)
bkamins marked this conversation as resolved.
Show resolved Hide resolved
dfr[I] = bc′[I]
end
dfr
end