Add broadcasting of AbstractDataFrame #1840

bkamins · 2019-06-08T17:41:44Z

This resolves #1806 and #1610.

@nalimilan - in particular have a look at how CategoricalVector is transformed in broadcasting using current similar rules. This is visible in the test starting with df3 = coalesce.(df, nothing) line

@mbauman - could you have a look at this PR if you have time to spare (and also possibly at the whole other/broadcasting.jl file which now lays out full rules of broadcasting of data frames)

bkamins · 2019-06-08T18:02:19Z

@nalimilan - the tests pass on Julia 1.1, but fail on Julia 1.0 due to a check against CategoricalArray type - do you have an idea what is going on? (if not I can install Julia 1.0 locally and check, but maybe you know the reason). My intuition is that this is similar related again 😞.

bkamins · 2019-06-08T21:42:39Z

Fortunately all works after all. It was another change in broadcasting between 1.0 and 1.1 that caused the problem but it is unrelated to this PR.

nalimilan

@nalimilan - in particular have a look at how CategoricalVector is transformed in broadcasting using current similar rules. This is visible in the test starting with df3 = coalesce.(df, nothing) line

That sounds almost too good to be true! It's really great that broadcasting preserves CategoricalArray when the function returns a CategoricalValue or Missing, but returns a plain Array{T} when it returns CategoricalValue or Nothing (and not Array{Union{CategoricalValue{T},Nothing}}, which isn't very useful).

That works because promote_type(Nothing, CategoricalString{UInt32}) === Union{Nothing,String}. Unfortunately, that's not completely consistent with what happens with standard broadcasting over arrays, since it uses promote_typejoin, which returns Union{Nothing,CategoricalString{UInt32}}. Maybe we should discuss separately whether we should keep using promote_type instead of promote_typejoin (the idea was that you almost always want homogeneous column types when working with tabular data, contrary to general programming with arrays which is harder to decide).

docs/src/lib/indexing.md

src/other/broadcasting.jl

test/broadcasting.jl

bkamins · 2019-06-09T15:43:48Z

Maybe we should discuss separately whether we should keep using promote_type instead of promote_typejoin

You have the most experience here (but as you say we can leave it for later). I think the difference is not that big in practice as this case should not happen often (if at all). If we change it then probably we should make sure that we make it consistent everywhere (in particular in original copyto_widen!)

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2019-06-20T09:51:54Z

@nalimilan In the process of analysis of @mbauman suggestions I have identified an aliasing bug we had in broadcasting assignment.
I add fixes to this PR as they are required here.
You have test cases here.

bkamins · 2019-06-20T22:12:17Z

@nalimilan - this PR should be good for another look at. I have implemented fast handling of columns and alias detection in all cases (I hope). Additional test case suggestions would be appreciated as the code, although not very long, was non-trivial for me.

bkamins · 2019-06-21T18:27:11Z

I have rebased this PR against master and added even more tests.

src/other/broadcasting.jl

bkamins · 2019-06-21T19:24:11Z

Benchmarks

Simple broadcasting

julia> m = rand(100000, 10);

julia> df = DataFrame(m);

julia> @benchmark $m .+ 1
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     2.548 ms (0.00% GC)
  median time:      3.682 ms (0.00% GC)
  mean time:        4.252 ms (20.66% GC)
  maximum time:     65.567 ms (93.74% GC)
  --------------
  samples:          1174
  evals/sample:     1

julia> @benchmark $df .+ 1
BenchmarkTools.Trial:
  memory estimate:  7.64 MiB
  allocs estimate:  145
  --------------
  minimum time:     889.600 μs (0.00% GC)
  median time:      1.112 ms (0.00% GC)
  mean time:        1.391 ms (10.95% GC)
  maximum time:     58.203 ms (94.78% GC)
  --------------
  samples:          3582
  evals/sample:     1

julia> @benchmark $m .+ [1 2 3 4 5 6 7 8 9 10]
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  3
  --------------
  minimum time:     2.530 ms (0.00% GC)
  median time:      3.665 ms (0.00% GC)
  mean time:        4.212 ms (20.24% GC)
  maximum time:     64.252 ms (93.99% GC)
  --------------
  samples:          1186
  evals/sample:     1

julia> @benchmark $df .+ [1 2 3 4 5 6 7 8 9 10]
BenchmarkTools.Trial:
  memory estimate:  7.64 MiB
  allocs estimate:  156
  --------------
  minimum time:     931.900 μs (0.00% GC)
  median time:      1.147 ms (0.00% GC)
  mean time:        1.350 ms (9.48% GC)
  maximum time:     60.199 ms (94.59% GC)
  --------------
  samples:          3687
  evals/sample:     1

julia> @benchmark $m .+ (1:100000)
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     2.537 ms (0.00% GC)
  median time:      3.624 ms (0.00% GC)
  mean time:        4.167 ms (20.01% GC)
  maximum time:     72.307 ms (94.54% GC)
  --------------
  samples:          1202
  evals/sample:     1

julia> @benchmark $df .+ (1:100000)
BenchmarkTools.Trial:
  memory estimate:  7.64 MiB
  allocs estimate:  156
  --------------
  minimum time:     1.019 ms (0.00% GC)
  median time:      1.239 ms (0.00% GC)
  mean time:        1.446 ms (8.60% GC)
  maximum time:     58.719 ms (94.23% GC)
  --------------
  samples:          3456
  evals/sample:     1

julia> @benchmark $m .+ $m
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     2.451 ms (0.00% GC)
  median time:      3.571 ms (0.00% GC)
  mean time:        4.144 ms (20.80% GC)
  maximum time:     68.536 ms (94.01% GC)
  --------------
  samples:          1205
  evals/sample:     1

julia> @benchmark $m .+ $df
BenchmarkTools.Trial:
  memory estimate:  7.64 MiB
  allocs estimate:  155
  --------------
  minimum time:     1.187 ms (0.00% GC)
  median time:      1.415 ms (0.00% GC)
  mean time:        1.623 ms (8.59% GC)
  maximum time:     63.595 ms (91.71% GC)
  --------------
  samples:          3070
  evals/sample:     1

julia> @benchmark $df .+ $df
BenchmarkTools.Trial:
  memory estimate:  7.64 MiB
  allocs estimate:  166
  --------------
  minimum time:     944.500 μs (0.00% GC)
  median time:      1.171 ms (0.00% GC)
  mean time:        1.331 ms (9.21% GC)
  maximum time:     63.837 ms (95.01% GC)
  --------------
  samples:          3752
  evals/sample:     1

Assignment broadcasting (here the performance of data frames drops):

julia> @benchmark $m .+= 1
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     441.401 μs (0.00% GC)
  median time:      500.201 μs (0.00% GC)
  mean time:        562.079 μs (0.00% GC)
  maximum time:     3.071 ms (0.00% GC)
  --------------
  samples:          8841
  evals/sample:     1

julia> @benchmark $df .+= 1
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  133
  --------------
  minimum time:     1.508 ms (0.00% GC)
  median time:      1.635 ms (0.00% GC)
  mean time:        1.795 ms (7.40% GC)
  maximum time:     51.025 ms (93.99% GC)
  --------------
  samples:          2775
  evals/sample:     1

julia> @benchmark $m .+= $m
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     473.200 μs (0.00% GC)
  median time:      552.801 μs (0.00% GC)
  mean time:        619.016 μs (0.00% GC)
  maximum time:     2.216 ms (0.00% GC)
  --------------
  samples:          8034
  evals/sample:     1

julia> @benchmark $df .+= $df
BenchmarkTools.Trial:
  memory estimate:  15.27 MiB
  allocs estimate:  218
  --------------
  minimum time:     2.442 ms (0.00% GC)
  median time:      2.828 ms (9.58% GC)
  mean time:        2.993 ms (8.48% GC)
  maximum time:     50.711 ms (89.78% GC)
  --------------
  samples:          1670
  evals/sample:     1

Now the case where we are hit by anty-aliasing analysis most:

julia> m = rand(10, 1000);

julia> df = DataFrame(m);

julia> @benchmark $m .+= $m
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.100 μs (0.00% GC)
  median time:      7.420 μs (0.00% GC)
  mean time:        7.909 μs (0.00% GC)
  maximum time:     48.540 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark $df .+= $df
BenchmarkTools.Trial:
  memory estimate:  903.03 KiB
  allocs estimate:  10582
  --------------
  minimum time:     850.199 μs (0.00% GC)
  median time:      948.401 μs (0.00% GC)
  mean time:        1.123 ms (5.74% GC)
  maximum time:     50.658 ms (97.44% GC)
  --------------
  samples:          4439
  evals/sample:     1

This one is slow:

julia> @benchmark $m .+= $df
BenchmarkTools.Trial:
  memory estimate:  625.00 KiB
  allocs estimate:  40000
  --------------
  minimum time:     1.249 ms (0.00% GC)
  median time:      1.401 ms (0.00% GC)
  mean time:        1.644 ms (1.89% GC)
  maximum time:     45.792 ms (96.29% GC)
  --------------
  samples:          3046
  evals/sample:     1

because it does not use custom optimizations (we have a data frame on RHS, but not on LHS).

OTOH this is faster than $df .= $df:

julia> @benchmark $df .+= $m
BenchmarkTools.Trial:
  memory estimate:  502.75 KiB
  allocs estimate:  7045
  --------------
  minimum time:     615.500 μs (0.00% GC)
  median time:      689.050 μs (0.00% GC)
  mean time:        818.695 μs (5.18% GC)
  maximum time:     50.480 ms (98.05% GC)
  --------------
  samples:          6094
  evals/sample:     1

as aliasing analysis in this case is cheap.

So in summary: if you have a very large number of columns it is better to work with matrices than data frames (which probably could be expected).

bkamins · 2019-06-21T19:53:21Z

In data frame unaliasing code I can add a line:

col1 == col2 && dcol === scol && continue

which:

greatly improves the performance in the common case when the same data frame is on the left and right hand side of broadcasting
further penalizes the case of many columns

The general reason why unaliasing is slow is because in this code we have to do:

            dcol = dest[col1]
            scol = src[col2]

which is not type stable (so all subsequent checks require dynamic dispatch) and this is performed O(ncol(df)^2) times.

nalimilan · 2019-06-21T21:22:03Z

Sounds like a good idea. It shouldn't be very costly compared to the unalias checks we already perform. That will make df .+= 1 faster, right?

src/other/broadcasting.jl

bkamins · 2019-06-21T21:23:05Z

After discussion with @nalimilan I have added some performance improvements in unaliasing. Here are the updated benchmarks. I report the relevant cases.

Long data frame:

julia> m = rand(100_000, 10);

julia> df = DataFrame(m);

julia> @benchmark $m .+ $m
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     2.662 ms (0.00% GC)
  median time:      3.253 ms (0.00% GC)
  mean time:        3.842 ms (20.50% GC)
  maximum time:     7.473 ms (53.80% GC)
  --------------
  samples:          1299
  evals/sample:     1

julia> @benchmark $df .+ $df
BenchmarkTools.Trial:
  memory estimate:  7.64 MiB
  allocs estimate:  166
  --------------
  minimum time:     1.088 ms (0.00% GC)
  median time:      3.016 ms (0.00% GC)
  mean time:        3.029 ms (16.34% GC)
  maximum time:     6.696 ms (37.92% GC)
  --------------
  samples:          1648
  evals/sample:     1

julia> @benchmark $m .+= 1
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     428.500 μs (0.00% GC)
  median time:      441.201 μs (0.00% GC)
  mean time:        484.818 μs (0.00% GC)
  maximum time:     1.839 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark $df .+= 1
BenchmarkTools.Trial:
  memory estimate:  7.64 MiB
  allocs estimate:  143
  --------------
  minimum time:     1.414 ms (0.00% GC)
  median time:      2.009 ms (0.00% GC)
  mean time:        2.357 ms (13.52% GC)
  maximum time:     5.696 ms (38.79% GC)
  --------------
  samples:          2120
  evals/sample:     1

Wide data frame:

julia> m = rand(10, 1000);

julia> df = DataFrame(m);

julia> @benchmark $m .+ $m
BenchmarkTools.Trial:
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     7.125 μs (0.00% GC)
  median time:      8.025 μs (0.00% GC)
  mean time:        10.341 μs (13.86% GC)
  maximum time:     617.800 μs (98.54% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> @benchmark $df .+ $df
BenchmarkTools.Trial:
  memory estimate:  556.36 KiB
  allocs estimate:  10550
  --------------
  minimum time:     917.701 μs (0.00% GC)
  median time:      1.003 ms (0.00% GC)
  mean time:        1.096 ms (4.56% GC)
  maximum time:     5.177 ms (67.39% GC)
  --------------
  samples:          4555
  evals/sample:     1

julia> @benchmark $m .+= 1
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.357 μs (0.00% GC)
  median time:      4.714 μs (0.00% GC)
  mean time:        4.862 μs (0.00% GC)
  maximum time:     40.800 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark $df .+= 1
BenchmarkTools.Trial:
  memory estimate:  533.95 KiB
  allocs estimate:  8044
  --------------
  minimum time:     550.200 μs (0.00% GC)
  median time:      601.600 μs (0.00% GC)
  mean time:        675.285 μs (6.58% GC)
  maximum time:     4.716 ms (84.92% GC)
  --------------
  samples:          7382
  evals/sample:     1

julia> @benchmark $m .+= $m
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.080 μs (0.00% GC)
  median time:      6.560 μs (0.00% GC)
  mean time:        6.964 μs (0.00% GC)
  maximum time:     54.600 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark $df .+= $df
BenchmarkTools.Trial:
  memory estimate:  965.53 KiB
  allocs estimate:  12582
  --------------
  minimum time:     815.999 μs (0.00% GC)
  median time:      839.800 μs (0.00% GC)
  mean time:        981.483 μs (6.75% GC)
  maximum time:     3.718 ms (72.33% GC)
  --------------
  samples:          5093
  evals/sample:     1

We can see that in the case of wide data frame (1000 columns) unaliasing costs ~25% of total time. The biggest penalty for this range is the fact that data frame is not type stable (when we process a small number of long columns we do not have this problem as most work is done behind a barrier function).

bkamins · 2019-06-22T10:34:39Z

@nalimilan Thank you for working on it.

bkamins · 2019-06-22T10:35:15Z

and @mbauman for an excellent suggestion of the column extraction approach.

bkamins · 2019-06-22T12:46:02Z

Design principles of data frame broadcasting:

we add a possibility to index an AbstractDataFrame using CartesianIndex
AbstractDataFrame is directly broadcastable (it is passed as is)
If AbstractDataFrame is used anywhere in broadcasting it enforces that the broadcast has DataFrameStyle (this works OK with Base, but will lead to errors if other types that define custom broadcasting rules are mixed with AbstractDataFrame, but in such cases it is not clear what should be produced so it is OK to throw an error then)
when broadcasted over internally each column of an AbstractDataFrame is processed separately (both on LHS and RHS); this is achieved by dynamically generating a new broadcasting object for each processed column that contains only this column instead of the original data frame; this happens to work because, as @mbauman observed, the standard broadcasting mechanics will broadcast this vector just as expected (the important fact that allows this is that data frame is exactly two dimensional and vectors are columnwise - exactly as in data frames); this approach allows us to process broadcasting fast when the data frame has many rows; it has a performance overhead though if we have many columns to process (the processing complexity is O(ncol(df))); additionally a minor thing that is done to simplify and improve the performance of this process is that the original broadcasted object is flattened so that we are sure that we are working on non-nested object
when data frame is used in non-assignment broadcasting we do not have to unalias columns of a data frame; however if data frame is on left or right hand side of assignment broadcasting we have to perform unaliasing; typically this is O(ncol(df))) operation; the only special case is AbstractDataFrame present both on LHS and RHS in which case the operation has O(ncol(df)^2)) cost. This is the second reason why performance of broadcasting of AbstractDataFrames when it contains many columns might get expensive; practical tests show that for up to 1000 columns the timings are in reasonable range; if you have tens of columns and thousands of rows the performance of broadcasting of AbstractDataFrame should be good
as a special case broadcasting into a column of a DataFrame allows for column creation, so an operation df[newcol] .= value will work even if newcol is not present in a DataFrame.

bkamins · 2019-06-22T21:48:09Z

@nalimilan I have done a final round of code check after your approval, which lead to some small changes (unfortunately this PR is hard for me to get it right in one run). They are collected in the last two commits (the first adds tests, the second changed documentation and implementation).

Changes:

I have made a documentation more precise regarding what kinds of broadcasting operations are allowed (but your review here would be most welcome as always 😄)
now we throw an explicit dimension mismatch error in case we try to broadcast a data frame into higher dimension than 2 (earlier we also errored, but the error was cryptic); however, we allow to do broadcast assignment into higher dimensional objects (then data frame is treated as normal 2-dimensional array-like object)
there was a tricky case when an additional anti-aliasing method was needed (initially I thought this case is not needed, but I came up with a scenario when it is required) - namely we also have to unalias when src is a DataFrame and dest is not a data frame
I have added tests for: broadcast assignment of a data frame into other type of object (including the case that required anti-aliasing) and tests of broadcasting of data frame in combination with 3-dimensional object

docs/src/lib/indexing.md

src/other/broadcasting.jl

test/broadcasting.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

add broadcasting of AbstractDataFrame

8bb061d

bkamins added 3 commits June 8, 2019 22:39

switch to Tables.allocatecolumn

6a6e954

one more fix

6f151ac

revert similar and fix tests

628229b

bkamins mentioned this pull request Jun 9, 2019

Regex handling in columns #1819

Merged

nalimilan reviewed Jun 9, 2019

View reviewed changes

nalimilan mentioned this pull request Jun 9, 2019

Make a DataFrame not iterable #1513

Closed

bkamins and others added 3 commits June 9, 2019 17:50

Apply suggestions from code review

8af3a0f

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

corrections after code review

c9a98bd

fix typo

5b4cc36

bkamins mentioned this pull request Jun 19, 2019

Consider making setproperty copy Vector #1846

Closed

fix broadcasting assignment bug

24566e4

bkamins added 2 commits June 20, 2019 12:07

fix SubDataFrame case

2b82f79

add unaliasing of data frame against data frame

3810f75

bkamins changed the title ~~Add broadcasting of AbstractDataFrame~~ WIP: Add broadcasting of AbstractDataFrame Jun 20, 2019

bkamins added 6 commits June 20, 2019 16:32

small fixes in legacy code

8e335f1

optimized broadcasting

80a131e

correct unaliasing

82e53a6

small performance optimization

87206a2

performance improvements

fba7cef

add more broadcasting tests

0e63fb8

bkamins changed the title ~~WIP: Add broadcasting of AbstractDataFrame~~ Add broadcasting of AbstractDataFrame Jun 20, 2019

bkamins added 3 commits June 21, 2019 00:18

more tests

bb4862b

Merge branch 'master' into new_dataframe_broadcasting

5b8d2ec

Merge branch 'master' into new_dataframe_broadcasting

699cb6b

even more tests

6780b26

nalimilan reviewed Jun 21, 2019

View reviewed changes

src/other/broadcasting.jl Outdated Show resolved Hide resolved

src/other/broadcasting.jl Show resolved Hide resolved

src/other/broadcasting.jl Show resolved Hide resolved

src/other/broadcasting.jl Outdated Show resolved Hide resolved

nalimilan mentioned this pull request Jun 21, 2019

Stop treating DataFrames like matrices #484

Merged

getcolbc cleanup

432a530

nalimilan reviewed Jun 21, 2019

View reviewed changes

src/other/broadcasting.jl Outdated Show resolved Hide resolved

fix after a code review

3fdf733

bkamins added 2 commits June 21, 2019 23:01

unalias optimizations

a67a3f5

more tests for common cases

5784100

nalimilan reviewed Jun 21, 2019

View reviewed changes

src/other/broadcasting.jl Outdated Show resolved Hide resolved

improve helper signature

b1813e1

nalimilan approved these changes Jun 22, 2019

View reviewed changes

bkamins added 2 commits June 22, 2019 23:39

minor improvements

6a87c42

minor improvements 2

27c730a

nalimilan reviewed Jun 23, 2019

View reviewed changes

bkamins and others added 3 commits June 23, 2019 21:33

Apply suggestions from code review

9a68d30

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

fixes after code review

5d3ec40

Fix indentation

1f46086

bkamins merged commit 4d8983c into JuliaData:master Jun 23, 2019

bkamins deleted the new_dataframe_broadcasting branch June 23, 2019 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add broadcasting of AbstractDataFrame #1840

Add broadcasting of AbstractDataFrame #1840

bkamins commented Jun 8, 2019 •

edited by nalimilan

bkamins commented Jun 8, 2019

bkamins commented Jun 8, 2019

nalimilan left a comment

bkamins commented Jun 9, 2019

bkamins commented Jun 20, 2019

bkamins commented Jun 20, 2019

bkamins commented Jun 21, 2019

bkamins commented Jun 21, 2019 •

edited

bkamins commented Jun 21, 2019

nalimilan commented Jun 21, 2019

bkamins commented Jun 21, 2019

bkamins commented Jun 22, 2019

bkamins commented Jun 22, 2019

bkamins commented Jun 22, 2019 •

edited

bkamins commented Jun 22, 2019

Add broadcasting of AbstractDataFrame #1840

Add broadcasting of AbstractDataFrame #1840

Conversation

bkamins commented Jun 8, 2019 • edited by nalimilan

bkamins commented Jun 8, 2019

bkamins commented Jun 8, 2019

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Jun 9, 2019

bkamins commented Jun 20, 2019

bkamins commented Jun 20, 2019

bkamins commented Jun 21, 2019

bkamins commented Jun 21, 2019 • edited

bkamins commented Jun 21, 2019

nalimilan commented Jun 21, 2019

bkamins commented Jun 21, 2019

bkamins commented Jun 22, 2019

bkamins commented Jun 22, 2019

bkamins commented Jun 22, 2019 • edited

bkamins commented Jun 22, 2019

bkamins commented Jun 8, 2019 •

edited by nalimilan

bkamins commented Jun 21, 2019 •

edited

bkamins commented Jun 22, 2019 •

edited