Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add broadcasting of AbstractDataFrame #1840

Merged
merged 30 commits into from Jun 23, 2019

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jun 8, 2019

This resolves #1806 and #1610.

@nalimilan - in particular have a look at how CategoricalVector is transformed in broadcasting using current similar rules. This is visible in the test starting with df3 = coalesce.(df, nothing) line

@mbauman - could you have a look at this PR if you have time to spare (and also possibly at the whole other/broadcasting.jl file which now lays out full rules of broadcasting of data frames)

@bkamins
Copy link
Member Author

bkamins commented Jun 8, 2019

@nalimilan - the tests pass on Julia 1.1, but fail on Julia 1.0 due to a check against CategoricalArray type - do you have an idea what is going on? (if not I can install Julia 1.0 locally and check, but maybe you know the reason). My intuition is that this is similar related again 😞.

@bkamins
Copy link
Member Author

bkamins commented Jun 8, 2019

Fortunately all works after all. It was another change in broadcasting between 1.0 and 1.1 that caused the problem but it is unrelated to this PR.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan - in particular have a look at how CategoricalVector is transformed in broadcasting using current similar rules. This is visible in the test starting with df3 = coalesce.(df, nothing) line

That sounds almost too good to be true! It's really great that broadcasting preserves CategoricalArray when the function returns a CategoricalValue or Missing, but returns a plain Array{T} when it returns CategoricalValue or Nothing (and not Array{Union{CategoricalValue{T},Nothing}}, which isn't very useful).

That works because promote_type(Nothing, CategoricalString{UInt32}) === Union{Nothing,String}. Unfortunately, that's not completely consistent with what happens with standard broadcasting over arrays, since it uses promote_typejoin, which returns Union{Nothing,CategoricalString{UInt32}}. Maybe we should discuss separately whether we should keep using promote_type instead of promote_typejoin (the idea was that you almost always want homogeneous column types when working with tabular data, contrary to general programming with arrays which is harder to decide).

docs/src/lib/indexing.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
src/other/broadcasting.jl Outdated Show resolved Hide resolved
src/other/broadcasting.jl Show resolved Hide resolved
src/other/broadcasting.jl Show resolved Hide resolved
test/broadcasting.jl Show resolved Hide resolved
test/broadcasting.jl Show resolved Hide resolved
test/broadcasting.jl Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jun 9, 2019

Maybe we should discuss separately whether we should keep using promote_type instead of promote_typejoin

You have the most experience here (but as you say we can leave it for later). I think the difference is not that big in practice as this case should not happen often (if at all). If we change it then probably we should make sure that we make it consistent everywhere (in particular in original copyto_widen!)

bkamins and others added 3 commits June 9, 2019 17:50
@bkamins
Copy link
Member Author

bkamins commented Jun 20, 2019

@nalimilan In the process of analysis of @mbauman suggestions I have identified an aliasing bug we had in broadcasting assignment.
I add fixes to this PR as they are required here.
You have test cases here.

@bkamins bkamins changed the title Add broadcasting of AbstractDataFrame WIP: Add broadcasting of AbstractDataFrame Jun 20, 2019
@bkamins bkamins changed the title WIP: Add broadcasting of AbstractDataFrame Add broadcasting of AbstractDataFrame Jun 20, 2019
@bkamins
Copy link
Member Author

bkamins commented Jun 20, 2019

@nalimilan - this PR should be good for another look at. I have implemented fast handling of columns and alias detection in all cases (I hope). Additional test case suggestions would be appreciated as the code, although not very long, was non-trivial for me.

@bkamins
Copy link
Member Author

bkamins commented Jun 21, 2019

I have rebased this PR against master and added even more tests.

src/other/broadcasting.jl Outdated Show resolved Hide resolved
src/other/broadcasting.jl Show resolved Hide resolved
src/other/broadcasting.jl Show resolved Hide resolved
src/other/broadcasting.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jun 21, 2019

Benchmarks

Simple broadcasting

julia> m = rand(100000, 10);

julia> df = DataFrame(m);

julia> @benchmark $m .+ 1
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     2.548 ms (0.00% GC)
  median time:      3.682 ms (0.00% GC)
  mean time:        4.252 ms (20.66% GC)
  maximum time:     65.567 ms (93.74% GC)
  --------------
  samples:          1174
  evals/sample:     1

julia> @benchmark $df .+ 1
BenchmarkTools.Trial:
  memory estimate:  7.64 MiB
  allocs estimate:  145
  --------------
  minimum time:     889.600 μs (0.00% GC)
  median time:      1.112 ms (0.00% GC)
  mean time:        1.391 ms (10.95% GC)
  maximum time:     58.203 ms (94.78% GC)
  --------------
  samples:          3582
  evals/sample:     1

julia> @benchmark $m .+ [1 2 3 4 5 6 7 8 9 10]
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  3
  --------------
  minimum time:     2.530 ms (0.00% GC)
  median time:      3.665 ms (0.00% GC)
  mean time:        4.212 ms (20.24% GC)
  maximum time:     64.252 ms (93.99% GC)
  --------------
  samples:          1186
  evals/sample:     1

julia> @benchmark $df .+ [1 2 3 4 5 6 7 8 9 10]
BenchmarkTools.Trial:
  memory estimate:  7.64 MiB
  allocs estimate:  156
  --------------
  minimum time:     931.900 μs (0.00% GC)
  median time:      1.147 ms (0.00% GC)
  mean time:        1.350 ms (9.48% GC)
  maximum time:     60.199 ms (94.59% GC)
  --------------
  samples:          3687
  evals/sample:     1

julia> @benchmark $m .+ (1:100000)
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     2.537 ms (0.00% GC)
  median time:      3.624 ms (0.00% GC)
  mean time:        4.167 ms (20.01% GC)
  maximum time:     72.307 ms (94.54% GC)
  --------------
  samples:          1202
  evals/sample:     1

julia> @benchmark $df .+ (1:100000)
BenchmarkTools.Trial:
  memory estimate:  7.64 MiB
  allocs estimate:  156
  --------------
  minimum time:     1.019 ms (0.00% GC)
  median time:      1.239 ms (0.00% GC)
  mean time:        1.446 ms (8.60% GC)
  maximum time:     58.719 ms (94.23% GC)
  --------------
  samples:          3456
  evals/sample:     1

julia> @benchmark $m .+ $m
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     2.451 ms (0.00% GC)
  median time:      3.571 ms (0.00% GC)
  mean time:        4.144 ms (20.80% GC)
  maximum time:     68.536 ms (94.01% GC)
  --------------
  samples:          1205
  evals/sample:     1

julia> @benchmark $m .+ $df
BenchmarkTools.Trial:
  memory estimate:  7.64 MiB
  allocs estimate:  155
  --------------
  minimum time:     1.187 ms (0.00% GC)
  median time:      1.415 ms (0.00% GC)
  mean time:        1.623 ms (8.59% GC)
  maximum time:     63.595 ms (91.71% GC)
  --------------
  samples:          3070
  evals/sample:     1

julia> @benchmark $df .+ $df
BenchmarkTools.Trial:
  memory estimate:  7.64 MiB
  allocs estimate:  166
  --------------
  minimum time:     944.500 μs (0.00% GC)
  median time:      1.171 ms (0.00% GC)
  mean time:        1.331 ms (9.21% GC)
  maximum time:     63.837 ms (95.01% GC)
  --------------
  samples:          3752
  evals/sample:     1

Assignment broadcasting (here the performance of data frames drops):

julia> @benchmark $m .+= 1
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     441.401 μs (0.00% GC)
  median time:      500.201 μs (0.00% GC)
  mean time:        562.079 μs (0.00% GC)
  maximum time:     3.071 ms (0.00% GC)
  --------------
  samples:          8841
  evals/sample:     1

julia> @benchmark $df .+= 1
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  133
  --------------
  minimum time:     1.508 ms (0.00% GC)
  median time:      1.635 ms (0.00% GC)
  mean time:        1.795 ms (7.40% GC)
  maximum time:     51.025 ms (93.99% GC)
  --------------
  samples:          2775
  evals/sample:     1

julia> @benchmark $m .+= $m
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     473.200 μs (0.00% GC)
  median time:      552.801 μs (0.00% GC)
  mean time:        619.016 μs (0.00% GC)
  maximum time:     2.216 ms (0.00% GC)
  --------------
  samples:          8034
  evals/sample:     1

julia> @benchmark $df .+= $df
BenchmarkTools.Trial:
  memory estimate:  15.27 MiB
  allocs estimate:  218
  --------------
  minimum time:     2.442 ms (0.00% GC)
  median time:      2.828 ms (9.58% GC)
  mean time:        2.993 ms (8.48% GC)
  maximum time:     50.711 ms (89.78% GC)
  --------------
  samples:          1670
  evals/sample:     1

Now the case where we are hit by anty-aliasing analysis most:

julia> m = rand(10, 1000);

julia> df = DataFrame(m);

julia> @benchmark $m .+= $m
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.100 μs (0.00% GC)
  median time:      7.420 μs (0.00% GC)
  mean time:        7.909 μs (0.00% GC)
  maximum time:     48.540 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark $df .+= $df
BenchmarkTools.Trial:
  memory estimate:  903.03 KiB
  allocs estimate:  10582
  --------------
  minimum time:     850.199 μs (0.00% GC)
  median time:      948.401 μs (0.00% GC)
  mean time:        1.123 ms (5.74% GC)
  maximum time:     50.658 ms (97.44% GC)
  --------------
  samples:          4439
  evals/sample:     1

This one is slow:

julia> @benchmark $m .+= $df
BenchmarkTools.Trial:
  memory estimate:  625.00 KiB
  allocs estimate:  40000
  --------------
  minimum time:     1.249 ms (0.00% GC)
  median time:      1.401 ms (0.00% GC)
  mean time:        1.644 ms (1.89% GC)
  maximum time:     45.792 ms (96.29% GC)
  --------------
  samples:          3046
  evals/sample:     1

because it does not use custom optimizations (we have a data frame on RHS, but not on LHS).

OTOH this is faster than $df .= $df:

julia> @benchmark $df .+= $m
BenchmarkTools.Trial:
  memory estimate:  502.75 KiB
  allocs estimate:  7045
  --------------
  minimum time:     615.500 μs (0.00% GC)
  median time:      689.050 μs (0.00% GC)
  mean time:        818.695 μs (5.18% GC)
  maximum time:     50.480 ms (98.05% GC)
  --------------
  samples:          6094
  evals/sample:     1

as aliasing analysis in this case is cheap.

So in summary: if you have a very large number of columns it is better to work with matrices than data frames (which probably could be expected).

@bkamins
Copy link
Member Author

bkamins commented Jun 21, 2019

In data frame unaliasing code I can add a line:

col1 == col2 && dcol === scol && continue

which:

  1. greatly improves the performance in the common case when the same data frame is on the left and right hand side of broadcasting
  2. further penalizes the case of many columns

The general reason why unaliasing is slow is because in this code we have to do:

            dcol = dest[col1]
            scol = src[col2]

which is not type stable (so all subsequent checks require dynamic dispatch) and this is performed O(ncol(df)^2) times.

@nalimilan
Copy link
Member

Sounds like a good idea. It shouldn't be very costly compared to the unalias checks we already perform. That will make df .+= 1 faster, right?

@bkamins
Copy link
Member Author

bkamins commented Jun 21, 2019

After discussion with @nalimilan I have added some performance improvements in unaliasing. Here are the updated benchmarks. I report the relevant cases.

Long data frame:

julia> m = rand(100_000, 10);

julia> df = DataFrame(m);

julia> @benchmark $m .+ $m
BenchmarkTools.Trial:
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     2.662 ms (0.00% GC)
  median time:      3.253 ms (0.00% GC)
  mean time:        3.842 ms (20.50% GC)
  maximum time:     7.473 ms (53.80% GC)
  --------------
  samples:          1299
  evals/sample:     1

julia> @benchmark $df .+ $df
BenchmarkTools.Trial:
  memory estimate:  7.64 MiB
  allocs estimate:  166
  --------------
  minimum time:     1.088 ms (0.00% GC)
  median time:      3.016 ms (0.00% GC)
  mean time:        3.029 ms (16.34% GC)
  maximum time:     6.696 ms (37.92% GC)
  --------------
  samples:          1648
  evals/sample:     1

julia> @benchmark $m .+= 1
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     428.500 μs (0.00% GC)
  median time:      441.201 μs (0.00% GC)
  mean time:        484.818 μs (0.00% GC)
  maximum time:     1.839 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark $df .+= 1
BenchmarkTools.Trial:
  memory estimate:  7.64 MiB
  allocs estimate:  143
  --------------
  minimum time:     1.414 ms (0.00% GC)
  median time:      2.009 ms (0.00% GC)
  mean time:        2.357 ms (13.52% GC)
  maximum time:     5.696 ms (38.79% GC)
  --------------
  samples:          2120
  evals/sample:     1

Wide data frame:

julia> m = rand(10, 1000);

julia> df = DataFrame(m);

julia> @benchmark $m .+ $m
BenchmarkTools.Trial:
  memory estimate:  78.20 KiB
  allocs estimate:  2
  --------------
  minimum time:     7.125 μs (0.00% GC)
  median time:      8.025 μs (0.00% GC)
  mean time:        10.341 μs (13.86% GC)
  maximum time:     617.800 μs (98.54% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> @benchmark $df .+ $df
BenchmarkTools.Trial:
  memory estimate:  556.36 KiB
  allocs estimate:  10550
  --------------
  minimum time:     917.701 μs (0.00% GC)
  median time:      1.003 ms (0.00% GC)
  mean time:        1.096 ms (4.56% GC)
  maximum time:     5.177 ms (67.39% GC)
  --------------
  samples:          4555
  evals/sample:     1

julia> @benchmark $m .+= 1
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.357 μs (0.00% GC)
  median time:      4.714 μs (0.00% GC)
  mean time:        4.862 μs (0.00% GC)
  maximum time:     40.800 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     7

julia> @benchmark $df .+= 1
BenchmarkTools.Trial:
  memory estimate:  533.95 KiB
  allocs estimate:  8044
  --------------
  minimum time:     550.200 μs (0.00% GC)
  median time:      601.600 μs (0.00% GC)
  mean time:        675.285 μs (6.58% GC)
  maximum time:     4.716 ms (84.92% GC)
  --------------
  samples:          7382
  evals/sample:     1

julia> @benchmark $m .+= $m
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.080 μs (0.00% GC)
  median time:      6.560 μs (0.00% GC)
  mean time:        6.964 μs (0.00% GC)
  maximum time:     54.600 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

julia> @benchmark $df .+= $df
BenchmarkTools.Trial:
  memory estimate:  965.53 KiB
  allocs estimate:  12582
  --------------
  minimum time:     815.999 μs (0.00% GC)
  median time:      839.800 μs (0.00% GC)
  mean time:        981.483 μs (6.75% GC)
  maximum time:     3.718 ms (72.33% GC)
  --------------
  samples:          5093
  evals/sample:     1

We can see that in the case of wide data frame (1000 columns) unaliasing costs ~25% of total time. The biggest penalty for this range is the fact that data frame is not type stable (when we process a small number of long columns we do not have this problem as most work is done behind a barrier function).

@bkamins
Copy link
Member Author

bkamins commented Jun 22, 2019

@nalimilan Thank you for working on it.

@bkamins
Copy link
Member Author

bkamins commented Jun 22, 2019

and @mbauman for an excellent suggestion of the column extraction approach.

@bkamins
Copy link
Member Author

bkamins commented Jun 22, 2019

Design principles of data frame broadcasting:

  • we add a possibility to index an AbstractDataFrame using CartesianIndex
  • AbstractDataFrame is directly broadcastable (it is passed as is)
  • If AbstractDataFrame is used anywhere in broadcasting it enforces that the broadcast has DataFrameStyle (this works OK with Base, but will lead to errors if other types that define custom broadcasting rules are mixed with AbstractDataFrame, but in such cases it is not clear what should be produced so it is OK to throw an error then)
  • when broadcasted over internally each column of an AbstractDataFrame is processed separately (both on LHS and RHS); this is achieved by dynamically generating a new broadcasting object for each processed column that contains only this column instead of the original data frame; this happens to work because, as @mbauman observed, the standard broadcasting mechanics will broadcast this vector just as expected (the important fact that allows this is that data frame is exactly two dimensional and vectors are columnwise - exactly as in data frames); this approach allows us to process broadcasting fast when the data frame has many rows; it has a performance overhead though if we have many columns to process (the processing complexity is O(ncol(df))); additionally a minor thing that is done to simplify and improve the performance of this process is that the original broadcasted object is flattened so that we are sure that we are working on non-nested object
  • when data frame is used in non-assignment broadcasting we do not have to unalias columns of a data frame; however if data frame is on left or right hand side of assignment broadcasting we have to perform unaliasing; typically this is O(ncol(df))) operation; the only special case is AbstractDataFrame present both on LHS and RHS in which case the operation has O(ncol(df)^2)) cost. This is the second reason why performance of broadcasting of AbstractDataFrames when it contains many columns might get expensive; practical tests show that for up to 1000 columns the timings are in reasonable range; if you have tens of columns and thousands of rows the performance of broadcasting of AbstractDataFrame should be good
  • as a special case broadcasting into a column of a DataFrame allows for column creation, so an operation df[newcol] .= value will work even if newcol is not present in a DataFrame.

@bkamins
Copy link
Member Author

bkamins commented Jun 22, 2019

@nalimilan I have done a final round of code check after your approval, which lead to some small changes (unfortunately this PR is hard for me to get it right in one run). They are collected in the last two commits (the first adds tests, the second changed documentation and implementation).

Changes:

  • I have made a documentation more precise regarding what kinds of broadcasting operations are allowed (but your review here would be most welcome as always 😄)
  • now we throw an explicit dimension mismatch error in case we try to broadcast a data frame into higher dimension than 2 (earlier we also errored, but the error was cryptic); however, we allow to do broadcast assignment into higher dimensional objects (then data frame is treated as normal 2-dimensional array-like object)
  • there was a tricky case when an additional anti-aliasing method was needed (initially I thought this case is not needed, but I came up with a scenario when it is required) - namely we also have to unalias when src is a DataFrame and dest is not a data frame
  • I have added tests for: broadcast assignment of a data frame into other type of object (including the case that required anti-aliasing) and tests of broadcasting of data frame in combination with 3-dimensional object

docs/src/lib/indexing.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
src/other/broadcasting.jl Outdated Show resolved Hide resolved
test/broadcasting.jl Outdated Show resolved Hide resolved
bkamins and others added 3 commits June 23, 2019 21:33
@bkamins bkamins merged commit 4d8983c into JuliaData:master Jun 23, 2019
@bkamins bkamins deleted the new_dataframe_broadcasting branch June 23, 2019 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow data frame and DataFrameRow to take part in broadcasting
3 participants