broadcast in DataFrames.jl #1643

bkamins · 2018-12-14T22:15:21Z

This is a small PR, but a big decision towards making DataFrames.jl broadcasting-ready. What I recommend is:

treat an AbstractDataFrame as an AbstractVector of DataFrameRows when broadcasting;
treat a DataFrameRow as a one-dimensional generic collection when broadcasting.

The major consequence is that the output of broadcasting aa function over these types will be a vector in both cases.

bkamins · 2018-12-14T23:49:34Z

This PR will pass tests after #1637 is merged.

nalimilan

Thanks. I have a few hesitations about this.

First, I wonder whether returning a vector when broadcasting over a DataFrame is the most useful option. We could implement custom methods which return a DataFrame, and allow returning a named tuple to create columns. This would be essentially transmute in the dplyr terminology. Or a possible intermediate approach would be to return a vector if the function returns a single value, but a data frame if it returns a named tuple.

The other annoying point is that this syntax will necessarily be slow, and we don't provide a fast alternative (like we do for by). Providing a syntax which is convenient but slow could be worse than not providing it at all.

nalimilan · 2018-12-15T18:35:09Z

test/broadcasting.jl

@@ -0,0 +1,9 @@
+module TestDataFrame


Can you put this in an existing file?

OK - I will move it to indexing.jl.

nalimilan · 2018-12-15T18:37:21Z

docs/src/man/getting_started.md

@@ -471,3 +471,70 @@ CSV.write(output, df)
 ```

 The behavior of CSV functions can be adapted via keyword arguments. For more information, see `?CSV.read` and `?CSV.write`, or checkout the online [CSV.jl documentation](https://juliadata.github.io/CSV.jl/stable/).
+
+## Broadcasting


I'm not sure we should mention this here given that this syntax is inefficient. We should first show how to achieve this efficiently. Broadcasting over columns is also very useful.

My point with this PR is exactly to decide what kind of broadcasting we want to provide by default. We essentially have four options for broadcasting purposes:

make DataFrame be treated as scalar, then we do Ref(df);

make DataFrame be treated as a collection of rows, then we do eachrow(df);

make DataFrame be treated as a collection of rows, then we do eachcol(df, false);

make DataFrame be treated as a two dimensional object, like array; an inefficient way to do this would be Matrix(df), probably a more efficient implementation would be needed if we decided on it.

We have to select one. The other three have to be keyed-in by the user manually when broadcasting.

So the question is - how do we want to treat an AbstractDataFrame in broadcasting by default (when it is not wrapped by some other object).

nalimilan · 2018-12-15T18:38:40Z

docs/src/man/getting_started.md

+ "1"
+ "3"
+
+julia> (row -> string.(row)).(df)


I don't think this syntax is recommended (and it's quite obscure for newcomers).

OK - I will remove it.

bkamins · 2018-12-15T21:28:23Z

Given (I have just noticed it as it got covered by inline comments):

The other annoying point is that this syntax will necessarily be slow, and we don't provide a fast alternative

I am OK to leave it undefined for now, but at least let us try to decide what should be the dimensionality of AbstractDataFrame when broadcasted over (a question I have asked above).

pdeffebach · 2018-12-15T21:38:43Z

One thing I've been thinking about is if we might want to have our own map function that works similar to the by optimizations.

Having

map(df, [:x1, :x2]) do x

could be nice. I mean, maybe not, but it's worth wondering if we should save row-iteration for something behind a few more function calls to make it performant.

bkamins · 2018-12-15T22:21:59Z

Here is what my current thinking is (it would solve many challenges we face in one shot):

add typed field to DataFrame structure and typeddirty::Bool field
this would be either a a) NamedTuple of AbstractVectors or b) TypedTable (this would introduce a dependency, but we would gain many things for free); any of those options would reuse the stored columns
the typed field would be materialized lazily only on read; it would be materialized only when typedirty is true and after materialization set it to false;
we set typeddirty to true on any opertion that changes column types (column addition/deletion/disallowmissing etc. or setindex! on column that changes its type) - we have full control over all methods that are able to do this modification;
probably some locking should be done in order to make it thread safe.

In this way we keep normal DataFrame with all its benefits but "for free" we get typed version of it when it is needed.

And for example broadcast, map, or by etc. could use this typed field. The big benefit is that we materialize typed only once if columns do not change (so we have no precompilation overhead every time we call). Of course typed field would have volatile type, so we would have to have a barrier function inside all functions that want to take advantage of this field.

Notice that then we would not need to write:

map(df, [:x1, :x2]) do x

but simply:

map(df) do x

would be enough and efficient.

If df had thousands of columns then you would write

map(df[[:x1,:x2]]) do x

at the cost of generation of typed for this small data frame but saving on precompilation of function taking a possibly large NamedTuple and if you wanted to do several operations on a subset you would write df2 = df[[:x1, :x2]] and then df2 would cache its typed argument in repeated calls.

Do you see any disadvantages of this approach?

nalimilan · 2018-12-20T10:07:41Z

Interesting. I'm not sure that would work if we only stored a field: it would have to be part of the type, with e.g. DataFrame{Nothing} for the type-unstable default. Maybe we could simply have a type parameter reflecting the type of the columns field, which could either be a named tuple (type stable) or a kind of named vector (type-unstable, with eltype AbstractVector).

But this discussion is off-topic here. Maybe continue it at #1335?

nalimilan · 2018-12-26T13:48:16Z

x-ref #1019, #1513

bkamins · 2019-06-08T16:55:27Z

Closing this as we have a new approach to data frame broadcasting now

broadcast in DataFrames.jl

4f19bd4

nalimilan reviewed Dec 15, 2018

View reviewed changes

bkamins mentioned this pull request Dec 20, 2018

Base DataFrame on NamedTuple #1335

Closed

bkamins mentioned this pull request Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

bkamins mentioned this pull request May 6, 2019

Broadcasting in DataFrames #1804

Merged

bkamins closed this Jun 8, 2019

bkamins deleted the dataframe_broadcast branch June 8, 2019 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broadcast in DataFrames.jl #1643

broadcast in DataFrames.jl #1643

bkamins commented Dec 14, 2018

bkamins commented Dec 14, 2018

nalimilan left a comment

nalimilan Dec 15, 2018

bkamins Dec 15, 2018

nalimilan Dec 15, 2018

bkamins Dec 15, 2018

nalimilan Dec 15, 2018

bkamins Dec 15, 2018

bkamins commented Dec 15, 2018

pdeffebach commented Dec 15, 2018

bkamins commented Dec 15, 2018 •

edited

Loading

nalimilan commented Dec 20, 2018

nalimilan commented Dec 26, 2018 •

edited

Loading

bkamins commented Jun 8, 2019

broadcast in DataFrames.jl #1643

broadcast in DataFrames.jl #1643

Conversation

bkamins commented Dec 14, 2018

bkamins commented Dec 14, 2018

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan Dec 15, 2018

Choose a reason for hiding this comment

bkamins Dec 15, 2018

Choose a reason for hiding this comment

nalimilan Dec 15, 2018

Choose a reason for hiding this comment

bkamins Dec 15, 2018

Choose a reason for hiding this comment

nalimilan Dec 15, 2018

Choose a reason for hiding this comment

bkamins Dec 15, 2018

Choose a reason for hiding this comment

bkamins commented Dec 15, 2018

pdeffebach commented Dec 15, 2018

bkamins commented Dec 15, 2018 • edited Loading

nalimilan commented Dec 20, 2018

nalimilan commented Dec 26, 2018 • edited Loading

bkamins commented Jun 8, 2019

bkamins commented Dec 15, 2018 •

edited

Loading

nalimilan commented Dec 26, 2018 •

edited

Loading