Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broadcast in DataFrames.jl #1643

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions docs/src/man/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -471,3 +471,70 @@ CSV.write(output, df)
```

The behavior of CSV functions can be adapted via keyword arguments. For more information, see `?CSV.read` and `?CSV.write`, or checkout the online [CSV.jl documentation](https://juliadata.github.io/CSV.jl/stable/).

## Broadcasting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should mention this here given that this syntax is inefficient. We should first show how to achieve this efficiently. Broadcasting over columns is also very useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point with this PR is exactly to decide what kind of broadcasting we want to provide by default. We essentially have four options for broadcasting purposes:

  • make DataFrame be treated as scalar, then we do Ref(df);
  • make DataFrame be treated as a collection of rows, then we do eachrow(df);
  • make DataFrame be treated as a collection of rows, then we do eachcol(df, false);
  • make DataFrame be treated as a two dimensional object, like array; an inefficient way to do this would be Matrix(df), probably a more efficient implementation would be needed if we decided on it.

We have to select one. The other three have to be keyed-in by the user manually when broadcasting.

So the question is - how do we want to treat an AbstractDataFrame in broadcasting by default (when it is not wrapped by some other object).


When you broadcast a function over an `AbstractDataFrame` it is treated as an `AbstractVector` of rows and each row is represented as a `DataFrameRow`:

```jldoctest dataframe
julia> df = DataFrame(A = 1:3, B = 3:-1:1)
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 1 │

julia> identity.(df)
3-element Array{DataFrameRow{DataFrame},1}:
DataFrameRow (row 1)
A 1
B 3
DataFrameRow (row 2)
A 2
B 2
DataFrameRow (row 3)
A 3
B 1

julia> copy.(df)
3-element Array{NamedTuple{(:A, :B),Tuple{Int64,Int64}},1}:
(A = 1, B = 3)
(A = 2, B = 2)
(A = 3, B = 1)
```

In the last example we used the `copy` function which transforms a `DataFrameRow` into a `NamedTuple`.

A `DataFrameRow` is treated as a collection of values stored in its columns so you can apply to it standard functions that accept collections and also broadcast functions over it to get a vector:

```jldoctest dataframe
julia> df = DataFrame(A = 1:3, B = 3:-1:1)
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 3 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 1 │

julia> dfr = df[1, :]
DataFrameRow (row 1)
A 1
B 3

julia> sum(dfr)
4

julia> string.(dfr)
2-element Array{String,1}:
"1"
"3"

julia> (row -> string.(row)).(df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this syntax is recommended (and it's quite obscure for newcomers).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I will remove it.

3-element Array{Array{String,1},1}:
["1", "3"]
["2", "2"]
["3", "1"]
```
2 changes: 2 additions & 0 deletions src/abstractdataframe/abstractdataframe.jl
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,8 @@ Base.axes(df::AbstractDataFrame, i::Integer) = axes(df)[i]

Base.ndims(::AbstractDataFrame) = 2

Base.broadcastable(adf::AbstractDataFrame) = eachrow(adf)

Base.getproperty(df::AbstractDataFrame, col_ind::Symbol) = getindex(df, col_ind)
Base.setproperty!(df::AbstractDataFrame, col_ind::Symbol, x) = setindex!(df, x, col_ind)
# Private fields are never exposed since they can conflict with column names
Expand Down
9 changes: 9 additions & 0 deletions test/broadcasting.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
module TestDataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this in an existing file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I will move it to indexing.jl.

using DataFrames, Test

@testset "broadcast DataFrame & DataFrameRow" begin
df = DataFrame(x=1:4, y=5:8, z=9:12)
@test sum.(df) == [15, 18, 21, 24]
@test ((row -> row .+ 1)).(df) == [i .+ [0, 4, 8] for i in 2:5]
end
end
1 change: 1 addition & 0 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ my_tests = ["utils.jl",
"tables.jl",
"tabletraits.jl",
"indexing.jl",
"broadcasting.jl",
"deprecated.jl"]

println("Running tests:")
Expand Down