-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broadcast in DataFrames.jl #1643
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -471,3 +471,70 @@ CSV.write(output, df) | |
``` | ||
|
||
The behavior of CSV functions can be adapted via keyword arguments. For more information, see `?CSV.read` and `?CSV.write`, or checkout the online [CSV.jl documentation](https://juliadata.github.io/CSV.jl/stable/). | ||
|
||
## Broadcasting | ||
|
||
When you broadcast a function over an `AbstractDataFrame` it is treated as an `AbstractVector` of rows and each row is represented as a `DataFrameRow`: | ||
|
||
```jldoctest dataframe | ||
julia> df = DataFrame(A = 1:3, B = 3:-1:1) | ||
3×2 DataFrame | ||
│ Row │ A │ B │ | ||
│ │ Int64 │ Int64 │ | ||
├─────┼───────┼───────┤ | ||
│ 1 │ 1 │ 3 │ | ||
│ 2 │ 2 │ 2 │ | ||
│ 3 │ 3 │ 1 │ | ||
|
||
julia> identity.(df) | ||
3-element Array{DataFrameRow{DataFrame},1}: | ||
DataFrameRow (row 1) | ||
A 1 | ||
B 3 | ||
DataFrameRow (row 2) | ||
A 2 | ||
B 2 | ||
DataFrameRow (row 3) | ||
A 3 | ||
B 1 | ||
|
||
julia> copy.(df) | ||
3-element Array{NamedTuple{(:A, :B),Tuple{Int64,Int64}},1}: | ||
(A = 1, B = 3) | ||
(A = 2, B = 2) | ||
(A = 3, B = 1) | ||
``` | ||
|
||
In the last example we used the `copy` function which transforms a `DataFrameRow` into a `NamedTuple`. | ||
|
||
A `DataFrameRow` is treated as a collection of values stored in its columns so you can apply to it standard functions that accept collections and also broadcast functions over it to get a vector: | ||
|
||
```jldoctest dataframe | ||
julia> df = DataFrame(A = 1:3, B = 3:-1:1) | ||
3×2 DataFrame | ||
│ Row │ A │ B │ | ||
│ │ Int64 │ Int64 │ | ||
├─────┼───────┼───────┤ | ||
│ 1 │ 1 │ 3 │ | ||
│ 2 │ 2 │ 2 │ | ||
│ 3 │ 3 │ 1 │ | ||
|
||
julia> dfr = df[1, :] | ||
DataFrameRow (row 1) | ||
A 1 | ||
B 3 | ||
|
||
julia> sum(dfr) | ||
4 | ||
|
||
julia> string.(dfr) | ||
2-element Array{String,1}: | ||
"1" | ||
"3" | ||
|
||
julia> (row -> string.(row)).(df) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this syntax is recommended (and it's quite obscure for newcomers). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK - I will remove it. |
||
3-element Array{Array{String,1},1}: | ||
["1", "3"] | ||
["2", "2"] | ||
["3", "1"] | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
module TestDataFrame | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you put this in an existing file? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK - I will move it to indexing.jl. |
||
using DataFrames, Test | ||
|
||
@testset "broadcast DataFrame & DataFrameRow" begin | ||
df = DataFrame(x=1:4, y=5:8, z=9:12) | ||
@test sum.(df) == [15, 18, 21, 24] | ||
@test ((row -> row .+ 1)).(df) == [i .+ [0, 4, 8] for i in 2:5] | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should mention this here given that this syntax is inefficient. We should first show how to achieve this efficiently. Broadcasting over columns is also very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point with this PR is exactly to decide what kind of broadcasting we want to provide by default. We essentially have four options for broadcasting purposes:
DataFrame
be treated as scalar, then we doRef(df)
;DataFrame
be treated as a collection of rows, then we doeachrow(df)
;DataFrame
be treated as a collection of rows, then we doeachcol(df, false)
;DataFrame
be treated as a two dimensional object, like array; an inefficient way to do this would beMatrix(df)
, probably a more efficient implementation would be needed if we decided on it.We have to select one. The other three have to be keyed-in by the user manually when broadcasting.
So the question is - how do we want to treat an
AbstractDataFrame
in broadcasting by default (when it is not wrapped by some other object).