Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a column via broadcasting in an empty data frame #1889

Closed
grahamgill opened this issue Jul 18, 2019 · 7 comments · Fixed by #1890
Closed

Creating a column via broadcasting in an empty data frame #1889

grahamgill opened this issue Jul 18, 2019 · 7 comments · Fixed by #1890

Comments

@grahamgill
Copy link

grahamgill commented Jul 18, 2019

In DataFrames 0.18.4, I can for example do the following:
df[cols] = 0.0
where df is a data frame, cols is a vector of Symbol naming columns not in df.

As a result, df gets new columns of type Float64 containing 0.0 entries in all rows. If df is an empty data frame, i.e. with 0 rows, it still gets the new columns with type Float64, but of course having 0 rows.

This seems logical: creating a column through broadcasting to an empty data frame creates an empty column. In particular it means I don't have to test separately for the special case of an empty data frame.

In DataFrames 0.19.0 the same code gives me an error when df is an empty data frame:
ArgumentError: creating a column via broadcasting is not allowed on empty data frames
from broadcasting.jl in the Base.copyto! method.

df[!,col] .= 0.0 works in 0.19.0, where df is non-empty and col identifies a single new column: a new column col is created with 0.0 entries in all rows. Can this also be made to work for empty df, similar to the behaviour of 0.18.4?

NOTE by "empty" here I mean a data frame with columns but with no rows. The behaviour for a data frame that is completely empty, no columns or rows, is not consistent with what I've described, as discussed a few comments further down.

@bkamins
Copy link
Member

bkamins commented Jul 18, 2019

The problem is that in general you cannot know what the result of a broadcasted expression would be if you never execute it. That is why this is disallowed.

However, now as I think of it we could allow it in one special case of bc meeting the conditions:

bc isa Base.Broadcast.Broadcasted{<:Base.Broadcast.AbstractArrayStyle{0}} &&
    bc.f === identity && bc.args isa Tuple{Any} && Base.Broadcast.isflat(bc)

(as this is actually the most common case you mention).

I will make a PR and let us wait for the feedback.

@grahamgill
Copy link
Author

Thanks @bkamins for explanation of the general difficulty and PR on this particular use case. Julia newbie here - still a lot to learn.

@itsdfish
Copy link

Hi. We just had a discussion on discourse and I found this issue through the linked PR. I just want to clarify a point that the original poster made:

If df is an empty data frame, i.e. with 0 rows, it still gets the new columns with type Float64, but of course having 0 rows.

Unless I misunderstood, this does not appear to be the case:

(v1.1) pkg> st DataFrames
    Status `~/.julia/environments/v1.1/Project.toml`
  [a93c6f00] DataFrames v0.18.4
  [2913bbd2] StatsBase v0.30.0

 using DataFrames

 df = DataFrame()
  0×0 DataFrame


 df[:col] = 0.0
  0.0

 df
  1×1 DataFrame
│ Row │ col     │
│     │ Float64 │
├─────┼─────────┤
│ 1   │ 0.0     │

This is the behavior I was hoping could be restored if it fits within the other goals of DataFrames.

@grahamgill
Copy link
Author

grahamgill commented Jul 19, 2019

That's interesting - I hadn't encountered that. Using "empty data frame" I should have been clearer. My use case was a data frame with columns but with 0 rows, the result of filtering on an existing data frame with no rows matching the filter.

Then seems like there's an inconsistency in 0.18.4 with respect to a completely empty data frame, because:

using DataFrames

julia> df2 = DataFrame(A=Int64[])
0×1 DataFrame

julia> dump(df2)
DataFrame
  columns: Array{AbstractArray{T,1} where T}((1,))
    1: Array{Int64}((0,)) Int64[]
...

julia> df2[:B] = 0.0
0.0

julia> df2
0×2 DataFrame


julia> dump(df2)
DataFrame
  columns: Array{AbstractArray{T,1} where T}((2,))
    1: Array{Int64}((0,)) Int64[]
    2: Array{Float64}((0,)) Float64[]
...

@itsdfish
Copy link

itsdfish commented Jul 19, 2019

Very interesting. Thanks for clarifying. I was not aware of this divergent behavior. I would have expected your code to produce an error. You can also create a new empty column like so:

df = DataFrame()
df[:C] = Float64[]

which is what I would expect.

@grahamgill
Copy link
Author

Sure yes thanks. The use case however was

  1. partitioning a long format data frame according to some filters,
  2. unstack()ing the resulting data frames to wide format,
  3. then adding some additional columns containing a default scalar value via broadcast.
    The filters in the step 1 can plausibly return no rows for some partitions. The 0.18.4 behaviour for creating a column via broadcast in a data frame with columns but no rows does "the right thing" with respect to step 3, minimising the amount of special case checking required.

@bkamins
Copy link
Member

bkamins commented Jul 19, 2019

Can you please check the code and do some tests of #1890 so that we are sure we have the functionality you expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants