Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow scalar broadcasting into an empty data frame #1890

Merged
merged 12 commits into from
Jul 25, 2019

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Jul 18, 2019

Fix #1889
@grahamgill - in tests you can see what will be allowed and what will be still disallowed.

@bkamins
Copy link
Member Author

bkamins commented Jul 19, 2019

I have added new broadcasting rules for an empty DataFrame now. Please comment if this is what was expected. Tests are appreciated as the requested functionality is tricky (also I have updated the docs so you can check if what we have is what you expected).

@itsdfish
Copy link

As far as I can tell, everything looks great. Thanks for taking the time to add those changes! I think this makes DataFrames more flexible.

@bkamins
Copy link
Member Author

bkamins commented Jul 19, 2019

@nalimilan + @quinnj : you might find the last commit interesting, see e693124

I also checked that col_tmp when returned from this function passes col_tmp isa AbstractVector test, so something bad happens when the function is compiled probably (most likely type inference). The problem arises when col_tmp is a categorical([1,2,missing]) vector. It is fully reproducible by checking out this repo without the last commit.

The problem is present in Julia 1.0 only. Do you have any idea what could cause it?

CC @vtjnash - it seems to me like a bug in Julia 1.0.

@bkamins
Copy link
Member Author

bkamins commented Jul 19, 2019

Just as a reference this is a MWE of what was not working under Julia 1.0:

using DataFrames
v = categorical([1,2, missing])
df = DataFrame(reshape(1.5:15.5, (3,5)))
df[!, :c1] .= v

@bkamins
Copy link
Member Author

bkamins commented Jul 21, 2019

Fixes #1893.
An additional fix for df[1,1] .=v, df[1,:col] .=v and df[CartesianIndex(1,1)] .=v as it was not consistent with Base.

@bkamins
Copy link
Member Author

bkamins commented Jul 22, 2019

@nalimilan This should be good for a review. I have gotten rid of "speculative code (the possible bug in Base was worked around by a cleaner code). These are the only problems reported with 0.19 that needed fixing and are:

  • treat DataFrame() as an object with undefined number of rows in broadcasting and create as many rows as the right hand side specifies; in particular one row if it is a scalar (this is consistent with the old behavior of DataFrames.jl and current behavior in non-broadcasting df[!, :col] = v which does not copy as opposed to broadcasting which copies).
  • allow 0-row and multi-column DataFrame to be broadcasted into using a scalar like df[!, :col] .= scalar by creating a new column with 0 rows
  • treat df[1,1] .= v as broadcasting into the contents of the cell (fixed, this is consistent with Base)

@nalimilan
Copy link
Member

* treat `DataFrame()` as an object with undefined number of rows in broadcasting and create as many rows as the right hand side specifies; in particular one row if it is a scalar (this is consistent with the old behavior of DataFrames.jl and current behavior in non-broadcasting `df[!, :col] = v` which does not copy as opposed to broadcasting which copies).

I'm not sure that's a great idea. That would introduce an inconsistency between 0-rows data frames depending on whether they have 0 columns or more. I think it's fine to allow for some flexibility for 0-column data frames in cases that would otherwise throw errors; but introducing inconsistencies is problematic. And indeed it makes the description of the rules more complex.

* allow 0-row and multi-column `DataFrame` to be broadcasted into using a scalar like `df[!, :col] .= scalar` by creating a new column with 0 rows

This sounds OK as long as 0-column and multi-column data frames behave consistently.

* treat `df[1,1] .= v` as broadcasting into the contents of the cell (fixed, this is consistent with Base)

+1

@bkamins
Copy link
Member Author

bkamins commented Jul 22, 2019

@nalimilan - your comments are along my initial thoughts. Let me give an extended explanation of the reason for the change and I would appreciate your feedback.

The problem is that: 1) users want what I have implemented here; 2) currently in setindex! behaves this way. What I mean is that you can do:

df = DataFrame()
df[!, :a] = v

and v can have any number of rows you like. And this first column then defines the nrow for a DataFrame afterwards. So - in setindex! actually we treat 0-column DataFrame differently than a DataFrame with some columns. And what users want is to keep this behavior also for broadcasted assignment (in other words - the rules are longer, but are consistent between setindex! and broadcasted assignment).

Also in the past df.col = 1 worked for 0 column data frames and produced one row. Now we disallow this and this is breaking (this is not end of the world but still ...).

@nalimilan
Copy link
Member

  1. currently in setindex! behaves this way. What I mean is that you can do:
df = DataFrame()
df[!, :a] = v

and v can have any number of rows you like. And this first column then defines the nrow for a DataFrame afterwards. So - in setindex! actually we treat 0-column DataFrame differently than a DataFrame with some columns. And what users want is to keep this behavior also for broadcasted assignment (in other words - the rules are longer, but are consistent between setindex! and broadcasted assignment).

AFAICT that's a very different thing, since it doesn't conflict with the general behavior when there are columns: we just allow something that would otherwise fail (because the length of the new vector isn't 0). But this PR proposes to introduce an inconsistency: broadcasting into a non-existent column in a zero-row data frame would give a zero-row data frame if it has columns, but a data frame with length(v) rows otherwise. That sounds really problematic to me.

Also in the past df.col = 1 worked for 0 column data frames and produced one row. Now we disallow this and this is breaking (this is not end of the world but still ...).

Yes, that's a bit annoying. But df.col = [1] is quite short and efficient. Maybe we can point to that as a replacement (and avoid throwing an error)?

@bkamins
Copy link
Member Author

bkamins commented Jul 22, 2019

But df.col = [1] is quite short and efficient

I agree (and that is why initially I wanted to create 0 rows always even for a DataFrame with 0 columns).

@itsdfish + @grahamgill : can you please comment how much you want 0-column data frame to be special and why? We have a tension here between consistency (which was our original design intention) and usability (this is what you wanted).

@itsdfish
Copy link

itsdfish commented Jul 22, 2019

@bkamins, from time to time, I find myself adding scalars to empty DataFrames when restructuring data. Wrapping scalars with [] isn't too bad. The main downside is that it cannot handle cases such as df[!,:col] .=v`, where v could be a vector or scalar. Nonetheless, I would recommend doing whatever provides a simple and consistent interface, even if that means dropping scalar support.

@bkamins
Copy link
Member Author

bkamins commented Jul 23, 2019

@itsdfish thank you for the feedback.

I think we can wait one or two days for other feedback and then decide. If there are no new voices I will implement the following rule:

when broadcasting into a data frame df the target number of rows is always nrow(df)

@nalimilan I guess this is the invariant you wanted - right?. This will mean that we cannot broadcast vectors of positive length into 0-row data frames (no matter if they have columns or not) and we allow to broadcast scalars taking their type to create a column.
Actually we could allow broadcasting 0-length arrays also (Base allows this).

For the setindex! the rule is simpler:

when using df.col = v you always get the vector of length(v) as a result; this has to be equal to nrow(df) if ncol(df)>0

@bkamins
Copy link
Member Author

bkamins commented Jul 23, 2019

@nalimilan - I have slept over this and I am still unsure what is best 😢 (I include both setindex! and broadcasting cases as I think it is relevant what we do in both). Here are the options. Can you please comment on them? Thank you

operation current this PR @nalimilan
df = DataFrame(); df[!, :a] .= 1; fails new [1] col new Int[] col
df = DataFrame(); df[!, :a] .= sin.(1); fails new [sin(1)] col new Float64[] col
df = DataFrame(); df[!, :a] .= [1,2]; fails new [1,2] col new Int[] col
df = DataFrame(); df[!, :a] .= []; fails fails - I can make it work new [] col
df = DataFrame(); df[!, :a] .= sin.([]); fails fails - it must fail new [] col
df = DataFrame(x=[]); df[!, :a] .= 1; fails new Int[] col new Int[] col
df = DataFrame(x=[]); df[!, :a] .= sin.(1); fails fails - I can make it work new Float64[] col
df = DataFrame(x=[]); df[!, :a] .= [1,2]; fails fails new Int[] col
df = DataFrame(x=[]); df[!, :a] .= []; fails fails - I can make it work new [] col
df = DataFrame(); df[!, :a] .= sin.([]); fails fails - and must fail new [] col
df = DataFrame(); df[!, :a] = 1; fails fails fails
df = DataFrame(); df[!, :a] = [1,2]; works works works
df = DataFrame(); df[!, :a] = []; works works works
df = DataFrame(x=[]); df[!, :a] = 1; fails fails fails
df = DataFrame(x=[]); df[!, :a] = [1,2]; fails fails fails
df = DataFrame(x=[]); df[!, :a] = []; works works works

@nalimilan
Copy link
Member

I'd suggest df = DataFrame(); df[!, :a] .= v and df = DataFrame(x=[]); df[!, :a] .= v to be consistent with x = []; x .= v, i.e. create an empty column. There wouldn't be any difference between DataFrame() and DataFrame(x=[]) -- the idea being that the former should support all cases that the latter supports (and possibly a few more for convenience, but never less). I've filled the table.

I'd also be OK with throwing errors when attempting to use .= to broadcast to 0 rows (i.e. the current status), as it's probably not very useful and possibly confusing, and as it can be added later if needed.

@bkamins
Copy link
Member Author

bkamins commented Jul 23, 2019

@nalimilan Thank you for a prompt response. You are a robot 😄. Here are my comments to your suggestions (retaining your proposal, but showing the problematic cases):

df = DataFrame(x=[]); df[!, :a] .= [1,2];

must fail (it is a DimensionMismatch), which means that also under your proposal:

df = DataFrame(); df[!, :a] .= [1,2];

also should fail.

So essentially we have a choice:

  • disallow broadcasting into 0-rows data frame (current behavior)
  • allow broadcasting into 0-rows data frame (no matter if it has or does not have columns) and create a 0-row column as long as RHS has length 1 and up to 1 dimension.

The last condition (length and dimensionality) is relevant as this is what Base currently allows you to do:

julia> x = []
0-element Array{Any,1}

julia> x .= rand()
0-element Array{Any,1}

julia> x
0-element Array{Any,1}

julia> x .= rand(1)
0-element Array{Any,1}

julia> x
0-element Array{Any,1}

julia> x .= rand(1, 1)
ERROR: DimensionMismatch("cannot broadcast array to have fewer dimensions")

@bkamins
Copy link
Member Author

bkamins commented Jul 23, 2019

@itsdfish + @grahamgill:
we have distilled down the choice to two options as noted above. In short: 0-rows is not allowed or it always creates 0-rows vector if this is what Base would allow.

I am leaning towards the second option (this is consistent with Base).

You should get all you wanted initially except. For 0-column DataFrame:

  • df[!, :col] .= [1,2] will fail; you should write df[!, :col] = [1,2] to be non-copying and df[!, :col] = copy([1,2]) to copy;
  • df[!, :col] .= [1] will create a 0-length vector (not 1-length vector); the recommended syntax to add the vector you wanted is as above;

(also @nalimilan - just please confirm that this is exactly what we want 😄 as this is tricky, because it will lead to setindex! and broadcasted assignment inconsistency)

@bkamins
Copy link
Member Author

bkamins commented Jul 23, 2019

@nalimilan - just to expand on my last comment. DataFrame() object is IMO more common than 0-row DataFrame with some columns. Actually a behavior currently implemented (allowing you do to df[!, :col] .= [1] creating a copy of [1] is more convenient than creating a 0-row new column) so we could consider having 0-column DataFrame to behave specially.

@itsdfish
Copy link

@bkamins, I'll defer to your recommendation. That looks good.

@grahamgill
Copy link

But df.col = [1] is quite short and efficient

I agree (and that is why initially I wanted to create 0 rows always even for a DataFrame with 0 columns).

@itsdfish + @grahamgill : can you please comment how much you want 0-column data frame to be special and why? We have a tension here between consistency (which was our original design intention) and usability (this is what you wanted).

If a new column is created by broadcasting a scalar into a data frame with existing columns, then the new column necessarily has as many rows as the others. If that can also work when there are 0 rows, that's what interests me, because it eliminates a lot of edge case checking. If the data frame has both 0 columns and 0 rows, then I'd prefer consistency, with the resulting data frame also having 0 rows.

@grahamgill
Copy link

Thanks for the summary. I prefer the second option. I didn't realise I was stirring the pot so much with my initial question in #1889

@itsdfish + @grahamgill:
we have distilled down the choice to two options as noted above. In short: 0-rows is not allowed or it always creates 0-rows vector if this is what Base would allow.

I am leaning towards the second option (this is consistent with Base).

You should get all you wanted initially except. For 0-column DataFrame:

  • df[!, :col] .= [1,2] will fail; you should write df[!, :col] = [1,2] to be non-copying and df[!, :col] = copy([1,2]) to copy;
  • df[!, :col] .= [1] will create a 0-length vector (not 1-length vector); the recommended syntax to add the vector you wanted is as above;

(also @nalimilan - just please confirm that this is exactly what we want 😄 as this is tricky, because it will lead to setindex! and broadcasted assignment inconsistency)

@bkamins
Copy link
Member Author

bkamins commented Jul 23, 2019

Thank you both for the feedback. Really appreciated.

@nalimilan - so we go your way I think broadcasting to isempty(df) always produces 0 rows. It will actually make this PR simpler (I do not have to mess with tricky corner cases). Just please confirm the corner cases I highlighted before I change this PR.

@nalimilan
Copy link
Member

Sounds good!

@bkamins
Copy link
Member Author

bkamins commented Jul 23, 2019

I was talking with @mbauman about this and what I will implement is:

  1. assume that if isempty(df) == true then we always treat it as a 0 number of rows;
  2. I will develop it in a way that it is 1-to-1 with Base (actually the behavior in Base might change here in the future; I will make the code in a way that will automatically keep track of Base behavior);

Thanks to all who contributed to the discussion.

@bkamins
Copy link
Member Author

bkamins commented Jul 23, 2019

Broadcasting rules: Round 4. (and I hope the last round).

What we do:

  1. broadcasting always takes nrow(df) before the operation to determine the number of rows to create
  2. we use exactly the same rules as Base what we accept or what we do not accept (and if something changes there we will synchronize automatically)

(and as a side effect - we need less code to express this)

@nalimilan - this should be good for a final review.

docs/src/lib/indexing.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
src/other/broadcasting.jl Outdated Show resolved Hide resolved
test/broadcasting.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jul 24, 2019

CI passes (we have only Coverage decrease as usual 😞)

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small suggestion.

docs/src/lib/indexing.md Outdated Show resolved Hide resolved
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Jul 25, 2019

Thank you. I am going to merge it today and make a release (unless you see that we should wait - e.g. for #1887).

@nalimilan
Copy link
Member

Fine with me.

@bkamins bkamins merged commit ed25099 into JuliaData:master Jul 25, 2019
@bkamins bkamins deleted the more_flexible_broadcast branch July 25, 2019 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Creating a column via broadcasting in an empty data frame
4 participants