Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues using the df[!, col] syntax during broadcasts #1959

Closed
eperim opened this issue Sep 18, 2019 · 11 comments
Closed

Issues using the df[!, col] syntax during broadcasts #1959

eperim opened this issue Sep 18, 2019 · 11 comments

Comments

@eperim
Copy link

eperim commented Sep 18, 2019

When updating my code to comply with v0.19, I was changing the df[col] calls to df[!, col] as instructed. However, that led to

MethodError: no method matching length(::DataFrames.LazyNewColDataFrame{Symbol})

while it works fine if I use df[:, col] instead.

I don't have a MWE for this, because it happens somewhere deep in a stack containing proprietary packages. However, it seems to me that the ! syntax is creating some intermediate types under the hood for which not all properties are defined, and that can lead to issues when doing broadcasts over columns.

I'd suggest that, at least, the deprecation warning be fixed to suggest the safe : syntax instead of the unsafe ! one.

@eperim
Copy link
Author

eperim commented Sep 18, 2019

@oxinabox

@oxinabox
Copy link
Contributor

I'd suggest that, at least, the deprecation warning be fixed to suggest the safe : syntax instead of the unsafe ! one.

Problem is they do do different things. It might be nice if it suggested both.
Mostly people should use the : one, but the ! ione is supposed to be identical (including unsafety) to df[col]

But also there must be some bug with !

@eperim
Copy link
Author

eperim commented Sep 18, 2019

It doesn't seem to be exactly identical for me. Calling df[col] worked fine before, but calling df[!, col] leads to that error from the original message.

@bkamins
Copy link
Member

bkamins commented Sep 18, 2019

What you must be writing is something like:

df[!, col] .= some_operations

which replaces column col with a new column.

df[:, col] .= some_operations

is almost the same, but it is an in-place operation, so in some cases it will have a different effect (the most important is when you want to create a new column). The most important are the 1-2 functions up the stack. Chiefly - what function wanted to call length on LazyNewColDataFrame (in DataFrames.jl we never make such a call).

Use one or the other depending on what your objective is.

Now - to solve the problem - can you please share with me the stacktrace (you can cut off all proprietary methods and leave only the ones that are from Base or DataFrames.jl). The reason is that broadcasting API, as documented here does not require objects that are assigned to using broadcasted assignment to define length, they only need to define axes. This means that the operation you are trying to do calls length on an object that is not required to have length defined.

@bkamins
Copy link
Member

bkamins commented Sep 18, 2019

Also note that the old syntax df[col] .= something was an undocumented behavior. Actually now its equivalent is df[:, col].

I will add maybeview(::DataFrame, ::ColumnIndex) deprecation to fix a warning.

(but still length should never be called AFAICT so could you please pass this part of a stacktrace)

@bkamins
Copy link
Member

bkamins commented Sep 18, 2019

see #1960

@eperim
Copy link
Author

eperim commented Sep 18, 2019

Here's the stacktrace, if that helps:

  Got exception outside of a @test
  MethodError: no method matching length(::DataFrames.LazyNewColDataFrame{Symbol})
  Closest candidates are:
    length(!Matched::Core.SimpleVector) at essentials.jl:582
    length(!Matched::Base.MethodList) at reflection.jl:732
    length(!Matched::Core.MethodTable) at reflection.jl:806
    ...
  Stacktrace:
   [1] _similar_for(::UnitRange{Int64}, ::Type, ::DataFrames.LazyNewColDataFrame{Symbol}, ::Base.HasLength) at ./array.jl:532
   [2] _collect(::UnitRange{Int64}, ::DataFrames.LazyNewColDataFrame{Symbol}, ::Base.HasEltype, ::Base.HasLength) at ./array.jl:563
   [3] collect(::DataFrames.LazyNewColDataFrame{Symbol}) at ./array.jl:557
   [4] broadcastable(::DataFrames.LazyNewColDataFrame{Symbol}) at ./broadcast.jl:617
   [5] broadcasted at ./broadcast.jl:1171 [inlined]
   [6] (::DataFrame, ::typeof(getindex), ::Type{Arg{1}}, ::Tuple{}, ::Array{Float64,1}, ::Array{Float64,1}, ::DataFrame, ::Function, ::Symbol) at /Users/ericperim/.julia/packages/Nabla/D60dX/src/sensitivities/indexing.jl:14
   [7] (::typeof(getindex), ::Type{Arg{1}}, ::Tuple{}, ::Array{Float64,1}, ::Array{Float64,1}, ::DataFrame, ::Function, ::Symbol) at /Users/ericperim/.julia/packages/Nabla/D60dX/src/sensitivities/indexing.jl:18
   [8] propagate(::Branch{Array{Float64,1}}, ::Tape) at /Users/ericperim/.julia/packages/Nabla/D60dX/src/core.jl:135
   [9] propagate(::Tape, ::Tape) at /Users/ericperim/.julia/packages/Nabla/D60dX/src/core.jl:146
   [10] ∇ at /Users/ericperim/.julia/packages/Nabla/D60dX/src/core.jl:179 [inlined]
   [11] (::Branch{Float64}) at /Users/ericperim/.julia/packages/Nabla/D60dX/src/core.jl:180
...
   [36] include at ./boot.jl:317 [inlined]
   [37] include_relative(::Module, ::String) at ./loading.jl:1044
   [38] include(::Module, ::String) at ./sysimg.jl:29
   [39] include(::String) at ./client.jl:392
   [40] top-level scope at none:0
   [41] eval(::Module, ::Any) at ./boot.jl:319
   [42] exec_options(::Base.JLOptions) at ./client.jl:243
   [43] _start() at ./client.jl:425

@bkamins
Copy link
Member

bkamins commented Sep 18, 2019

Yes it helps 😄. We get an unintended interaction with @views:

julia> df = DataFrame(rand(3,4))
3×4 DataFrame
│ Row │ x1       │ x2       │ x3       │ x4       │
│     │ Float64  │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 0.387991 │ 0.115378 │ 0.968019 │ 0.726938 │
│ 2   │ 0.824922 │ 0.585347 │ 0.845218 │ 0.58268  │
│ 3   │ 0.656691 │ 0.19712  │ 0.326753 │ 0.910748 │

julia> @views df[!, 1] .+= 1
ERROR: MethodError: no method matching length(::DataFrames.LazyNewColDataFrame{Int64})

In order to handle this I think we will need to define:

broadcastable(::DataFrames.LazyNewColDataFrame)

@mbauman - could you kindly confirm that this is a right approach.

CC @nalimilan

@bkamins
Copy link
Member

bkamins commented Sep 18, 2019

I have added broadcastable definition to #1960.

You can add:

Base.broadcastable(df::DataFrames.LazyNewColDataFrame) = df.df[!, df.col]

to your code just to check if this resolves your case (it should). It would be great if you could confirm that all is OK.

@eperim
Copy link
Author

eperim commented Sep 18, 2019

Just confirmed it here, and overloading Base.broadcastable does solve the issue. Thanks!

@bkamins
Copy link
Member

bkamins commented Sep 19, 2019

Fixed with #1960

@bkamins bkamins closed this as completed Sep 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants