Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid aliasing columns on assignment #1052

Closed
wants to merge 2 commits into from

Conversation

TotalVerb
Copy link
Contributor

@TotalVerb TotalVerb commented Sep 4, 2016

A Stack Overflow user has reported behaviour that I thought was strange, with regard to aliasing between columns: http://stackoverflow.com/questions/39320783/julia-dataframe-changing-one-cell-changes-entire-row

This makes upgrade_vector copying and also unhoists the broadcasting, to remove aliasing. Of course, the performance of these operations will be much worse.

Fixes #1051.

@nalimilan
Copy link
Member

Thanks! Indeed, it's more consistent to always return a newly-allocated vector from upgrade_vector.

@ararslan
Copy link
Member

ararslan commented Sep 4, 2016

Looks like this is causing this line to fail on 0.4:

df = DataFrame(A = Array(String, 3))

@andreasnoack
Copy link
Member

andreasnoack commented Sep 4, 2016

Two comments:

  1. I don't think we should copy when it's not necessary. If we follow that policy everywhere we'll end up as inefficient as R. The single change in line 414 seems to be sufficient to fix the issue without introducing copying anywhere else. See also the discussion in implement copy(::Void) = nothing JuliaLang/julia#15546. As I read that issue, the general view is that aliasing can depend on types but not values and here it's about types.
  2. Slighty off topic, sorry. As mentioned elsewhere, I think we should get rid this kind of 2d indexing of DataFrames. It's bad R heritage, it's inefficient and it leads to issues like the one we are fixing here.

@TotalVerb
Copy link
Contributor Author

@andreasnoack I'll separate out the single change in line 414 and keep this open for discussion.

@nalimilan
Copy link
Member

Actually, I find it weird that df[:, 1:end]=0.0 replaces the original columns with new vectors. I'd rather have it modify existing columns in place (and fail when conversion isn't possible), to be consistent with what happens with standard arrays.

We could also deprecate this indexing method, but probably better keep this discussion separate.

@andreasnoack
Copy link
Member

Good point. I think you are right here. It is how we usually do this and it is also likely to be more efficient.

@TotalVerb
Copy link
Contributor Author

So what's the decision on setting columns to vectors? Should that copy the data into the existing vector, or replace (and thus alias)?

@andreasnoack
Copy link
Member

I'd vote for copying the data into the existing and failing if not possible.

@TotalVerb
Copy link
Contributor Author

Closing, as it seems like the redesign will make a lot of these issues redundant.

@TotalVerb TotalVerb closed this Feb 27, 2017
@nalimilan
Copy link
Member

I don't think we have precise plans to fix this in a different way, and DataFrames still exists, so better keep it open.

@nalimilan nalimilan reopened this Feb 27, 2017
@TotalVerb
Copy link
Contributor Author

This PR does not implement the suggestion of @andreasnoack. I suspect that suggestion can be worded as simply

---
 src/dataframe/dataframe.jl | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/dataframe/dataframe.jl b/src/dataframe/dataframe.jl
index 4ce59b3..2ec54aa 100644
--- a/src/dataframe/dataframe.jl
+++ b/src/dataframe/dataframe.jl
@@ -325,15 +325,15 @@ function insert_single_column!(df::DataFrame,
     end
     if haskey(index(df), col_ind)
         j = index(df)[col_ind]
-        df.columns[j] = dv
+        @compat df.columns[j] .= dv
     else
         if typeof(col_ind) <: Symbol
             push!(index(df), col_ind)
-            push!(df.columns, dv)
+            push!(df.columns, upgrade_vector(dv))
         else
             if isnextcol(df, col_ind)
                 push!(index(df), nextcolname(df))
-                push!(df.columns, dv)
+                push!(df.columns, upgrade_vector(dv))
             else
                 error("Cannot assign to non-existent column: $col_ind")
             end
@@ -380,7 +380,7 @@ end
 function Base.setindex!(df::DataFrame,
                 v::AbstractVector,
                 col_ind::ColumnIndex)
-    insert_single_column!(df, upgrade_vector(v), col_ind)
+    insert_single_column!(df, v, col_ind)
 end
 
 # df[SingleColumnIndex] = Single Item (EXPANDS TO NROW(DF) if NCOL(DF) > 0)
-- 
2.9.3

which is a more minimal change than this PR, albeit possibly disruptive.

@nalimilan
Copy link
Member

Could you open a PR against DataTables? Disruptive changes are appreciated there. ;-)

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

upgrade_vector no longer exists, so I don't think this is an issue anymore.

@quinnj quinnj closed this Sep 7, 2017
@nalimilan
Copy link
Member

We still have upgrade_scalar, and the tests added in the PR still fail on master.

@nalimilan
Copy link
Member

See #1528.

@nalimilan nalimilan closed this Sep 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants