Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preventing problems with aliased columns #1885

Closed
jablauvelt opened this issue Jul 17, 2019 · 11 comments · Fixed by #1887
Closed

Preventing problems with aliased columns #1885

jablauvelt opened this issue Jul 17, 2019 · 11 comments · Fixed by #1887

Comments

@jablauvelt
Copy link

A note in the documentation for append!() says:

Please note that append! must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===) as it will silently produce a wrong result in such a situation.

Unfortunately, I only found this note, and understood it, after many hours of confusion and debugging (I have only been using Julia for a couple of months). I was getting strange errors that didn't seem to be related to append!(), and therefore took a while to track down. It turns out that appending a DataFrame to another DataFrame with aliased columns caused the problem. Here's a reproducible example:

x1 = DataFrame(a = [1, 2, 3], b= ["x", "y", "z"])
x1.c = x1.b # here is the aliasing that causes the problem
x2 = DataFrame(a = [4, 5], b= ["x2", "y2"], c = ["z", "z"])
append!(x1, x2) # succeeds with no warning

# The dataframe looks normal when displayed:
show(x1)
# 5×3 DataFrame
# │ Row │ a     │ b      │ c      │
# │     │ Int64 │ String │ String │
# ├─────┼───────┼────────┼────────┤
# │ 1   │ 1     │ x      │ x      │
# │ 2   │ 2     │ y      │ y      │
# │ 3   │ 3     │ z      │ z      │
# │ 4   │ 4     │ x2     │ x2     │
# │ 5   │ 5     │ y2     │ y2     │

# But columns b and c are longer than column a:
[length(col) for col in eachcol(x1)]
# 3-element Array{Int64,1}:
# 5
# 7
# 7

# Here is what b (and c) actually look like:
x1.b
# 7-element Array{String,1}:
# "x"
# "y"
# "z"
# "x2"
# "y2"
# "z"
# "z"

# This leads to all sorts of strange errors, such as:
x1[x1.a .== 2, :]
# ERROR: BoundsError: attempt to access 7-element Array{String,1} at index [Base.LogicalIndex(Bool[false, true, false, false, false])]

# This is initially where I started seeing problems, but my append!() logic was
# many steps away from this filtering line, so it took a while to track down the culprit.

I know the note in the documentation technically warns users of this behavior, but many users who are less familiar with Julia, and aren't thinking of object references/copying and so forth, will also be very confused. I think it would help immensely to add some basic checks on DataFrames to ensure that their columns are consistent sizes or that there aren't problems created by aliased columns. These checks would go a long way towards protecting Julia's reputation of being safe and transparent.

I feel like there are a few options for resolving this, but I'm not sure what's preferred:

  1. Add some inherent checks to DataFrame objects, such that it is impossible for them to lose their rectangular shape: I'm not sure if this is possible, or if this adds too much overhead, but it would in many senses be good if DataFrames came with an ironclad guarantee of row and column consistency. As an illustrative extreme, checks could be made every time an DataFrame is modified. Of course that might affect performance significantly, so maybe there are strategic junctures to perform these checks that don't add too much overhead.
  2. For methods that could cause errors due to aliasing, add checks/warnings/workarounds for aliasing: I'm not sure how hard it is to identify all the right methods. But if it's doable, it seems like it wouldn't add too much overhead to have a quick check at the beginning of each method. For example, a simple check that each column has its own unique memory address. This could result in a warning, workaround, or outright error, depending on how we want to handle it.
  3. Disallow aliasing, or force copying of aliased columns: I know that if someone has a DataFrame with 500 columns, and 499 of them all are the exact same, then it's a waste of memory to make each column their own object. But at the same time, it's not clear that there is any real application to such a scenario, and maybe discourage this. Perhaps we disallow aliasing for DataFrames, or at least automatically make a copy in situations where aliasing would happen.
  4. Just add a check to append!(): This might be the simplest, since we know there is a problem with this method (and it's unclear right now to me whether other methods are affected by this too).
@bkamins
Copy link
Member

bkamins commented Jul 17, 2019

Thank you for this report and detailed analysis. Actually it goes in hand with what @oxinabox calls for for some time now (and we keep patching it - see the new indexing rules and the fact that constructors copy columns by default).

What we plan to do is:

  • implement Add column length checks to expensive operations #1845 (I am working on it currently) - we will the check for expensive operations; we will add append! to the list as this is a good suggestion
  • also it is a good suggestion that actually we could check for aliases; we already do it for broadcasting and I have recently implemented full unaliasing mechanism (unfortunately this is O(ncol(df)^2) operation) - so this is a decision point when we would want to do it; comments are welcome
  • disallowing aliasing can be considered; it has cost O(ncol(df)) so again - comments are welcome if we want to do it

My current thinking is to (and this would go to #1845 implementation):

  • check for data frame consistency before expensive operations
  • check for data frame consistency after operations that could cause a problem (largely this is append! and push!)
  • do not perform aliasing analysis as it would be too expensive and in general allow aliases (but this is a mild preference - we can discuss here as I see pros of disallowing aliases)

CC @nalimilan, @oxinabox, @quinnj

@jablauvelt
Copy link
Author

Thanks for the quick and comprehensive response! I meant to say that I did see #1845 when searching, but wasn't sure how closely related it was to this - good to know that they are.

Dumb newbie question here, but why would we only check for data frame consistency before expensive operations? I'd normally assume that if it's not an expensive operation then we'd "have even more time" for these checks. Is it because non-expensive operations tend to be applied more often, and therefore the overhead of these checks would add up much more quickly?

@bkamins
Copy link
Member

bkamins commented Jul 17, 2019

You do non-expensive operations in a loop usually like this:

df = DataFrame(a=1,b=2)
for i in 1:10^6
    push!(df, [i,i])
end

adding a check to push! would kill performance.

Also note that in your original code the "proper" way to create a new column is:

x1[!, c] .= x1.b

which avoids aliasing (this is of course does not solve the problem but I just wanted to note the intended pattern for such cases).

@jablauvelt
Copy link
Author

OK, good to know. I was also wondering about the proper way of creating a new column, short of using copy(x1.b), so that's helpful. Thanks.

@bkamins
Copy link
Member

bkamins commented Jul 18, 2019

see #1887

@nalimilan
Copy link
Member

In your actual use case, did you call append!(x1, x2), or append!(x1, x3) with x3 another, unrelated data frame? It's relatively easy to fix the former since we can check aliasing between x1 and x2, but it's much harder if appending x3 since the aliasing isn't visible by append! (unless we keep a global table of aliasing, which is a possibility).

@bkamins
Copy link
Member

bkamins commented Jul 22, 2019

We have aliasing detection code already, as it was needed for broadcasting.
But aliasing is not a problem - if it happens in #1887 I detect its consequences (wrong column names) and print an error (i.e. I do not need to know we have aliasing to know we have a problem 😄).

@nalimilan
Copy link
Member

Disallow aliasing, or force copying of aliased columns: I know that if someone has a DataFrame with 500 columns, and 499 of them all are the exact same, then it's a waste of memory to make each column their own object. But at the same time, it's not clear that there is any real application to such a scenario, and maybe discourage this. Perhaps we disallow aliasing for DataFrames, or at least automatically make a copy in situations where aliasing would happen.

FWIW, that's what we do by default in recent releases e.g. with select, but the x1.c = x2.b doesn't allow us to know that x2.b comes from a data frame: all we know is that it's a vector. We could always make a copy of that vector, but that can be sub-optimal if you build that vector manually and don't intend to resize it, and that would mean x1.a isn't equivalent to x1[!, :a], making things more complex to grasp. Though as I said maybe we could use a global table of column vectors to detect that the vector is used by another data frame, and print a warning.

@nalimilan
Copy link
Member

We have aliasing detection code already, as it was needed for broadcasting.
But aliasing is not a problem - if it happens in #1887 I detect its consequences (wrong column names) and print an error (i.e. I do not need to know we have aliasing to know we have a problem smile).

@bkamins Aliasing detection when broadcasting only needs to check columns between data frames involved in the operation, just like the append!(x1, x2) case. But detecting aliasing with other data frames requires a different system.

@bkamins
Copy link
Member

bkamins commented Jul 22, 2019

I agree (but my major point was that we do not need to detect aliasing anyway and aliasing detection is very expensive so we also do not want to do this - it is much cheaper to check afterwards if something got broken)

@nalimilan
Copy link
Member

Ah sorry I thought that in the OP the aliasing was between x1 and x2, while is it between two columns of x1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants