Preventing problems with aliased columns #1885

jablauvelt · 2019-07-17T20:09:09Z

A note in the documentation for append!() says:

Please note that append! must not be used on a DataFrame that contains columns that are aliases (equal when compared with ===) as it will silently produce a wrong result in such a situation.

Unfortunately, I only found this note, and understood it, after many hours of confusion and debugging (I have only been using Julia for a couple of months). I was getting strange errors that didn't seem to be related to append!(), and therefore took a while to track down. It turns out that appending a DataFrame to another DataFrame with aliased columns caused the problem. Here's a reproducible example:

x1 = DataFrame(a = [1, 2, 3], b= ["x", "y", "z"])
x1.c = x1.b # here is the aliasing that causes the problem
x2 = DataFrame(a = [4, 5], b= ["x2", "y2"], c = ["z", "z"])
append!(x1, x2) # succeeds with no warning

# The dataframe looks normal when displayed:
show(x1)
# 5×3 DataFrame
# │ Row │ a     │ b      │ c      │
# │     │ Int64 │ String │ String │
# ├─────┼───────┼────────┼────────┤
# │ 1   │ 1     │ x      │ x      │
# │ 2   │ 2     │ y      │ y      │
# │ 3   │ 3     │ z      │ z      │
# │ 4   │ 4     │ x2     │ x2     │
# │ 5   │ 5     │ y2     │ y2     │

# But columns b and c are longer than column a:
[length(col) for col in eachcol(x1)]
# 3-element Array{Int64,1}:
# 5
# 7
# 7

# Here is what b (and c) actually look like:
x1.b
# 7-element Array{String,1}:
# "x"
# "y"
# "z"
# "x2"
# "y2"
# "z"
# "z"

# This leads to all sorts of strange errors, such as:
x1[x1.a .== 2, :]
# ERROR: BoundsError: attempt to access 7-element Array{String,1} at index [Base.LogicalIndex(Bool[false, true, false, false, false])]

# This is initially where I started seeing problems, but my append!() logic was
# many steps away from this filtering line, so it took a while to track down the culprit.

I know the note in the documentation technically warns users of this behavior, but many users who are less familiar with Julia, and aren't thinking of object references/copying and so forth, will also be very confused. I think it would help immensely to add some basic checks on DataFrames to ensure that their columns are consistent sizes or that there aren't problems created by aliased columns. These checks would go a long way towards protecting Julia's reputation of being safe and transparent.

I feel like there are a few options for resolving this, but I'm not sure what's preferred:

Add some inherent checks to DataFrame objects, such that it is impossible for them to lose their rectangular shape: I'm not sure if this is possible, or if this adds too much overhead, but it would in many senses be good if DataFrames came with an ironclad guarantee of row and column consistency. As an illustrative extreme, checks could be made every time an DataFrame is modified. Of course that might affect performance significantly, so maybe there are strategic junctures to perform these checks that don't add too much overhead.
For methods that could cause errors due to aliasing, add checks/warnings/workarounds for aliasing: I'm not sure how hard it is to identify all the right methods. But if it's doable, it seems like it wouldn't add too much overhead to have a quick check at the beginning of each method. For example, a simple check that each column has its own unique memory address. This could result in a warning, workaround, or outright error, depending on how we want to handle it.
Disallow aliasing, or force copying of aliased columns: I know that if someone has a DataFrame with 500 columns, and 499 of them all are the exact same, then it's a waste of memory to make each column their own object. But at the same time, it's not clear that there is any real application to such a scenario, and maybe discourage this. Perhaps we disallow aliasing for DataFrames, or at least automatically make a copy in situations where aliasing would happen.
Just add a check to append!(): This might be the simplest, since we know there is a problem with this method (and it's unclear right now to me whether other methods are affected by this too).

The text was updated successfully, but these errors were encountered:

bkamins · 2019-07-17T20:25:06Z

Thank you for this report and detailed analysis. Actually it goes in hand with what @oxinabox calls for for some time now (and we keep patching it - see the new indexing rules and the fact that constructors copy columns by default).

What we plan to do is:

implement Add column length checks to expensive operations #1845 (I am working on it currently) - we will the check for expensive operations; we will add append! to the list as this is a good suggestion
also it is a good suggestion that actually we could check for aliases; we already do it for broadcasting and I have recently implemented full unaliasing mechanism (unfortunately this is O(ncol(df)^2) operation) - so this is a decision point when we would want to do it; comments are welcome
disallowing aliasing can be considered; it has cost O(ncol(df)) so again - comments are welcome if we want to do it

My current thinking is to (and this would go to #1845 implementation):

check for data frame consistency before expensive operations
check for data frame consistency after operations that could cause a problem (largely this is append! and push!)
do not perform aliasing analysis as it would be too expensive and in general allow aliases (but this is a mild preference - we can discuss here as I see pros of disallowing aliases)

CC @nalimilan, @oxinabox, @quinnj

jablauvelt · 2019-07-17T20:45:07Z

Thanks for the quick and comprehensive response! I meant to say that I did see #1845 when searching, but wasn't sure how closely related it was to this - good to know that they are.

Dumb newbie question here, but why would we only check for data frame consistency before expensive operations? I'd normally assume that if it's not an expensive operation then we'd "have even more time" for these checks. Is it because non-expensive operations tend to be applied more often, and therefore the overhead of these checks would add up much more quickly?

bkamins · 2019-07-17T21:34:16Z

You do non-expensive operations in a loop usually like this:

df = DataFrame(a=1,b=2)
for i in 1:10^6
    push!(df, [i,i])
end

adding a check to push! would kill performance.

Also note that in your original code the "proper" way to create a new column is:

x1[!, c] .= x1.b

which avoids aliasing (this is of course does not solve the problem but I just wanted to note the intended pattern for such cases).

jablauvelt · 2019-07-17T21:56:09Z

OK, good to know. I was also wondering about the proper way of creating a new column, short of using copy(x1.b), so that's helpful. Thanks.

bkamins · 2019-07-18T10:43:19Z

see #1887

nalimilan · 2019-07-22T16:54:27Z

In your actual use case, did you call append!(x1, x2), or append!(x1, x3) with x3 another, unrelated data frame? It's relatively easy to fix the former since we can check aliasing between x1 and x2, but it's much harder if appending x3 since the aliasing isn't visible by append! (unless we keep a global table of aliasing, which is a possibility).

bkamins · 2019-07-22T16:58:49Z

We have aliasing detection code already, as it was needed for broadcasting.
But aliasing is not a problem - if it happens in #1887 I detect its consequences (wrong column names) and print an error (i.e. I do not need to know we have aliasing to know we have a problem 😄).

nalimilan · 2019-07-22T17:00:08Z

Disallow aliasing, or force copying of aliased columns: I know that if someone has a DataFrame with 500 columns, and 499 of them all are the exact same, then it's a waste of memory to make each column their own object. But at the same time, it's not clear that there is any real application to such a scenario, and maybe discourage this. Perhaps we disallow aliasing for DataFrames, or at least automatically make a copy in situations where aliasing would happen.

FWIW, that's what we do by default in recent releases e.g. with select, but the x1.c = x2.b doesn't allow us to know that x2.b comes from a data frame: all we know is that it's a vector. We could always make a copy of that vector, but that can be sub-optimal if you build that vector manually and don't intend to resize it, and that would mean x1.a isn't equivalent to x1[!, :a], making things more complex to grasp. Though as I said maybe we could use a global table of column vectors to detect that the vector is used by another data frame, and print a warning.

nalimilan · 2019-07-22T17:01:45Z

We have aliasing detection code already, as it was needed for broadcasting.
But aliasing is not a problem - if it happens in #1887 I detect its consequences (wrong column names) and print an error (i.e. I do not need to know we have aliasing to know we have a problem smile).

@bkamins Aliasing detection when broadcasting only needs to check columns between data frames involved in the operation, just like the append!(x1, x2) case. But detecting aliasing with other data frames requires a different system.

bkamins · 2019-07-22T17:16:50Z

I agree (but my major point was that we do not need to detect aliasing anyway and aliasing detection is very expensive so we also do not want to do this - it is much cheaper to check afterwards if something got broken)

nalimilan · 2019-07-22T17:19:02Z

Ah sorry I thought that in the OP the aliasing was between x1 and x2, while is it between two columns of x1.

bkamins mentioned this issue Jul 17, 2019

Add column length checks to expensive operations #1845

Closed

bkamins mentioned this issue Jul 18, 2019

First proposal of consistency checks #1887

Merged

bkamins closed this as completed in #1887 Jul 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preventing problems with aliased columns #1885

Preventing problems with aliased columns #1885

jablauvelt commented Jul 17, 2019

bkamins commented Jul 17, 2019

jablauvelt commented Jul 17, 2019

bkamins commented Jul 17, 2019

jablauvelt commented Jul 17, 2019

bkamins commented Jul 18, 2019

nalimilan commented Jul 22, 2019

bkamins commented Jul 22, 2019

nalimilan commented Jul 22, 2019

nalimilan commented Jul 22, 2019

bkamins commented Jul 22, 2019

nalimilan commented Jul 22, 2019

Preventing problems with aliased columns #1885

Preventing problems with aliased columns #1885

Comments

jablauvelt commented Jul 17, 2019

bkamins commented Jul 17, 2019

jablauvelt commented Jul 17, 2019

bkamins commented Jul 17, 2019

jablauvelt commented Jul 17, 2019

bkamins commented Jul 18, 2019

nalimilan commented Jul 22, 2019

bkamins commented Jul 22, 2019

nalimilan commented Jul 22, 2019

nalimilan commented Jul 22, 2019

bkamins commented Jul 22, 2019

nalimilan commented Jul 22, 2019