Add column length checks to expensive operations #1845

oxinabox · 2019-06-10T20:23:20Z

... we will always have holes) is to create a list of methods in DataFrames.jl that actively check if all columns have the same length before performing their operation (essentially - all expensive methods).

In short: checking column lengths is cheap,
We should do that a bunch of operations,
even of we do think we have patched all holes that let a user resize columns out of sync,
we can still catch bugs in internal methods via this (defensive programming)

bkamins · 2019-06-10T20:32:47Z

This is my point 😄. Thank you!
(I will get to it after I am finished with broadcasting and setindex! cleanups) - unless someone else will by then (which I am happy to review then)

bkamins · 2019-06-11T09:12:41Z

Here is a benchmark of checking cost for 20,000 columns (which I think is a typically reasonable maximum in practice):

julia> df = DataFrame(rand(10, 20_000));

julia> function samelength(df)
           rows = nrow(df)
           for col in _columns(df)
               length(col) == rows || return false
           end
           return true
       end
samelength (generic function with 1 method)

julia> @benchmark samelength($df)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     492.100 μs (0.00% GC)
  median time:      497.400 μs (0.00% GC)
  mean time:        627.434 μs (0.00% GC)
  maximum time:     2.256 ms (0.00% GC)
  --------------
  samples:          7955
  evals/sample:     1

So given this the question is for what functions we should add this check.

oxinabox · 2019-06-11T12:14:07Z

20,000 columns is definately an upper bound.
E.g. basically all spreedsheet programs crash out at over 16,384 columns.

I think it should be called if you do

show / print
first (when called with the nrows argument)

Which will catch things during in interactive use.

join
groupby

bkamins · 2019-06-11T17:08:08Z

It is tempting to check it also with every call of nrows but I have to check if we do not call it somewhere in hot loops.

I would also add:

oxinabox · 2019-06-11T18:00:45Z

It is tempting to check it also with every call of nrows but I have to check if we do not call it somewhere in hot loops.

If we do, then we should maybe define: nrows(check_column_constancy=true)
And then use nrows(false) in the hotloops.

bkamins · 2019-06-11T21:25:57Z

Agreed. Mostly this other form will be needed internally only so this should be OK (the key place where we will probably use it is creation of DataFrameRow from eachrow as doing the check on each iteration would be very slow)

oxinabox · 2019-06-11T23:13:27Z

I feel like @inbounds should supress this check.
We can opt into that by wrapping the check in @boundscheck.
then if @inbounds is active (and things are wrapped in @propergate_inbounds, as required)
the check would be supressed.

idk though it might be too magic -- people are used to @inbounds applying to indexing operations,
and this is only indirectly indexing.

nalimilan · 2019-06-12T11:43:30Z

Doing checks from nrow might be a bit too much. People could expect it to be very cheap, like length on arrays.

bkamins · 2019-06-12T12:57:17Z

The working rule I have in my head now for performing of this check is that:
each time an internal function is called and it works on a subset of columns (possibly all) it should check if these columns have the same length.

As for nrow I think that it does not hurt to add an argument telling which columns should be checked with the default to check the first column. A signature like:

nrow(::AbstractDataFrame, cols=nothing)

oxinabox · 2019-06-12T13:00:14Z

I think maing nrow take a columns argument would be confusing to the user.
It is supposed to be impossible for columns to have different lengths.
And that looks like it is saying that they can, and that I am querying different columns,
And that make it look like this is a feature.

bkamins · 2019-06-12T16:05:32Z

Good point. So we should have such function internally anyway.

nalimilan · 2019-06-12T16:26:06Z

The working rule I have in my head now for performing of this check is that:
each time an internal function is called and it works on a subset of columns (possibly all) it should check if these columns have the same length.

Sounds reasonable. Though I'm not sure we need a very clear rule, since throwing an error if a data frame is corrupt will always be OK, and missing a check would be OK too. What matters it that we add as many checks we can without affecting performance.

bkamins · 2019-06-13T22:03:39Z

So here is my list of functions that should do the checks. Please add/remove from it and then we can make a PR:

groupby, aggregate
join
melt, stack, unstack, stackdf, meltdf
categorical!
copy
describe, show
allowmissing!, completecases, disallowmissing!, dropmissing, dropmissing!
nonunique, unique!
eachrow, eachcol
filter, filter!
hcat, vcat
mapcols
repeat
sort, sort!

In particular I left out: getindex, setindex!, view, select, select!, append! and push! (we might have an opinion to add a check sometimes for them also).

bkamins · 2019-07-17T20:17:43Z

Add append! to the list, see #1885.

bkamins · 2019-07-25T22:02:33Z

Fixed by #1887

oxinabox mentioned this issue Jun 10, 2019

Make getproperty(df, col) return a full length view of the column #1844

Closed

This was referenced Jul 17, 2019

Preventing problems with aliased columns #1885

Closed

First proposal of consistency checks #1887

Merged

bkamins closed this as completed Jul 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add column length checks to expensive operations #1845

Add column length checks to expensive operations #1845

oxinabox commented Jun 10, 2019

bkamins commented Jun 10, 2019

bkamins commented Jun 11, 2019

oxinabox commented Jun 11, 2019 •

edited

Loading

bkamins commented Jun 11, 2019

oxinabox commented Jun 11, 2019

bkamins commented Jun 11, 2019

oxinabox commented Jun 11, 2019

nalimilan commented Jun 12, 2019

bkamins commented Jun 12, 2019

oxinabox commented Jun 12, 2019

bkamins commented Jun 12, 2019

nalimilan commented Jun 12, 2019

bkamins commented Jun 13, 2019

bkamins commented Jul 17, 2019

bkamins commented Jul 25, 2019

Add column length checks to expensive operations #1845

Add column length checks to expensive operations #1845

Comments

oxinabox commented Jun 10, 2019

bkamins commented Jun 10, 2019

bkamins commented Jun 11, 2019

oxinabox commented Jun 11, 2019 • edited Loading

bkamins commented Jun 11, 2019

oxinabox commented Jun 11, 2019

bkamins commented Jun 11, 2019

oxinabox commented Jun 11, 2019

nalimilan commented Jun 12, 2019

bkamins commented Jun 12, 2019

oxinabox commented Jun 12, 2019

bkamins commented Jun 12, 2019

nalimilan commented Jun 12, 2019

bkamins commented Jun 13, 2019

bkamins commented Jul 17, 2019

bkamins commented Jul 25, 2019

oxinabox commented Jun 11, 2019 •

edited

Loading