Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add column length checks to expensive operations #1845

Closed
oxinabox opened this issue Jun 10, 2019 · 15 comments
Closed

Add column length checks to expensive operations #1845

oxinabox opened this issue Jun 10, 2019 · 15 comments

Comments

@oxinabox
Copy link
Contributor

From: #1844 (comment)

... we will always have holes) is to create a list of methods in DataFrames.jl that actively check if all columns have the same length before performing their operation (essentially - all expensive methods).

In short: checking column lengths is cheap,
We should do that a bunch of operations,
even of we do think we have patched all holes that let a user resize columns out of sync,
we can still catch bugs in internal methods via this (defensive programming)

@bkamins
Copy link
Member

bkamins commented Jun 10, 2019

This is my point 😄. Thank you!
(I will get to it after I am finished with broadcasting and setindex! cleanups) - unless someone else will by then (which I am happy to review then)

@bkamins
Copy link
Member

bkamins commented Jun 11, 2019

Here is a benchmark of checking cost for 20,000 columns (which I think is a typically reasonable maximum in practice):

julia> df = DataFrame(rand(10, 20_000));

julia> function samelength(df)
           rows = nrow(df)
           for col in _columns(df)
               length(col) == rows || return false
           end
           return true
       end
samelength (generic function with 1 method)

julia> @benchmark samelength($df)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     492.100 μs (0.00% GC)
  median time:      497.400 μs (0.00% GC)
  mean time:        627.434 μs (0.00% GC)
  maximum time:     2.256 ms (0.00% GC)
  --------------
  samples:          7955
  evals/sample:     1

So given this the question is for what functions we should add this check.

@oxinabox
Copy link
Contributor Author

oxinabox commented Jun 11, 2019

20,000 columns is definately an upper bound.
E.g. basically all spreedsheet programs crash out at over 16,384 columns.

I think it should be called if you do

  • show / print
  • first (when called with the nrows argument)

Which will catch things during in interactive use.

  • join
  • groupby

@bkamins
Copy link
Member

bkamins commented Jun 11, 2019

It is tempting to check it also with every call of nrows but I have to check if we do not call it somewhere in hot loops.

I would also add:

  • reshape operations
  • filtering
  • removing duplicates
  • removing missings
  • eachcol/eachrow upon creation of the object

@oxinabox
Copy link
Contributor Author

It is tempting to check it also with every call of nrows but I have to check if we do not call it somewhere in hot loops.

If we do, then we should maybe define: nrows(check_column_constancy=true)
And then use nrows(false) in the hotloops.

@bkamins
Copy link
Member

bkamins commented Jun 11, 2019

Agreed. Mostly this other form will be needed internally only so this should be OK (the key place where we will probably use it is creation of DataFrameRow from eachrow as doing the check on each iteration would be very slow)

@oxinabox
Copy link
Contributor Author

I feel like @inbounds should supress this check.
We can opt into that by wrapping the check in @boundscheck.
then if @inbounds is active (and things are wrapped in @propergate_inbounds, as required)
the check would be supressed.

idk though it might be too magic -- people are used to @inbounds applying to indexing operations,
and this is only indirectly indexing.

@nalimilan
Copy link
Member

Doing checks from nrow might be a bit too much. People could expect it to be very cheap, like length on arrays.

@bkamins
Copy link
Member

bkamins commented Jun 12, 2019

The working rule I have in my head now for performing of this check is that:
each time an internal function is called and it works on a subset of columns (possibly all) it should check if these columns have the same length.

As for nrow I think that it does not hurt to add an argument telling which columns should be checked with the default to check the first column. A signature like:

nrow(::AbstractDataFrame, cols=nothing)

@oxinabox
Copy link
Contributor Author

I think maing nrow take a columns argument would be confusing to the user.
It is supposed to be impossible for columns to have different lengths.
And that looks like it is saying that they can, and that I am querying different columns,
And that make it look like this is a feature.

@bkamins
Copy link
Member

bkamins commented Jun 12, 2019

Good point. So we should have such function internally anyway.

@nalimilan
Copy link
Member

The working rule I have in my head now for performing of this check is that:
each time an internal function is called and it works on a subset of columns (possibly all) it should check if these columns have the same length.

Sounds reasonable. Though I'm not sure we need a very clear rule, since throwing an error if a data frame is corrupt will always be OK, and missing a check would be OK too. What matters it that we add as many checks we can without affecting performance.

@bkamins
Copy link
Member

bkamins commented Jun 13, 2019

So here is my list of functions that should do the checks. Please add/remove from it and then we can make a PR:

  • groupby, aggregate
  • join
  • melt, stack, unstack, stackdf, meltdf
  • categorical!
  • copy
  • describe, show
  • allowmissing!, completecases, disallowmissing!, dropmissing, dropmissing!
  • nonunique, unique!
  • eachrow, eachcol
  • filter, filter!
  • hcat, vcat
  • mapcols
  • repeat
  • sort, sort!

In particular I left out: getindex, setindex!, view, select, select!, append! and push! (we might have an opinion to add a check sometimes for them also).

@bkamins
Copy link
Member

bkamins commented Jul 17, 2019

Add append! to the list, see #1885.

@bkamins
Copy link
Member

bkamins commented Jul 25, 2019

Fixed by #1887

@bkamins bkamins closed this as completed Jul 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants