Skip to content

Conversation

@bkamins
Copy link
Member

@bkamins bkamins commented Dec 26, 2019

We were using wrong syntax for BoundsError in the whole package (it does not allow a custom message). This PR fixes this.

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Dec 27, 2019

Unfortunately BoundsError is not parametric. I have a fix ready that defines custom BounsErrorDF that handles everything we need. Do you think it would be OK to introduce our own exception type?

@nalimilan
Copy link
Member

I'd say it's OK. Though maybe call it DataFrameBoundsError?

@bkamins
Copy link
Member Author

bkamins commented Dec 28, 2019

Now we would have another problem 😢. Sometimes in very similar expressions we call different code paths that either would throw BoundsError or DataFrameBoundsError (sometimes we index into arrays that throw BoundsError). Intercepting all such cases will be hard. It seems that throwing just BoundsError is the best option (although admittedly not very informative).

Do you have any other idea?

@nalimilan
Copy link
Member

nalimilan commented Dec 29, 2019

Are there many examples like this? It doesn't sound very user-friendly to throw an error about an underlying vector rather than printing a more explicit error. EDIT: can you point to some examples?

Anyway I'd rather throw different exception types than less explicit errors.

@bkamins
Copy link
Member Author

bkamins commented Dec 29, 2019

I would have to review the whole code base. For sure it is the case with views. E.g. if we create them with : as column selector we have an optimized rule that reuses index of the parent. This has a consequence that if we index into a view there is a different code path for :-column view and any other column selection view. But maybe this is fixable (I want to avoid try-catch blocks substituting the exception type as this would be too slow - but probably this is doable).

@bkamins
Copy link
Member Author

bkamins commented Dec 29, 2019

I have analyzed this and it would be a lot of work for little gain IMO.
What I proposed instead as a partial solution (that does not hurt anyway) is to define appropriate summary methods (in most recent commit). In this way at least user gets the information about the dimensionality of the indexed object, e.g.:

julia> df = DataFrame(ones(2,3))
2×3 DataFrame
│ Row │ x1      │ x2      │ x3      │
│     │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1   │ 1.0     │ 1.0     │ 1.0     │
│ 2   │ 1.0     │ 1.0     │ 1.0     │

julia> df[0, 1]
ERROR: BoundsError: attempt to access 2×3 DataFrame
  at index [0, 1]

julia> df[1, 0]
ERROR: BoundsError: attempt to access 2×3 DataFrame
  at index [1, 0]

julia> df[1, :z]
ERROR: ArgumentError: column name :z not found in the data frame; existing most similar names are: :x1, :x2 and :x3

julia> df[1, :][0]
ERROR: BoundsError: attempt to access 3-element DataFrames.Index
  at index [0]

julia> df[1, 1:2][0]
ERROR: BoundsError: attempt to access 2-element UnitRange{Int64} at index [0]

if this is acceptable I will finalize this PR in this way (we can do a major overhaul of thrown error types after 1.0 release if we want I think).

@nalimilan
Copy link
Member

Sounds OK, except for the one mentioning DataFrames.Index. Do you think it can be avoided?

There's also this which isn't super nice:

julia> df[1:10, :]
ERROR: BoundsError: attempt to access 2×3 DataFrame
  at index [1:10, Colon()]

Though it's even worse for Array:

julia> rand(2, 2)[1:10, :]
ERROR: BoundsError: attempt to access 2×2 Array{Float64,2} at index [1:10, Base.Slice(Base.OneTo(2))]

I've filed JuliaLang/julia#34217 to improve this in Base.

BTW, this looks like it could be improved at some point (though it's somewhat orthogonal to this PR):

julia> df[:, 1:10]
ERROR: BoundsError: attempt to access 3-element Array{AbstractArray{T,1} where T,1} at index [1:10]

@bkamins
Copy link
Member Author

bkamins commented Dec 30, 2019

This is doable. I have just two questions to you and one comment. Thank you.

Sounds OK, except for the one mentioning DataFrames.Index. Do you think it can be avoided?

Yes - we can make summary of AbstractIndex not print its type but print any custom message so instead of printing "3-element DataFrames.Index" we would print "data frame with 3 columns" (AbstractIndex is internal so we are free to do non-standard things with it). Do you think it is OK?

There's also this which isn't super nice:

I can fix it to print : (I can pass ":" string instead of :). Do we go this way?

BTW, this looks like it could be improved at some point

Yes, agreed - but this is exactly this class of problems that require a significant redesign of indexing, which I think can be done later (it has to be benchmarked very carefully before making such changes).

@nalimilan
Copy link
Member

Yes - we can make summary of AbstractIndex not print its type but print any custom message so instead of printing "3-element DataFrames.Index" we would print "data frame with 3 columns" (AbstractIndex is internal so we are free to do non-standard things with it). Do you think it is OK?

Clever. Yeah, why not, if catching the error before or after to throw a better error isn't possible.

I can fix it to print : (I can pass ":" string instead of :). Do we go this way?

We could use ":" on older Julia versions on : on 1.4-DEV and above, assuming JuliaLang/julia#34217 will be merged in one way of another?

@bkamins
Copy link
Member Author

bkamins commented Dec 30, 2019

Yeah, why not, if catching the error before or after to throw a better error isn't possible.

So this is what I will implement now. In order to improve things - as noted earlier - we would need to reimplement all AbstractDataFrame related methods to do bounds checking on their level and always do @inbounds for AbstractIndex calls. But - although it seems a simple change - I have a feeling that it will be not that easy (especially with SubDataFrame and DataFrameRow), so that is why I would postpone it post 1.0, as e.g. moving select forward (and other 1.0 marked issues/PR) is more pressing now. And even if we do this AbstractIndex is only used in the context of AbstractDataFrame anyway so we will not be misleading anyone.

We could use ":" on older Julia versions on : on 1.4-DEV and above

I have just checked the implementation in Julia you have proposed. Essentially the best thing would be to move it entirely. So maybe let us just leave things here as is and follow that Julia prints by default.

@bkamins
Copy link
Member Author

bkamins commented Dec 30, 2019

OK, changed. The error messages now look like this:

julia> DataFrame(rand(2,3))[:, 0]
ERROR: BoundsError: attempt to access data frame with 3 columns
  at index [0]
Stacktrace:
 [1] getindex at D:\AppData\.julia\dev\DataFrames\src\other\index.jl:164 [inlined]
 [2] getindex(::DataFrame, ::Colon, ::Int64) at D:\AppData\.julia\dev\DataFrames\src\dataframe\dataframe.jl:342
 [3] top-level scope at REPL[5]:1

julia> DataFrame(rand(2,1))[:, 0]
ERROR: BoundsError: attempt to access data frame with 1 column
  at index [0]
Stacktrace:
 [1] getindex at D:\AppData\.julia\dev\DataFrames\src\other\index.jl:164 [inlined]
 [2] getindex(::DataFrame, ::Colon, ::Int64) at D:\AppData\.julia\dev\DataFrames\src\dataframe\dataframe.jl:342
 [3] top-level scope at REPL[6]:1

@nalimilan
Copy link
Member

OK, sounds good! Can you add tests for summary though?

@bkamins
Copy link
Member Author

bkamins commented Jan 1, 2020

OK - added. I have added tests with and without IO parameter as this is what caused the problem in the past. The tests cover all relevant types we define in DataFrames.jl.

Also this means that I have added a test for eachcol.

Here is one small issue we could discuss. When displaying eachrow(df) (the same with eachcol(df)) we print the following:

julia> eachrow(DataFrame(ones(2,3)))
2×3 DataFrameRows
│ Row │ x1      │ x2      │ x3      │
│     │ Float64 │ Float64 │ Float64 │
├─────┼─────────┼─────────┼─────────┤
│ 1   │ 1.0     │ 1.0     │ 1.0     │
│ 2   │ 1.0     │ 1.0     │ 1.0     │

and we give both dimensions of the parent.

However, in summary we give:

julia> summary(eachrow(DataFrame(ones(2,3))))
"2-element DataFrameRows"

indicating the length of the iterable.

I thought it is OK to do this this way, but maybe you feel we should make it consistent (in which case I would change what we print in show to use summary for the header). Do you have any opinion on this?

@bkamins
Copy link
Member Author

bkamins commented Jan 1, 2020

CI produces ❌ due to coveralls

@nalimilan
Copy link
Member

Thanks. Ideally I'd prefer showing both dimensions for consistency and because they are both relevant, but that would be misleading when printing BoundsError so I guess it's better to only print the length.

@bkamins
Copy link
Member Author

bkamins commented Jan 1, 2020

but that would be misleading when printing BoundsError

This was exactly my thinking (i.e. it is an AbstractVector so for iteration-related stuff we should report the length of this vector).

@bkamins bkamins merged commit f46a404 into JuliaData:master Jan 1, 2020
@bkamins bkamins deleted the fix_boundserror branch January 1, 2020 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants