RFC: start(arr) vs. 1 for bounds checking, iteration vs. indexing #11713

ScottPJones · 2015-06-15T13:23:29Z

In the review of my PR #11575, @tkelman requested I change a 1 to start(dat) in the bounds checking.
I did find a case where start() on an AbstractArray returns a tuple, instead of 1.
I also found numerous cases, just in Base (I'm sure there are many more in the registered Packages),
where 1 was used instead of start() for bounds checking.

@nalimilan and @mbauman both responded, with some good information, and some ideas for what might need to be done to resolve these issues.

What really needs to be done here?

The text was updated successfully, but these errors were encountered:

yuyichao · 2015-06-15T13:47:48Z

Shouldn't it be first?

mbauman · 2015-06-15T14:03:55Z

first returns the first element; this is about getting the initial index or the start of the iterable state. Those two things can be the same, but they don't have to be. I definitely don't think we should assume that they are the same.

So, is there a need for another verb here? One motivating case would be Fortran-like offset arrays, which are more possible these days without colon lowering to 1:endof(A). Unfortunately there will almost certainly be all sorts of confusion about which first is the start. Maybe first(eachindex(A)) will inline enough to be optimized away?

mbauman · 2015-06-15T14:19:09Z

Just getting the first index isn't good enough for an offset array. You need the first index in a given dimension. Not only is the lower bound wrong, the upper bound (size(A, d)) would be wrong, too. This may not be something that's worth bending over backwards to support, especially since the package can provide its own checkbounds method. Perhaps the answer is to just use checkbounds where-ever it's possible.

mikewl · 2015-06-15T14:27:43Z

Slightly tangential but couldn't start be introduced as a keyword similar to end in the array? Then A[start:end] would work for any offset array. <Moan at me to file an issue for it if anyone thinks this would be useful please!>

mbauman · 2015-06-15T14:34:44Z

No, end only works since it is already a reserved keyword that can't be used as a variable name.

end lowers to size or trailingsize, so it would be wrong for an offset array, too. That's actually a place where we would need this concept in Base to support such a package.

ScottPJones · 2015-06-15T14:37:44Z

Ufff... looks like I've opened another can of worms!
So, will the test: start(dat) <= startpos <= endof(dat) actually work for any dat::AbstractArray,
where start(dat) might return a tuple? Thanks!

mbauman · 2015-06-15T14:39:03Z

No. Use checkbounds(dat, startpos).

mikewl · 2015-06-15T14:44:20Z

Thanks! So more work would be needed in Base for offset arrays then before one could attempt such a package? Could you think of any more places work would be needed offhand?

ScottPJones · 2015-06-15T14:49:57Z

@Mike43110 I found just by grep'ing over a hundred places with BoundsError that would need to be examined...
@mbauman OK, thanks! 😢 Yet another change to #11575 (and I don't even use the "safe" bounds-checking version!)

ScottPJones · 2015-06-15T15:13:30Z

@mbauman The code has:

    startpos < start(dat) && throw(BoundsError(dat, startpos))
    (startpos <= endpos <= endof(dat)) || throw(BoundsError(dat, endpos))

checkbounds(dat,startpos) only checks the start position (and, if the end position is also checked,
unnecessarily checks it against the end of the array.
How do I also check that startpos <= endpos? It seems like there needs to be a:
checkbounds(dat,startpos,endpos) to correctly handle what my code currently does...
Help! 😀

mbauman · 2015-06-15T15:18:25Z

Ah, yes, that makes sense. You could use checkbounds(dat, startpos:endpos), but I wouldn't worry about it too much. Right now we make the implicit assumption that valid indices in an abstract array A along dimension d are in 1:sizeof(A, d). To do anything else would be a large change that wouldn't be your problem.

nalimilan · 2015-06-15T15:37:08Z

@ScottPJones Your second line doesn't look completely correct actually. If one passes endpos = startpos - 1, he will get a BoundsError about trying to index the array at an incorrect position. But that's misleading, as the problem is with startpos relative to endpos. I think you should use two calls to checkbounds, and one custom check for startpos <= endpos.

ScottPJones · 2015-06-15T16:08:37Z

@nalimilan OK, but will startpos <= endpos work for the tuples returned by some types of AbstractArray? What sort of error would you recommend reporting then?

mbauman · 2015-06-15T16:31:37Z

Indexing and iteration are separate concepts. Whatever gets returned by start (like those tuples) is a "private" iteration state and should just get passed along to next and done. You generally shouldn't index into a collection with its iteration state. So you shouldn't be passing (or receiving) those tuples as your start and end indices. If you'll be indexing with it, just use 1 as your default startpos.

ScottPJones · 2015-06-15T16:48:07Z

I was using next, with integer pos and endpos, doing while pos <= endpos, copying from other code I've seen... so far, nobody had brought that up as an issue, after two months of reviews...
It seems, from looking at the code in Base, that this distinction has not been very clear to everybody...

mbauman · 2015-06-15T17:09:38Z

Yes, this is fairly subtle and the split only happened recently in #10704. It definitely needs more documentation, and I'll put it on my list for my planned Interfaces manual page.

Looking over at #11575, you're not working with general AbstractArrays, though. You're coding directly against the internal implementation of strings, which all use Vector{<:Unsigned}. And the iteration states for Vector are its indices. The folks who are more interested in the exact implementation of these string methods can speak more to this, but I think it's probably ok to rely upon the private internals of Vector in that case.

JeffBezanson · 2015-06-15T17:36:32Z

first(eachindex(A)) certainly seems like the right way to get the first index of something.

ScottPJones · 2015-06-15T17:42:50Z

And to correctly check both a start and end position for bounds of a vector, what would be the best way?

ScottPJones · 2015-06-15T17:46:32Z

    startpos < first(eachindex(dat)) && throw(BoundsError(dat, startpos))
    checkbounds(dat, endpos)
    endpos < startpos && throw(ArgumentError("End position ($endpos) is less than start position ($startpos)"))

would that be good?

mbauman · 2015-06-15T17:53:41Z

Depends on what kind of error messages you want, and what kind of vector you're working with. I'd probably use checkbounds(dat, startpos:endpos) — I just checked and it inlines wonderfully without heap allocations or a GC frame. But it doesn't check that startpos < endpos, and that should be a separate ArgumentError, so keep the third line.

The error reports indexing with a range, though, so there's a tradeoff for the simplicity. Personally, I think it's fine.

ScottPJones · 2015-06-15T17:58:30Z

OK, thanks!

nalimilan · 2015-06-15T19:46:37Z

@mbauman As you say, the error message is not that great with your proposal. Wouldn't calling checkbounds twice be more correct, and as efficient because of inlining?

kmsquire · 2015-06-15T20:30:51Z

first(eachindex(A)) certainly seems like the right way to get the first index of something.

I had thought that eachindex wasn't meant to guarantee ordering, so that, e.g., parallel iterators could be implemented easily with it.

Cc:@timholy

mbauman · 2015-06-15T20:43:22Z

@nalimilan You're right, the emitted LLVM is very similar. The difference is likely negligible:

checkbounds(X, a:b): 4 comparisons, 3 branches, 1 add + 1 select, 1 pointer load
checkbounds(X, a); checkbounds(X, b): 4 comparisons, 4 branches, 1 pointer load

ScottPJones · 2015-06-15T21:10:05Z

@mbauman When is your Interfaces manual planned for? It sounds like that would be a wonderful and much needed addition to the documentation! (I realize everybody is quite busy, and documentation is often the last thing somebody wants to do...)

mbauman · 2015-06-15T22:27:46Z

It's currently planned to get done on the day it gets done. :) Definitely aiming for 0.4.

jakebolewski · 2015-08-18T04:55:37Z

The discussion here seems best continued on other related AbstractArray / indexing issues/PR's slated for 0.5.

ScottPJones mentioned this issue Jun 15, 2015

Add UTF encoding validity functions #11575

Merged

jakebolewski closed this as completed Aug 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: start(arr) vs. 1 for bounds checking, iteration vs. indexing #11713

RFC: start(arr) vs. 1 for bounds checking, iteration vs. indexing #11713

ScottPJones commented Jun 15, 2015

yuyichao commented Jun 15, 2015

mbauman commented Jun 15, 2015

mbauman commented Jun 15, 2015

mikewl commented Jun 15, 2015

mbauman commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

mbauman commented Jun 15, 2015

mikewl commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

mbauman commented Jun 15, 2015

nalimilan commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

mbauman commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

mbauman commented Jun 15, 2015

JeffBezanson commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

mbauman commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

nalimilan commented Jun 15, 2015

kmsquire commented Jun 15, 2015

mbauman commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

mbauman commented Jun 15, 2015

jakebolewski commented Aug 18, 2015

RFC: start(arr) vs. 1 for bounds checking, iteration vs. indexing #11713

RFC: start(arr) vs. 1 for bounds checking, iteration vs. indexing #11713

Comments

ScottPJones commented Jun 15, 2015

yuyichao commented Jun 15, 2015

mbauman commented Jun 15, 2015

mbauman commented Jun 15, 2015

mikewl commented Jun 15, 2015

mbauman commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

mbauman commented Jun 15, 2015

mikewl commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

mbauman commented Jun 15, 2015

nalimilan commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

mbauman commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

mbauman commented Jun 15, 2015

JeffBezanson commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

mbauman commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

nalimilan commented Jun 15, 2015

kmsquire commented Jun 15, 2015

mbauman commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

mbauman commented Jun 15, 2015

jakebolewski commented Aug 18, 2015