Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: start(arr) vs. 1 for bounds checking, iteration vs. indexing #11713

Closed
ScottPJones opened this issue Jun 15, 2015 · 27 comments
Closed

RFC: start(arr) vs. 1 for bounds checking, iteration vs. indexing #11713

ScottPJones opened this issue Jun 15, 2015 · 27 comments

Comments

@ScottPJones
Copy link
Contributor

In the review of my PR #11575, @tkelman requested I change a 1 to start(dat) in the bounds checking.
I did find a case where start() on an AbstractArray returns a tuple, instead of 1.
I also found numerous cases, just in Base (I'm sure there are many more in the registered Packages),
where 1 was used instead of start() for bounds checking.

@nalimilan and @mbauman both responded, with some good information, and some ideas for what might need to be done to resolve these issues.

What really needs to be done here?

@yuyichao
Copy link
Contributor

Shouldn't it be first?

@mbauman
Copy link
Member

mbauman commented Jun 15, 2015

first returns the first element; this is about getting the initial index or the start of the iterable state. Those two things can be the same, but they don't have to be. I definitely don't think we should assume that they are the same.

So, is there a need for another verb here? One motivating case would be Fortran-like offset arrays, which are more possible these days without colon lowering to 1:endof(A). Unfortunately there will almost certainly be all sorts of confusion about which first is the start. Maybe first(eachindex(A)) will inline enough to be optimized away?

@mbauman
Copy link
Member

mbauman commented Jun 15, 2015

Just getting the first index isn't good enough for an offset array. You need the first index in a given dimension. Not only is the lower bound wrong, the upper bound (size(A, d)) would be wrong, too. This may not be something that's worth bending over backwards to support, especially since the package can provide its own checkbounds method. Perhaps the answer is to just use checkbounds where-ever it's possible.

@mikewl
Copy link
Contributor

mikewl commented Jun 15, 2015

Slightly tangential but couldn't start be introduced as a keyword similar to end in the array? Then A[start:end] would work for any offset array. <Moan at me to file an issue for it if anyone thinks this would be useful please!>

@mbauman
Copy link
Member

mbauman commented Jun 15, 2015

No, end only works since it is already a reserved keyword that can't be used as a variable name.

end lowers to size or trailingsize, so it would be wrong for an offset array, too. That's actually a place where we would need this concept in Base to support such a package.

@ScottPJones
Copy link
Contributor Author

Ufff... looks like I've opened another can of worms!
So, will the test: start(dat) <= startpos <= endof(dat) actually work for any dat::AbstractArray,
where start(dat) might return a tuple? Thanks!

@mbauman
Copy link
Member

mbauman commented Jun 15, 2015

No. Use checkbounds(dat, startpos).

@mikewl
Copy link
Contributor

mikewl commented Jun 15, 2015

Thanks! So more work would be needed in Base for offset arrays then before one could attempt such a package? Could you think of any more places work would be needed offhand?

@ScottPJones
Copy link
Contributor Author

@Mike43110 I found just by grep'ing over a hundred places with BoundsError that would need to be examined...
@mbauman OK, thanks! 😢 Yet another change to #11575 (and I don't even use the "safe" bounds-checking version!)

@ScottPJones
Copy link
Contributor Author

@mbauman The code has:

    startpos < start(dat) && throw(BoundsError(dat, startpos))
    (startpos <= endpos <= endof(dat)) || throw(BoundsError(dat, endpos))

checkbounds(dat,startpos) only checks the start position (and, if the end position is also checked,
unnecessarily checks it against the end of the array.
How do I also check that startpos <= endpos? It seems like there needs to be a:
checkbounds(dat,startpos,endpos) to correctly handle what my code currently does...
Help! 😀

@mbauman
Copy link
Member

mbauman commented Jun 15, 2015

Ah, yes, that makes sense. You could use checkbounds(dat, startpos:endpos), but I wouldn't worry about it too much. Right now we make the implicit assumption that valid indices in an abstract array A along dimension d are in 1:sizeof(A, d). To do anything else would be a large change that wouldn't be your problem.

@nalimilan
Copy link
Member

@ScottPJones Your second line doesn't look completely correct actually. If one passes endpos = startpos - 1, he will get a BoundsError about trying to index the array at an incorrect position. But that's misleading, as the problem is with startpos relative to endpos. I think you should use two calls to checkbounds, and one custom check for startpos <= endpos.

@ScottPJones
Copy link
Contributor Author

@nalimilan OK, but will startpos <= endpos work for the tuples returned by some types of AbstractArray? What sort of error would you recommend reporting then?

@mbauman
Copy link
Member

mbauman commented Jun 15, 2015

Indexing and iteration are separate concepts. Whatever gets returned by start (like those tuples) is a "private" iteration state and should just get passed along to next and done. You generally shouldn't index into a collection with its iteration state. So you shouldn't be passing (or receiving) those tuples as your start and end indices. If you'll be indexing with it, just use 1 as your default startpos.

@ScottPJones
Copy link
Contributor Author

I was using next, with integer pos and endpos, doing while pos <= endpos, copying from other code I've seen... so far, nobody had brought that up as an issue, after two months of reviews...
It seems, from looking at the code in Base, that this distinction has not been very clear to everybody...

@mbauman
Copy link
Member

mbauman commented Jun 15, 2015

Yes, this is fairly subtle and the split only happened recently in #10704. It definitely needs more documentation, and I'll put it on my list for my planned Interfaces manual page.

Looking over at #11575, you're not working with general AbstractArrays, though. You're coding directly against the internal implementation of strings, which all use Vector{<:Unsigned}. And the iteration states for Vector are its indices. The folks who are more interested in the exact implementation of these string methods can speak more to this, but I think it's probably ok to rely upon the private internals of Vector in that case.

@JeffBezanson
Copy link
Member

first(eachindex(A)) certainly seems like the right way to get the first index of something.

@ScottPJones
Copy link
Contributor Author

And to correctly check both a start and end position for bounds of a vector, what would be the best way?

@ScottPJones
Copy link
Contributor Author

    startpos < first(eachindex(dat)) && throw(BoundsError(dat, startpos))
    checkbounds(dat, endpos)
    endpos < startpos && throw(ArgumentError("End position ($endpos) is less than start position ($startpos)"))

would that be good?

@mbauman
Copy link
Member

mbauman commented Jun 15, 2015

Depends on what kind of error messages you want, and what kind of vector you're working with. I'd probably use checkbounds(dat, startpos:endpos) — I just checked and it inlines wonderfully without heap allocations or a GC frame. But it doesn't check that startpos < endpos, and that should be a separate ArgumentError, so keep the third line.

The error reports indexing with a range, though, so there's a tradeoff for the simplicity. Personally, I think it's fine.

@ScottPJones
Copy link
Contributor Author

OK, thanks!

@nalimilan
Copy link
Member

@mbauman As you say, the error message is not that great with your proposal. Wouldn't calling checkbounds twice be more correct, and as efficient because of inlining?

@kmsquire
Copy link
Member

first(eachindex(A)) certainly seems like the right way to get the first index of something.

I had thought that eachindex wasn't meant to guarantee ordering, so that, e.g., parallel iterators could be implemented easily with it.

Cc:@timholy

@mbauman
Copy link
Member

mbauman commented Jun 15, 2015

@nalimilan You're right, the emitted LLVM is very similar. The difference is likely negligible:

checkbounds(X, a:b): 4 comparisons, 3 branches, 1 add + 1 select, 1 pointer load
checkbounds(X, a); checkbounds(X, b): 4 comparisons, 4 branches, 1 pointer load

@ScottPJones
Copy link
Contributor Author

@mbauman When is your Interfaces manual planned for? It sounds like that would be a wonderful and much needed addition to the documentation! (I realize everybody is quite busy, and documentation is often the last thing somebody wants to do...)

@mbauman
Copy link
Member

mbauman commented Jun 15, 2015

It's currently planned to get done on the day it gets done. :) Definitely aiming for 0.4.

@jakebolewski
Copy link
Member

The discussion here seems best continued on other related AbstractArray / indexing issues/PR's slated for 0.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants