New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEP: Add proposal for oindex and vindex. #6256
Conversation
Some thoughts...
|
Regardless of my objections, wonderful write-up Sebastian, thanks for putting this together. And you even managed to keep the word "subspace" out of the explanation! ;-) |
I am sure I can squeeze in subspace somewhere ;). |
9dd0aed
to
f7417c5
Compare
Some quick comments from a Numpy near-beginner :
|
Yeah, I should call it, "making indexing understandable" ;) (kidding, but the title is really bad). Did you check the examples? Without them it is unfortunatly impossible to understand things unless you are extremely well versed in "fancy indexing". Even with them it might be (please tell me if it is, I would like a final version that is understandable...). Naming is difficult, and no vectorized only makes half sense. Actually I think broadcasted indexing might be more to the point, maybe I should add/change that. The reason to add both is, I hope explained. Fancy indexing which is much like the proposed "vindex" has random transposing which is confusing as hell, plus the boolean part of it is useless. About the transposing, note that numpy uses |
See, I told you, you should have used "subspace", it makes things soooo much more understandable... ;-) |
The way I think about it is:
|
@seberg: Also, I haven't had a chance to look at the NEP yet (sorry!), but select all rows that contain no NaNsarr[np.any(np.isnan(arr), axis=0, keepdims=True)]
|
@njsmith, yet another one of those thinngs.... Actually have not even thought about that one too much. The biggest problem is that only in 1.10 we started deprecating the weird behaviour it used to use (plus implementation wise, it requires some extra logic at another place, but oh well). One of those things we could in principle change for new indexing attributes though I guess (and then later after deprecation also for standard indexing). |
Actually, rethinking what I thought about vindex+boolean back then. Maybe it does make sense to have fancy boolean indexing, have to think about it. On the other hand the whole combination buisness (unless oindex) is likely a corner caes of a corner case. |
I agree, there is probably a case for vectorized boolean indexing -- but not for combined integer/boolean indexing or more than one boolean indexer. |
@shoyer, was that a case for a boolean indexing? Something like a @njsmith, I understand your case and would not mind it to work. For the moment we are at least making it an error. Making it work right will be annoying with all the transposing logic (unless you make it "works only if there is no other index" which is may be the only place where it makes a lot of sense). Anyway, I think it might be ugly to get there unfortunately, and I would prefer to delay it for the time being since the time table for that change would be at best in two years.... (though maybe we could mention it) For anyone interested, I added a silly "motivational example" which features an Anna trying to do data analysis ;). You can bash it all you like, honestly! Thought an example like this might make it possible to understand at least the basic ideas for someone who is not me. |
OK, it is way too silly ;). I will reword it some time, but I think the examples itself are useful. |
doc/neps/indexing.rst
Outdated
Again, however, the columns need to be handled more individually (but in | ||
groups), and the ``oindex`` attribute works well:: | ||
|
||
>>> arr.oindex[bad_data, [2, 5]] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the RHS should be 0
.
Agreed. I don't think this is an essential feature, so I'm fine leaving it out -- there are lots of other ways to do it. |
doc/neps/indexing.rst
Outdated
converted to integer array indices and then broadcast), and hardly | ||
useful. | ||
There is no well defined broadcast for booleans, so that boolean | ||
indices are logically always "``outer``" type indices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't entirely agree with this. Boolean indexing with a >1d indexer that flattens (e.g., like nonzero, x[x > 0]
) is not outer indexing, either, because it does not follow the same rules for preserving the size of dimensions. Unfortunately, this sort of indexing is very popular. I would still allow it without a warning in normal indexing, but not with either oindex
or vindex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean outer indexing with respect to what to do with the "new" dimension. Broadcast it to the rest like vindex or keep it where it was like oindex. They are outer in this sense (maybe should add the example):
arr[arr[:, :, 0, 0] > 1, arr[0, 0, :, :] < 1]
returning a 2D array, not 1D like fancy indexing. Or you mean we should just not allow that? To me at least in oindex it makes sense, vindex is what I am not sure of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should allow it. It breaks the rule that outer indexing is always scalars or 1d arrays. I guess it is still technically one index per axis, but it's not at all what I would expect for outer indexing. In particular, whereas outer indexing with at most 1d arrays can be easily done with labeled array libraries like pandas and xray, this cannot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I never anticipated a limitation to 1d arrays for outer indexing (just not much a difference from my point of view)! Is it really so bad if numpy supports a bit more than others in this regard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we want multidimensional outer indexing. It took me some
thinking to work out what it is, but it does turn out to be pretty simple I
think. It should be fine for labeled arrays too, I think? Each dimension
of the index contributes one dimension to the output, so tick labels at
least are fine. (Axis names of course are another matter -- maybe call them
"{}-{}".format(indexed_axis_name, indexer_axis_name)?) Among other things,
arr[multidimlookup] is a common and non-confusing usage, so according to
the rule that we'd like to be able to deprecate places where getitem and
oindex disagree, we should support it in oindex...
On Sep 27, 2015 2:53 AM, "seberg" notifications@github.com wrote:
In doc/neps/indexing.rst
#6256 (comment):+1. Most new users will usually expect outer indexing (consistent with
- slicing). This is also the most common way of handling this in other
- packages or languages.
+2. The axes introduced by the array indices are at the front, unless- all array indices are consecutive, in which case one can deduce where
- the user "expects" them to be:
- *
arr[:, [0, 1], :, [0, 1]]
will have the first dimension shaped 2.- *
arr[:, [0, 1], [0, 1]]
will have the second dimension shaped 2.
+3. When a boolean array index is mixed with another boolean or integer
- array, the result is very hard to understand (the boolean array is
- converted to integer array indices and then broadcast), and hardly
- useful.
- There is no well defined broadcast for booleans, so that boolean
- indices are logically always "
outer
" type indices.Oh, I never anticipated a limitation to 1d arrays for outer indexing (just
not much a difference from my point of view)! Is it really so bad if numpy
supports a bit more than others in this regard?—
Reply to this email directly or view it on GitHub
https://github.com/numpy/numpy/pull/6256/files#r40503106.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tick labels resulting from indexing are actually taken from the array being indexed rather than the array doing the indexing in xray and pandas. So indexing along multiple dimensions at once means that we need to collapse to something like a multi index / tuples. Likewise, it's really not obvious what to name the new axis name (I can think of several alternatives to your suggestion).
This doesn't mean we can't allow multidimensional indexing with oindex for numpy, but it is something unlikely to ever be supported by xray and pandas (or a hypothetical future version of numpy arrays with axis labels, for that matter). Of course, it's quite straightforward for xray/pandas to simply raise an error in these situations. We already do something similar when we get np.newaxis
, which definitely should be OK to use with oindex.
Does it ever make sense to do multidimensional indexing with anything other than a Boolean array? If not, we should say so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ever? Definitely. Though it might be more from the use case where you want to use the indexing labels for the dimensions more likely than those being indexed.
I.e. the old example look_up_table[image]
.
Another benefit is that it makes implementation of vectorized indexing also much simpler for authors of duck array types. I think this example from dask.array is pretty compelling: https://github.com/blaze/dask/pull/439/files#diff-0dc21516366e88095fc3dbf2e9e62f9aR2358 |
Well, we could also provide code do to the transpose for you possibly. The question is, what is less surprising ;).
which I think has to transpose also because:
transposes for any N-Dimensional array including 0D and scalars and 0D are pretty much identical. |
a937fb1
to
8e53dd9
Compare
I think I will just post this to the list in a day or two (possibly closing this and opening a new issue). If anyone else has time to see if it is not way too hard to understand (which I am sure it is at least to some degree), please tell me :). |
Considering the enthusiastic response on the mailing list ;) (I know, nobody cares for anything that sounds too preliminary...), @njsmith when you have some spare cycles, mind looking over this? You are much better at writing this kind of stuff and maybe I can bring it into a state where people actually comment on ;). |
Note that the current state of gh-6075 should (hopefully bug free) implement the NEP with the exception of deprecating non-obvious advanced indexing from plain indexing. The examples should all run fine. |
OK, going to be a bit pushy, I know you guys probably follow things and have no time ;). but it really is now in a state where feedback and discussion is needed to do anything more on it. Of course I could refactor, but really the only huge pile of work would be the changed necessary because of deprecations and I am not even going to think about doing anything on it until we know relatively clear what the path is. So, I would like to send this out again soon, or lets say before the end of the year, possibly to more than just the numpy list. If you got suggestions where to send it (scipy, xray?, dask?), otherwise helping to polish the NEP up would be appreciated, too. I am sure it can use it, but I am not about to do any larger changes. |
I think this PR is subject to a reverse-bikeshedding effect: It's hard to evaluate such an important set of changes! I was concerned about multidimensional integer arrays as indices, since that's my main use of fancy indexing. In summary, I think it all makes sense, though it took me a little bit to figure out what oindex with multidim indices does. In case my examples are useful to anyone, here's what I tried out: Here's my most common use case of fancy indexing at work: I have a 2d array, which I index with 2 2d int-index arrays (one for each axis), producing a new 2d array with the shape of the index arrays. (Basically a 2d lookup table). >>> eps = np.random.rand(30,4)
>>> seq = np.random.randint(0,4,size=(100,30))
>>> eps.vindex[np.arange(30), seq].shape # this is old behavior
(100, 30)
>>> eps.oindex[np.arange(30), seq].shape # a funny new array
(30, 100, 30) It took me some thought but I get what is going on now in the oindex case, though it's not useful to me here. Second, I would guess a big use of fancy indexing is in lookup tables in image processing. Oindex does what is expected for a simple lut (this is with @seberg's PR): >>> lut = np.random.rand(10,3)
>>> img = np.random.randint(0,10,size=(256,256))
>>> lut[img].shape
(256, 256, 3) Overall this I like the way this looks, although it is going to be a little annoying (but perhaps clearer) to add |
I can speak for xray and probably dask as well on this point (though @mrocklin may also have thoughts to share). I would suggest posting to the scipy-users and pydata lists, though these do have a lot of overlap with numpy-discussion. I can also tweet about this stuff :). In my opinion, these changes will be extremely useful for xray, dask and other projects with similar goals (e.g., pandas):
And of course there are all the advantages these would bring to users even if they only use numpy. To me, these changes seem like a clear win with little downside. I'm happy with the proposed functionality. My only thought is that perhaps we should use a longer, more explicit name for the legacy indexing attribute, e.g., |
ab124c7
to
fd09c61
Compare
I wonder if we can cleanup/simplify boolean indexing as well. I would propose:
I don't think we need a separate "boolean index", because a full boolean array is equivalent to |
@shoyer, yeah, I think you are right. I was trying to be unnecessarily general back then. It is a good idea to start of with a more minimal version. Making it allow more at some point would still be possible in any case... |
just work. Subclasses that *do* provide it must be updated accordingly | ||
and should preferably not subclass working versions of these attributes. | ||
|
||
All subclasses will inherit the attributes, however, it seems possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My tendency would be to require that subclasses override the new methods if they need to; I don't like the extra code.
But one possible solution to ensure things "just break" might be to go through the __getitem__
call with an extra keyword set, say index_method='outer|vector|fancy|
- if a subclass cannot take this extra keyword, it will immediately fail. (Obviously, this keyword needs to have a default of None
that can then change from meaning fancy
to outer
).
Indeed, a keyword argument would be super-nice if PEP 472 were accepted...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, the that was hanging me was that getting masked arrays right was damn annoying, plus we also need to have a way for things like memmap. I am not sure why a kwarg is super nice if that PEP is accepted, we get a mix of different meanings for kwargs, so trying to not use kwargs because of the PEP might be a point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on closer reading of the PEP, my suggestion would require the kwarg
strategy; https://www.python.org/dev/peps/pep-0472/#strategy-kwargs-argument
Now it doesn't look like the PEP is actually going anywhere...
Anyway, my overall sense is that this should not hold up the NEP, or, indeed, the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could potentially call self.__getitem__(indices, index_method='outer')
directly. We just can't write self[indices, index_method='outer']
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is indeed what I had in mind, and I thought the PEP might eventually make that possible as well, but it turns out that is true only for one of its options. Anyway, at some level that is a separate concern. The main advantage of the index_method
approach is that subclasses would fail automatically (and that one could change the default at some point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misread the PEP, my gut feeling is, if arr[indices, index_method='outer']
would work, that would not be nice. Unless you want to use it instead of the arr.oindex[indices]
syntax maybe. Simply because index_method='outer'
could also mean get me "outer" from the axis "index_method"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would preclude having an axis called index_method
!
Anyway, right now we cannot use keywords inside [...]
at all, so that part is not so relevant. The main question is whether it is worth solving the subclass issue by having __getitem__
acquire an extra argument that we use internally (and oindex
and vindex
just being shortcuts to it).
I should add that I don't really have an opinion one way or another; I do think it is fine to expect subclasses to pay attention and adjust to new ndarray
features (we just have to try our best not to break existing features).
(2, 5, 8) | ||
>>> arr.vindex[:, [0], :, [0, 1]].shape | ||
(2, 5, 7) | ||
>>> arr.vindex[:, [0], 0, :].shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I echo @jaimefrio's comment from way back that having [0]
remove a dimension and 0
keep it is extremely confusing. But it may just mean I still do not understand fancy indexing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frankly, I am a bit confused what you mean :). Any index "removes" the dimension in some sense (for vindex, oindex in some sense replaces it), the question is just where to put the new dims. Jaime had a comment suggesting to maybe keep the axis swapping in a less confusing manner then it used to exist.
Right now, I disagree with this. vindex
should be as consistent as possible, there is nothing special enough about [0]
compared to [0, 1]
. There might be a "less confusing then currently fancy indexing like transposing" thing, but I personally believe you can use oindex then, it already does what you want, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just don't understand why here a leading 1
appears. But as said, my understanding of fancy indexing is quite hopeless. I do agree with the always-transpose-to-the-front rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the thing is, there is a 1 sized dimension created by the [1]
so it has to go somewhere? And in vindex that somewhere is the front?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is logical, but then why is it not happening in the examples right above? Or are those mistakes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, its vindex, not oindex. All indexes (except slices I guess) are broadcast together for the result (just like legacy fancy indexing) and iterated together. So the extra dimension at the start has size 2 because it has the size of [0]
and [0, 1]
broadcast together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A, see, I knew I was missing something. Though this is particularly silly. Sorry for the noise.
p.s. I wish we could just change |
I dunno if that is in there, but I think we can add that as a Further steps thing or so. PS: I will not do anything here implementation wise at least for a few months, was just procrastinating... |
Not sure why the build fails to find cython, but probably should be rebased off master anyway. Would be nice to merge this so it doesn't suffer bitrot |
Rebased on master, not that I think it matters, probably just some fluke. I did not yet go through all of it to see it suffered bitrot anyway, and Stephan wants to change a few things? But could also squash it, we just merge it and then he/anyone else can update it too. |
I think the solution described in the NEP of making
ndarray.vindex.__getitem__ check if type(self).__getitem__ is
ndarray.__gettiem__ would suffice. Basically, you will need to implement
vindex and oindex yourself if you override __getitem__.
…On Wed, Jun 20, 2018 at 1:18 PM Marten van Kerkwijk < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In doc/neps/nep-0021-advanced-indexing.rst
<#6256 (comment)>:
> +**Orthogonal** oindex oix
+**Vectorized** vindex vix
+**Legacy** l/findex legacy_index
+============== ======== ======= ============
+
+
+Subclasses
+~~~~~~~~~~
+
+Subclasses are a bit problematic in the light of these changes. There are
+some possible solutions for this. For most subclasses (those which do not
+provide ``__getitem__`` or ``__setitem__``) the special attributes should
+just work. Subclasses that *do* provide it must be updated accordingly
+and should preferably not subclass working versions of these attributes.
+
+All subclasses will inherit the attributes, however, it seems possible
Yes, it would preclude having an axis called index_method!
Anyway, right now we cannot use keywords inside [...] at all, so that
part is not so relevant. The main question is whether it is worth solving
the subclass issue by having __getitem__ acquire an extra argument that
we use internally (and oindex and vindex just being shortcuts to it).
I should add that I don't really have an opinion one way or another; I do
think it is fine to expect subclasses to pay attention and adjust to new
ndarray features (we just have to try our best not to break existing
features).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6256 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1jBxLZv4ltV_4qDUmbfwhm5BZF_Nks5t-q3-gaJpZM4FzUYz>
.
|
@shoyer any reason not to merge this as a draft? |
Indeed, let's merge this. I'll revise it in a follow-on PR. |
👍 |
squashed commits |
See #11414 for my follow-on edits. |
NOTE: A working (but not final) implementation of oindex can be found in PR #6075.
This is meant for discussion more then actually adding it at this point. I will wait a bit longer with inviting the mailing list (I think it is maybe better for the release and governance to settle), but if anyone has more interest, this may be easier to comment on, at least for the actual text. There are still spelling mistakes, etc. but I think it is likely complete enough for some comments for anyone more seriously interested. Or for discussions about some of the less clear points for that matter.
Direct link to rendered version: https://github.com/numpy/numpy/pull/6256/files?short_path=01e4dd9 (can also be reached by clicking on files changed -> Display rich diff icon top right)