Skip to content

Add RangeIndex #10076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Apr 18, 2025
Merged

Add RangeIndex #10076

merged 27 commits into from
Apr 18, 2025

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Feb 25, 2025

Work in progress (just Ready for review (copied and adapted the example from #9543 (comment)).

benbovy added 9 commits March 20, 2025 09:27
- Use start, stop, step terms

- Make RangeIndex.__init__ private and more flexible, add
  RangeIndex.arange and RangeIndex.linspace public factories

- General support of RangeIndex slicing

- RangeIndex.isel with arbitrary 1D values: convert to PandasIndex

- Add RangeIndex.to_pandas_index
... when check_default_indexes=False.
@benbovy
Copy link
Member Author

benbovy commented Mar 21, 2025

I've made further progress on this. Some design questions (thoughts welcome!):

Create a new RangeIndex

  • I'm not sure yet about the public API? Currently RangeIndex.__init__ is "private" (more flexible and easier for internals) and there are two public factories RangeIndex.arange and RangeIndex.linspace inspired from Numpy API. Creating a new dataset with a range index would look like:
import xarray as xr
from xarray.indexes import RangeIndex

index = RangeIndex.arange("x", "x", 0.0, 1.0, 0.1)
ds = xr.Dataset(coords=xr.Coordinates.from_xindex(index))
<xarray.Dataset> Size: 80B
Dimensions:  (x: 10)
Coordinates:
  * x        (x) float64 80B 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Data variables:
    *empty*
Indexes:
    x        RangeIndex
  • RangeIndex doesn't support set_xindex. Do we want to support it? If yes, how would look like the input coordinate? An existing range with explicit values from which RangeIndex would try to infer a constant step value? Is that useful to have? Since the point of RangeIndex is to avoid materializing coordinate values in memory... Or a 1D coordinate with three values representing start, stop and step? Any other alternative?

Index import

Should we expose all public built-in Xarray indexes at the top level? Or only at the xarray.indexes level?

Currently the Index base class and CFTimeIndex (not an Xarray index but could eventually be refactored so) are exposed at the top level, while PandasIndex, PandasMultiIndex and RangeIndex (this PR) are only exposed at the xarray.indexes level. We might want to uniformize that.

@benbovy
Copy link
Member Author

benbovy commented Mar 21, 2025

Note: this Xarray RangeIndex is designed for floating value ranges. For integer ranges it is probably best to use a PandasIndex wrapping a pandas.RangeIndex. I added a note in the docstrings here. More work on the documentation is needed but probably in a later PR addressing Xarray indexes in general.

@benbovy benbovy marked this pull request as ready for review March 21, 2025 11:55
dim : str
Dimension name.
start : float, optional
Start of interval (default: 0.0). The interval includes this value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could consider adding a closed kwarg like pd.Interval, but in a future PR of course.

"`Coordinates.from_xindex()`"
)

@property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these all be cached_property?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be much benefit of caching those simple aliases to attributes of the underlying transform?

dtype : dtype, optional
The dtype of the coordinate variable (default: float64).

Examples
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Examples
Note that all `start`, `stop` & `step` must be passed, which is more explicit than `np.arange` or `range`
Examples

(optional, no strong view)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that all start, stop & step must be passed

This isn't exactly true, but yes the API here is more explicit than np.arange and range, e.g., RangeIndex.arange(10.0) means start=10 while np.arange(10.0) means stop=10.

RangeIndex.arange(10.0) doesn't make much sense, though, considering the default value of stop=1.0. I'll see if we can get closer to np.arange using tpying.overload.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RangeIndex.arange(10.0) doesn't make much sense, though, considering the default value of stop=1.0. I'll see if we can get closer to np.arange using tpying.overload.

yeah. no objection to the more explicit approach — it's useful-but-a-bit-magic that arange / range changes the meaning of the first arg based on how many are supplied

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mimicing numpy.arange behavior is surprisingly difficult! (at least for me, I've been struggling with this).

I got it close with some simple logic, but then I hit the same issue than numpy/numpy#17878 (i.e., RangeIndex.arange(start=10) returns a range in the [0, 10) interval, which makes no sense). I could fix it after some heavy refactor making the code / API ugly to a point I'm not sure I want to push it here :).

Numpy relies on the Python C API PyArg_ParseTupleAndKeywords() but AFAIK there seem to be no easy way to know from Python whether a value has been passed as positional or keyword argument.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!!

I think the explicit approach is very valid, but maybe we just call it out / ensure people need to pass kwargs where it could be confusing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 4a128d0 I've chosen to mimicking numpy and pandas API anyway, with some simple logic and clearly documenting the caveat above.

I find myself (likely others too) doing range(10) or np.arange(10.0) so many times while I doubt many will write something like np.arange(start=10).

pandas.RangeIndex(start=10) actually still returns RangeIndex(start=0, stop=10, step=1) and I haven't seen anyone complaining in the pandas issues (or I missed it).

@wpbonelli
Copy link
Contributor

Hope it's alright to chime in here.

RangeIndex doesn't support set_xindex. Do we want to support it? If yes, how would look like the input coordinate? An existing range with explicit values from which RangeIndex would try to infer a constant step value? Is that useful to have? Since the point of RangeIndex is to avoid materializing coordinate values in memory... Or a 1D coordinate with three values representing start, stop and step? Any other alternative?

One use case for set_xindex (i.e. situation where a materialized coordinate pre-exists an index) could be to "alias" a dimension coordinate?

E.g. say you have a regular grid with x/y/z dims/coords, and you want to index not only positionally but explicitly with e.g. i/j/k or row/col/lay. You could write an integer range index which accepts those names for the dimension lookup. Is that reasonable? Maybe I'm thinking about this wrong. I asked a very beginner question along these lines a few weeks ago, still trying to wrap my head around indexing.

(I understand for integer ranges one wants a PandasIndex wrapping RangeIndex — is there an example of that anywhere?)

@benbovy
Copy link
Member Author

benbovy commented Apr 16, 2025

@wpbonelli I don't think this is well documented, but you could do:

da = xr.DataArray(np.zeros((5, 10)), dims=("rows", "cols"))

da.coords["r"] = ("rows", pd.RangeIndex(da.sizes["rows"]))
da = da.set_xindex("r")

da
# <xarray.DataArray (rows: 5, cols: 10)> Size: 400B
# array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
#        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
#        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
#        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
#        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
# Coordinates:
#   * r        (rows) int64 40B 0 1 2 3 4
# Dimensions without coordinates: rows, cols

da.xindexes["r"]
# PandasIndex(RangeIndex(start=0, stop=5, step=1, name='r'))

This relies on the fact that Xarray internally keeps track of the pandas.Index object when wrapped as variable data, though.

A more explicit way of achieving the same result:

r_index = PandasIndex(pd.RangeIndex(da.sizes["rows"], name="r"), dim="rows")
da = da.assign_coords(xr.Coordinates.from_xindex(r_index))

With one caveat (also in pandas): `RangeIndex.arange(4.0)` creates an
index within the range [0.0, 4.0) (`start` is interpreted as `stop`).
This caveat is documented.
@benbovy
Copy link
Member Author

benbovy commented Apr 16, 2025

This is ready for another round of review! I don't think that CI failures are related to anything in this PR.

@wpbonelli
Copy link
Contributor

@benbovy thanks! sorry to hijack the thread.

@dcherian dcherian added the plan to merge Final call for comments label Apr 16, 2025
@dcherian dcherian enabled auto-merge (squash) April 18, 2025 00:53
@dcherian dcherian disabled auto-merge April 18, 2025 04:30
@dcherian dcherian merged commit 3816901 into pydata:main Apr 18, 2025
30 of 32 checks passed
@github-project-automation github-project-automation bot moved this from In progress to Done in Explicit Indexes Apr 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Regular (linspace) Coordinates/Index
5 participants