Skip to content

Avoid copying vectorized indexes #10316

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 27, 2025
Merged

Avoid copying vectorized indexes #10316

merged 6 commits into from
May 27, 2025

Conversation

jder
Copy link
Contributor

@jder jder commented May 13, 2025

While investigating #10311, I noticed that the large memory usage described there was exacerbated by duplicating the passed indexes in VectorizedIndexer.__init__. While this could be a defensive copy to avoid someone else mutating the passed ndarrays, looking at the PR it was introduced I would guess this was an unintentional regression as part of supporting duck arrays, since asarray defaults to not copying the data and astype defaults to copying the data, but cc @dcherian in case that's not true.

Also included in this PR is increasing the size of the benchmark to show the improvement more clearly. (On my machine, it's a ~20% time savings.) Happy to break this out or remove it if you'd prefer to leave the benchmarks as is.

  • Tests added

Copy link

welcome bot commented May 13, 2025

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@dcherian
Copy link
Contributor

dcherian commented May 13, 2025

Very nice! I have two minor requests:

  1. Can you use duck_array_ops.astype instead?
  2. There is another astype in OuterIndexer that could be updated.

Can you also add a note in whats-new.rst under "Performance" please

@jder jder force-pushed the no-copy-indexes branch from 337f8d6 to 498188d Compare May 14, 2025 14:14
@jder
Copy link
Contributor Author

jder commented May 14, 2025

OK, @dcherian I believe this is RFAL. The test failures seem to be because dask released a new version yesterday -- for the ones I spot checked, I can reproduce them locally with dask==2025.5.0 but not dask==2025.04.1.

@jder jder force-pushed the no-copy-indexes branch from 498188d to 0af50b5 Compare May 21, 2025 17:24
@dcherian dcherian enabled auto-merge (squash) May 27, 2025 17:31
@dcherian dcherian merged commit fad1185 into pydata:main May 27, 2025
31 checks passed
Copy link

welcome bot commented May 27, 2025

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

@dcherian
Copy link
Contributor

Thanks for tracking this down @jder! I know this is not an easy part of the codebase to edit :)

@jder
Copy link
Contributor Author

jder commented May 27, 2025

@dcherian Thanks for the quick review, feedback, and landing it!

- Lazily indexed arrays now use less memory to store keys by avoiding copies
in :py:class:`~xarray.indexing.VectorizedIndexer` and :py:class:`~xarray.indexing.OuterIndexer`
(:issue:`10316`).
By `Jesse Rusak <https://github.com/jder>`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants