Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to validation checks #189

Merged
merged 4 commits into from May 3, 2022
Merged

Improvements to validation checks #189

merged 4 commits into from May 3, 2022

Conversation

groutr
Copy link
Contributor

@groutr groutr commented Apr 18, 2022

Some minor improvements to _validation.py that center around the usage of np.asarray. Using np.asarray over np.array can avoid unnecessary copies (ie, when the input is already an array).

When comparing the shapes, we don't need to hold all the arrays in memory.

* Use asarray to avoid unnecessary copies of arrays
* Lazily convert to array to conserve memory
* Compare shapes directly
If we have a Series, we can avoid expensive allocation to ndarray.
@aaraney
Copy link
Member

aaraney commented Apr 22, 2022

@groutr, thanks for the contribution!

I am not sure the _array_attr function is necessary. np.asarray does not copy if it receives an input np.array. I think it would be more clear to just cast to np.asarray and check for the property in places you are proposing to call this method.

@groutr
Copy link
Contributor Author

groutr commented Apr 22, 2022

@aaraney _array_attr is there because these validation functions are most likely to receive pandas objects, not ndarrays. We are hoping to take advantage of pandas exposing the array properties without the unnecessary cast to ndarray. With _array_attr, checking the attribute is an order of magnitude faster.

# aa = pd.Series(range(1000))
In [93]: %timeit _array_attr(aa, "ndim").ndim
425 ns ± 41.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [94]: %timeit np.asarray(aa).ndim
4.45 µs ± 213 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@aaraney
Copy link
Member

aaraney commented Apr 22, 2022

Ah, I see. That makes a lot of sense. Thanks for clarifying and providing the above example, @groutr.

My only request would be adding some type hints to the function declaration. The return type hint might be difficult to nail down for all cases, so I think the straight path of npt.ArrayLike should be fine unless you have a better suggestion.

@jarq6c jarq6c self-requested a review April 22, 2022 21:01
@jarq6c jarq6c added the enhancement New feature or request label Apr 22, 2022
Copy link
Collaborator

@jarq6c jarq6c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I think we just want to bump the patch version in python/metrics/src/hydrotools/metrics/_version.py to 1.2.2

I tested these changes locally and it is indeed faster. Thanks for the contribution!

@aaraney
Copy link
Member

aaraney commented Apr 25, 2022

Thanks again, @groutr!

@jarq6c jarq6c merged commit 312a90b into NOAA-OWP:main May 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants