Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All-lazy statistics #2418

Closed
7 tasks
pp-mo opened this issue Mar 7, 2017 · 12 comments
Closed
7 tasks

All-lazy statistics #2418

pp-mo opened this issue Mar 7, 2017 · 12 comments

Comments

@pp-mo
Copy link
Member

pp-mo commented Mar 7, 2017

Ideally, make all stats calculate via dask, instead of requiring a alternative 'real data' algorithm.

Wishlist:

  • retain support for non-lazy user functions : automatically (?) convert to lazy ?
  • when stat is computed, don't realise the source cube
  • mostly about Aggregators, but also ..
    • consider 'aggregated_by' as well as 'collapse' ..
    • .. ? and rolling_window ?
    • N.B. collapse, aggregated_by + rolling_window are the only existing stats methods
  • (vague objective) make user-defined custom aggregators as simple as possible
  • simplify mechanism of Aggregator creation so only one calculation function is supplied (lazy or not)
  • (internal code hygiene) simplify cube / aggregator contract : AFAP move detail away from cube code into generic Aggregator code
@pp-mo
Copy link
Member Author

pp-mo commented Mar 7, 2017

Providing a Lazy-Only-Aggregator is really pretty easy.
E.G some quick proof-of-priniciple code example here:
https://gist.github.com/pp-mo/12f97ec50ebc3e6f5c90cbf6d271d41c

However, the way the real/lazy split is handled explicitly in existing cube operations is currently rather messy, and properly simplifying all that might be rather more work.

N.B. there are just 3 lazy cube-stats operations at present :
collapsed, aggregated_by and rolling_window.

@DPeterK DPeterK changed the title All-lazy statistics All-lazy statistics (5 man days' effort) Apr 12, 2017
@lbdreyer
Copy link
Member

lbdreyer commented Jul 4, 2017

@pp-mo surely this has been completed by now?

@pp-mo
Copy link
Member Author

pp-mo commented Jul 4, 2017

surely this has been completed by now?

Absolutely not : it would ideally mean that Aggregators have only an aggregate method and not an alternative lazy_aggregate, or at the least that the "main" one should call the "lazy" one.

@pelson pelson changed the title All-lazy statistics (5 man days' effort) All-lazy statistics Oct 27, 2017
@pelson
Copy link
Member

pelson commented Oct 27, 2017

I believe I'm up for doing this, but it isn't clear on the parts that remain. @pp-mo - could you produce a list of the things that, when done, would allow us to close this issue?

@rcomer
Copy link
Member

rcomer commented Nov 24, 2017

If this were done, would the operators still accept user-defined non-lazy aggregators?

@pelson
Copy link
Member

pelson commented Jan 3, 2018

If this were done, would the operators still accept user-defined non-lazy aggregators?

It would be relatively easy to turn ufunc type aggregators into lazy ones if it wasn't done out of the box (though I'd expect to be able to pass non-lazy aggregators and iris would defer the call to the aggregator until the correct point). Do you have concerns for aggregators similar to median, which cannot be done out of core in a single-pass?

@rcomer
Copy link
Member

rcomer commented Jan 3, 2018

It's not something I've done recently, but I have written my own aggregators in the past. Most obvious example is the iris.analysis.WPERCENTILE which I originally wrote for use in my team. Here is the underlying numpy function. Would it need re-writing?

Even for the most straight-forward cases where a Dask equivalent to a numpy/scipy function exists, there is a convenience of just grabbing the numpy/scipy version and creating an aggregator from it. If a user has a relatively small data set, learning about Dask and its functions seems like unnecessary overhead.

@pelson
Copy link
Member

pelson commented Jan 3, 2018

Agreed. Ideally we would be able to take any aggregator and make it lazy (though not out of core / parallelised).

@pp-mo
Copy link
Member Author

pp-mo commented Oct 9, 2018

Added a summary list of things to do (edited into the top description box).
This also refers to 'aggregated_by' -- previously also discussed under #2928

@rcomer
Copy link
Member

rcomer commented Sep 28, 2020

I believe implementing this would fix #3190.

@github-actions
Copy link
Contributor

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

@github-actions github-actions bot added the Stale A stale issue/pull-request label Feb 11, 2022
@github-actions
Copy link
Contributor

This stale issue has been automatically closed due to a lack of community activity.

If you still care about this issue, then please either:

  • Re-open this issue, if you have sufficient permissions, or
  • Add a comment pinging @SciTools/iris-devs who will re-open on your behalf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants