Enable custom statistics to return multiple results #3904

berndbecker · 2020-10-07T14:42:14Z

✨ Feature Request

Make custom statistic return a tuple rather than a scalar.
MISSION, store a vector of threshold exceedances of increasing duration
at each gridpoint in liew for the time domain. (much shorter)

In the example

https://scitools-iris.readthedocs.io/en/stable/generated/gallery/general/plot_custom_aggregation.html
a single number is returned at each gridpoint. I am after functionality that returns more than one value for each grid point.

Motivation

Not sure if this is an issue, but I have colleagues who calculated threschold exceedance durations at great pains. Feedback on my request from an AVD surgery was also pointing to hightened frustration as to how complicated "this" is. With this I mean
doing something on a time series, stored at each grid point (3-D cube) and retaining a set of numbers rather than collapsing the time dimension to just one (max, min, mean) number.

I'm always frustrated when something is almost doable but does not quite work and
you have to go all the way back and do it with a sledge hammer.

Additional context

Click to expand this section...

I need a push to understand custom statistics better.

In the attached example
( run with module load scitools/experimental-current,
python /net/home/h02/frtm/prog/wcssp/wcssp5/scripts/ts_exceedance.py)

I am compiling a threshold exceedance duration or survival function
For rainfall time series. Asking how many rainy periods were longer than 1, 2, ....5. and so on days.
This works for a demonstrator on a single time series.

Next I would like to run the same custom statistic at each grid point as in
https://scitools.org.uk/iris/docs/latest/examples/General/custom_aggregation.html#general-custom-aggregation

But I struggle to understand the shape of data being passed to aggregator, what should axis be?
And I have no idea how to store the survivers vector over the time series dimension.

But I am convinced it is not really that difficult.

Add additional verbose information in a collapsible section.

See here for further details.

The text was updated successfully, but these errors were encountered:

rcomer · 2020-10-07T14:54:37Z

Possibly related: #3810 #3331

rcomer · 2020-10-08T17:12:58Z

So, if I've understood, you start with a cube that is (time, latitude, longitude), and you want to end up with a cube that is (durations, latitude, longitude), having done your calculation over time at each grid point. The problem is that the standard iris Aggregator class is designed to reduce the dimensionality down to just (latitude, longitude) when used with collapsed.

We do have the PercentileAggregator class, which has the capacity to add a "percent" dimension if you want to calculate more than one percentile. So we know that it is possible to add dimensions. That class is hard-coded to calculate percentiles though so, if you wanted to make use of it to calculate some other dimension-adding statistic, I think you'd need to subclass it. It also isn't even listed in the docs.

So possibly what we need here is to generalise PercentileAggregator into a class that could create dimension-adding aggregators based on user-defined functions.

rcomer · 2020-10-08T18:43:59Z

Having said that, this particular statistic presumably needs information from the time coordinate. I think all the existing aggregation calculations only use the cube data. 🤔

berndbecker · 2020-10-09T09:34:20Z

The threshold exceedance duration may live without information from the time coordinate for the time being. The PercentileAggregator would deliver on what I expected for starters to be an easy operation. For a generalization later, more complex combination of meta data is a possibility but that can wait.

berndbecker · 2020-10-21T10:23:48Z

Perhaps it is easier if the shape of the tuple to be returned is set at the beginning. I.e it could be the list of linear regression coefficients, or the first 4 moments of normal distribution or the list of percentiles as in the Percentil Aggregator or a list of durations in time units.

bjlittle · 2020-10-28T10:42:39Z

@rcomer Fancy taking this on?

rcomer · 2020-10-28T11:38:04Z

Hey @bjlittle, sorry I think I'd struggle to justify time on this one. My PRs generally fall into two categories:

it directly affects my (or someone in my group's) work
it's small enough to do "in the margins", so don't need to justify the time

While this one doesn't look huge, it looks like more that a 5 min job.

rcomer · 2021-08-23T12:53:51Z

So possibly what we need here is to generalise PercentileAggregator into a class that could create dimension-adding aggregators based on user-defined functions.

While digging to find something else, I noticed that PercentileAggregator was in fact originally written as AdditiveAggregator but was changed "after review discussion" as part of #1569. So there were reasons to make it specific, but I can't see from that PR what the reasons were.

Here be dragons.

rcomer · 2021-09-22T10:03:27Z

Note that #3901 also makes changes to the percentile aggregator, so it may be better to wait until that is resolved before starting work on this. Otherwise we could create some nasty code conflicts.

trexfeathers · 2022-04-06T09:41:22Z

Hi @berndbecker, sorry for the delay on this - it's both difficult and slightly niche! Is it still something you'd be interested in seeing in Iris?

If you think others would also be interested, we encourage you and them to try out the new voting feature.

berndbecker · 2022-04-06T10:04:34Z

Hi Martin, Nice to hear from you! This feature request fits with others working on threshold exceedance, percentiles, etc. So much functionality is nearly there so it could be very rewarding, with some effort , to Make this happen. Albeit, for now, I am working on clustering on single point time series. Dismantling a cube to a single time series, running the clustering and reassembling a cube From the single point results is painful and fraught with error. Having the facility described in the #3904 would come in handy here as well. People are shouting out for something similar here as well: https://web.yammer.com/main/threads/eyJfdHlwZSI6IlRocmVhZCIsImlkIjoiMTYyODUwMzA0OTkwNDEyOCJ9?search=aggregator&groupScope=eyJfdHlwZSI6Ikdyb3VwIiwiaWQiOiIxMDU5MjUyMCJ9 All the best, Bernd. From: Martin Yeo ***@***.***> Sent: 06 April 2022 10:42 To: SciTools/iris ***@***.***> Cc: Becker, Bernd ***@***.***>; Mention ***@***.***> Subject: Re: [SciTools/iris] custom statistic to return a tuple rather than a scalar (#3904) This email was received from an external source. Always check sender details, links & attachments. Hi @berndbecker<https://github.com/berndbecker>, sorry for the delay on this - it's both difficult and slightly niche! Is it still something you'd be interested in seeing in Iris? — Reply to this email directly, view it on GitHub<#3904 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQIRTJB2QEMAK2A6YFAXTGTVDVL45ANCNFSM4SHPX5FA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

rcomer · 2022-04-06T12:55:08Z

@wjbenfold has #4676 to implement an aggregator for number of days of data matching certain criteria (e.g. above a threshold), which I think addresses that Yammer thread. However, it would only handle a single threshold value at a time I think.

wjbenfold · 2022-04-06T15:59:02Z

I'm currently intending that it can handle being between two thresholds (or any other criterion you can write as a lambda) but only one condition at a time, yes

pp-mo · 2022-11-16T12:06:20Z

I just changed the title to something a bit more general.
I actually think there are two different possibilities here for extending the capabilities :

firstly, a calculation statistic that returns multiple statistical components
- in these cases, the cube method (collapse/aggregrated_by/rolling_winfow) would naturally return multiple cubes instead of one
- a classic example would be a linear regression operator, which computes "slope" + "intercept" values together
secondly, a statistical operation repeated over multiple thresholds, categories, etc
- in these cases, the result would have an extra dimension -- e.g. threshold, category, histogram-bin
- as an example, we already have the PERCENTILE operator.
  But we don't have an easy way of creating a custom statistic of this sort.
- a relevant example that came up lately : calculating frequency of occurrence (over a time period) from category values (over time + locations)

From an efficiency point of view, it is always possible to make multiple statistical cubes, and use the CubeList.realise_data method to efficiently calculate multiple statistics over the same data.
Also, the 'extra dimension' cases can be constructed with by creating multiple statistical result cubes; adding a defining scalar coord; and merging into one.
But obviously, from a simplicity + convenience PoV this can be improved !!

berndbecker added the New: Feature label Oct 7, 2020

bjlittle added the Peloton 🚴‍♂️ Target a breakaway issue to be caught and closed by the peloton label Oct 28, 2020

pp-mo self-assigned this Aug 25, 2021

trexfeathers removed the New: Feature label Apr 6, 2022

pp-mo added the Feature: Statistics Label for reduction-like operations e.g., collapsing, aggregrating, rolling-window label Sep 7, 2022

trexfeathers unassigned pp-mo Nov 16, 2022

pp-mo changed the title ~~custom statistic to return a tuple rather than a scalar~~ Allow custom statistics to return multiple results Nov 16, 2022

pp-mo changed the title ~~Allow custom statistics to return multiple results~~ Enable custom statistics to return multiple results Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable custom statistics to return multiple results #3904

Enable custom statistics to return multiple results #3904

berndbecker commented Oct 7, 2020 •

edited by rcomer

Loading

rcomer commented Oct 7, 2020

rcomer commented Oct 8, 2020 •

edited

Loading

rcomer commented Oct 8, 2020

berndbecker commented Oct 9, 2020

berndbecker commented Oct 21, 2020

bjlittle commented Oct 28, 2020

rcomer commented Oct 28, 2020

rcomer commented Aug 23, 2021

rcomer commented Sep 22, 2021

trexfeathers commented Apr 6, 2022 •

edited

Loading

berndbecker commented Apr 6, 2022 via email

rcomer commented Apr 6, 2022

wjbenfold commented Apr 6, 2022

pp-mo commented Nov 16, 2022 •

edited

Loading

Enable custom statistics to return multiple results #3904

Enable custom statistics to return multiple results #3904

Comments

berndbecker commented Oct 7, 2020 • edited by rcomer Loading

✨ Feature Request

Motivation

Additional context

rcomer commented Oct 7, 2020

rcomer commented Oct 8, 2020 • edited Loading

rcomer commented Oct 8, 2020

berndbecker commented Oct 9, 2020

berndbecker commented Oct 21, 2020

bjlittle commented Oct 28, 2020

rcomer commented Oct 28, 2020

rcomer commented Aug 23, 2021

rcomer commented Sep 22, 2021

trexfeathers commented Apr 6, 2022 • edited Loading

berndbecker commented Apr 6, 2022 via email

rcomer commented Apr 6, 2022

wjbenfold commented Apr 6, 2022

pp-mo commented Nov 16, 2022 • edited Loading

berndbecker commented Oct 7, 2020 •

edited by rcomer

Loading

rcomer commented Oct 8, 2020 •

edited

Loading

trexfeathers commented Apr 6, 2022 •

edited

Loading

pp-mo commented Nov 16, 2022 •

edited

Loading