Reducers - Post processing of aggregation results #8110

colings86 · 2014-10-16T13:36:39Z

Overview

This feature would introduce a new 'reducers' module, allowing for the processing and transformation of aggregation results. The current aggregation module is very powerful and can compute varied and complex analytics but is limited when calculating analytics which depend on numerous independently calculated aggregated metrics. Currently, the only way to calculate these types of complex metrics is to write client side logic. The reducer module will add a new phase in the search request where the coordinating node can pass the reduced aggregations for further processing.

Key Concepts & Terminology

Reducer - A reducer is the computation unit which receives a bucket aggregation and uses it's set of buckets to perform calculations and generate a new aggregation which represents either a single answer (a new metric aggregation) or a new set of buckets (a new bucket aggregation). Reducers run after aggregations and act on the result tree returned by aggregations or the result tree of previous reducers. There are 2 types of reducer:
- Bucket reducer - Take all specified buckets and divide them into groups (selections). A bucket reducer may create new buckets to add to the selections but is not able to modify existing buckets.
- Metric reducer - Performs calculations on the input aggregation and outputs a single result (i.e. cannot create buckets)

Reducer Structure

The following snippet captures the basic structure of aggregations:

"reducers" : [
    {
        "<reducer_name>" : {
            "<reducer_type>" : {
                <reducer_body>
            },
            "reducers" : [ 
                {
                    <sub_reducer_1>
                },
                {
                    <sub_reducer_2>
                },
                ...
           ] 
        }
    },
    {
        "<reducer_name_2>" : { ... }
    },    ...
]

Some rules for reducers requests:

reducers and aggregations are separate in the request structure
reducers are specified in order because they can process the output of other reducers
reducers can be nested, in which case they only see the data for the bucket they're in, although they can refer to any value in the tree with _root.path.to.value

Use Cases

Some possible implementations of reducers are detailed below:

Bucket Reducers

Sliding window - Can be used with the results of a histogram aggregation to produce selections representing a sliding time window of buckets. The reducer would create a selection for each bucket in the histogram aggregation which contains the bucket and the N subsequent buckets (where N is a configurable window size).

Metric Reducers

Avg/Min/Max/Sum/Stats - Same as their aggregation equivalent except they calculate the result for a particular field in their input aggregation's buckets.
Delta - Calculates the range of the values of a field between the input buckets
Gradient - Calculates the range of values of two fields between the input buckets and returns the result of the first divided by the second (e.g. change in price divided by change in time)

Example

As an example, imagine that you have sales data in an index and you would like to calculate the derivative of the price field over time. This is useful for determining trends in the data, such as whether sales are increasing linearly over time.

You might start with the following aggregation:

{
  "aggs": {
    "months": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

The results of these aggregations might look something like the following:

"aggregations": {
  "months": {
     "buckets": [
        {
           "key_as_string": "2014/02/01 00:00:00",
           "key": 1391212800000,
           "doc_count": 6,
           "avg_price": {
              "value": 11
           }
        },
        {
           "key_as_string": "2014/03/01 00:00:00",
           "key": 1391212800000,
           "doc_count": 4,
           "avg_price": {
              "value": 19
           }
        },
        {
           "key_as_string": "2014/04/01 00:00:00",
           "key": 1391212800000,
           "doc_count": 8,
           "avg_price": {
              "value": 31
           }
        }
     ]
  }
}

The sliding window reducer could then be used to request time windows of two months. A nested delta reducer inside the sliding window reducer could be used to calculate the rate of change of price for each two month selection.

"reducers": [
  {
    "two_months": {
      "sliding_window": {
        "buckets": "months",
        "window": 2
      },
      "reducers": [
        {
          "price_derivative": {
            "gradient": {
              "field": "avg_price.value"
            }
          }
        }
      ]
    }
  }
]

Which would produce the following response (derivative value is shown in price per month):

{
  "reducers": {
    "two_months": {
      "selections": [
        {
          "buckets": [
            {
              "key_as_string": "2014/02/01 00:00:00",
              "key": 1391212800000,
              "doc_count": 6,
              "avg_price": {
                "value": 11
              }
            },
            {
              "key_as_string": "2014/03/01 00:00:00",
              "key": 1391212800000,
              "doc_count": 4,
              "avg_price": {
                "value": 19
              }
            }
          ],
          "price_derivative": {
            "value": 8
          }
        },
        {
          "buckets": [
            {
              "key_as_string": "2014/03/01 00:00:00",
              "key": 1391212800000,
               "doc_count": 4,
               "avg_price": {
                "value": 19
               }
            },
            {
              "key_as_string": "2014/04/01 00:00:00",
              "key": 1391212800000,
              "doc_count": 8,
              "avg_price": {
                "value": 31
              }
            }
          ],
          "price_derivative": {
            "value": 12
          }
        }
      ]
    }
  }
}

The text was updated successfully, but these errors were encountered:

s1monw · 2014-10-16T13:38:12Z

+1

roytmana · 2014-10-17T00:54:35Z

Hi @colings86

Will it allow for expressions against aggregated values? Say I need to calculate a metric:
number of docs[where field1=value1] divided by number of docs[where field2=value2]

would transformation merging (adding new and replacing existing) buckets from one agg into another be possible. My use cases for it are:

Need to have terms agg where several terms MUST be present even if they would be cut off based on agg size/ordering. I do it by aggregating twice once normally and with filter and adding filtered results (at which point I can chose to add to top/bottom or sort properly handling dups to )
Need to sub-aggregate only choosen parent agg buckets not all of them. Achieve it by doing two aggs one for all terms without sub-agg and one for selected terms with sub aggs and merging second into the first

Thanks
alex

clintongormley · 2014-10-17T06:36:39Z

Also see #4983 for other use cases.

martijnvg · 2014-10-17T11:39:14Z

+1 :)

colings86 · 2014-10-17T11:46:15Z

@roytmana currently the implementations of the reducers are still being discussed but the design does allow for creating new buckets (which could be a merge of existing buckets) but not modifying existing buckets

abhijitiitr · 2014-10-17T17:27:35Z

+1

roytmana · 2014-10-17T17:35:04Z

Thank you @colings86. Is replacement of an existing bucket considered to be modification? I would not want to modify an existing bucket but rather replace it all together

I think combining two aggregation results is a hugely useful use case (I outline just couple of cases in my previous post which I and other who build generic interactive solutions based on ES would be interested in)

colings86 · 2014-11-17T11:17:50Z

I am trying to implement the Union Reducer and have hit a dilemma which I want some opinions on. Basically reducers need to be able to access and use multiple aggregations from anywhere in the aggregation tree. Thinking for the moment only about top-level reducers, I think it would be good if the logic to extract the aggregations was done in the Reducer class so that the implementations don't have to worry about doing it. That would mean the doReduce method would have a list of aggregations passed to it rather than a single aggregation.

The problem is that this makes reducers which only care about 1 aggregation a bit clunky since they each have to check that the list only contains one aggregation and then get it. its not a lot of code but will get repeated in a lot of places. I think the options are:

Deal with the list in each implementation and just leave the repetition
Create a SingleAggregationBucketReducer and a SingleAggregationMetricReducer but now have 4 types of reducer instead of two
suggestion from @polyfractal: Have the implementations receive a list of aggregations but have a helper method to check there is only 1 aggregation in the list and return it (throwing an exception if there is more than 1 aggregation) which can be used by the reducers which only want a single aggregation as input

Interested to hear peoples thought on this

brwe · 2014-11-17T11:30:38Z

I am actually not sure if we need to care for reducers that only take one aggregation. Instead we might also just implement all the "single" reducers so that they can also operate on an arbitrary number of aggregations and then return the result for any of the curves. For example, a reducer running on the following aggregation could return 1 or several results depending on if there are one or more terms in the terms aggregation:

{
  "aggs": {
    "terms": {
      "terms": {
        "field": "label",
        "size": 10
      },
      "aggs": {
        "histo": {
          "histogram": {
            "field": "number",
            "interval": 1
          }
        }
      }
    }
  }
}

Grauen · 2014-11-18T12:06:02Z

+1

mewwts · 2014-11-24T11:26:02Z

+1. Would love the ability to do a simple metric aggregation on bucket fields. For example, give me the average "doc_count" etc.

dan-dr · 2015-01-08T13:32:26Z

yes. yes yes yes.

colings86 · 2015-01-14T09:46:05Z

Having started to implement bucket reducers using the style of API above we have found the API to be overly complex and very verbose to perform even straight forward tasks (such as a simple derivative of a metric). A lot of this is to do with a user having to carefully build their aggregations and then have to define multiple reducers and link them to the aggregations. We found that the only way to build reducer requests with any confidence is to first build the aggregations and then inspect the response to work out what the reducers request should look like. This is not what we want. We would like an API where reducers can be defined at the same time as the aggregations with a structure which is inuitive from the aggregation request structure (i.e. you shouldn't need to get the aggregations response to be able to write your reducers request)

So, we have rethought the API, going back to a more native design with the existing aggregations framework so that reducers look and feel like just another aggregation.

Now there is a new type of aggregation which takes the output of an aggregation tree and computes a new aggregation tree based on these results. The original sub-aggregation tree is destroyed in the process. This can be used to create functionality such as derivatives and moving average where the original data is not needed in the final output. With this new aggregation functionality we can have even more aggregation implementations. The first implementation of this will be the derivative (#9293).

jayhilden · 2015-01-22T06:35:18Z

👍

jasalmeron · 2015-01-22T16:26:57Z

So there is no option to directly get the division between two aggregations? I have this search query, and I wonder if it's possible to obtain the division between the balanceIncrement after the nested aggregations are done? ideally i want to calculate the payout:

sum balanceIncrement where 'type' = "prizes" / sum balanceIncrement where 'type' in ("play","ext) grouped by room and then grouped by day. What I have now is that summation is done but I don't find the way I can then divide between them and then get the payout. This is my query:

/fb/balance_trasnfer/_search?pretty

{   
    "query" : {
        "filtered": {
            "filter": {
               "range" : {
                    "transferDate" : {
                        "from" : "${BAL_TRANSFER_DATE_INIT}",
                        "to" : "${BAL_TRANSFER_DATE_PLUS_7_DAYS}"
                    }
                }
            }
        }
    },
    "size": 0,
    "aggs" : {
        "transfers_over_day" : {
            "date_histogram" : {
                "field" : "transferDate",
                "interval" : "day"
            },
            "aggs": {
                "group_by_room": {
                    "terms": {
                        "field": "roomName"
                    },                
                    "aggs" : {
                        "wins-and-bets" : {
                             "filters" : {
                                "filters" : {
                                  "win" :   { "term" : { "type" : "prize"}},
                                  "bet" : { "term" : { "type" : ["play","ext"]}}
                                }
                              },
                             "aggs" : {
                                 "sum_balance": {
                                   "sum" : {"field":"balanceIncrement"}
                                 }
                            }                    
                        }               
                    }
                }
            }
        }        
    }
}'

tcucchietti · 2015-01-26T12:27:31Z

👍

d-chabrovsky · 2015-01-28T15:49:43Z

+1

{
    "aggregations": {
        "currentMessageID": {
            "buckets": [
                {
                    "key": "5fbdfcb6-e6e9-446f-be59-e9527c006656",
                    "doc_count": 6,
                    "status": {
                        "buckets": [ {
                            "key": "FAILURE",
                            "doc_count": 6
                        } ]
                    },
                    "operationDateTm": { "value": 1422343776347 }
                },
                {
                    "key": "25a561c3-d45f-4f17-9ea9-e8b7a7cc14e8",
                    "doc_count": 6,
                    "status": {
                        "buckets": [ {
                            "key": "SUCCESS",
                            "doc_count": 1
                        } ]
                    },
                    "operationDateTm": { "value": 1422343964242 }
                }
            ]
        }
    }
}

Would be great to be able to filter out top buckets having the "SUCCESS" in the sub-aggregation's success field.

neilvarnas-zz · 2015-02-20T08:14:22Z

Why has this issue been closed ?

colings86 · 2015-02-20T08:34:53Z

As described in the comment above (#8110 (comment)) the approach and API proposed in this issue turned out to be too complex and not user friendly. Work is still continuing to provide this kind of functionality but with a better more intuitive API. The start of this is the implementation of derivatives in #9293

neilvarnas-zz · 2015-02-20T08:39:50Z

Thank You, managed to miss that comment.

brupm · 2015-03-16T18:32:16Z

👍

lakshmi-guruparan · 2015-03-27T21:52:05Z

Hi All,

I have a requirement to aggregate on 3 fields.I need to perform range operation on the doc_count value of last aggregation. Is there a way I can accomplish this ? Let us say my query looks like the below -

"aggregations" : {
"g1" : {
"terms" : {
"field" : "TECHNOLOGY",
"size" : 100000
},
"aggregations" : {
"g2" : {
"terms" : {
"field" : "FEATURE_TITLE",
"size" : 100000
},
"aggregations" : {
"g3" : {
"terms" : {
"field" : "SOFTWARE_TYPE",
"size" : 100000,
}
}
}
}
}
}
}

I want to explore on the possibilities of filtering the results based in doc_count value of last aggregation(g3). Below is the result -
g1: {
buckets: [
{
key: LAN Switching
doc_count: 271
g2: {
buckets: [
{
key: EtherChannel
doc_count: 8
g3: {
buckets: [
{
key: IOS
doc_count: 8
}
]
}
}

Thanks.

OPtoss · 2015-05-12T19:49:02Z

While I see why reducers can create an overly complex API, I don't think the specialized route is any better. There are many actions I may want to perform on the resulting buckets from an aggregation besides the derivative. The derivative approach here (#9293) is simple and easy, but very specific. How do you plan to cover all the potential use cases for this?

For instance, I may want the entire filter API to filter the buckets returned from an aggregation. In my particular case, I want to only find the buckets with "doc_count" equal to 1. But I can see many more complex use cases for these "meta-filters".

Edit: I just found this #9876, which might answer how you're planning for other use cases. But I don't see an answer for my particular use case. Those seem to be aggregating aggregation results, not filtering them. Any idea how I can resolve my use case?

clintongormley · 2015-05-15T17:32:56Z

For instance, I may want the entire filter API to filter the buckets returned from an aggregation. In my particular case, I want to only find the buckets with "doc_count" equal to 1.

We plan to add a having agg which allows you to do exactly what you describe.

But I can see many more complex use cases for these "meta-filters".

Anything that these post-processing aggs can do, can also be done on the client. So the major benefit of adding these is convenience for the user. We'll keep adding aggs that are commonly useful, but if something is really specific, then it can still be done client side.

acarstoiu · 2015-06-18T18:14:41Z

Anything that these post-processing aggs can do, can also be done on the client. So the major benefit of adding these is convenience for the user. We'll keep adding aggs that are commonly useful, but if something is really specific, then it can still be done client side.

Actually no, not really.
Of course one can theoretically crunch the data on the client side following any deterministic recipe, but that's not the idea! Going down that line, you'll end up saying that databases are optional, as you can always read all data in a huge (in-memory) structure and perform on them endless transformations 👎
I'm using Elasticsearch precisely because I do not want to bother with the feasibility of such a medieval approach ✌️

So, no, I don't want to transfer tens of megabytes of aggregated data from Elasticsearch into my application just to be able to return to my consumer a few figures obtainable by further aggregating the aggregations. No, I would very much like the aggregations of aggregations to take place on the Elasticsearch coordinating node, as the application delivers high-level statistics.

Some post-aggregations applicable on the aggregations would cover more than 90% of the needs, I reckon. There's no need for fancier aggregations, the existing ones are enough; it's just that they cannot be used on their own results. And they should 😉

clintongormley · 2015-06-18T18:25:27Z

@acarstoiu thank you for your essay. If, instead of venting your spleen, you'd followed some of the links above, you would have discovered the wonder that is pipeline aggregations: https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-pipeline.html

Please do remember that there are real humans here, with real feelings, writing software which you are using.

acarstoiu · 2015-06-18T18:41:33Z

I'm aware of #9876, I just wanted to point out that your recommendation is evolutionary wrong.
That has nothing to do with my using the software your team wrote, nor with my spleen (thanks for asking, by the way).

I hope this time was short enough for you (and your feelings). Last, but not least, thank you for the link, you're a good writer too!

jason2055141 · 2016-01-19T01:50:01Z

Hi!

Why elasticsearch sum aggs in API with timestamp, showing different result with kibana interface?
Thanks in advance.

colings86 · 2016-01-19T09:26:27Z

Please ask questions like this on the https://discuss.elastic.co/ forums, this issues list is used for tracking bugs and feature requests. Also, so people can help you with your problem you will need to include a lot more detail about the problem you are seeing. Please read the following on how best to ask questions on the forum to get the best help: https://www.elastic.co/help

nishasingla1224 · 2016-01-29T12:58:31Z

I am using elasticsearch-1.5.1 and I want to apply a filter over derived filters. I tried using reducers but I guess it doesn't support reducers. Please suggest a solution.

polyfractal · 2016-01-31T20:05:59Z

@nishasingla1224 The "bucket reducer" functionality was added to Elasticsearch 2.0 under the name "Pipeline Aggregations". You can read the docs here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline.html

You will need to upgrade to 2.0+ to use them, however.

shuangshui · 2016-05-26T06:40:22Z

@acarstoiu Now is there a way to just return the size of the aggr bucket, without the bucket content.
As in my case, the bucker is usually very big and not useful here

jrj-d · 2016-06-23T14:51:37Z

Hi everyone, + 1 for @shuangshui's question

sourcec0de · 2016-10-31T02:41:33Z

I am also interested in the answer to @shuangshui's question.
@polyfractal any idea's?

polyfractal · 2016-10-31T14:29:49Z

@shuangshui @jrj-d @sourcec0de You can potentially use response filtering via filter_path() to cut down on the size of the response. This allows you to select which parts of the json tree you want to see. But it only selects trees/subtrees, not parts of individual elements. So you won't be able to select just the count... you'll have to grab the relevant subtree.

Still, it can help to cut down on very large responses, especially when you're just interested in some kind of metric calculated over a bunch of histograms for example

colings86 added >feature v2.0.0-beta1 labels Oct 16, 2014

colings86 self-assigned this Oct 16, 2014

This was referenced Oct 16, 2014

Use Case for Aggregation Metrics Calculated based on Other Metrics #8017

Closed

Filtering on results of an aggregation #4404

Closed

Buckets can now be serialized outside of an Aggregation #8113

Merged

rashidkpc mentioned this issue Oct 23, 2014

Series transformations (eg derivative) elastic/kibana#1743

Closed

This was referenced Nov 7, 2014

server side scripting #7547

Closed

Add K-means clustering feature #5512

Closed

Aggregations; Support bucket-leve calculated aggregations #5608

Closed

jpountz mentioned this issue Nov 17, 2014

Support minimum value in cardinality aggregation #6419

Closed

clintongormley mentioned this issue Jan 2, 2015

Provision to add condition to remove bucket in aggregation #9124

Closed

colings86 closed this as completed Jan 14, 2015

jpountz mentioned this issue Mar 16, 2015

Add min_doc_count support to range aggregations #10027

Closed

rashidkpc mentioned this issue Mar 26, 2015

Next level aggregations elastic/kibana#3458

Closed

rashidkpc mentioned this issue Apr 2, 2015

combine metrics with formula (or operators) elastic/kibana#3505

Closed

spalger mentioned this issue Apr 2, 2015

Incorrect bar ordering for unique count with terms sub aggregation elastic/kibana#3314

Closed

Reducers - Post processing of aggregation results #8110

Reducers - Post processing of aggregation results #8110

Comments

colings86 commented Oct 16, 2014

Overview

Key Concepts & Terminology

Reducer Structure

Use Cases

Bucket Reducers

Metric Reducers

Example

s1monw commented Oct 16, 2014

roytmana commented Oct 17, 2014

clintongormley commented Oct 17, 2014

martijnvg commented Oct 17, 2014

colings86 commented Oct 17, 2014

abhijitiitr commented Oct 17, 2014

roytmana commented Oct 17, 2014

colings86 commented Nov 17, 2014

brwe commented Nov 17, 2014

Grauen commented Nov 18, 2014

mewwts commented Nov 24, 2014

dan-dr commented Jan 8, 2015

colings86 commented Jan 14, 2015

jayhilden commented Jan 22, 2015

jasalmeron commented Jan 22, 2015

tcucchietti commented Jan 26, 2015

d-chabrovsky commented Jan 28, 2015

neilvarnas-zz commented Feb 20, 2015

colings86 commented Feb 20, 2015

neilvarnas-zz commented Feb 20, 2015

brupm commented Mar 16, 2015

lakshmi-guruparan commented Mar 27, 2015

OPtoss commented May 12, 2015

clintongormley commented May 15, 2015

acarstoiu commented Jun 18, 2015

clintongormley commented Jun 18, 2015

acarstoiu commented Jun 18, 2015

jason2055141 commented Jan 19, 2016

colings86 commented Jan 19, 2016

nishasingla1224 commented Jan 29, 2016

polyfractal commented Jan 31, 2016

shuangshui commented May 26, 2016

jrj-d commented Jun 23, 2016

sourcec0de commented Oct 31, 2016

polyfractal commented Oct 31, 2016