Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducers - Post processing of aggregation results #8110

Closed
colings86 opened this issue Oct 16, 2014 · 39 comments
Closed

Reducers - Post processing of aggregation results #8110

colings86 opened this issue Oct 16, 2014 · 39 comments

Comments

@colings86
Copy link
Contributor

Overview

This feature would introduce a new 'reducers' module, allowing for the processing and transformation of aggregation results. The current aggregation module is very powerful and can compute varied and complex analytics but is limited when calculating analytics which depend on numerous independently calculated aggregated metrics. Currently, the only way to calculate these types of complex metrics is to write client side logic. The reducer module will add a new phase in the search request where the coordinating node can pass the reduced aggregations for further processing.

Key Concepts & Terminology

  • Reducer - A reducer is the computation unit which receives a bucket aggregation and uses it's set of buckets to perform calculations and generate a new aggregation which represents either a single answer (a new metric aggregation) or a new set of buckets (a new bucket aggregation). Reducers run after aggregations and act on the result tree returned by aggregations or the result tree of previous reducers. There are 2 types of reducer:
    • Bucket reducer - Take all specified buckets and divide them into groups (selections). A bucket reducer may create new buckets to add to the selections but is not able to modify existing buckets.
    • Metric reducer - Performs calculations on the input aggregation and outputs a single result (i.e. cannot create buckets)

Reducer Structure

The following snippet captures the basic structure of aggregations:

"reducers" : [
    {
        "<reducer_name>" : {
            "<reducer_type>" : {
                <reducer_body>
            },
            "reducers" : [ 
                {
                    <sub_reducer_1>
                },
                {
                    <sub_reducer_2>
                },
                ...
           ] 
        }
    },
    {
        "<reducer_name_2>" : { ... }
    },    ...
]

Some rules for reducers requests:

  • reducers and aggregations are separate in the request structure
  • reducers are specified in order because they can process the output of other reducers
  • reducers can be nested, in which case they only see the data for the bucket they're in, although they can refer to any value in the tree with _root.path.to.value

Use Cases

Some possible implementations of reducers are detailed below:

Bucket Reducers

  • Sliding window - Can be used with the results of a histogram aggregation to produce selections representing a sliding time window of buckets. The reducer would create a selection for each bucket in the histogram aggregation which contains the bucket and the N subsequent buckets (where N is a configurable window size).

Metric Reducers

  • Avg/Min/Max/Sum/Stats - Same as their aggregation equivalent except they calculate the result for a particular field in their input aggregation's buckets.
  • Delta - Calculates the range of the values of a field between the input buckets
  • Gradient - Calculates the range of values of two fields between the input buckets and returns the result of the first divided by the second (e.g. change in price divided by change in time)

Example

As an example, imagine that you have sales data in an index and you would like to calculate the derivative of the price field over time. This is useful for determining trends in the data, such as whether sales are increasing linearly over time.

You might start with the following aggregation:

{
  "aggs": {
    "months": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

The results of these aggregations might look something like the following:

"aggregations": {
  "months": {
     "buckets": [
        {
           "key_as_string": "2014/02/01 00:00:00",
           "key": 1391212800000,
           "doc_count": 6,
           "avg_price": {
              "value": 11
           }
        },
        {
           "key_as_string": "2014/03/01 00:00:00",
           "key": 1391212800000,
           "doc_count": 4,
           "avg_price": {
              "value": 19
           }
        },
        {
           "key_as_string": "2014/04/01 00:00:00",
           "key": 1391212800000,
           "doc_count": 8,
           "avg_price": {
              "value": 31
           }
        }
     ]
  }
}

The sliding window reducer could then be used to request time windows of two months. A nested delta reducer inside the sliding window reducer could be used to calculate the rate of change of price for each two month selection.

"reducers": [
  {
    "two_months": {
      "sliding_window": {
        "buckets": "months",
        "window": 2
      },
      "reducers": [
        {
          "price_derivative": {
            "gradient": {
              "field": "avg_price.value"
            }
          }
        }
      ]
    }
  }
]

Which would produce the following response (derivative value is shown in price per month):

{
  "reducers": {
    "two_months": {
      "selections": [
        {
          "buckets": [
            {
              "key_as_string": "2014/02/01 00:00:00",
              "key": 1391212800000,
              "doc_count": 6,
              "avg_price": {
                "value": 11
              }
            },
            {
              "key_as_string": "2014/03/01 00:00:00",
              "key": 1391212800000,
              "doc_count": 4,
              "avg_price": {
                "value": 19
              }
            }
          ],
          "price_derivative": {
            "value": 8
          }
        },
        {
          "buckets": [
            {
              "key_as_string": "2014/03/01 00:00:00",
              "key": 1391212800000,
               "doc_count": 4,
               "avg_price": {
                "value": 19
               }
            },
            {
              "key_as_string": "2014/04/01 00:00:00",
              "key": 1391212800000,
              "doc_count": 8,
              "avg_price": {
                "value": 31
              }
            }
          ],
          "price_derivative": {
            "value": 12
          }
        }
      ]
    }
  }
}
@s1monw
Copy link
Contributor

s1monw commented Oct 16, 2014

+1

@roytmana
Copy link

Hi @colings86

Will it allow for expressions against aggregated values? Say I need to calculate a metric:
number of docs[where field1=value1] divided by number of docs[where field2=value2]

would transformation merging (adding new and replacing existing) buckets from one agg into another be possible. My use cases for it are:

  1. Need to have terms agg where several terms MUST be present even if they would be cut off based on agg size/ordering. I do it by aggregating twice once normally and with filter and adding filtered results (at which point I can chose to add to top/bottom or sort properly handling dups to )
  2. Need to sub-aggregate only choosen parent agg buckets not all of them. Achieve it by doing two aggs one for all terms without sub-agg and one for selected terms with sub aggs and merging second into the first

Thanks
alex

@clintongormley
Copy link

Also see #4983 for other use cases.

@martijnvg
Copy link
Member

+1 :)

@colings86
Copy link
Contributor Author

@roytmana currently the implementations of the reducers are still being discussed but the design does allow for creating new buckets (which could be a merge of existing buckets) but not modifying existing buckets

@abhijitiitr
Copy link

+1

@roytmana
Copy link

Thank you @colings86. Is replacement of an existing bucket considered to be modification? I would not want to modify an existing bucket but rather replace it all together

I think combining two aggregation results is a hugely useful use case (I outline just couple of cases in my previous post which I and other who build generic interactive solutions based on ES would be interested in)

@colings86
Copy link
Contributor Author

I am trying to implement the Union Reducer and have hit a dilemma which I want some opinions on. Basically reducers need to be able to access and use multiple aggregations from anywhere in the aggregation tree. Thinking for the moment only about top-level reducers, I think it would be good if the logic to extract the aggregations was done in the Reducer class so that the implementations don't have to worry about doing it. That would mean the doReduce method would have a list of aggregations passed to it rather than a single aggregation.

The problem is that this makes reducers which only care about 1 aggregation a bit clunky since they each have to check that the list only contains one aggregation and then get it. its not a lot of code but will get repeated in a lot of places. I think the options are:

  1. Deal with the list in each implementation and just leave the repetition
  2. Create a SingleAggregationBucketReducer and a SingleAggregationMetricReducer but now have 4 types of reducer instead of two
  3. suggestion from @polyfractal: Have the implementations receive a list of aggregations but have a helper method to check there is only 1 aggregation in the list and return it (throwing an exception if there is more than 1 aggregation) which can be used by the reducers which only want a single aggregation as input

Interested to hear peoples thought on this

@brwe
Copy link
Contributor

brwe commented Nov 17, 2014

I am actually not sure if we need to care for reducers that only take one aggregation. Instead we might also just implement all the "single" reducers so that they can also operate on an arbitrary number of aggregations and then return the result for any of the curves. For example, a reducer running on the following aggregation could return 1 or several results depending on if there are one or more terms in the terms aggregation:

{
  "aggs": {
    "terms": {
      "terms": {
        "field": "label",
        "size": 10
      },
      "aggs": {
        "histo": {
          "histogram": {
            "field": "number",
            "interval": 1
          }
        }
      }
    }
  }
}

@Grauen
Copy link

Grauen commented Nov 18, 2014

+1

@mewwts
Copy link

mewwts commented Nov 24, 2014

+1. Would love the ability to do a simple metric aggregation on bucket fields. For example, give me the average "doc_count" etc.

@dan-dr
Copy link

dan-dr commented Jan 8, 2015

yes. yes yes yes.

@colings86
Copy link
Contributor Author

Having started to implement bucket reducers using the style of API above we have found the API to be overly complex and very verbose to perform even straight forward tasks (such as a simple derivative of a metric). A lot of this is to do with a user having to carefully build their aggregations and then have to define multiple reducers and link them to the aggregations. We found that the only way to build reducer requests with any confidence is to first build the aggregations and then inspect the response to work out what the reducers request should look like. This is not what we want. We would like an API where reducers can be defined at the same time as the aggregations with a structure which is inuitive from the aggregation request structure (i.e. you shouldn't need to get the aggregations response to be able to write your reducers request)

So, we have rethought the API, going back to a more native design with the existing aggregations framework so that reducers look and feel like just another aggregation.

Now there is a new type of aggregation which takes the output of an aggregation tree and computes a new aggregation tree based on these results. The original sub-aggregation tree is destroyed in the process. This can be used to create functionality such as derivatives and moving average where the original data is not needed in the final output. With this new aggregation functionality we can have even more aggregation implementations. The first implementation of this will be the derivative (#9293).

@jayhilden
Copy link

👍

@jasalmeron
Copy link

So there is no option to directly get the division between two aggregations? I have this search query, and I wonder if it's possible to obtain the division between the balanceIncrement after the nested aggregations are done? ideally i want to calculate the payout:

sum balanceIncrement where 'type' = "prizes" / sum balanceIncrement where 'type' in ("play","ext) grouped by room and then grouped by day. What I have now is that summation is done but I don't find the way I can then divide between them and then get the payout. This is my query:

/fb/balance_trasnfer/_search?pretty

{   
    "query" : {
        "filtered": {
            "filter": {
               "range" : {
                    "transferDate" : {
                        "from" : "${BAL_TRANSFER_DATE_INIT}",
                        "to" : "${BAL_TRANSFER_DATE_PLUS_7_DAYS}"
                    }
                }
            }
        }
    },
    "size": 0,
    "aggs" : {
        "transfers_over_day" : {
            "date_histogram" : {
                "field" : "transferDate",
                "interval" : "day"
            },
            "aggs": {
                "group_by_room": {
                    "terms": {
                        "field": "roomName"
                    },                
                    "aggs" : {
                        "wins-and-bets" : {
                             "filters" : {
                                "filters" : {
                                  "win" :   { "term" : { "type" : "prize"}},
                                  "bet" : { "term" : { "type" : ["play","ext"]}}
                                }
                              },
                             "aggs" : {
                                 "sum_balance": {
                                   "sum" : {"field":"balanceIncrement"}
                                 }
                            }                    
                        }               
                    }
                }
            }
        }        
    }
}'

@tcucchietti
Copy link
Contributor

👍

@d-chabrovsky
Copy link

+1

{
    "aggregations": {
        "currentMessageID": {
            "buckets": [
                {
                    "key": "5fbdfcb6-e6e9-446f-be59-e9527c006656",
                    "doc_count": 6,
                    "status": {
                        "buckets": [ {
                            "key": "FAILURE",
                            "doc_count": 6
                        } ]
                    },
                    "operationDateTm": { "value": 1422343776347 }
                },
                {
                    "key": "25a561c3-d45f-4f17-9ea9-e8b7a7cc14e8",
                    "doc_count": 6,
                    "status": {
                        "buckets": [ {
                            "key": "SUCCESS",
                            "doc_count": 1
                        } ]
                    },
                    "operationDateTm": { "value": 1422343964242 }
                }
            ]
        }
    }
}

Would be great to be able to filter out top buckets having the "SUCCESS" in the sub-aggregation's success field.

@neilvarnas-zz
Copy link

Why has this issue been closed ?

@colings86
Copy link
Contributor Author

As described in the comment above (#8110 (comment)) the approach and API proposed in this issue turned out to be too complex and not user friendly. Work is still continuing to provide this kind of functionality but with a better more intuitive API. The start of this is the implementation of derivatives in #9293

@neilvarnas-zz
Copy link

Thank You, managed to miss that comment.

@brupm
Copy link

brupm commented Mar 16, 2015

👍

@lakshmi-guruparan
Copy link

Hi All,

I have a requirement to aggregate on 3 fields.I need to perform range operation on the doc_count value of last aggregation. Is there a way I can accomplish this ? Let us say my query looks like the below -

"aggregations" : {
"g1" : {
"terms" : {
"field" : "TECHNOLOGY",
"size" : 100000
},
"aggregations" : {
"g2" : {
"terms" : {
"field" : "FEATURE_TITLE",
"size" : 100000
},
"aggregations" : {
"g3" : {
"terms" : {
"field" : "SOFTWARE_TYPE",
"size" : 100000,
}
}
}
}
}
}
}

I want to explore on the possibilities of filtering the results based in doc_count value of last aggregation(g3). Below is the result -
g1: {
buckets: [
{
key: LAN Switching
doc_count: 271
g2: {
buckets: [
{
key: EtherChannel
doc_count: 8
g3: {
buckets: [
{
key: IOS
doc_count: 8
}
]
}
}

Thanks.

@OPtoss
Copy link

OPtoss commented May 12, 2015

While I see why reducers can create an overly complex API, I don't think the specialized route is any better. There are many actions I may want to perform on the resulting buckets from an aggregation besides the derivative. The derivative approach here (#9293) is simple and easy, but very specific. How do you plan to cover all the potential use cases for this?

For instance, I may want the entire filter API to filter the buckets returned from an aggregation. In my particular case, I want to only find the buckets with "doc_count" equal to 1. But I can see many more complex use cases for these "meta-filters".

Edit: I just found this #9876, which might answer how you're planning for other use cases. But I don't see an answer for my particular use case. Those seem to be aggregating aggregation results, not filtering them. Any idea how I can resolve my use case?

@clintongormley
Copy link

For instance, I may want the entire filter API to filter the buckets returned from an aggregation. In my particular case, I want to only find the buckets with "doc_count" equal to 1.

We plan to add a having agg which allows you to do exactly what you describe.

But I can see many more complex use cases for these "meta-filters".

Anything that these post-processing aggs can do, can also be done on the client. So the major benefit of adding these is convenience for the user. We'll keep adding aggs that are commonly useful, but if something is really specific, then it can still be done client side.

@acarstoiu
Copy link

Anything that these post-processing aggs can do, can also be done on the client. So the major benefit of adding these is convenience for the user. We'll keep adding aggs that are commonly useful, but if something is really specific, then it can still be done client side.

Actually no, not really.
Of course one can theoretically crunch the data on the client side following any deterministic recipe, but that's not the idea! Going down that line, you'll end up saying that databases are optional, as you can always read all data in a huge (in-memory) structure and perform on them endless transformations 👎
I'm using Elasticsearch precisely because I do not want to bother with the feasibility of such a medieval approach ✌️

So, no, I don't want to transfer tens of megabytes of aggregated data from Elasticsearch into my application just to be able to return to my consumer a few figures obtainable by further aggregating the aggregations. No, I would very much like the aggregations of aggregations to take place on the Elasticsearch coordinating node, as the application delivers high-level statistics.

Some post-aggregations applicable on the aggregations would cover more than 90% of the needs, I reckon. There's no need for fancier aggregations, the existing ones are enough; it's just that they cannot be used on their own results. And they should 😉

@clintongormley
Copy link

@acarstoiu thank you for your essay. If, instead of venting your spleen, you'd followed some of the links above, you would have discovered the wonder that is pipeline aggregations: https://www.elastic.co/guide/en/elasticsearch/reference/master/search-aggregations-pipeline.html

Please do remember that there are real humans here, with real feelings, writing software which you are using.

@acarstoiu
Copy link

I'm aware of #9876, I just wanted to point out that your recommendation is evolutionary wrong.
That has nothing to do with my using the software your team wrote, nor with my spleen (thanks for asking, by the way).

I hope this time was short enough for you (and your feelings). Last, but not least, thank you for the link, you're a good writer too!

@jason2055141
Copy link

Hi!

Why elasticsearch sum aggs in API with timestamp, showing different result with kibana interface?
Thanks in advance.

@colings86
Copy link
Contributor Author

Please ask questions like this on the https://discuss.elastic.co/ forums, this issues list is used for tracking bugs and feature requests. Also, so people can help you with your problem you will need to include a lot more detail about the problem you are seeing. Please read the following on how best to ask questions on the forum to get the best help: https://www.elastic.co/help

@nishasingla1224
Copy link

I am using elasticsearch-1.5.1 and I want to apply a filter over derived filters. I tried using reducers but I guess it doesn't support reducers. Please suggest a solution.

@polyfractal
Copy link
Contributor

@nishasingla1224 The "bucket reducer" functionality was added to Elasticsearch 2.0 under the name "Pipeline Aggregations". You can read the docs here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline.html

You will need to upgrade to 2.0+ to use them, however.

@shuangshui
Copy link

@acarstoiu Now is there a way to just return the size of the aggr bucket, without the bucket content.
As in my case, the bucker is usually very big and not useful here

@jrj-d
Copy link

jrj-d commented Jun 23, 2016

Hi everyone, + 1 for @shuangshui's question

@sourcec0de
Copy link

I am also interested in the answer to @shuangshui's question.
@polyfractal any idea's?

@polyfractal
Copy link
Contributor

@shuangshui @jrj-d @sourcec0de You can potentially use response filtering via filter_path() to cut down on the size of the response. This allows you to select which parts of the json tree you want to see. But it only selects trees/subtrees, not parts of individual elements. So you won't be able to select just the count... you'll have to grab the relevant subtree.

Still, it can help to cut down on very large responses, especially when you're just interested in some kind of metric calculated over a bunch of histograms for example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests