Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return the sum of the doc counts of other buckets in terms aggregations. #8213

Closed
wants to merge 2 commits into from

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Oct 23, 2014

This commit adds a new field to the response of the terms aggregation called
sum_other_doc_count which is equal to the sum of the doc counts of the buckets
that did not make it to the list of top buckets. It is typically useful to have
a sector called eg. other when using terms aggregations to build pie charts.

Example query and response:

GET test/_search?search_type=count
{
  "aggs": {
    "colors": {
      "terms": {
        "field": "color",
        "size": 3
      }
    }
  }
}
{
   [...],
   "aggregations": {
      "colors": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 4,
         "buckets": [
            {
               "key": "blue",
               "doc_count": 65
            },
            {
               "key": "red",
               "doc_count": 14
            },
            {
               "key": "brown",
               "doc_count": 3
            }
         ]
      }
   }
}

This commit adds a new field to the response of the terms aggregation called
`sum_other_doc_count` which is equal to the sum of the doc counts of the buckets
that did not make it to the list of top buckets. It is typically useful to have
a sector called eg. `other` when using terms aggregations to build pie charts.

Example query and response:

```json
GET test/_search?search_type=count
{
  "aggs": {
    "colors": {
      "terms": {
        "field": "color",
        "size": 3
      }
    }
  }
}
```

```json
{
   [...],
   "aggregations": {
      "colors": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 4,
         "buckets": [
            {
               "key": "blue",
               "doc_count": 65
            },
            {
               "key": "red",
               "doc_count": 14
            },
            {
               "key": "brown",
               "doc_count": 3
            }
         ]
      }
   }
}
```
@roytmana
Copy link

If we go with sum of counts rather than a bucket approach. Could we configure this field to sum not just counts but values in another field. So if we build a chart of sales $ that other segment in your example will show other sales $ total rather than count

@jpountz
Copy link
Contributor Author

jpountz commented Oct 23, 2014

@roytmana We want to see how we can make this feature more flexible in 2.0 (eg. with sub aggregations) but it is challenging but for now (1.4), we only plan to return document counts.

@roytmana
Copy link

@jpountz thanks makes sense I should not be too greedy :-) count is helpful. Will missing bucket be implemented in upcomimg 1.x versions?

@jpountz
Copy link
Contributor Author

jpountz commented Oct 24, 2014

@roytmana Missing is also something we are thinking about. The issue is a bit more complicated than "other" because there are more options: should we add a bucket for missing, allow to replace the missing value with arbitrary values (like sorting does) or both? And also, how to implement it in a unified way across all aggregations since this is something that would make sense on nearly all aggregations.

assertSearchResponse(resp);
terms = resp.getAggregations().get("terms");
assertEquals(Math.min(size, totalNumTerms), terms.getBuckets().size());
assertEquals(sumOfDocCounts, sumOfDocCounts(terms));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not also check that terms.getSumOtherDocCounts() == totalHits - sumOfDocCounts ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not work in the multivalued case: imagine that your size is 1 and you have 2 documents that have colors:

doc1: red, blue
doc2: red

The number of hits would be 2, the sum of doc counts would be 2 (counts for red) and the sum of other doc counts would be 1 (blue) so it doesn't match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see that now. My concern was that the test is checking the value of the others doc count in an indirect way (from a readability point of view) since we are checking sum of all the doc counts rather than directly checking the other doc count, but actually there probably isn't another way to do it.

@colings86
Copy link
Contributor

@jpountz Code looks good, left some comments on the tests

@colings86 colings86 removed the review label Oct 24, 2014
@roytmana
Copy link

Ok, thank you @jpountz. I was hoping to get some sense on when these features may become available. I need to start moving my BI framework off facets but without buckets for _other and _missing and agg filtering on array of terms including numeric it would be very challenging and pure waste if missing/other is finally supported

@jpountz
Copy link
Contributor Author

jpountz commented Oct 24, 2014

@colings86 I replied/addressed your comments

@jpountz jpountz added the review label Oct 24, 2014
@colings86
Copy link
Contributor

@jpountz Looks good

jpountz added a commit that referenced this pull request Oct 27, 2014
This commit adds a new field to the response of the terms aggregation called
`sum_other_doc_count` which is equal to the sum of the doc counts of the buckets
that did not make it to the list of top buckets. It is typically useful to have
a sector called eg. `other` when using terms aggregations to build pie charts.

Example query and response:

```json
GET test/_search?search_type=count
{
  "aggs": {
    "colors": {
      "terms": {
        "field": "color",
        "size": 3
      }
    }
  }
}
```

```json
{
   [...],
   "aggregations": {
      "colors": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 4,
         "buckets": [
            {
               "key": "blue",
               "doc_count": 65
            },
            {
               "key": "red",
               "doc_count": 14
            },
            {
               "key": "brown",
               "doc_count": 3
            }
         ]
      }
   }
}
```

Close #8213
@jpountz jpountz closed this in 7ea490d Oct 27, 2014
jpountz added a commit that referenced this pull request Oct 27, 2014
This commit adds a new field to the response of the terms aggregation called
`sum_other_doc_count` which is equal to the sum of the doc counts of the buckets
that did not make it to the list of top buckets. It is typically useful to have
a sector called eg. `other` when using terms aggregations to build pie charts.

Example query and response:

```json
GET test/_search?search_type=count
{
  "aggs": {
    "colors": {
      "terms": {
        "field": "color",
        "size": 3
      }
    }
  }
}
```

```json
{
   [...],
   "aggregations": {
      "colors": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 4,
         "buckets": [
            {
               "key": "blue",
               "doc_count": 65
            },
            {
               "key": "red",
               "doc_count": 14
            },
            {
               "key": "brown",
               "doc_count": 3
            }
         ]
      }
   }
}
```

Close #8213
@jpountz jpountz deleted the feature/terms_agg_other branch October 27, 2014 11:51
@niemyjski
Copy link
Contributor

Will you be pushing a 1.4.2 with this fix?

@jpountz
Copy link
Contributor Author

jpountz commented Nov 7, 2014

@niemyjski I am confused, what bug are you talking about?

@niemyjski
Copy link
Contributor

We were getting a null reference exception when we upgraded ( elastic/elasticsearch-net#1042)

@gmarz
Copy link
Contributor

gmarz commented Nov 7, 2014

Hey @niemyjski this is an issue with NEST, not elasticsearch. As mentioned in elastic/elasticsearch-net#1041, we'll be releasing a new version next week. In the meantime you can pick up the CI package here.

@clintongormley clintongormley changed the title Aggregations: Return the sum of the doc counts of other buckets in terms aggregations. Return the sum of the doc counts of other buckets in terms aggregations. Jun 6, 2015
mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
This commit adds a new field to the response of the terms aggregation called
`sum_other_doc_count` which is equal to the sum of the doc counts of the buckets
that did not make it to the list of top buckets. It is typically useful to have
a sector called eg. `other` when using terms aggregations to build pie charts.

Example query and response:

```json
GET test/_search?search_type=count
{
  "aggs": {
    "colors": {
      "terms": {
        "field": "color",
        "size": 3
      }
    }
  }
}
```

```json
{
   [...],
   "aggregations": {
      "colors": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 4,
         "buckets": [
            {
               "key": "blue",
               "doc_count": 65
            },
            {
               "key": "red",
               "doc_count": 14
            },
            {
               "key": "brown",
               "doc_count": 3
            }
         ]
      }
   }
}
```

Close elastic#8213
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants