Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terms aggregations order wrong when sorting NaN's #5236

Closed
thanodnl opened this issue Feb 24, 2014 · 2 comments
Closed

Terms aggregations order wrong when sorting NaN's #5236

thanodnl opened this issue Feb 24, 2014 · 2 comments

Comments

@thanodnl
Copy link
Contributor

I have a strong believe there is an issue in the sorting of term aggregations.

Have a look here. If we look at the comment above it indicates that it would like to push NaN's to the bottom of the list (which would be the correct behaviour according to me). But when I test this out it does not work

Loading the following test data:

$ curl -XDELETE 'localhost:9200/sorting?pretty=true'
$ curl -XPOST 'localhost:9200/_bulk?pretty=true' -d '
{ index: { _index: "sorting", _type: "object" }}
{ t: "a", a:1, b:1 }
{ index: { _index: "sorting", _type: "object" }}
{ t: "a", a:2, b:4 }
{ index: { _index: "sorting", _type: "object" }}
{ t: "a", a:3, b:9 }
{ index: { _index: "sorting", _type: "object" }}
{ t: "b", a:4, b:16 }
{ index: { _index: "sorting", _type: "object" }}
{ t: "b", a:5, b:25, c: 42 }
{ index: { _index: "sorting", _type: "object" }}
{ t: "b", a:6, b:36, c: 50 }
{ index: { _index: "sorting", _type: "object" }}
{ t: "c", a:7, b:49 }
{ index: { _index: "sorting", _type: "object" }}
{ t: "c", a:8, b:64 }
{ index: { _index: "sorting", _type: "object" }}
{ t: "c", a:9, b:81 }
{ index: { _index: "sorting", _type: "object" }}
{ t: "c", a:10, c:100 }
'

You can start running some aggregations:

$ curl 'localhost:9200/sorting/_search?pretty=true' -d '
{
  "size": 0,
  "aggs": {
    "t": {
      "terms": {
        "field": "t",
        "order": {
          "aStats.min": "desc"
        }
      },
      "aggs": {
        "aStats": {
          "stats": {
            "field": "c"
          }
        }
      }
    }
  }
}
'

When we run this aggregations if turns out that a term without value's in the c field ends in the top.

{
  "aggregations": {
    "t": {
      "buckets": [
        {
          "key": "a",
          "doc_count": 3,
          "aStats": {
            "count": 0,
            "min": null,
            "max": null,
            "avg": null,
            "sum": null
          }
        },
        {
          "key": "c",
          "doc_count": 4,
          "aStats": {
            "count": 1,
            "min": 100,
            "max": 100,
            "avg": 100,
            "sum": 100
          }
        },
        {
          "key": "b",
          "doc_count": 3,
          "aStats": {
            "count": 2,
            "min": 42,
            "max": 50,
            "avg": 46,
            "sum": 92
          }
        }
      ]
    }
  }
}

When running the same analysis with facetting you would see that the term without the values in c are pushed to the bottom of the list

$ curl 'localhost:9200/sorting/_search?pretty=true' -d '
{
  "size": 0,
  "facets": {
    "t": {
      "terms_stats": {
        "key_field": "t",
        "value_field": "c",
        "order": "min"
      }
    }
  }
}
'

This looks to me as correct behaviour.

I have done some research on why this is happening, and in fact the if statement referenced above is comparing the aggregated metric to Double.NaN. In java it turns out that NaN is not equal to NaN :), luckily the guys working at java thought of this and added a function to check for NaN values Double.isNaN. Changing the line accordingly makes the return statement next work since it is skipped always at the moment. But...

On line 216 it returns 1 or -1 depending on the ordering provided. This would result in NaN floating to the top of the list when ordering descending. Which has the strange effect that NaN's would be at the top of the list. My believe is that is should always return 1 if v1 is NaN.

Last part of the bug is that only v1 is being checked to be NaN. You would also need to check v2 for being NaN and return -1(!) if so. This would, as the comment suggest always push 'NaN' values to the bottom of the sorted list. This resembles the most to how facets sort at the moment.

Concrete effects of this bug is that we are not able to use aggregations for a table like view (which is the main benefit of using aggs, since you can get multiple columns at once) to show terms sorted descending on the avg of a sparse field we have in our collection of documents which was able when using facets.

Please have a look, and note that this error is made on two spots in the same file. (second spot is on line 230).

I tested (not via the test suite, but by hand) out a fix locally and that seems works for us. If you would like to share my patch by opening a pull request I could, although I need to check my code against your guidelines, and add some automated tests :)

@jpountz
Copy link
Contributor

jpountz commented Feb 24, 2014

Thanks for opening such a detailed bug report, your observation and the way you propose to fix this issue sound good to me so a pull request would be highly welcome!

@jpountz jpountz assigned jpountz and unassigned uboness Feb 24, 2014
thanodnl added a commit to thanodnl/elasticsearch that referenced this issue Feb 24, 2014
@thanodnl
Copy link
Contributor Author

I opened the pull request, If you need me to change things just let me know.

jpountz pushed a commit to jpountz/elasticsearch that referenced this issue Feb 27, 2014
jpountz pushed a commit that referenced this issue Mar 4, 2014
@jpountz jpountz closed this as completed in 0e57915 Mar 4, 2014
jpountz pushed a commit that referenced this issue Mar 4, 2014
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants