Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

facet.stat with avg() function returns NaN #21

Open
servicepilot opened this issue Sep 12, 2014 · 3 comments
Open

facet.stat with avg() function returns NaN #21

servicepilot opened this issue Sep 12, 2014 · 3 comments

Comments

@servicepilot
Copy link

I want to get the top 10 countries with the higher average load time. The problem is that countries with NaN are in the top of the list (before countries with numeric value). I know the NaN value is generated by a /0 operation.
I would be useful to exclude from the result the values which have a document count equal to zero.

@yonik
Copy link
Member

yonik commented Sep 18, 2014

What version is this? I can't reproduce this with the latest version of Heliosearch.

@servicepilot
Copy link
Author

Hello Yonik,

Thanks for your answer.

Below is the version I use:
solr-spec 4.11.0
solr-impl hs_0.07 Unversioned directory - yonik - 2014-09-07 21:01:57
lucene-spec 4.11.0
lucene-impl hs_0.07 Unversioned directory - yonik - 2014-09-07 20:59:47

  1. I created a "test" core with only 2 documents (1 for today and 1 for yesterday).

http://127.0.0.1:8983/solr/test/select?q=*%3A*&wt=json&indent=true
{
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"indent":"true",
"q":":",
"wt":"json"}},
"response":{"numFound":2,"start":0,"docs":[
{
"id":"0001",
"text":"change.me",
"object":"object_1",
"size":10,
"timestamp":"2014-09-19T16:00:00Z",
"version":1479657155584851968},
{
"id":"0002",
"text":"change.me",
"timestamp":"2014-09-18T20:00:00Z",
"object":"object_2",
"size":15,
"version":1479657258503634944}]
}}

  1. Then I want the today top 10 objects by average size.

http://127.0.0.1:8983/solr/test/select?q=timestamp:[2014-09-19T00:00:00.000Z%20TO%202014-09-19T23:59:59.999Z]%20AND%20(*)%20AND%20((*))&fq=object:[*%20TO%20*]&facet=true&facet.field=object&facet.limit=10&facet.offset=0&facet.stat=column0:avg(size)&facet.sort=column0%20desc&facet.stat=count()&rows=0&wt=json&indent=true

{
"responseHeader":{
"status":0,
"QTime":7,
"params":{
"facet":"true",
"indent":"true",
"facet.offset":"0",
"facet.sort":"column0 desc",
"q":"timestamp:[2014-09-19T00:00:00.000Z TO 2014-09-19T23:59:59.999Z] AND () AND (())",
"facet.stat":["column0:avg(size)",
"count()"],
"facet.limit":"10",
"facet.field":"object",
"wt":"json",
"fq":"object:[* TO *]",
"rows":"0"}},
"response":{"numFound":1,"start":0,"docs":[]
},
"facets":{
"object":{
"stats":{
"column0":10.0,
"count()":1},
"buckets":[{
"val":"object_2",
"column0":"NaN",
"count()":0},
{
"val":"object_1",
"column0":10.0,
"count()":1}]}}}

  1. For "object_2" which has no document for the selected period, I have the "NaN" value and it is top 1 (in the top 10).

Let me know if you have enough information

Boris

@yonik
Copy link
Member

yonik commented Sep 19, 2014

Ah, I was trying to reproduce with no values in a particular bucket, but now I see the issue was with no documents in a bucket. Thanks for the clarification, I can reproduce it now.

I'll add mincount support so one cap specify "mincount=1"

There are other longer term issues here too:

  • should NaN continue to be used when no documents match a bucket, or should we just use 0?
    Although NaN is technically slightly more correct, it may be too much of a practical pain.
  • how should avg handle missing values in a bucket
  • if we do allow NaN values, how should they be sorted (certainly not like they are today... I'd consider that a bug).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants