facet.stat with avg() function returns NaN #21

servicepilot · 2014-09-12T12:56:25Z

I want to get the top 10 countries with the higher average load time. The problem is that countries with NaN are in the top of the list (before countries with numeric value). I know the NaN value is generated by a /0 operation.
I would be useful to exclude from the result the values which have a document count equal to zero.

yonik · 2014-09-18T19:23:12Z

What version is this? I can't reproduce this with the latest version of Heliosearch.

servicepilot · 2014-09-19T07:54:31Z

Hello Yonik,

Thanks for your answer.

Below is the version I use:
solr-spec 4.11.0
solr-impl hs_0.07 Unversioned directory - yonik - 2014-09-07 21:01:57
lucene-spec 4.11.0
lucene-impl hs_0.07 Unversioned directory - yonik - 2014-09-07 20:59:47

I created a "test" core with only 2 documents (1 for today and 1 for yesterday).

http://127.0.0.1:8983/solr/test/select?q=*%3A*&wt=json&indent=true
{
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"indent":"true",
"q":":",
"wt":"json"}},
"response":{"numFound":2,"start":0,"docs":[
{
"id":"0001",
"text":"change.me",
"object":"object_1",
"size":10,
"timestamp":"2014-09-19T16:00:00Z",
"version":1479657155584851968},
{
"id":"0002",
"text":"change.me",
"timestamp":"2014-09-18T20:00:00Z",
"object":"object_2",
"size":15,
"version":1479657258503634944}]
}}

Then I want the today top 10 objects by average size.

http://127.0.0.1:8983/solr/test/select?q=timestamp:[2014-09-19T00:00:00.000Z%20TO%202014-09-19T23:59:59.999Z]%20AND%20(*)%20AND%20((*))&fq=object:[*%20TO%20*]&facet=true&facet.field=object&facet.limit=10&facet.offset=0&facet.stat=column0:avg(size)&facet.sort=column0%20desc&facet.stat=count()&rows=0&wt=json&indent=true

{
"responseHeader":{
"status":0,
"QTime":7,
"params":{
"facet":"true",
"indent":"true",
"facet.offset":"0",
"facet.sort":"column0 desc",
"q":"timestamp:[2014-09-19T00:00:00.000Z TO 2014-09-19T23:59:59.999Z] AND () AND (())",
"facet.stat":["column0:avg(size)",
"count()"],
"facet.limit":"10",
"facet.field":"object",
"wt":"json",
"fq":"object:[* TO *]",
"rows":"0"}},
"response":{"numFound":1,"start":0,"docs":[]
},
"facets":{
"object":{
"stats":{
"column0":10.0,
"count()":1},
"buckets":[{
"val":"object_2",
"column0":"NaN",
"count()":0},
{
"val":"object_1",
"column0":10.0,
"count()":1}]}}}

For "object_2" which has no document for the selected period, I have the "NaN" value and it is top 1 (in the top 10).

Let me know if you have enough information

Boris

yonik · 2014-09-19T14:11:05Z

Ah, I was trying to reproduce with no values in a particular bucket, but now I see the issue was with no documents in a bucket. Thanks for the clarification, I can reproduce it now.

I'll add mincount support so one cap specify "mincount=1"

There are other longer term issues here too:

should NaN continue to be used when no documents match a bucket, or should we just use 0?
Although NaN is technically slightly more correct, it may be too much of a practical pain.
how should avg handle missing values in a bucket
if we do allow NaN values, how should they be sorted (certainly not like they are today... I'd consider that a bug).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

facet.stat with avg() function returns NaN #21

facet.stat with avg() function returns NaN #21

servicepilot commented Sep 12, 2014

yonik commented Sep 18, 2014

servicepilot commented Sep 19, 2014

yonik commented Sep 19, 2014

facet.stat with avg() function returns NaN #21

facet.stat with avg() function returns NaN #21

Comments

servicepilot commented Sep 12, 2014

yonik commented Sep 18, 2014

servicepilot commented Sep 19, 2014

yonik commented Sep 19, 2014