Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bucket script aggregation returns invalid value for missing docs #27377

Closed
shaharmor opened this issue Nov 14, 2017 · 14 comments · Fixed by #73297 · May be fixed by #44516
Closed

Bucket script aggregation returns invalid value for missing docs #27377

shaharmor opened this issue Nov 14, 2017 · 14 comments · Fixed by #73297 · May be fixed by #44516
Assignees
Labels
:Analytics/Aggregations Aggregations >bug help wanted adoptme high hanging fruit Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@shaharmor
Copy link

shaharmor commented Nov 14, 2017

Elasticsearch version (bin/elasticsearch --version): 6.0.0-rc1

JVM version (java -version):
java version "1.8.0_151"
Java(TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
OS version (uname -a if on a Unix-like system):
Linux elasticsearch-data-hot-003 4.11.0-1013-azure #13-Ubuntu SMP Mon Oct 2 17:59:06 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:
When doing a bucket script aggregation that depends on a cumulative sum aggregation of another sum aggregation, if the sum aggregation returns null values (Because there are no docs in that time interval bucket), the bucket script aggregation will also return null, instead of relying on the cumulative sum value that was gathered so far.

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

  1. Add docs that span over 5 minutes that look like this:
{
  @timestamp: '',
  bytes: 100
}
  1. Run a query that spans after the 5m end (meaning that there will be date histogram buckets without docs), with this aggregation:
{
"aggs": {
    "timeseries": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "1m",
        "min_doc_count": 0,
        "time_zone": "UTC"
      },
      "aggs": {
        "sum_bytes": {
          "sum": {
            "field": "bytes"
          }
        },
        "cumulative_bytes": {
          "cumulative_sum": {
            "buckets_path": "sum_bytes"
          }
        },
        "bucket": {
          "bucket_script": {
            "buckets_path": {
              "bytes": "cumulative_bytes"
            },
            "script": {
              "source": "params.bytes",
              "lang": "painless"
            }
          }
        }
      }
    }
  }
}
  1. Check the response and see that in the date histogram without buckets, the bucket aggregation does not show the value that its supposed to (The cumulative_bytes value)
@martijnvg
Copy link
Member

@shaharmor Did you try setting a missing value on the date_histogram aggregation? Github issues are meant for bugs and feature requests and this sounds more like a question that should be asked on the forum first.

@shaharmor
Copy link
Author

shaharmor commented Nov 14, 2017

I'm not sure how a missing value on the date_histogram would help, as there are no docs to add the missing field to...

Anyway, I created a forum thread as well: https://discuss.elastic.co/t/bucket-script-fails-when-some-docs-are-missing/107592/1

@colings86
Copy link
Contributor

colings86 commented Nov 20, 2017

This is actually a bug.

It seems that the bucket_script aggregation is not executed on buckets that have a doc count of zero. The following recreation script highlights this:

PUT test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "bytes": {
          "type": "long"
        }
      }
    }
  }
}

PUT test/doc/1
{
  "@timestamp": "2017-01-01T00:00:00",
  "bytes": 100
}

PUT test/doc/2
{
  "@timestamp": "2017-01-01T00:05:00",
  "bytes": 100
}

GET test/_search
{
  "size": 0,
  "aggs": {
    "timeseries": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "1m",
        "min_doc_count": 0,
        "time_zone": "UTC"
      },
      "aggs": {
        "sum_bytes": {
          "sum": {
            "field": "bytes"
          }
        },
        "cumulative_bytes": {
          "cumulative_sum": {
            "buckets_path": "sum_bytes"
          }
        },
        "bucket": {
          "bucket_script": {
            "buckets_path": {
              "bytes": "cumulative_bytes"
            },
            "script": {
              "source": "params.bytes",
              "lang": "painless"
            }
          }
        }
      }
    }
  }
}

The response from the search request is:

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "timeseries": {
      "buckets": [
        {
          "key_as_string": "2017-01-01T00:00:00.000Z",
          "key": 1483228800000,
          "doc_count": 1,
          "sum_bytes": {
            "value": 100
          },
          "cumulative_bytes": {
            "value": 100
          },
          "bucket": {
            "value": 100
          }
        },
        {
          "key_as_string": "2017-01-01T00:01:00.000Z",
          "key": 1483228860000,
          "doc_count": 0,
          "sum_bytes": {
            "value": 0
          },
          "cumulative_bytes": {
            "value": 100
          }
        },
        {
          "key_as_string": "2017-01-01T00:02:00.000Z",
          "key": 1483228920000,
          "doc_count": 0,
          "sum_bytes": {
            "value": 0
          },
          "cumulative_bytes": {
            "value": 100
          }
        },
        {
          "key_as_string": "2017-01-01T00:03:00.000Z",
          "key": 1483228980000,
          "doc_count": 0,
          "sum_bytes": {
            "value": 0
          },
          "cumulative_bytes": {
            "value": 100
          }
        },
        {
          "key_as_string": "2017-01-01T00:04:00.000Z",
          "key": 1483229040000,
          "doc_count": 0,
          "sum_bytes": {
            "value": 0
          },
          "cumulative_bytes": {
            "value": 100
          }
        },
        {
          "key_as_string": "2017-01-01T00:05:00.000Z",
          "key": 1483229100000,
          "doc_count": 1,
          "sum_bytes": {
            "value": 100
          },
          "cumulative_bytes": {
            "value": 200
          },
          "bucket": {
            "value": 200
          }
        }
      ]
    }
  }
}

In the response above the buckets with keys 2017-01-01T00:01:00.000Z to 2017-01-01T00:04:00.000Z should have a sub-aggregation bucket whose value should be 100 (the same as the cumulative_sum aggregation since the script just outputs that value).

@liketic
Copy link
Contributor

liketic commented Nov 22, 2017

Hi @colings86 Could you please provide some instructions to fix this? Thanks in advance!

@colings86
Copy link
Contributor

I just look a look at the code around this and I think its going to need some thought around how we can fix this bug without adversely affecting other pipeline aggregations. The problem arises when we resolve the value for the buckets_path for the empty bucket in BucketHelpers.resolveBucketValue here (https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/search/aggregations/pipeline/BucketHelpers.java#L174):

if (Double.isInfinite(value) || Double.isNaN(value) || (bucket.getDocCount() == 0 && !isDocCountProperty)) {
                    switch (gapPolicy) {
                    case INSERT_ZEROS:
                        return 0.0;
                    case SKIP:
                    default:
                        return Double.NaN;
                    }
                } else {
                    return value;
                }

That if statement return Double.NaN (because the gap policy is SKIP) for any empty buckets, where an empty bucket is defined as a bucket with a doc count of zero.

In BucketScriptPipelineAggregator.reduce() we don't execute on buckets which return Double.NaN so the bucket is skipped.

BucketHelpers.resolveBucketValue() is used by all of the pipeline aggregations so we need to be careful that any changes do not adversely affect the other pipeline aggregations.

We could potentially change the bucket_script aggregation so it executes on all buckets regardless of the retrieved value but we also pass the doc count of the current bucket to the script so it can determine what value to output but I'm worried that this might make this aggregation too unwieldly for the user.

I'll mark this issue as discuss for now as I think we need to decide on an approach before it is implemented.

@colings86 colings86 added discuss and removed help wanted adoptme labels Nov 23, 2017
@colings86
Copy link
Contributor

Discussed in FIxItFriday and we decided that this is requires more in depth discussion within the Search and Aggregations team

@colings86
Copy link
Contributor

Discussed with the search and aggs team and we decided that we should pass the actual value to the script (instead of converting it to NaN or 0.0 using the gap policy and also pass the doc count of the bucket to the script so the user writing the script can know if the bucket is empty and decide how to interpret the value

@colings86
Copy link
Contributor

We should ensure that the solution to this issue also fixes the issue in #27544

@shaharmor
Copy link
Author

@colings86 Is this something that is being actively worked on? Any ETA?

@colings86
Copy link
Contributor

colings86 commented Dec 19, 2017

@shaharmor its not being actively worked on right now but since we now have a way forward on this it is now available to be picked up and worked on. There is no ETA for this currently.

@colings86
Copy link
Contributor

@elastic/es-search-aggs

@smanolloff
Copy link

smanolloff commented Jan 15, 2019

For anyone stumbling upon the same issue with bucket_selector aggregations, here is my workaround:

I will use the example from the docs:

{
  "size": 0,
  "aggs": {
    "sales_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      },
      "aggs": {
        "total_sales": {
          "sum": {
            "field": "price"
          }
        },
        "sales_bucket_filter": {
          "bucket_selector": {
            "buckets_path": {
              "totalSales": "total_sales"
            },
            "script": "params.totalSales < 200"
          }
        }
      }
    }
  }
}

The issue: the script will select only non-empty buckets. If you try something like params.totalSales == null || params.totalSales < 200, it still won't select the empty ones.
Workaround is to make use of the special _count path:

        "sales_bucket_filter": {
          "bucket_selector": {
            "buckets_path": {
              "totalSales": "total_sales",
              "bucketCount": "_count"
            },
            "script": "params.bucketCount == 0 || params.totalSales < 200"
          }
        }

At least in 6.3, this works for bucket_path, but not for bucket_script, unfortunately. Can't say for other pipeline aggs.

@shaharmor
Copy link
Author

Is this being fixed in 7?

@polyfractal
Copy link
Contributor

Hi @shaharmor, I'm afraid there is still no progress on this issue. It's still on our radar, but we don't currently have anyone working on it at this time. We'll update this ticket when there's movement.

@polyfractal polyfractal self-assigned this Jul 17, 2019
@rjernst rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020
@polyfractal polyfractal removed their assignment Mar 18, 2021
@imotov imotov self-assigned this May 12, 2021
imotov added a commit to imotov/elasticsearch that referenced this issue May 20, 2021
Adds a new keep_values gap policy that works like skip, except if the metric
calculated on an empty bucket provides a non-null non-NaN value, this value is
used for the bucket.

Fixes elastic#27377
imotov added a commit that referenced this issue Jun 8, 2021
Adds a new keep_values gap policy that works like skip, except if the metric
calculated on an empty bucket provides a non-null non-NaN value, this value is
used for the bucket.

Fixes #27377

Co-authored-by: Mark Tozzi <mark.tozzi@gmail.com>
imotov added a commit that referenced this issue Jun 8, 2021
Adds a new keep_values gap policy that works like skip, except if the metric
calculated on an empty bucket provides a non-null non-NaN value, this value is
used for the bucket.

Fixes #27377

Co-authored-by: Mark Tozzi <mark.tozzi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >bug help wanted adoptme high hanging fruit Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
8 participants