Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

min_doc_count=0 doesn't work with a date_histogram with a filter #4843

Closed
cmaitchison opened this issue Jan 22, 2014 · 17 comments
Closed

min_doc_count=0 doesn't work with a date_histogram with a filter #4843

cmaitchison opened this issue Jan 22, 2014 · 17 comments

Comments

@cmaitchison
Copy link

I'm trying to create a date_histogram for recent events, where days where no events happen are still shown.

{
  "aggs": {
    "events_last_week": {
      "filter": {
        "range": {
          "@timestamp": {
            "from": "2014-01-10"
          }
        }
      },
      "aggs": {
        "events_last_week_histogram": {
          "date_histogram": {
            "min_doc_count": 0,
            "field": "@timestamp",
            "format": "yyyy-MM-dd",
            "interval": "1d"
          }
        }
      }
    }
  }
}

I get a response like this

"aggregations":  {
  "events_last_week": {
    "doc_count": 33861,
    "events_last_week_histogram": [
      {
        "key_as_string": "2014-01-10",
        "key": 1389744000000,
        "doc_count": 2120
      }, {
        "key_as_string": "2014-01-16",
        "key": 1389830400000,
        "doc_count": 3823
      }, {
        "key_as_string": "2014-01-17",
        "key": 1389916800000,
        "doc_count": 27918
      }
    ]
  }
}

The empty days are not returned. If I construct the query without the filter, the empty days are returned correctly.

There is also an issue even when the empty days are returned correctly without the filter. If, for example, today is "2014-01-22", and the latest timestamp in my data is "2014-01-17", then the 5 days between these two dates are not returned as empty buckets, though all the empty buckets prior to "2014-01-17" are returned correctly.

@uboness
Copy link
Contributor

uboness commented Jan 22, 2014

@cmaitchison

I can't really reproduce it, I ran the same queries as you and I get the right responses. What es version are you working with? we introduced min_doc_count on 1.0.0.RC1

There is also an issue even when the empty days are returned correctly without the filter. If, for example, today is "2014-01-22", and the latest timestamp in my data is "2014-01-17", then the 5 days between these two dates are not returned as empty buckets, though all the empty buckets prior to "2014-01-17" are returned correctly.

the gaps that are filled are based on the dates in the documents you're aggregating... so the first histogram bucket will be based on the earliest date in the document set and the last bucket will be based on the latest date in the set... then we fill in all gaps between these two buckets.

we can consider adding a "range" settings to the histograms which will enable to define the value range (or date range in case of date_histogram) on which the buckets will be created. In your case, that'll mean that if you define a range of the form "range": { "to" : "now" } along with "min_doc_count" : 0 we'll return all the empty buckets until now (beyond the dates in the document set)

@uboness
Copy link
Contributor

uboness commented Jan 22, 2014

@cmaitchison scratch that... I finally managed to reproduce it (it happens when you have a single shard)... will work on a fix

@cmaitchison
Copy link
Author

Wow, nice find! I would never have thought to have mentioned that.

On 22 Jan 2014, at 21:32, uboness notifications@github.com wrote:

@cmaitchison scratch that... I finally managed to reproduce it (it happens when you have a single shard)... will work on a fix


Reply to this email directly or view it on GitHub.

@cmaitchison
Copy link
Author

Also related to this title, I've found that min_doc_count=0 does not work if all of the buckets would be empty after applying the filter. I can reproduce this issue on an index with 2 shards.

{
  "aggs": {
    "filtered_events": {
      "filter": {
        "and": [
          {
            "range": {
              "@timestamp": {
                "from": 1390267500000,
                "to":   1390267560000
              }
            }
          }
        ]
      },
      "aggs": {
        "filtered_events_histogram": {
          "date_histogram": {
            "min_doc_count": 0,
            "field": "@timestamp",
            "interval": "1s"
          }
        }
      }
    }
  }
}

The above query should return 60 results, 1 for each second in the minute. If any events are found in that minute then 60 results are returned. If no events are found in that minute then 0 results are returned, when you would expect 60 empty buckets.

My use case is zooming in on a series on a chart. The zero value results are very helpful to know where to plot the zeros on the x-axis.

@cmaitchison
Copy link
Author

Another related issue I am finding is that sometimes the intervals do not go back far enough.

{
  "aggs": {
    "events_last_week": {
      "filter": {
        "and": [
          {
            "range": {
              "@timestamp": {
                "from": 1390267432894,
                "to": 1390267547037
              }
            }
          }
        ]
      },
      "aggs": {
        "events_last_week_histogram": {
          "date_histogram": {
            "min_doc_count": 0,
            "field": "@timestamp",
            "interval": "second"
          }
        }
      }
    }
  }
}

returns exactly

{
  "aggregations": {
    "events_last_week": {
      "doc_count": 1099,
      "events_last_week_histogram": [
        {
          "key": 1390267526000,
          "doc_count": 12
        },
        {
          "key": 1390267527000,
          "doc_count": 0
        },
        {
          "key": 1390267528000,
          "doc_count": 29
        },
        {
          "key": 1390267529000,
          "doc_count": 32
        },
        {
          "key": 1390267530000,
          "doc_count": 58
        },
        {
          "key": 1390267531000,
          "doc_count": 64
        },
        {
          "key": 1390267532000,
          "doc_count": 35
        },
        {
          "key": 1390267533000,
          "doc_count": 36
        },
        {
          "key": 1390267534000,
          "doc_count": 43
        },
        {
          "key": 1390267535000,
          "doc_count": 52
        },
        {
          "key": 1390267536000,
          "doc_count": 58
        },
        {
          "key": 1390267537000,
          "doc_count": 62
        },
        {
          "key": 1390267538000,
          "doc_count": 76
        },
        {
          "key": 1390267539000,
          "doc_count": 70
        },
        {
          "key": 1390267540000,
          "doc_count": 53
        },
        {
          "key": 1390267541000,
          "doc_count": 72
        },
        {
          "key": 1390267542000,
          "doc_count": 81
        },
        {
          "key": 1390267543000,
          "doc_count": 48
        },
        {
          "key": 1390267544000,
          "doc_count": 88
        },
        {
          "key": 1390267545000,
          "doc_count": 45
        },
        {
          "key": 1390267546000,
          "doc_count": 83
        },
        {
          "key": 1390267547000,
          "doc_count": 2
        }
      ]
    }
  }
}

But it is missing all of the empty buckets between 1390267432894 and 1390267526000. Again, this is with a 2 shard index on 1.0.0RC1.

@uboness
Copy link
Contributor

uboness commented Jan 23, 2014

@cmaitchison as I mentioned above, the histogram operates on the dataset and extracts the min/max of the histogram from the documents (the earliest/latest). There is no direct relations between the filter aggregation and the histogram aggregations (aggregations are unaware of other aggregations in their hierarchy). We could potentially add a range feature to histogram, but if we do it'll have to be post 1.0.

In the first example you gave, there are no documents in that minute, there are no buckets (as we can't determine the min/max values). For the second example, it might be that the first document in the doc set has a later timestamp than the from one in the filter.

uboness added a commit that referenced this issue Jan 23, 2014
… shard, the reduce call was not propagated properly down the agg hierarchy.

Closes #4843
uboness added a commit that referenced this issue Jan 23, 2014
… shard, the reduce call was not propagated properly down the agg hierarchy.

Closes #4843
@cmaitchison
Copy link
Author

Thanks, @uboness, for your help and excellent explanation. range on histogram is definitely a feature I would use. For now I can fill in the gaps on the client-side. Thanks again.

@uboness
Copy link
Contributor

uboness commented Jan 23, 2014

@cmaitchison no worries... thank you for the bug report! important one!

@erikvanzijst
Copy link

I'm interested in hard range boundaries (returning empty buckets to fill gaps between from and to in the case of missing documents) as well. Is there an issue tracking this, or shall I raise one?

@deanchen
Copy link

For anyone who arrived to this thread via Google, hard ranges is supported via the extended_bounds param. http://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-histogram-aggregation.html

@taf2
Copy link

taf2 commented Jun 18, 2015

I'm now experiencing the same issue as reported running es 1.6.0

histogram = {
  invervals: {
    date_histogram: {
      field: 'called_at',
      interval: 'day',
      order: { _key: "asc" },
      min_doc_count: 0 # doesn't appear to have any impact on the final result.
    },
    aggs: stats
  }
}

@taf2
Copy link

taf2 commented Jun 18, 2015

it looks like when nesting a date_histogram within a term aggregation there is no way for the min_doc_count to auto fill the zero results.

aggs: {
   groups: {
     terms: {
       min_doc_count: 0
       script: '...'
    },
   aggs: {
   invervals: {
    date_histogram: {
      field: 'called_at',
      interval: 'day',
      order: { _key: "asc" },
      min_doc_count: 0 # doesn't appear to have any impact on the final result.
    },
    aggs: stats
  }
  }
}

@clintongormley
Copy link

@taf2 please could you open an issue with a complete recreation which explains the problem?

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
… shard, the reduce call was not propagated properly down the agg hierarchy.

Closes elastic#4843
@quillan86
Copy link

Is this bug still there? I am trying to do the same exact thing as the OP right now.

@vicapow
Copy link

vicapow commented Apr 20, 2020

me too! :)

@mashahabi15
Copy link

And me either. :)

@Crijavi4
Copy link

Hi, i found the same issue but it could be workaround adding the object extended_bounds to the date_histogram aggregation, something like this:

{"extended_bounds":{"min":"+timeInit+","max":"+timeFin+"}} where timeInit and timeFin are the same period specified in the range filter in miliseconds

I hope this can help somebody.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants