Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggs: clean up date_histogram #9062

Closed
jpountz opened this issue Dec 24, 2014 · 21 comments
Closed

Aggs: clean up date_histogram #9062

jpountz opened this issue Dec 24, 2014 · 21 comments
Assignees

Comments

@jpountz
Copy link
Contributor

jpountz commented Dec 24, 2014

The date_histogram aggregation supports both pre_zone and post_zone. This is bad because specifying two different values for these parameters makes elasticsearch return dates that are not in UTC. Instead we should only support a zone parameter that is the time zone that we will use to compute the buckets, and return buckets in UTC.

Similarly we have pre_offset and post_offset while we should only support a single offset parameter that would allow to do things like:

  • daily buckets should go from 6AM to 6AM the next day instead of 12AM
  • monthly buckets should go from the 10th of the month to the 10th of the next month instead of the 1st
  • etc.
@rjernst
Copy link
Member

rjernst commented Dec 29, 2014

+1

@cbuescher
Copy link
Member

Since I just changed the "pre_/post_offset" in #9417 for the histogram I can also look into this. I changed the date_histogram to only use the simpler "offset" option which only allows shifting the bucket start and endpoints.

Just to make sure I understand the way offest should work here: If I have two docs, one at 5am and one 7am on Feb 3 and I use a daily bucket with "offset" : "6h", then the first doc should go into the Feb 2nd bucket, keyed with "2015-02-02T06:00:00" and the second in the Feb 3rd bucket.

As for the changes in 'pre_zone' and 'post_zone'. As far as I understand it so far 'post_zone' only affects the 'valueForKey' of the underlying Roundings, so it would probably be sufficient to rename 'pre_zone' to 'zone' and drop the other parameter. Or am I missing something here?

@jpountz I could work on this further, adapt the existing tests etc. Would need some advice on what to consider in terms of bwc though, because dropping the parameters will affect serialization. Not sure how much of an issues this is when this only goes to v2.0.0.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 3, 2015

Just to make sure I understand the way offest should work here: If I have two docs, one at 5am and one 7am on Feb 3 and I use a daily bucket with "offset" : "6h", then the first doc should go into the Feb 2nd bucket, keyed with "2015-02-02T06:00:00" and the second in the Feb 3rd bucket.

Yes.

As for the changes in 'pre_zone' and 'post_zone'. As far as I understand it so far 'post_zone' only affects the 'valueForKey' of the underlying Roundings, so it would probably be sufficient to rename 'pre_zone' to 'zone' and drop the other parameter. Or am I missing something here?

Elasticsearch should always return UTC dates so I think we need both?

The time zone parameter should essentially behave the same way as the offset parameter, except that it is not fixed but adapts itself to things like daylight saving times. When a user specifies a timezone, it means that all buckets should start at offset (midnight by default) in this time zone. So for instance with Europe/Berlin, daily buckets would be 2015-02-02T23:00:00Z, 2015-02-03T23:00:00Z, ...

If we keep this Europe/Berlin + daily buckets example, when a date is aggregated, I think the workflow should look like this:

  1. temporary convert the date from UTC to the desired time zone, for instance 2015-02-03T70:00:00 would become 2015-02-03T08:00:00
  2. apply the rounding: 2015-02-03T08:00:00 becomes 2015-02-03T00:00:00
  3. convert back from the desired time zone to UTC: 2015-02-03T00:00:00 becomes 2015-02-02T23:00:00 which is midnight in Berlin.

Would need some advice on what to consider in terms of bwc though, because dropping the parameters will affect serialization. Not sure how much of an issues this is when this only goes to v2.0.0.

For backward compatibility I think we need to do the following:

  • On 1.5:
    • Add support for the offset parameter which does not exist there yet
    • Make pre_offset, post_offset, pre_zone, and post_zone parsed with the ParseField class. This way users will have a way to know they are using deprecated settings (see Use parseField to parse all parameter names #8964)
    • Modify the migration guide (in docs/reference/migration) to say that the pre_offset, post_offset parameter are deprecated in favour of offset and that pre_zone and post_zone are deprecated in favour of time_zone.
  • On 2.0
    • Remove the pre_offset, post_offset, pre_zone and post_zone parameters. The fact that serialization is affected is not an issue since 2.0 will require a full cluster restart, so it will not need to be able to talk with a 1.x node.
    • Add a note to the 2.0 migration guide explaining that these parameters were deprecated and have been removed
    • Potentially add a note that the behaviour of the time_zone parameter changed (I may be wrong but it seems to me like it does not always return UTC dates today).

@clintongormley Would be great if you could confirm that what I wrote above looks good to you.

@jpountz jpountz assigned cbuescher and unassigned jpountz Feb 3, 2015
@cbuescher
Copy link
Member

Working on this I start to see now why the current options are a bit confusing. Removing the "pre/post offset" part seems fairly straightforward, but for the pre/postZone deletion I had trouble getting all the test to run after my modifications to the TimeZoneRounding classes.

When I finally got my Tests running I stumbled into a test failure in IndicesQueryCacheTests which I first thought was due to my code changes, but digging into it I found I can reproduce the failure also on master.

There is an assertion in
https://github.com/elasticsearch/elasticsearch/blob/29c24d75e733d8f3409bac8488dce0e950118968/src/main/java/org/elasticsearch/search/aggregations/bucket/histogram/InternalHistogram.java#L398

that checks keys of subsequent buckets. When using a postZone(-1) in IndicesQueryCacheTests this can be made to fail. I added examples and an isolated test for this in the "test-timeZoneRounding" branch in my repo at

https://github.com/cbuescher/elasticsearch/compare/test-timeZoneRounding

@jpountz should I open separate issue for this?

@jpountz
Copy link
Contributor Author

jpountz commented Feb 4, 2015

Haven't dug yet but your description of the issue looks similar to #7673

@cbuescher
Copy link
Member

Looks related. The problem seems to lie in the nextRoundingValue() methods in the unit-based TimeZoneRoundings, haven't checked the proposed fix fo #7673 in #9029 but that solution is probably incomplete because it only corrects the case for one of the TimeZoneRounding classes.

@cbuescher
Copy link
Member

Just to clarify a few things I got really confused about when digging deeper into the current state of TimeZoneRoundings:

Currently pre_Zone offset is applied in roundKey() and postZone in valueForKey(). Would it be possible to just do all the zone conversion in roundKey() and return that in UTC after the cleanup? Apart from beeing confusing the current solution leads to things like roundKey() not beeing an idempotent operation when preZone!=0.

Also, should one safely assume that the input and output values to the rounding methods are alway UTC after the cleanup? Currently this seems not to be the case since the values returned by roundKey() and valueForKey() can both be in local time zones.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 5, 2015

Currently pre_Zone offset is applied in roundKey() and postZone in valueForKey(). Would it be possible to just do all the zone conversion in roundKey() and return that in UTC after the cleanup? Apart from beeing confusing the current solution leads to things like roundKey() not beeing an idempotent operation when preZone!=0.

The reason why it works this way is that these rounding methods are called in HistogramCollector.collect which is a typical bottleneck when running histogram aggregations. By splitting the rounding logic into roundKey and valueForKey, we still call roundKey for every value (typically millions of times), but valueForKey only once for every unique roundKey (typically only a couple tens). I don't know if roundKey should be idempotent, but X -> valueForKey(roundKey(X)) certainly should.

That said, we should favour correctness over performance so if this distinction makes things harder to reason about, let's disable it for now (eg. by putting all the logic into roundKey and making valueForKey return the provided argument).

Also, should one safely assume that the input and output values to the rounding methods are alway UTC after the cleanup? Currently this seems not to be the case since the values returned by roundKey() and valueForKey() can both be in local time zones.

The return value of roundKey is really an identifier of a bucket, it doesn't carry any meaning and has no notion of timezone. However, indeed, one can safely assume that the input of roundKey is a UTC date, and the output of valueForKey should be a UTC date as well.

I know the current state of timezone handling is a bit messy so if it makes things easier for you, feel free to break this issue into several smaller issues/pull requests.

@cbuescher
Copy link
Member

Probably separating removal of pre/postOffset from the whole time zone issue in two pull requests makes sense. I think that part is almost done, will revise the tests and go over the documentation for that one tomorrow and maybe can issue a PR then. Also merging this to 1.x seperately would be good.
The whole time zone thing also relies a bit on the discussion in #9029, so I see how that develops.

@cbuescher
Copy link
Member

From discussion this morning: also the 'pre_zone_adjust_large_interval' should be removed if possible.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 6, 2015

+1

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Feb 6, 2015
…ffset' and 'post_offset'

Add offset option to 'date_histogram' replacing and simplifying the previous 'pre_offset' and 'post_offset' options.
This change is part of a larger clean up task for `date_histogram` from issue elastic#9062.
@cbuescher
Copy link
Member

Opened a pull request for the 'offset' part of this task. Will do the changes needed for the deprecation of 'pre/postOffset' on the 1.x branch separately.

cbuescher pushed a commit that referenced this issue Feb 9, 2015
…ffset' and 'post_offset'

Add offset option to 'date_histogram' replacing and simplifying the previous 'pre_offset' and 'post_offset' options.
This change is part of a larger clean up task for `date_histogram` from issue #9062.
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Feb 9, 2015
Add offset option to 'date_histogram', deprecating the previous 'pre_offset' and 'post_offset' options.
This change is part of a larger clean up task for `date_histogram` from issue elastic#9062.
cbuescher pushed a commit that referenced this issue Feb 9, 2015
Add offset option to 'date_histogram', deprecating the previous 'pre_offset' and 'post_offset' options.
This change is part of a larger clean up task for `date_histogram` from issue #9062.
@cbuescher
Copy link
Member

Writing randomized tests I started wondering if the 'interval' setting should only be allowed to be positive. This isn't supported at the moment but if I issue a date_histogram request on with negative interval setting I seem to get an OOM.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 10, 2015

+1 on validating the interval

@cbuescher
Copy link
Member

Should I open a separate issue for a fix for that also on 1.x? Using negative intervals I think I can reliable reproduce an OOM error. This can be rejected quiet early in the REST call already.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 10, 2015

Yes, please!

@cbuescher
Copy link
Member

I think I have the time zone simplification in #9637. However there are a few potential simplifications that I would still like to discuss in that PR.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 12, 2015

Something else that would be nice to do when we have a single time_zone parameter would be to use this time_zone to format bucket keys. Here is a quick code example that probably better explains what I mean:

FormatDateTimeFormatter formatter = DateFieldMapper.Defaults.DATE_TIME_FORMATTER;
// this is midnight in Berlin
DateTime dateTime = formatter.parser().parseDateTime("2015-02-03T23:00:00Z");

// What we are basically doing today:
System.out.println(formatter.printer().print(dateTime));
// prints 2015-02-03T23:00:00.000Z

// What we could do instead
System.out.println(formatter.printer().withZone(DateTimeZone.forID("Europe/Berlin")).print(dateTime));
// prints 2015-02-04T00:00:00.000+01:00

Like the post_zone parameter this helps make buckets look like they still start at midnight, but on the contrary to the post_zone parameter we did not move the date to a different time zone, we are only printing it in the desired time zone (hence the "+01:00" at the end).

@cbuescher
Copy link
Member

Printing bucket keys in local time zones is another great idea, maybe this should be configurable as well? I would prefer opening a separate issue from the current PR though.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 13, 2015

+1 on a different issue, I was just suggesting it here since I think it's quite tied to this issue. I tend to think as time zones as something which is quite complicated to get right (a bit like encodings) so the fewer parameters we have about time zones, the better I think? Maybe we could just architecture the code so that adding this parameter would be straightforward to do in the future (if we ever decide to add it)?

@cbuescher
Copy link
Member

Just opened the above follow up issue for using the time zone formatting for bucket keys.

cbuescher pushed a commit that referenced this issue Mar 4, 2015
The options for pre_zone and post_zone in date_histogram are going to be removed
in 2.0 (docs and ocs and ee #9062) in favor of the already existing time_zone option.
This commit deprecates those fields using ParseFields and adds deprecation notice to
docs and the migration guide.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants