Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up time zone options for date_histogram #9637

Closed
wants to merge 7 commits into from

Conversation

cbuescher
Copy link
Member

Remove the current time zone options pre_zone and post_zone in favor of a simplified time_zone option.
The former were hard to understand and made it possible to return dates that are not in UTC. The time_zone parameter can be used to compute the buckets in the specified time zone and the buckets are returned in UTC.
Also the pre_zone_adjust_large_interval parameter is removed because now buckets are always returned in UTC.

Closes #9062

@cbuescher
Copy link
Member Author

Two thing about TimeZoneRounding that I still don't find intuitive to understand and are a potential source of missunderstandings and errors:

  1. The roundKey() and valueForKey() methods in the Roundings classes both look like they take time dates (in millis) as arguments, but only the former does. The valueForKey() method should only be called with arguments that are the result of calling the roundKey() method with a date. Resons for this are internal optimizations mentioned here (Aggs: clean up date_histogram #9062 (comment)). One possibility to make this less error prone would be to pull all the time zone calculations in roundKey() and have all the internal keys be valid UTC dates instead of just ids like right now, but this would mean that the additional calculations that now are in valueForKey() would be called internally for every value during the collect phase in the aggregation.
  2. The nextRoundingValue() should also only be used with certain input values, otherwise the semantics in terms of the underlying Rounding is not clear. At the moment however this is not enforced, and so calling that method e.g. for interval-based Roundings with input values that are not rounded returns values that are not rounded according to the underlying Rounding, (that is round(nextRoundingValue(v)) =< nextRoundingValue(v)). This can lead to bugs like the one discussed in Strange bucket for date_histogram with negative post_zone and zero min_doc_count #7673. One solution would be to take the rounding into account by (1) round the input value (2) add interval or time unit (3) return the rounded result. However this comes at the cost of extra computations that are avoided now.

@cbuescher
Copy link
Member Author

Also, after the refactoring the six TimeZoneRounding implementations are mostly identical. I think we can try to reduce them to just two (one for TimeUnit, the other for intervals).

@@ -125,6 +125,12 @@ The `histogram` and the `date_histogram` aggregation now support a simplified `o
`post_offset` rounding options. Instead of having to specify two separate offset shifts of the underlying buckets, the `offset` option
moves the bucket boundaries in positive or negative direction depending on its argument.

The `date_histogram` options for `pre_zone` and `post_zone` are replaced by the `time_zone` option. So far the `time_zone` option was
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this paragraph can be simplified to something like:

The `date_histogram` options for `pre_zone` and `post_zone` have been replaced by the `time_zone` option.  The behavior of `time_zone` is equivalent to the former `pre_zone` option.

I think the migration guide should be to the point about what options were available before, and are available now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, time_zone basically replaces the old pre_zone and the explanation is too verbose here. However I would stress here that the rounding values returned by round(utcTime) which should be equivalent to keyForValue(roundKey(utcTime)) (and thus be used as the bucket keys) are always UTC now. Before you could shift them to any other time zone using post_zone which to avoid was one of the points of #9062. At least thats what I thought it was, but I could also be mistaken.

@rjernst
Copy link
Member

rjernst commented Feb 11, 2015

@cbuescher I'm sorry I haven't been following the discussion about this closely, but I did notice right away here that the logic seems wrong to me. I would expect a single time_zone setting to be the equivalent of passing both pre and post zone before. I thought we always stored dates as UTC, is that not true? If so, setting time zone should tell us what timezone to parse incoming dates, and what time zone to output dates in string form (so that they can be displayed directly, without having to do translation on the client side). It is setting just the pre or post zone alone that I think is bogus.

Also, after the refactoring the six TimeZoneRounding implementations are mostly identical. I think we can try to reduce them to just two (one for TimeUnit, the other for intervals).

Big +1!

@cbuescher
Copy link
Member Author

@rjernst I might have missed something here too.

I would expect a single time_zone setting to be the equivalent of passing both pre and post zone before.

I thought that post_zone before was applied after rounding, so that ES could return dates that are not UTC,

I thought we always stored dates as UTC, is that not true?

I think yes, that should be true. Also in InternalDateHistogram.Bucket.getKey() the internal long keys are treated as if they are in UTC. Thats why all values returned by rounding should also be in UTC

setting time zone should tell us what timezone to parse incoming dates, and what time zone to output dates in string form (so that they can be displayed directly, without having to do translation on the client side)

Agree with the first part, but not sure about the second. I thought that after clean up dates and bucket keys (and key_as_string) are always UTC, so client might need to convert that for displaying it.

@jpountz would you agree with the above statements or am I still confusing something here?

@jpountz
Copy link
Contributor

jpountz commented Feb 11, 2015

What @cbuescher says is correct:

  • pre_zone was what the new time_zone is
  • post_zone was a way to convert buckets from UTC to a different time zone

Also, after the refactoring the six TimeZoneRounding implementations are mostly identical. I think we can try to reduce them to just two (one for TimeUnit, the other for intervals).

Huge +1 to that

Two thing about TimeZoneRounding that I still don't find intuitive to understand and are a potential source of missunderstandings and errors

For the record, I'm perfectly fine with simplifying things for now (like putting all the rounding in roundKey) if it helps us be correct. We can still look at optimizing that stuff later...

@cbuescher
Copy link
Member Author

I will add the DST tests @orenash mentioned in #9491. They all pass with the cleanup refactorings here. Then I will add another commit with a clean up of the six rounding classes.

@@ -303,6 +283,8 @@ public void writeTo(StreamOutput out) throws IOException {
}

UTCIntervalTimeZoneRounding(long interval) {
if (interval < 1)
throw new ElasticsearchIllegalArgumentException("Negative time interval not supported");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this!

@jpountz
Copy link
Contributor

jpountz commented Feb 11, 2015

For the record, the change looks good to me so far.

I will add the DST tests @orenash mentioned in #9491. They all pass with the cleanup refactorings here. Then I will add another commit with a clean up of the six rounding classes.

Sounds great!

}
long rounded = field.roundFloor(timeLocal);
if (!isUtc) {
return timeZone.convertLocalToUTC(rounded, true, utcMillis);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz Doing the conversion back from local time to UTC here makes more sense to me, but I'm still wondering about performance maybe beeing better if this would be in valueForKey(). On the other hand that would implicitly create a dependency about how the two methods should be used. Should I make performance tests here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbuescher Making performance tests to see whether we would benefit from moving some computations to valueForKey would be awesome, but I'm already happy that we are making the time zone handling more reasonable, so I would be totally fine with investigating performance in a different issue/PR.

@cbuescher
Copy link
Member Author

I've removed four of the six TimeZoneRoundings, this will probably result in some test cases not be needed anymore. Will go through the unit and integration tests to see what is redundant there if you're generally okay with this simplification.

this.field = unit.field();
this.durationField = field.getDurationField();
this.timeZone = timeZone;
this.isUtc = timeZone.equals(DateTimeZone.UTC);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need this boolean to guard against some timezone conversions. When the tz is UTC, it looks like the conversion is going to be very cheap anyway (just substracting a 0 if I'm not mistaken). It would also make the impl simpler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, only reason I added it was potential performance gains. But as you said above, simplify it here and do performance tests to check if it makes any change at all is the right way I think.

@jpountz
Copy link
Contributor

jpountz commented Feb 13, 2015

I've removed four of the six TimeZoneRoundings, this will probably result in some test cases not be needed anymore. Will go through the unit and integration tests to see what is redundant there if you're generally okay with this simplification.

Absolutely, this is great!

@cbuescher
Copy link
Member Author

Removed the special utc flag in the roundings and reverted some of the auto formatting which should make the diff more readable. There's not that many redundant tests as I thought, better to leave most of them in than miss a case that we test now. I only deleted the tests for removed options (like the "pre_zone_adjust...")

@jpountz
Copy link
Contributor

jpountz commented Feb 13, 2015

LGTM

@cbuescher
Copy link
Member Author

I trief comparing the performance of having the local-to-utc backward conversion either in roundKey() or valueForKey() and could not measure much difference calling it on 10e7 random values, once with a time conversion in roundKey() for every call, the other times just after each 1000/10000/100000 docs in valueForKey(). I dont know if that are usual ratios for buckets, but the measured difference was very small (maybe 1%), so I'll leave it more readable like it is now.

@jpountz since this pr is a little bit larger, do you suggest someone else looking at it or should I just merge this with master?

@jpountz
Copy link
Contributor

jpountz commented Feb 16, 2015

@cbuescher Please merge!

the other times just after each 1000/10000/100000 docs in valueForKey()

Not sure I understand that part, are these numbers the numbers of buckets that were generated in your test?

@cbuescher
Copy link
Member Author

I tested calling roundKey/valueForKey() with the implemenation where the local-to-utc conversion is only done once every bucket in valueForKey(). The time saving in just doing the back conversion depends on the number of values that fall into each bucket, so I varied these (those are the numbers above) but could not see any significant speed up, so I leave it without this optimization which makes the code harder to read.

@jpountz
Copy link
Contributor

jpountz commented Feb 16, 2015

OK, I understand now. This means that we could simplify further by removing roundKey and valueForKey and just use round() (in another PR).

@cbuescher
Copy link
Member Author

Except that these methods are defined in the the abstract Rounding class and it might be different in other implementations than the TimeZoneRoundings.

@cbuescher cbuescher closed this in 30fd70f Feb 16, 2015
@clintongormley clintongormley changed the title Aggregations: Clean up date_histogram Clean up time zone options for date_histogram Jun 6, 2015
@cbuescher cbuescher deleted the fix/9062 branch March 31, 2016 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Aggs: clean up date_histogram
4 participants