New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up time zone options for date_histogram
#9637
Conversation
Two thing about TimeZoneRounding that I still don't find intuitive to understand and are a potential source of missunderstandings and errors:
|
Also, after the refactoring the six TimeZoneRounding implementations are mostly identical. I think we can try to reduce them to just two (one for TimeUnit, the other for intervals). |
@@ -125,6 +125,12 @@ The `histogram` and the `date_histogram` aggregation now support a simplified `o | |||
`post_offset` rounding options. Instead of having to specify two separate offset shifts of the underlying buckets, the `offset` option | |||
moves the bucket boundaries in positive or negative direction depending on its argument. | |||
|
|||
The `date_histogram` options for `pre_zone` and `post_zone` are replaced by the `time_zone` option. So far the `time_zone` option was |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this paragraph can be simplified to something like:
The `date_histogram` options for `pre_zone` and `post_zone` have been replaced by the `time_zone` option. The behavior of `time_zone` is equivalent to the former `pre_zone` option.
I think the migration guide should be to the point about what options were available before, and are available now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, time_zone
basically replaces the old pre_zone
and the explanation is too verbose here. However I would stress here that the rounding values returned by round(utcTime) which should be equivalent to keyForValue(roundKey(utcTime)) (and thus be used as the bucket keys) are always UTC now. Before you could shift them to any other time zone using post_zone
which to avoid was one of the points of #9062. At least thats what I thought it was, but I could also be mistaken.
@cbuescher I'm sorry I haven't been following the discussion about this closely, but I did notice right away here that the logic seems wrong to me. I would expect a single
Big +1! |
@rjernst I might have missed something here too.
I thought that
I think yes, that should be true. Also in InternalDateHistogram.Bucket.getKey() the internal long keys are treated as if they are in UTC. Thats why all values returned by rounding should also be in UTC
Agree with the first part, but not sure about the second. I thought that after clean up dates and bucket keys (and key_as_string) are always UTC, so client might need to convert that for displaying it. @jpountz would you agree with the above statements or am I still confusing something here? |
What @cbuescher says is correct:
Huge +1 to that
For the record, I'm perfectly fine with simplifying things for now (like putting all the rounding in roundKey) if it helps us be correct. We can still look at optimizing that stuff later... |
@@ -303,6 +283,8 @@ public void writeTo(StreamOutput out) throws IOException { | |||
} | |||
|
|||
UTCIntervalTimeZoneRounding(long interval) { | |||
if (interval < 1) | |||
throw new ElasticsearchIllegalArgumentException("Negative time interval not supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding this!
} | ||
long rounded = field.roundFloor(timeLocal); | ||
if (!isUtc) { | ||
return timeZone.convertLocalToUTC(rounded, true, utcMillis); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz Doing the conversion back from local time to UTC here makes more sense to me, but I'm still wondering about performance maybe beeing better if this would be in valueForKey(). On the other hand that would implicitly create a dependency about how the two methods should be used. Should I make performance tests here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbuescher Making performance tests to see whether we would benefit from moving some computations to valueForKey would be awesome, but I'm already happy that we are making the time zone handling more reasonable, so I would be totally fine with investigating performance in a different issue/PR.
I've removed four of the six TimeZoneRoundings, this will probably result in some test cases not be needed anymore. Will go through the unit and integration tests to see what is redundant there if you're generally okay with this simplification. |
this.field = unit.field(); | ||
this.durationField = field.getDurationField(); | ||
this.timeZone = timeZone; | ||
this.isUtc = timeZone.equals(DateTimeZone.UTC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we need this boolean to guard against some timezone conversions. When the tz is UTC, it looks like the conversion is going to be very cheap anyway (just substracting a 0 if I'm not mistaken). It would also make the impl simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, only reason I added it was potential performance gains. But as you said above, simplify it here and do performance tests to check if it makes any change at all is the right way I think.
Absolutely, this is great! |
…ting changes in tests
Removed the special utc flag in the roundings and reverted some of the auto formatting which should make the diff more readable. There's not that many redundant tests as I thought, better to leave most of them in than miss a case that we test now. I only deleted the tests for removed options (like the "pre_zone_adjust...") |
LGTM |
I trief comparing the performance of having the local-to-utc backward conversion either in roundKey() or valueForKey() and could not measure much difference calling it on 10e7 random values, once with a time conversion in roundKey() for every call, the other times just after each 1000/10000/100000 docs in valueForKey(). I dont know if that are usual ratios for buckets, but the measured difference was very small (maybe 1%), so I'll leave it more readable like it is now. @jpountz since this pr is a little bit larger, do you suggest someone else looking at it or should I just merge this with master? |
@cbuescher Please merge!
Not sure I understand that part, are these numbers the numbers of buckets that were generated in your test? |
I tested calling roundKey/valueForKey() with the implemenation where the local-to-utc conversion is only done once every bucket in valueForKey(). The time saving in just doing the back conversion depends on the number of values that fall into each bucket, so I varied these (those are the numbers above) but could not see any significant speed up, so I leave it without this optimization which makes the code harder to read. |
OK, I understand now. This means that we could simplify further by removing roundKey and valueForKey and just use round() (in another PR). |
Except that these methods are defined in the the abstract Rounding class and it might be different in other implementations than the TimeZoneRoundings. |
date_histogram
date_histogram
Remove the current time zone options
pre_zone
andpost_zone
in favor of a simplifiedtime_zone
option.The former were hard to understand and made it possible to return dates that are not in UTC. The
time_zone
parameter can be used to compute the buckets in the specified time zone and the buckets are returned in UTC.Also the
pre_zone_adjust_large_interval
parameter is removed because now buckets are always returned in UTC.Closes #9062