New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix date_trunc() behavior for decades, centuries and millenniums and add the ability to extract() these values out of a timestamp. #5056
Fix date_trunc() behavior for decades, centuries and millenniums and add the ability to extract() these values out of a timestamp. #5056
Conversation
…havior of PostgeSQL when truncating dates by millennium, century or decade. Negative years are dealt with, as well as the issue with years such as 1900 and 2000 being truncated to 1901 and 2001 respectively.
centuries and decades from timestamps.
Thanks very much! @quodlibetor or @sploiselle, can you review? Also, we'll need to get some test cases for the new/changed behavior. |
Hey @benesch, thanks for the quick consideration. I added some tests in test/sqllogictest, hopefully it's enough. The additions don't work for INTERVAL/TIME date_part()/date_trunc() functions yet, but those require more work due to being separated from timestamps, dates and date-times, so I think it should go into a separate PR. |
} | ||
|
||
fn extract_decade(&self) -> f64 { | ||
f64::from(self.year().div_euclid(10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Euclidean division doesn't produce different results than standard division in the provided test cases. Can you let me know why you chose this method, or provide a test case that standard division fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current version of Materialize will return the error unsupported timestamp units 'decade'
on the test for decade extraction that I added in the last commit.
PostgreSQL will return the right result, "200 200 199 0 0 -1 -1 -2".
The years obtained from self.year() are [2001, 2000, 1999, 1, 0, -1, -10, -11] in the test.
Had I implemented it as
fn extract_decade(&self) -> f64 {
f64::from(self.year() / 10)
}
That is, using normal division, Materialize would return "200 200 199 0 0 0 -1 -1", which is not how postgres handles it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you are correct!
Thanks for this @zRedShift! The code looks reasonable, I'm trying to wrap my head around the exact semantics of this math, and it's kind of comforting that postgres has some of the same thoughts as I do: |
Yep, just following their logic for feature parity. If only the CE started with the year 0 most of this math could've been spared, but we're not that lucky. You'd think, why would a real-time streaming database concern itself with timestamps of events that happened over 2 millennia ago? But having a faithful reproduction of the original behavior is still nice to have. |
That high-volume computational archaeology 😂 . I agree doing this right is important! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks again! Sorry this took awhile, I wanted to make sure that I understood it!
I've got one nit, but if you'd rather not handle it I'm happy to merge as-is, just let me know.
src/repr/src/adt/datetime.rs
Outdated
@@ -96,7 +96,9 @@ impl FromStr for DateTimeUnits { | |||
"isodow" => Ok(Self::IsoDayOfWeek), | |||
"isodoy" => Ok(Self::IsoDayOfYear), | |||
"h" | "hour" | "hours" | "hr" | "hrs" => Ok(Self::Hour), | |||
"microsecond" | "microseconds" => Ok(Self::Microseconds), | |||
"us" | "usec" | "microsecond" | "microseconds" | "useconds" | "usecs" => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you reorganize this slightly:
"us" | "usec" | "microsecond" | "microseconds" | "useconds" | "usecs" => { | |
"us" | "usec" | "usecs" | "useconds" | "microsecond" | "microseconds" => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, that's a good change. I just modelled it to have the same order as milliseconds down below. So I'll change the order for both of those.
BTW, I've seen other codebases use burntsushi's fst when there are so many strings to match since the code generated from a match expression doesn't turn the strings into a trie, but since the statements just get parsed once I don't think there's any benefit from using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, in this case it's basically part of the query compilation step so hopefully isn't often in a hot path. We generally like to have benchmarks or other evidence for optimizations that make code meaningfully more complex (even if only under the hood). We should keep it in m ind, though, there might be some places where it would help!
Thanks again for this! |
Fixes #5057
Postgres implementation of truncation and extraction
This change is