Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix date_trunc() behavior for decades, centuries and millenniums and add the ability to extract() these values out of a timestamp. #5056

Merged
merged 6 commits into from Dec 17, 2020

Conversation

zRedShift
Copy link
Contributor

@zRedShift zRedShift commented Dec 15, 2020

…havior

of PostgeSQL when truncating dates by millennium, century or decade.
Negative years are dealt with, as well as the issue with years such as
1900 and 2000 being truncated to 1901 and 2001 respectively.
@CLAassistant
Copy link

CLAassistant commented Dec 15, 2020

CLA assistant check
All committers have signed the CLA.

@benesch
Copy link
Member

benesch commented Dec 15, 2020

Thanks very much! @quodlibetor or @sploiselle, can you review? Also, we'll need to get some test cases for the new/changed behavior.

@zRedShift
Copy link
Contributor Author

Hey @benesch, thanks for the quick consideration. I added some tests in test/sqllogictest, hopefully it's enough. The additions don't work for INTERVAL/TIME date_part()/date_trunc() functions yet, but those require more work due to being separated from timestamps, dates and date-times, so I think it should go into a separate PR.

@zRedShift zRedShift marked this pull request as ready for review December 15, 2020 14:33
}

fn extract_decade(&self) -> f64 {
f64::from(self.year().div_euclid(10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Euclidean division doesn't produce different results than standard division in the provided test cases. Can you let me know why you chose this method, or provide a test case that standard division fails?

Copy link
Contributor Author

@zRedShift zRedShift Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version of Materialize will return the error unsupported timestamp units 'decade' on the test for decade extraction that I added in the last commit.

PostgreSQL will return the right result, "200 200 199 0 0 -1 -1 -2".
The years obtained from self.year() are [2001, 2000, 1999, 1, 0, -1, -10, -11] in the test.

Had I implemented it as

fn extract_decade(&self) -> f64 {
    f64::from(self.year() / 10)
}

That is, using normal division, Materialize would return "200 200 199 0 0 0 -1 -1", which is not how postgres handles it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you are correct!

@quodlibetor
Copy link
Contributor

Thanks for this @zRedShift! The code looks reasonable, I'm trying to wrap my head around the exact semantics of this math, and it's kind of comforting that postgres has some of the same thoughts as I do:

https://github.com/postgres/postgres/blob/a58db3aa10e62e4228aa409ba006014fa07a8ca2/src/backend/utils/adt/timestamp.c#L4007-L4018

@zRedShift
Copy link
Contributor Author

Yep, just following their logic for feature parity. If only the CE started with the year 0 most of this math could've been spared, but we're not that lucky.

You'd think, why would a real-time streaming database concern itself with timestamps of events that happened over 2 millennia ago? But having a faithful reproduction of the original behavior is still nice to have.

@quodlibetor
Copy link
Contributor

why would a real-time streaming database concern itself with timestamps of events that happened over 2 millennia ago?

That high-volume computational archaeology 😂 . I agree doing this right is important!

Copy link
Contributor

@quodlibetor quodlibetor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thanks again! Sorry this took awhile, I wanted to make sure that I understood it!

I've got one nit, but if you'd rather not handle it I'm happy to merge as-is, just let me know.

@@ -96,7 +96,9 @@ impl FromStr for DateTimeUnits {
"isodow" => Ok(Self::IsoDayOfWeek),
"isodoy" => Ok(Self::IsoDayOfYear),
"h" | "hour" | "hours" | "hr" | "hrs" => Ok(Self::Hour),
"microsecond" | "microseconds" => Ok(Self::Microseconds),
"us" | "usec" | "microsecond" | "microseconds" | "useconds" | "usecs" => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you reorganize this slightly:

Suggested change
"us" | "usec" | "microsecond" | "microseconds" | "useconds" | "usecs" => {
"us" | "usec" | "usecs" | "useconds" | "microsecond" | "microseconds" => {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that's a good change. I just modelled it to have the same order as milliseconds down below. So I'll change the order for both of those.

BTW, I've seen other codebases use burntsushi's fst when there are so many strings to match since the code generated from a match expression doesn't turn the strings into a trie, but since the statements just get parsed once I don't think there's any benefit from using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, in this case it's basically part of the query compilation step so hopefully isn't often in a hot path. We generally like to have benchmarks or other evidence for optimizations that make code meaningfully more complex (even if only under the hood). We should keep it in m ind, though, there might be some places where it would help!

@quodlibetor quodlibetor merged commit 4a64938 into MaterializeInc:main Dec 17, 2020
@quodlibetor
Copy link
Contributor

Thanks again for this!

@zRedShift zRedShift deleted the timestamp-truncation branch December 17, 2020 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

date_trunc() behaves incorrectly when asked to truncate to millennium, century or decade
5 participants