Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of DateTime64 #5187

Closed
wants to merge 15 commits into from

Conversation

Gladdy
Copy link
Contributor

@Gladdy Gladdy commented May 4, 2019

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Request for feedback - not for merging yet.

#4860
There is still some functionality that should be added, however, I believe this should be a valid first prototype implementation in terms of functionality, though feedback would be very much appreciated.

For changelog. Remove if this is non-significant change.
Add DateTime64

Category (leave one):

  • New Feature

Short description (up to few sentences):
Adding a DateTime64 column. As the name implies it will consist of 64 bits. Currently the only supported way of accessing it is as nanos since epoch. It supports formatting, entering as string and some basic transformations to described in DateTimeTransforms.h.

Detailed description (optional):
The interpretation of the 64 bit value has been factored out, so it should be straightforward to also include functionality for millis/micros since epoch (this would work in a similar fashion to how the timezone changes how the field is interpreted).

The reason for picking nanos since epoch initially is that is also what the python data processing ecosystem has settled on (eg. pandas using int64_t nanos since epoch for their timestamps). The test below outlines the currently supported functionality. As the column is based in Int64, joins work fine, though stuff that is currently lacking is arithmetic.

CREATE TABLE A(t DateTime64) ENGINE = MergeTree() ORDER BY t;
INSERT INTO A(t) VALUES (1556879125123456789);
INSERT INTO A(t) VALUES ('2019-05-03 11:25:25.123456789');

SELECT toString(t, 'UTC'), toDate(t), toStartOfDay(t), toStartOfQuarter(t), toTime(t), toStartOfMinute(t) FROM A ORDER BY t;
2019-05-03 10:25:25.123456789	2019-05-03	2019-05-03 00:00:00	2019-04-01	1970-01-02 11:25:25	2019-05-03 11:25:00
2019-05-03 10:25:25.123456789	2019-05-03	2019-05-03 00:00:00	2019-04-01	1970-01-02 11:25:25	2019-05-03 11:25:00

@alexey-milovidov alexey-milovidov added can be tested pr-feature Pull request with new product feature labels May 4, 2019
@filimonov
Copy link
Contributor

filimonov commented May 5, 2019

That implementation does not extend supported datatime range, and make nanoseconds a 'default choice' for users who need (let's say) milliseconds only.

In my opinion:

  1. precision should be optionated (like in Decimal datatype), because nanoseconds are needed quite rare (while micro-/milli- seconds - quite often).
  2. introducing 'wider' DataTime fields shoud also allow to cover wider timeranges than 1970..2105. (it can be just fallback to traditional and slow calendar calculations for times not covered by lookup tables).

@alexey-milovidov
Copy link
Member

alexey-milovidov commented May 5, 2019

precision should be optionated (like in Decimal datatype), because nanoseconds are needed quite rare (while micro-/milli- seconds - quite often).

Even milliseconds forces to use at least 64bit data type instead of 32bit. Consequently, there is not much difference between milliseconds and nanoseconds resolution.

introducing 'wider' DataTime fields shoud also allow to cover wider timeranges than 1970..2105. (it can be just fallback to traditional and slow calendar calculations for times not covered by lookup tables).

+1.

In straightforward implementation (a number of nanoseconds since the epoch), Int64 data type will give us about 292 years around 1970:

example.yandex.net :) SELECT (0x7FFFFFFFFFFFFFFF / 1000000000) / 86400 / 365

SELECT ((9223372036854775807 / 1000000000) / 86400) / 365

┌─divide(divide(divide(9223372036854775807, 1000000000), 86400), 365)─┐
│                                                    292.471208677536 │
└─────────────────────────────────────────────────────────────────────┘

It is not obvious whether it's enough.

Another way of implementation is to store fractional component in different subcolumn and data stream (like Tuple, Array, Nullable data types are stored and processed).

@filimonov
Copy link
Contributor

filimonov commented May 5, 2019

precision should be optionated (like in Decimal datatype), because nanoseconds are needed quite rare (while micro-/milli- seconds - quite often).

Even milliseconds forces to use at least 64bit data type instead of 32bit. Consequently, there is not much difference between milliseconds and nanoseconds resolution.

If you will use microseconds instead of nanoseconds it will leave space to store 1000x more seconds which will give 292471 years in both directions, which is 100% enough for all possible cases.

Or you can decrease subseconds precision to zero and store the age of universe in ClickHouse :)

Another (more realistic) scenario - i have microseconds in datastream, but want to store only with 1/10 precision in DB. So in insert it cames like that '2019-05-05 23:02:00.123141203' and want '2019-05-05 23:02:00.1' to be stored.

Also - you can use bit shifts instead of decimal division to remove subseconds precision. I.e. give whole lower 31bits for nanoseconds, or lower 21bits for microseconds, or lower 10bits for milliseconds, giving the rest - to seconds. (but that will make it incompatible with plain UInt64, and direct typecast will give strange results)

If making some fixed precision for DataTime64 - i think it shoud fixed to mictoseconds, not nanoseconds. I don't know any database natively supporting nanoseconds.

introducing 'wider' DataTime fields shoud also allow to cover wider timeranges than 1970..2105. (it can be just fallback to traditional and slow calendar calculations for times not covered by lookup tables).

+1.

In straightforward implementation (a number of nanoseconds since the epoch), Int64 data type will give us about 292 years around 1970:
It is not obvious whether it's enough.

SQL Server: January 1, 1753, through December 31, 9999
MySQL: '1000-01-01 00:00:00' to '9999-12-31 23:59:59'
Postgres: 4713 BC..294276 AD
Oracle: '0001-01-01-00.00.00.000000'..'9999-12-31-23.59.59.999999'

@filimonov
Copy link
Contributor

filimonov commented May 5, 2019

In straightforward implementation (a number of nanoseconds since the epoch), Int64 data type will give us about 292 years around 1970:

BTW: anyway DateLUT should be adjusted to support that, right? Bacause it will start getting Int64 instead of UInt32.

@alexey-milovidov
Copy link
Member

will give 292471 years in both directions, which is 100% enough for all possible cases.

Yes. Some calendar issues will arise (for calculations on historical events), but we can just ignore them.

Another (more realistic) scenario - i have microseconds in datastream, but want to store only with 1/10 precision in DB. So in insert it cames like that '2019-05-05 23:02:00.123141203' and want '2019-05-05 23:02:00.1' to be stored.

Ok, but it will be doable nevertheless.

Also - you can use bit shifts instead of decimal division to remove subseconds precision. I.e. give whole lower 31bits for nanoseconds, or lower 21bits for microseconds, or lower 10bits for milliseconds, giving the rest - to seconds. (but that will make it incompatible with plain UInt64, and direct typecast will give strange results)

Compiler will translate division to multiplication (latency 3 clock cycles) and bit shift (latency 1 clock cycle). As we done all operations in a loop, the loop will be unrolled and vectorized. SSE4.1 has packed multiplication of two 64bit integers. It will be slower than plain bit shift but not too much (about two-three times).

If making some fixed precision for DataTime64 - i think it shoud fixed to mictoseconds, not nanoseconds. I don't know any database natively supporting nanoseconds.

As far as I know, InfluxDB use nanoseconds precision by default.

BTW: anyway DateLUT should be adjusted to support that, right? Bacause it will start getting Int64 instead of UInt32.

We can introduce (probably with template) another DateLUT for Int64 and keep existing (for UInt32) to avoid any performance penalty. Existing DateLUT has too nice memory layout (but the difference should be measured).

@suxw8813
Copy link

Does the Windowfunnel function support datetime64?

@vitlibar vitlibar self-assigned this May 31, 2019
@filimonov
Copy link
Contributor

IMHO should be some with dynamic precision - like Decimal, and like MySQL https://dev.mysql.com/doc/refman/5.6/en/fractional-seconds.html

@filimonov filimonov assigned filimonov and unassigned filimonov Sep 3, 2019
@stale stale bot added the st-wontfix Known issue, no plans to fix it currenlty label Oct 20, 2019
@blinkov blinkov removed the st-wontfix Known issue, no plans to fix it currenlty label Oct 20, 2019
@ClickHouse ClickHouse deleted a comment from stale bot Oct 29, 2019
@stavrolia
Copy link
Contributor

The continuation of this PR is here.

@stavrolia stavrolia closed this Nov 27, 2019
@Enmk Enmk mentioned this pull request Dec 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants