Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: new partition grammar - parser part #3347

Merged
merged 8 commits into from Feb 27, 2024

Conversation

waynexia
Copy link
Member

I hereby agree to the terms of the GreptimeDB CLA

What's changed and what's your intention?

This is the first patch to the new partition rule. It only changes the parser and grammar. Other logics are left unchanged or blank. I wanna to merge this first because this part may trigger a lot of discussion.

A brief take out of the new grammar. It contains two parts, one is a list of columns, which is used as an allow-list of which columns can be present in the partition rule. And the second is a set of initial rule list. The grammar looks like the following

CREATE TABLE (...)
PARTITION ON COLUMNS (A, B, C) (
    <RULE LIST>
)

An example:

CREATE TABLE (
    A INT,
    B INT,
    C INT,
    TS TIMESTAMP TIME INDEX,
    PRIMARY KEY (A, B, C)
)
PARTITION ON (A, B, C) (
  A < 10,
  A > 10 AND A < 20,
  A > 20 AND B < 100,
  A > 20 AND B >= 100
);

Checklist

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR does not require documentation updates.

Refer to a related PR or issue link (optional)

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@github-actions github-actions bot added Invalid PR Title docs-required This change requires docs update. labels Feb 21, 2024
@waynexia waynexia changed the title New partition grammar feat: new partition grammar - parser part Feb 21, 2024
@waynexia waynexia changed the title feat: new partition grammar - parser part feat!: new partition grammar - parser part Feb 21, 2024
@github-actions github-actions bot added the breaking-change This pull request contains breaking changes. label Feb 21, 2024
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Copy link

codecov bot commented Feb 21, 2024

Codecov Report

Attention: Patch coverage is 97.10983% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 84.46%. Comparing base (450dfe3) to head (2661dda).
Report is 43 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3347      +/-   ##
==========================================
- Coverage   85.26%   84.46%   -0.80%     
==========================================
  Files         881      893      +12     
  Lines      143843   147284    +3441     
==========================================
+ Hits       122641   124406    +1765     
- Misses      21202    22878    +1676     

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@killme2008
Copy link
Contributor

Looks cool.

I have a question: assuming a user wants to partition data by date, for example, one partition per day, is this syntax currently supported?

@waynexia
Copy link
Member Author

I have a question: assuming a user wants to partition data by date, for example, one partition per day, is this syntax currently supported?

This is not supported in this place (or one can manually enumerate the partition, but this isn't what we want).

Maybe we can implement some special rule for TIME INDEX in the future? Like

PARTITION ON COLUMNS (..., ts) (
  ...,
  ts WITHIN '1d'
)

But is it a good practice to include TIME INDEX in the partition columns?

@fengjiachun
Copy link
Collaborator

Looks cool.

I have a question: assuming a user wants to partition data by date, for example, one partition per day, is this syntax currently supported?

It seems that this mode is not supported now, and it will make the number of regions dynamic. This also has a certain impact on the structure of routing info. There may also be some inactive regions, which means that no new data is being written.
I have a little concern about the above question, and I am slightly doubtful whether this mode is really necessary, given that our DB naturally organizes data according to timestamps.

@killme2008
Copy link
Contributor

Maybe we can implement some special rule for TIME INDEX in the future? Like

PARTITION ON COLUMNS (..., ts) (
  ...,
  ts WITHIN '1d'
)

That would be great!

But is it a good practice to include TIME INDEX in the partition columns?

Yes, it's not a good practice to partition data by TIME INDEX in common. But in some use cases, there is not too much data, and users want to organize and manage the data by date.

@WenyXu
Copy link
Member

WenyXu commented Feb 22, 2024

BTW, Do we need to add some alias for each <, >, >= , like adding lt, less than for <. (Maybe in another PR)

@MichaelScofield
Copy link
Collaborator

  1. Partitioning by date like that requires creating region on demand. It's not trivial work. However, once we have dynamic partitioning, it should be easy to Implement this static partitioning. I suggest we take a consideration of starting from there.
  2. Looking at the old partition grammar, I find the new grammar a little less interesting, more of like a "syntactic sugar". For example, the old:
CREATE TABLE dist_table(
    ts TIMESTAMP DEFAULT current_timestamp(),
    n INT,
    row_id INT,
    PRIMARY KEY(n),
    TIME INDEX (ts)
)
PARTITION BY RANGE COLUMNS (n) (
    PARTITION r0 VALUES LESS THAN (5),
    PARTITION r1 VALUES LESS THAN (9),
    PARTITION r2 VALUES LESS THAN (MAXVALUE),
)
engine=mito;

versus the new:

CREATE TABLE dist_table(
    ts TIMESTAMP DEFAULT current_timestamp(),
    n INT,
    row_id INT,
    PRIMARY KEY(n),
    TIME INDEX (ts)
)
PARTITION ON (n) (
    n < 5,
    n < 10,
    n >= 10,
)
engine=mito;

It's of course good to have this simplification, and I'm ok with this new grammar. However, I'm afraid if the partition rules grow, or the partition methods like list or hash being supported, in the end the new grammar would have became just as obscure as the old.

  1. Partitioning by time is natural to human, and is common in tsdb. At least timescaledb and influxdb both first decide to do it. Originally there's a hotspot issue we are concerning of. But now under the reproposing of partition grammar, I think it's time to reconsider

@waynexia
Copy link
Member Author

Yes, it's not a good practice to partition data by TIME INDEX in common. But in some use cases, there is not too much data, and users want to organize and manage the data by date.

Partitioning by time is natural to human, and is common in tsdb. At least timescaledb and influxdb both first decide to do it. Originally there's a hotspot issue we are concerning of. But now under the reproposing of partition grammar, I think it's time to reconsider

To remind, our database has the same capability corresponding to hypertables from TimescaleDB - or as a non-pg-based storage system, we simply do not have that issue that needs "hypertable".

Data would be organized to appropriate time windows during compaction, automatically. What can benefit more from encouraging users to set a time span?

BTW, Do we need to add some alias for each <, >, >= , like adding lt, less than for <. (Maybe in another PR)

Here uses the same parser with WHERE clause. Alias like <> for != is available. lt and less than can be chosen if we want more than WHERE.

However, I'm afraid if the partition rules grow, or the partition methods like list or hash being supported, in the end the new grammar would have became just as obscure as the old.

That's a good point. There should be a grammar to define repeated rules (like the time partition). I added a task to the tracker

@MichaelScofield
Copy link
Collaborator

The old grammar comes from mysql, it's complex but complete. If the new grammar looks just like the old, I'm afraid we are wasting time reinventing it. So I'm neutral to this change. I say let's design a whole new syntax for the unique dynamic partitioning feat for GreptimeDB!

Copy link
Contributor

@killme2008 killme2008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep pushing!

@killme2008 killme2008 added this pull request to the merge queue Feb 26, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 26, 2024
@waynexia waynexia added this pull request to the merge queue Feb 26, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 26, 2024
@killme2008 killme2008 added this pull request to the merge queue Feb 26, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 26, 2024
@killme2008
Copy link
Contributor

@waynexia waynexia added this pull request to the merge queue Feb 27, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 27, 2024
@waynexia waynexia added this pull request to the merge queue Feb 27, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 27, 2024
@waynexia waynexia added this pull request to the merge queue Feb 27, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 27, 2024
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@waynexia waynexia added this pull request to the merge queue Feb 27, 2024
Merged via the queue into GreptimeTeam:main with commit 3544c93 Feb 27, 2024
16 checks passed
@waynexia waynexia deleted the new-partition-grammar branch February 27, 2024 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change This pull request contains breaking changes. docs-required This change requires docs update.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants