Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPTIMIZE DEDUPLICATE BY COLUMNS #17846

Merged
merged 13 commits into from
Dec 20, 2020

Conversation

Enmk
Copy link
Contributor

@Enmk Enmk commented Dec 6, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Extended OPTIMIZE ... DEDUPLICATE syntax to allow explicit (or implicit with asterisk/column transformers) list of columns to check for duplicates on.
...

Detailed description / Documentation draft:
Following syntax variants are now supported:

OPTIMIZE TABLE table DEDUPLICATE; -- the old one
OPTIMIZE TABLE table DEDUPLICATE BY *; -- not the same as the old one, excludes MATERIALIZED columns (see the note below)
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT (colX, colY);
OPTIMIZE TABLE table DEDUPLICATE BY col1,col2,col3;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex');
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT (colX, colY);

Note that * behaves just like in SELECT: MATERIALIZED, and ALIAS columns are not used for expansion.
Also, it is an error to specify empty list of columns, or write an expression that results in an empty list of columns, or deduplicate by an ALIAS column.
Column transformers other than EXCEPT are not supported.

Please see tests for examples.
...

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Dec 6, 2020
@Enmk Enmk changed the title Optimize deduplicate OPTIMIZE DEDUPLICATE BY COLUMNS Dec 6, 2020
@Enmk Enmk force-pushed the Optimize_deduplicate branch 2 times, most recently from 80eed32 to cafe449 Compare December 7, 2020 06:31
Extended OPTIMIZE ... DEDUPLICATE syntax to allow explicit (or implicit with asterisk/column transformers) list of columns to check for duplicates on.

Following syntax variants are now supported:

OPTIMIZE TABLE table DEDUPLICATE; -- the old one
OPTIMIZE TABLE table DEDUPLICATE BY *;
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT (colX, colY);
OPTIMIZE TABLE table DEDUPLICATE BY col1,col2,col3;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex');
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT (colX, colY);

Note that * behaves just like in SELECT: MATERIALIZED, and ALIAS columns are not used for expansion.
Also, it is an error to specify empty list of columns, or write an expression that results in an empty list of columns, or deduplicate by an ALIAS column.
Column transformers other than EXCEPT are not supported.
* no more undefined values for attributes in ReplicatedMergeTreeLogEntry
* validation of string serialization format
Also a minor cleanup of the test code.
Also logging expanded list of columns passed from `DEDUPLICATE BY` to actual deduplication routines.
for (;;)
{
String tmp_column_name;
readJSONString(tmp_column_name, in);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite dangerous, because JSON does not support non-unicode data while we support arbitrary bytes for column names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will lead to idiosynchrasy as two different escaping methods will be used in the same file.
Let's reuse existing "tsv-escaped" method.

Copy link
Contributor Author

@Enmk Enmk Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I've done some research, and it looks like the method we use with TabSeparatedWithNames (I believe this is what you meant by 'tsv-secaped') wouldn't cut it.

INPUT as stated in C++ code:

{"name with space", "\"column\"", "'column'", "колонка", "\u30ab\u30e9\u30e0", "\x00\x01\x03 column \x10\x11\x12"},

writeEscapedString:

deduplicate_by_columns: [name with space,"column",\'column\',колонка,カラム,\0 column ]

writeCSV:

deduplicate_by_columns: ["name with space","""column""","'column'","колонка","カラム","\0 column "]

writeJSONString:

deduplicate_by_columns: ["name with space","\"column\"","'column'","колонка","カラム","\u0000\u0001\u0003 column \u0010\u0011\u0012"]

Please note how writeEscapedString and writeCSV eats all non-unicode binary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note how writeEscapedString and writeCSV eats all non-unicode binary.

No, they don't eat binary data. The bytes are written but they are lost in copy-paste.
Actually our JSON format is also binary safe, but it can be misleading to applications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ clickhouse-client --query "SELECT '\x00\x01\x02\x03'"
\0
$ clickhouse-client --query "SELECT '\x00\x01\x02\x03'" | xxd
00000000: 5c30 0102 030a                           \0....

Copy link
Contributor Author

@Enmk Enmk Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, my bad. Changed to CSV format, since that makes it easier to parse list of columns. Right now serialized list of columns from the unit tests like this:

deduplicate_by_columns: "name with space","""column""","'column'","колонка","カラム"," column "

However, I had to ditch using \0 in test data since it makes impossible to validate result with std::regex.

@alexey-milovidov alexey-milovidov self-assigned this Dec 17, 2020
@filimonov
Copy link
Contributor

filimonov commented Dec 17, 2020

Test name | Test status | Test time, sec.
-- | -- | 
test_materialize_mysql_database/test.py::test_select_without_columns_5_7[clickhouse_node1] | FAIL | 189.75
test_materialize_mysql_database/test.py::test_select_without_columns_8_0[clickhouse_node1

Broken in master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants