-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPTIMIZE DEDUPLICATE BY COLUMNS #17846
OPTIMIZE DEDUPLICATE BY COLUMNS #17846
Conversation
80eed32
to
cafe449
Compare
Extended OPTIMIZE ... DEDUPLICATE syntax to allow explicit (or implicit with asterisk/column transformers) list of columns to check for duplicates on. Following syntax variants are now supported: OPTIMIZE TABLE table DEDUPLICATE; -- the old one OPTIMIZE TABLE table DEDUPLICATE BY *; OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT colX; OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT (colX, colY); OPTIMIZE TABLE table DEDUPLICATE BY col1,col2,col3; OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex'); OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT colX; OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT (colX, colY); Note that * behaves just like in SELECT: MATERIALIZED, and ALIAS columns are not used for expansion. Also, it is an error to specify empty list of columns, or write an expression that results in an empty list of columns, or deduplicate by an ALIAS column. Column transformers other than EXCEPT are not supported.
cafe449
to
70ea507
Compare
…olumn transformers
* no more undefined values for attributes in ReplicatedMergeTreeLogEntry * validation of string serialization format
tests/queries/0_stateless/01581_deduplicate_by_columns_replicated.sql
Outdated
Show resolved
Hide resolved
Also a minor cleanup of the test code.
Also logging expanded list of columns passed from `DEDUPLICATE BY` to actual deduplication routines.
Updated test and minor cleanup
for (;;) | ||
{ | ||
String tmp_column_name; | ||
readJSONString(tmp_column_name, in); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite dangerous, because JSON does not support non-unicode data while we support arbitrary bytes for column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will lead to idiosynchrasy as two different escaping methods will be used in the same file.
Let's reuse existing "tsv-escaped" method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I've done some research, and it looks like the method we use with TabSeparatedWithNames
(I believe this is what you meant by 'tsv-secaped') wouldn't cut it.
INPUT as stated in C++ code:
{"name with space", "\"column\"", "'column'", "колонка", "\u30ab\u30e9\u30e0", "\x00\x01\x03 column \x10\x11\x12"},
writeEscapedString
:
deduplicate_by_columns: [name with space,"column",\'column\',колонка,カラム,\0 column ]
writeCSV
:
deduplicate_by_columns: ["name with space","""column""","'column'","колонка","カラム","\0 column "]
writeJSONString
:
deduplicate_by_columns: ["name with space","\"column\"","'column'","колонка","カラム","\u0000\u0001\u0003 column \u0010\u0011\u0012"]
Please note how writeEscapedString
and writeCSV
eats all non-unicode binary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note how writeEscapedString and writeCSV eats all non-unicode binary.
No, they don't eat binary data. The bytes are written but they are lost in copy-paste.
Actually our JSON format is also binary safe, but it can be misleading to applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ clickhouse-client --query "SELECT '\x00\x01\x02\x03'"
\0
$ clickhouse-client --query "SELECT '\x00\x01\x02\x03'" | xxd
00000000: 5c30 0102 030a \0....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, my bad. Changed to CSV format, since that makes it easier to parse list of columns. Right now serialized list of columns from the unit tests like this:
deduplicate_by_columns: "name with space","""column""","'column'","колонка","カラム"," column "
However, I had to ditch using \0
in test data since it makes impossible to validate result with std::regex
.
Broken in master |
…of JSON-like notation.
d31579e
to
e166aae
Compare
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Extended
OPTIMIZE ... DEDUPLICATE
syntax to allow explicit (or implicit with asterisk/column transformers) list of columns to check for duplicates on....
Detailed description / Documentation draft:
Following syntax variants are now supported:
Note that
*
behaves just like inSELECT
:MATERIALIZED
, andALIAS
columns are not used for expansion.Also, it is an error to specify empty list of columns, or write an expression that results in an empty list of columns, or deduplicate by an
ALIAS
column.Column transformers other than
EXCEPT
are not supported.Please see tests for examples.
...