OPTIMIZE DEDUPLICATE BY COLUMNS #17846

Enmk · 2020-12-06T19:29:54Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Extended OPTIMIZE ... DEDUPLICATE syntax to allow explicit (or implicit with asterisk/column transformers) list of columns to check for duplicates on.
...

Detailed description / Documentation draft:
Following syntax variants are now supported:

OPTIMIZE TABLE table DEDUPLICATE; -- the old one
OPTIMIZE TABLE table DEDUPLICATE BY *; -- not the same as the old one, excludes MATERIALIZED columns (see the note below)
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT (colX, colY);
OPTIMIZE TABLE table DEDUPLICATE BY col1,col2,col3;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex');
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT colX;
OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT (colX, colY);

Note that * behaves just like in SELECT: MATERIALIZED, and ALIAS columns are not used for expansion.
Also, it is an error to specify empty list of columns, or write an expression that results in an empty list of columns, or deduplicate by an ALIAS column.
Column transformers other than EXCEPT are not supported.

Please see tests for examples.
...

Extended OPTIMIZE ... DEDUPLICATE syntax to allow explicit (or implicit with asterisk/column transformers) list of columns to check for duplicates on. Following syntax variants are now supported: OPTIMIZE TABLE table DEDUPLICATE; -- the old one OPTIMIZE TABLE table DEDUPLICATE BY *; OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT colX; OPTIMIZE TABLE table DEDUPLICATE BY * EXCEPT (colX, colY); OPTIMIZE TABLE table DEDUPLICATE BY col1,col2,col3; OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex'); OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT colX; OPTIMIZE TABLE table DEDUPLICATE BY COLUMNS('column-matched-by-regex') EXCEPT (colX, colY); Note that * behaves just like in SELECT: MATERIALIZED, and ALIAS columns are not used for expansion. Also, it is an error to specify empty list of columns, or write an expression that results in an empty list of columns, or deduplicate by an ALIAS column. Column transformers other than EXCEPT are not supported.

…olumn transformers

* no more undefined values for attributes in ReplicatedMergeTreeLogEntry * validation of string serialization format

tests/queries/0_stateless/01581_deduplicate_by_columns_replicated.sql

Also a minor cleanup of the test code.

src/Interpreters/InterpreterOptimizeQuery.cpp

Also logging expanded list of columns passed from `DEDUPLICATE BY` to actual deduplication routines.

Updated test and minor cleanup

src/Interpreters/InterpreterOptimizeQuery.cpp

alexey-milovidov · 2020-12-17T00:45:52Z

src/Storages/MergeTree/ReplicatedMergeTreeLogEntry.cpp

+                    for (;;)
+                    {
+                        String tmp_column_name;
+                        readJSONString(tmp_column_name, in);


This is quite dangerous, because JSON does not support non-unicode data while we support arbitrary bytes for column names.

This will lead to idiosynchrasy as two different escaping methods will be used in the same file.
Let's reuse existing "tsv-escaped" method.

Well, I've done some research, and it looks like the method we use with TabSeparatedWithNames (I believe this is what you meant by 'tsv-secaped') wouldn't cut it.

INPUT as stated in C++ code:

{"name with space", "\"column\"", "'column'", "колонка", "\u30ab\u30e9\u30e0", "\x00\x01\x03 column \x10\x11\x12"},

writeEscapedString:

deduplicate_by_columns: [name with space,"column",\'column\',колонка,カラム,\0 column ]

writeCSV:

deduplicate_by_columns: ["name with space","""column""","'column'","колонка","カラム","\0 column "]

writeJSONString:

deduplicate_by_columns: ["name with space","\"column\"","'column'","колонка","カラム","\u0000\u0001\u0003 column \u0010\u0011\u0012"]

Please note how writeEscapedString and writeCSV eats all non-unicode binary.

Please note how writeEscapedString and writeCSV eats all non-unicode binary.

No, they don't eat binary data. The bytes are written but they are lost in copy-paste.
Actually our JSON format is also binary safe, but it can be misleading to applications.

$ clickhouse-client --query "SELECT '\x00\x01\x02\x03'" \0 $ clickhouse-client --query "SELECT '\x00\x01\x02\x03'" | xxd 00000000: 5c30 0102 030a \0....

Agree, my bad. Changed to CSV format, since that makes it easier to parse list of columns. Right now serialized list of columns from the unit tests like this:

deduplicate_by_columns: "name with space","""column""","'column'","колонка","カラム"," column "

However, I had to ditch using \0 in test data since it makes impossible to validate result with std::regex.

filimonov · 2020-12-17T06:04:59Z

Test name | Test status | Test time, sec.
-- | -- | 
test_materialize_mysql_database/test.py::test_select_without_columns_5_7[clickhouse_node1] | FAIL | 189.75
test_materialize_mysql_database/test.py::test_select_without_columns_8_0[clickhouse_node1

Broken in master

…of JSON-like notation.

robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Dec 6, 2020

Enmk changed the title ~~Optimize deduplicate~~ OPTIMIZE DEDUPLICATE BY COLUMNS Dec 6, 2020

Enmk force-pushed the Optimize_deduplicate branch 2 times, most recently from 80eed32 to cafe449 Compare December 7, 2020 06:31

Enmk force-pushed the Optimize_deduplicate branch from cafe449 to 70ea507 Compare December 7, 2020 06:44

filimonov added the altinity label Dec 7, 2020

Enmk added 4 commits December 7, 2020 13:18

Fixed parsing invalid cases: prohibit empty lists and APPLY/REPLACE c…

957bbfc

…olumn transformers

Fixed and refined unite-test

dbdc018

* no more undefined values for attributes in ReplicatedMergeTreeLogEntry * validation of string serialization format

Updated tests

f01a566

Minor: cleanup

168155e

filimonov reviewed Dec 7, 2020

View reviewed changes

tests/queries/0_stateless/01581_deduplicate_by_columns_replicated.sql Outdated Show resolved Hide resolved

Fixed building tests with GCC-10

8c5daf0

Also a minor cleanup of the test code.

filimonov reviewed Dec 8, 2020

View reviewed changes

src/Interpreters/InterpreterOptimizeQuery.cpp Outdated Show resolved Hide resolved

Enmk added 2 commits December 8, 2020 19:44

Fixed test to be less flaky

59fc301

Also logging expanded list of columns passed from `DEDUPLICATE BY` to actual deduplication routines.

Enforcing all sorting keys to be present in DEDUPLICATE BY columns

a2f85a0

Updated test and minor cleanup

filimonov mentioned this pull request Dec 14, 2020

21.1 release checklist #17951

Closed

filimonov reviewed Dec 15, 2020

View reviewed changes

src/Interpreters/InterpreterOptimizeQuery.cpp Show resolved Hide resolved

Enmk added 2 commits December 15, 2020 13:41

Checking that columns from PARTITION BY are present in DEDUPLICATE BY

bf8c7cd

Fixed compilation

90041ba

filimonov approved these changes Dec 16, 2020

View reviewed changes

alexey-milovidov requested changes Dec 17, 2020

View reviewed changes

alexey-milovidov self-assigned this Dec 17, 2020

Using CSV-like strings for list of columns to deduplicate by instead …

e166aae

…of JSON-like notation.

Enmk force-pushed the Optimize_deduplicate branch from d31579e to e166aae Compare December 18, 2020 11:47

Single quotes around column names

e5ec81f

alexey-milovidov approved these changes Dec 20, 2020

View reviewed changes

Merge branch 'master' into Enmk-Optimize_deduplicate

9be5fa9

alexey-milovidov merged commit 6eba458 into ClickHouse:master Dec 20, 2020

alexey-milovidov mentioned this pull request Dec 21, 2020

Fixes in ODBC dictionary reload and ODBC bridge reachability #18278

Merged

atereh mentioned this pull request May 13, 2021

DOCSUP-5919: Update OPTIMIZE description in docs #24084

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OPTIMIZE DEDUPLICATE BY COLUMNS #17846

OPTIMIZE DEDUPLICATE BY COLUMNS #17846

Enmk commented Dec 6, 2020 •

edited

Loading

alexey-milovidov Dec 17, 2020

alexey-milovidov Dec 17, 2020

Enmk Dec 17, 2020 •

edited

Loading

alexey-milovidov Dec 17, 2020

alexey-milovidov Dec 17, 2020

Enmk Dec 18, 2020 •

edited

Loading

filimonov commented Dec 17, 2020 •

edited

Loading

OPTIMIZE DEDUPLICATE BY COLUMNS #17846

OPTIMIZE DEDUPLICATE BY COLUMNS #17846

Conversation

Enmk commented Dec 6, 2020 • edited Loading

alexey-milovidov Dec 17, 2020

Choose a reason for hiding this comment

alexey-milovidov Dec 17, 2020

Choose a reason for hiding this comment

Enmk Dec 17, 2020 • edited Loading

Choose a reason for hiding this comment

alexey-milovidov Dec 17, 2020

Choose a reason for hiding this comment

alexey-milovidov Dec 17, 2020

Choose a reason for hiding this comment

Enmk Dec 18, 2020 • edited Loading

Choose a reason for hiding this comment

filimonov commented Dec 17, 2020 • edited Loading

Enmk commented Dec 6, 2020 •

edited

Loading

Enmk Dec 17, 2020 •

edited

Loading

Enmk Dec 18, 2020 •

edited

Loading

filimonov commented Dec 17, 2020 •

edited

Loading