Allow parallel replicas for JOIN with analyzer [part 2] #58916

KochetovNicolai · 2024-01-17T15:49:16Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Support LEFT JOIN, ALL INNER JOIN, and simple subqueries for parallel replicas (only with analyzer). New setting parallel_replicas_prefer_local_join chooses local JOIN execution (by default) vs GLOBAL JOIN. All tables should exist on every replica from cluster_for_parallel_replicas. New settings min_external_table_block_size_rows and min_external_table_block_size_bytes are used to squash small blocks that are sent for temporary tables (only with analyzer).

robot-clickhouse-ci-1 · 2024-01-17T15:52:20Z

This is an automated comment for commit 03720d5 with description of existing statuses. It's updated for the latest CI running

⏳ Click here to open a full report in a separate page

Successful checks

Check name	Description	Status
Docs check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Mergeable Check	Checks if all other necessary checks are successful	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success

Check name	Description	Status
CI running	A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR	⏳ pending

…yzer-2

novikd

In general LGTM

novikd · 2024-02-13T15:39:36Z

src/Planner/Utils.cpp

+    auto ast = queryNodeToSelectQuery(query_node);
+    /// Remove CTEs information from distributed queries.
+    /// Now, if cte_name is set for subquery node, AST -> String serialization will only print cte name.
+    /// But CTE is defined only for top-level query part, so may not be sent.
+    /// Removing cte_name forces subquery to be always printed.
+    removeCTEs(ast);


Maybe better to add a flag to ConvertToASTOptions and do not add CTEs in QueryNode::toASTImpl at all?

I've tried, but it did not work for some reason.
Will do it later.

novikd · 2024-02-13T15:40:58Z

src/Planner/PlannerContext.h

 class GlobalPlannerContext
 {
 public:
-    GlobalPlannerContext() = default;
+    explicit GlobalPlannerContext(const QueryNode * parallel_replicas_node_, const TableNode * parallel_replicas_table_)


Please, add a comment here.

There is a comment below ... Should I add another one?

novikd · 2024-02-13T15:43:06Z

src/Planner/PlannerContext.h

+    /// The query which will be executed with parallel replicas.
+    /// In case if only the most inner subquery can be executed with parallel replicas, node is nullptr.
+    const QueryNode * const parallel_replicas_node = nullptr;
+    /// Table which is used with parallel replicas reading. Now, only one table is supported by the protocol.
+    /// It is the left-most table of the query (in JOINs, UNIONs and subqueries).
+    const TableNode * const parallel_replicas_table = nullptr;


Is it really necessary in global context?

Discussed it. Could not find a better place so far.

novikd · 2024-02-13T15:56:38Z

src/Storages/buildQueryTreeForShard.cpp

+        CollectStoragesVisitor collect_storages;
+        collect_storages.visit(node);


This is not optimal and lead to O(N^2) traversal. It's better to collect this info in enterImpl and make check in leaveImpl. We can leave it as it, but it's better to rewrite it in the follow up PR.

I don't see why.
We traverse only the left table expression in RewriteJoinToGlobalJoinVisitor, but check allStoragesAreMergeTree only for the right table expressions.

novikd · 2024-02-13T15:58:54Z

src/Planner/findParallelReplicasQuery.h

+
+/// Find a qury which can be executed with parallel replicas up to WithMergableStage.
+/// Returned query will always contain some (>1) subqueries, possibly with joins.
+const QueryNode * findParallelReplicasQuery(const QueryTreeNodePtr & query_tree_node, SelectQueryOptions & select_query_options);


Maybe better to call it findQueryForParallelReplicas?

novikd · 2024-02-13T16:04:11Z

src/Planner/findParallelReplicasQuery.cpp

+    return res;
+}
+
+static const TableNode * findTableForParallelReplicas(const IQueryTreeNode * query_tree_node)


I'd prefer to have a non recursive implementation.

Agree.
However, it would make sense if we rewrite all the visitors to non-recursive mode implementation :)

nikitamikhaylov · 2024-02-13T17:24:43Z

@KochetovNicolai Please add new settings to the history

┌─name────────────────────────────────┐
│ min_external_table_block_size_rows  │
│ min_external_table_block_size_bytes │
│ parallel_replicas_prefer_local_join │
└─────────────────────────────────────┘

KochetovNicolai · 2024-02-13T17:33:23Z

@nikitamikhaylov I've added it a long time ago. Looks like the check is broken somehow.

…yzer-2

KochetovNicolai · 2024-02-14T12:30:32Z

Build is green. Merging.

robot-clickhouse-ci-1 added the pr-improvement Pull request with some product improvements label Jan 17, 2024

KochetovNicolai changed the title ~~Allow parallel replicas for join with analyzer [part 2]~~ Allow parallel replicas for JOIN with analyzer [part 2] Jan 17, 2024

novikd self-assigned this Jan 18, 2024

KochetovNicolai force-pushed the allow-parallel-replicas-for-join-with-analyzer-2 branch from a19db02 to ec0fce3 Compare January 18, 2024 13:20

KochetovNicolai force-pushed the allow-parallel-replicas-for-join-with-analyzer-2 branch from ec0fce3 to 6bf28c8 Compare January 29, 2024 14:11

KochetovNicolai added 10 commits February 5, 2024 17:05

Allow to send a chain of subqueries for parallel replicas with analyzer.

b60228a

Prohibit any inner join.

7f2a5d3

Squash temporary tables.

1892318

Add settings to squash external table blocks.

8692e8f

Support non global in mode.

15bf263

Fixing fasttest.

ec74571

Update test.

9c6538b

Add a test.

8a933e9

Update test

e5a8e36

Fixing tests.

29780b1

KochetovNicolai force-pushed the allow-parallel-replicas-for-join-with-analyzer-2 branch from cba10d7 to 29780b1 Compare February 5, 2024 17:07

KochetovNicolai added 3 commits February 5, 2024 17:10

Fixing merge

6563d0b

Fix more tests.

6b06fcf

Remove commented code. Add more comments.

29908dd

KochetovNicolai marked this pull request as ready for review February 6, 2024 15:59

KochetovNicolai added 3 commits February 7, 2024 12:25

Merge branch 'master' into allow-parallel-replicas-for-join-with-anal…

c434748

…yzer-2

Merge branch 'master' into allow-parallel-replicas-for-join-with-anal…

01d0ca3

…yzer-2

Update SettingsChangesHistory.h

a547116

nikitamikhaylov added the pr-must-backport-cloud label Feb 8, 2024

KochetovNicolai added 2 commits February 13, 2024 11:11

Merge branch 'master' into allow-parallel-replicas-for-join-with-anal…

666b3d6

…yzer-2

Fixing test.

5daab7a

novikd reviewed Feb 13, 2024

View reviewed changes

Review fixes.

c2ad769

KochetovNicolai added 3 commits February 13, 2024 17:37

Trying to fix settings change

72bcadb

Fixing build.

d697c12

Merge branch 'master' into allow-parallel-replicas-for-join-with-anal…

03720d5

…yzer-2

KochetovNicolai merged commit ebf47dd into master Feb 14, 2024
16 of 37 checks passed

KochetovNicolai deleted the allow-parallel-replicas-for-join-with-analyzer-2 branch February 14, 2024 12:30

robot-ch-test-poll2 added the pr-backports-created-cloud label Feb 14, 2024

robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow parallel replicas for JOIN with analyzer [part 2] #58916

Allow parallel replicas for JOIN with analyzer [part 2] #58916

KochetovNicolai commented Jan 17, 2024 •

edited

robot-clickhouse-ci-1 commented Jan 17, 2024 •

edited by robot-clickhouse-ci-2

novikd left a comment

novikd Feb 13, 2024

KochetovNicolai Feb 13, 2024

novikd Feb 13, 2024

KochetovNicolai Feb 13, 2024

novikd Feb 13, 2024

KochetovNicolai Feb 13, 2024

novikd Feb 13, 2024

KochetovNicolai Feb 13, 2024

novikd Feb 13, 2024

novikd Feb 13, 2024

KochetovNicolai Feb 13, 2024

nikitamikhaylov commented Feb 13, 2024

KochetovNicolai commented Feb 13, 2024

KochetovNicolai commented Feb 14, 2024

		CollectStoragesVisitor collect_storages;
		collect_storages.visit(node);

Allow parallel replicas for JOIN with analyzer [part 2] #58916

Allow parallel replicas for JOIN with analyzer [part 2] #58916

Conversation

KochetovNicolai commented Jan 17, 2024 • edited

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

robot-clickhouse-ci-1 commented Jan 17, 2024 • edited by robot-clickhouse-ci-2

novikd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikitamikhaylov commented Feb 13, 2024

KochetovNicolai commented Feb 13, 2024

KochetovNicolai commented Feb 14, 2024

KochetovNicolai commented Jan 17, 2024 •

edited

robot-clickhouse-ci-1 commented Jan 17, 2024 •

edited by robot-clickhouse-ci-2