[Improve][connector-clickhouse] Clickhouse support parallelism reading schema #9446

JeremyXin · 2025-06-16T12:27:45Z

Purpose of this pull request

Clickhouse support parallelism reading schema.
related pr #9421

The Clickhouse source connector supports parallel reading of data. For query table mode, the parallel reading is implemented based on the part file of table, which is obtained from the system.parts table.
The partition_list and filter_query parameter is used to filter data.
The batch_size parameter is used to control the amount of data read each time to avoid OOM exception.

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

…g schema

Copilot

Pull Request Overview

This PR adds support for parallel schema reading in the ClickHouse connector by leveraging the table part files from the system.parts table. Key changes include:

New configuration options (e.g., partition_list, filter_query, batch_size) and test cases to support parallel reading.
Updates to the core proxy, splitter, enumerator, source reader, and associated state management for splitting and reading parts concurrently.
Documentation updates explaining the new parallel reader features.

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
seatunnel-e2e/connector-clickhouse-e2e/.../clickhouse_with_parallelism_read.conf	Added test configuration for parallel read demonstration
ClickhouseIT.java	Added new test methods and constants to verify parallel reading functionality
TablePartSplitterTest.java	Introduced tests for generating splits, including duplicate parts handling
ClickhouseValueReaderTest.java	Added tests to validate various batch reading scenarios
ClickhouseProxy.java	Implemented methods to retrieve part lists and query data per part
ClickhouseSourceState.java	Updated state object to include pending splits
TablePartSplitter.java	Created new splitting logic for ClickHouse parts
ClickhouseSourceSplitEnumerator.java	Added new split enumerator to support parallel splits assignment
ClickhouseSourceSplit.java	Defined a split abstraction based on ClickHouse parts
ClickhouseValueReader.java	Modified value reader to iteratively process splits and update part offsets
ClickhouseSourceTable.java	Updated source table configuration to include new options
ClickhouseSourceReader.java	Refactored source reader to integrate parallelism mode with split queue management
ClickhouseSourceFactory.java	Enhanced factory to build source tables and incorporate new parallelism parameters
ClickhouseSource.java	Updated the connector interface to support parallel reading with new enumerator and reader
ClickhousePart.java	Introduced Comparable interface implementation (stubbed in current diff)
ClickhouseTable.java	Added getter for local database name
ClickhouseConnectorErrorCode.java	Added new error codes for part retrieval and query issues
ClickhouseSourceOptions.java	Defined new options: part_size, partition_list, batch_size, and filter_query
ClickhouseBaseOptions.java	Added table option to support table name configuration
docs/en/connector-v2/source/Clickhouse.md	Updated documentation with instructions and tips for parallel reading

Comments suppressed due to low confidence (1)

seatunnel-connectors-v2/connector-clickhouse/src/main/java/org/apache/seatunnel/connectors/seatunnel/clickhouse/source/ClickhousePart.java:77

The compareTo method always returns 0, which effectively treats all instances as equal. Consider implementing a proper comparison (for example, based on the part name) or removing Comparable if natural ordering is not intended.

public int compareTo(ClickhousePart o) { return 0; }

Copilot · 2025-06-17T01:57:08Z

...src/main/java/org/apache/seatunnel/connectors/seatunnel/clickhouse/util/ClickhouseProxy.java

+                        "select name from system.parts where database = '%s' and table = '%s'",
+                        database, table);
+
+        if (partitionList != null && !partitionList.isEmpty()) {


The SQL query in getPartList is built by directly concatenating the partition list values. Consider using a parameterized query or properly escaping input values to mitigate the risk of SQL injection.

Hisoka-X · 2025-06-17T02:52:51Z

docs/en/connector-v2/source/Clickhouse.md

 | username          | String | Yes      | -                      | `ClickHouse` user username.                                                                                                                                                                                                                                                                                 |
 | password          | String | Yes      | -                      | `ClickHouse` user password.                                                                                                                                                                                                                                                                                 |
+| database          | String | NO       | -                      | The `ClickHouse` database.                                                                                                                                                                                                                                                                                  |
+| table             | String | NO       | -                      | The `ClickHouse` table. If it is a distributed table, the cluster is obtained based on the table engine. If it is a local table, build the cluster based on the input `host`                                                                                                                                |


If the table_path parameter is used instead, should the database parameter also be removed? Is it uniformly represented by the table_path parameter?

Hisoka-X · 2025-06-17T03:14:44Z

...use-e2e/src/test/java/org/apache/seatunnel/connectors/seatunnel/clickhouse/ClickhouseIT.java

@@ -211,6 +213,17 @@ public void testClickHouseWithMultiTableSink(TestContainer container) throws Exc
        }
    }

+    @TestTemplate
+    public void testClickhouseWithParallelismRead(TestContainer testContainer)


could you add test case to verify filter_query and partition_list work properly?

Ok. I will add more test cases.

Carl-Zhou-CN · 2025-06-18T08:40:05Z

...src/main/java/org/apache/seatunnel/connectors/seatunnel/clickhouse/util/ClickhouseProxy.java

+
+        String sql =
+                String.format(
+                        "select * from %s.%s where %s limit %d, %d",


Is this implementation because for a single part, 'limit m,n' can guarantee the order?

This implementation is designed to read parts in batches to avoid large amounts of data when reading in parallel. Each ClickhousePart object has an offset attribute to record the offset of the current part that has been read, thereby ensuring the order of batch reading.

What I mean is this, whether to ensure that it won't repeat without sorting?

After reading the clickhouse documentation, I found that clickhouse supports a kind of LIMIT... WITH TIES way, this can ensure that the data with the same value in the Order By field will be queried in the same batch. Meanwhile, the Order By field of the table is used to define the sorting key when query part. Can this solution solve the problem?

I think it's good.

…er, add sql parallelism read strategy and fix other problem.

JeremyXin · 2025-06-20T15:04:37Z

I have made the following updates:

Add new e2e test cases
Add the table_path parameter and make corresponding configuration modifications (including e2e configuration)
Add sql parallelism read strategy. If sql parameters are specified, the parallel reading is implemented based on the parallelism execution of local table-based queries on each shard of the cluster
Fix the data duplication issue that may be caused by limit
Other newly added code and optimizations

Thanks for helping with the review!

Hisoka-X · 2025-06-23T03:19:58Z

.../java/org/apache/seatunnel/connectors/seatunnel/clickhouse/config/ClickhouseBaseOptions.java

@@ -40,6 +40,13 @@ public class ClickhouseBaseOptions {
                    .noDefaultValue()
                    .withDescription("Clickhouse database name");

+    /** Clickhouse database name */
+    public static final Option<String> TABLE =
+            Options.key("table")


Suggested change

Options.key("table")

Options.key("table_path")

…aseOptions, fix ClickhouseValueReader bug and add unit tests

Hisoka-X · 2025-06-24T03:34:54Z

...java/org/apache/seatunnel/connectors/seatunnel/clickhouse/config/ClickhouseSourceConfig.java

+    public String getTableIdentifier() {
+        if (StringUtils.isEmpty(tablePath)) {
+            // Extract table identifier from SQL
+            return ClickhouseUtil.extractTablePathFromSql(sql);
+        }
+
+        return tablePath;
+    }


if the sql contains join statement, how to handle it?

If sql in complex scenarios is considered, since there may be multiple situations, there may not be a suitable distributed parallel execution solution at present. One solution I thought of is to select one of the shards and execute SQL directly in single concurrency. Is this feasible?

select one of the shards and execute SQL directly in single concurrency.

I agree with this.

If sql in complex scenarios is considered, since there may be multiple situations, there may not be a suitable distributed parallel execution solution at present. One solution I thought of is to select one of the shards and execute SQL directly in single concurrency. Is this feasible?

Will this cause an error when sorting?

Should DynamicChunkSplitter be used to address complex SQL scenarios?

The first question: For complex sql scenarios, user input sql will only be executed on a single shard, and no additional order by and limit operations will be performed, just like the query executed on ck server.

The second question: I thought about it. There could be multiple possibilities in complex sql scenarios, such as users using group by, or multiple tables for global join, etc. Splitting sql in these scenarios might be rather complex, and no suitable solution has been thought of for the time being.

Perhaps we can first execute the above-mentioned complex sql according to the single concurrent solution and then continue to try to optimize it later?

For aggregation scenarios, are there any potential issues? For example, in a "TOP N" scenario where such SQL can only be executed on a distributed table？

Yes, for these scenarios (such as group by or join), the sql input by the user needs to be specified as a distributed table and executed in a single concurrent manner on a single shard. However, if the user specifies the local table, it will only be executed on the node where the local surface is located.

In my current sql concurrent execution scheme, it is best that the sql input by users only contains single tables and where filtering conditions. For sql in other complex scenarios, the above single concurrent solution will be used for execution.

Is this feasible?

Ok. Thank you for your answer. I think your plan is very good

JeremyXin added 2 commits June 16, 2025 20:18

[Improve][connector-clickhouse] Clickhouse support parallelism readin…

305b336

…g schema

[Improve][connector-clickhouse] update Clickhouse.md

baeb16b

github-actions bot added document connectors-v2 e2e clickhouse labels Jun 16, 2025

[Improve][connector-clickhouse] fix code style error

bb21fe0

nielifeng requested a review from Copilot June 17, 2025 01:56

Copilot AI reviewed Jun 17, 2025

View reviewed changes

Hisoka-X reviewed Jun 17, 2025

View reviewed changes

Carl-Zhou-CN reviewed Jun 18, 2025

View reviewed changes

[Improve][connector-clickhouse] add e2e tests, add table_path paramet…

6678790

…er, add sql parallelism read strategy and fix other problem.

Hisoka-X reviewed Jun 23, 2025

View reviewed changes

[Improve][connector-clickhouse] add table_path param in ClickhouseB…

147c3fd

…aseOptions, fix ClickhouseValueReader bug and add unit tests

Hisoka-X reviewed Jun 24, 2025

View reviewed changes

	\| table \| String \| NO \| - \| The `ClickHouse` table. If it is a distributed table, the cluster is obtained based on the table engine. If it is a local table, build the cluster based on the input `host` \|
	\| table_path \| String \| NO \| - \| The `ClickHouse` table. If it is a distributed table, the cluster is obtained based on the table engine. If it is a local table, build the cluster based on the input `host` \|

[Improve][connector-clickhouse] Clickhouse support parallelism reading schema #9446

Are you sure you want to change the base?

[Improve][connector-clickhouse] Clickhouse support parallelism reading schema #9446

Uh oh!

Conversation

JeremyXin commented Jun 16, 2025

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JeremyXin commented Jun 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!