From 793be14807d095e91a5500602f0bd79573182951 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Philip=20Dub=C3=A9?= Date: Thu, 4 Sep 2025 02:51:50 +0000 Subject: [PATCH 1/2] DB pipes: warn about NULLs when using custom partition key --- .../data-ingestion/clickpipes/mysql/parallel_initial_load.md | 1 + .../data-ingestion/clickpipes/postgres/parallel_initial_load.md | 1 + 2 files changed, 2 insertions(+) diff --git a/docs/integrations/data-ingestion/clickpipes/mysql/parallel_initial_load.md b/docs/integrations/data-ingestion/clickpipes/mysql/parallel_initial_load.md index 27b77bd6ddc..13d5cbcedbc 100644 --- a/docs/integrations/data-ingestion/clickpipes/mysql/parallel_initial_load.md +++ b/docs/integrations/data-ingestion/clickpipes/mysql/parallel_initial_load.md @@ -48,3 +48,4 @@ You can run **SHOW processlist** in MySQL to see the parallel snapshot in action ### Limitations {#limitations-parallel-mysql-snapshot} - The snapshot parameters cannot be edited after pipe creation. If you want to change them, you will have to create a new ClickPipe. - When adding tables to an existing ClickPipe, you cannot change the snapshot parameters. The ClickPipe will use the existing parameters for the new tables. +- The partition key column should not contain `NULL`s, as they will be skipped by the partitioning logic. diff --git a/docs/integrations/data-ingestion/clickpipes/postgres/parallel_initial_load.md b/docs/integrations/data-ingestion/clickpipes/postgres/parallel_initial_load.md index 983ff29b011..1a057032e80 100644 --- a/docs/integrations/data-ingestion/clickpipes/postgres/parallel_initial_load.md +++ b/docs/integrations/data-ingestion/clickpipes/postgres/parallel_initial_load.md @@ -45,3 +45,4 @@ You can analyze **pg_stat_activity** to see the parallel snapshot in action. The - The snapshot parameters cannot be edited after pipe creation. If you want to change them, you will have to create a new ClickPipe. - When adding tables to an existing ClickPipe, you cannot change the snapshot parameters. The ClickPipe will use the existing parameters for the new tables. +- The partition key column should not contain `NULL`s, as they will be skipped by the partitioning logic. From 740d9b8e3070584a3a7d670ae1941e7b116102a3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Philip=20Dub=C3=A9?= Date: Thu, 4 Sep 2025 03:13:46 +0000 Subject: [PATCH 2/2] lint: present tense --- .../clickpipes/mysql/parallel_initial_load.md | 8 ++++---- .../clickpipes/postgres/parallel_initial_load.md | 8 ++++---- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/integrations/data-ingestion/clickpipes/mysql/parallel_initial_load.md b/docs/integrations/data-ingestion/clickpipes/mysql/parallel_initial_load.md index 13d5cbcedbc..bf93de349ef 100644 --- a/docs/integrations/data-ingestion/clickpipes/mysql/parallel_initial_load.md +++ b/docs/integrations/data-ingestion/clickpipes/mysql/parallel_initial_load.md @@ -34,13 +34,13 @@ Let's talk about the below settings: Snapshot parameters #### Snapshot number of rows per partition {#numrows-mysql-snapshot} -This setting controls how many rows constitute a partition. The ClickPipe will read the source table in chunks of this size, and chunks will be processed in parallel based on the initial load parallelism set. The default value is 100,000 rows per partition. +This setting controls how many rows constitute a partition. The ClickPipe will read the source table in chunks of this size, and chunks are processed in parallel based on the initial load parallelism set. The default value is 100,000 rows per partition. #### Initial load parallelism {#parallelism-mysql-snapshot} -This setting controls how many partitions will be processed in parallel. The default value is 4, which means that the ClickPipe will read 4 partitions of the source table in parallel. This can be increased to speed up the initial load, but it is recommended to keep it to a reasonable value depending on your source instance specs to avoid overwhelming the source database. The ClickPipe will automatically adjust the number of partitions based on the size of the source table and the number of rows per partition. +This setting controls how many partitions are processed in parallel. The default value is 4, which means that the ClickPipe will read 4 partitions of the source table in parallel. This can be increased to speed up the initial load, but it is recommended to keep it to a reasonable value depending on your source instance specs to avoid overwhelming the source database. The ClickPipe will automatically adjust the number of partitions based on the size of the source table and the number of rows per partition. #### Snapshot number of tables in parallel {#tables-parallel-mysql-snapshot} -Not really related to parallel snapshot, but this setting controls how many tables will be processed in parallel during the initial load. The default value is 1. Note that is on top of the parallelism of the partitions, so if you have 4 partitions and 2 tables, the ClickPipe will read 8 partitions in parallel. +Not really related to parallel snapshot, but this setting controls how many tables are processed in parallel during the initial load. The default value is 1. Note that is on top of the parallelism of the partitions, so if you have 4 partitions and 2 tables, the ClickPipe will read 8 partitions in parallel. ### Monitoring parallel snapshot in MySQL {#monitoring-parallel-mysql-snapshot} You can run **SHOW processlist** in MySQL to see the parallel snapshot in action. The ClickPipe will create multiple connections to the source database, each reading a different partition of the source table. If you see **SELECT** queries with different ranges, it means that the ClickPipe is reading the source tables. You can also see the COUNT(*) and the partitioning query in here. @@ -48,4 +48,4 @@ You can run **SHOW processlist** in MySQL to see the parallel snapshot in action ### Limitations {#limitations-parallel-mysql-snapshot} - The snapshot parameters cannot be edited after pipe creation. If you want to change them, you will have to create a new ClickPipe. - When adding tables to an existing ClickPipe, you cannot change the snapshot parameters. The ClickPipe will use the existing parameters for the new tables. -- The partition key column should not contain `NULL`s, as they will be skipped by the partitioning logic. +- The partition key column should not contain `NULL`s, as they are skipped by the partitioning logic. diff --git a/docs/integrations/data-ingestion/clickpipes/postgres/parallel_initial_load.md b/docs/integrations/data-ingestion/clickpipes/postgres/parallel_initial_load.md index 1a057032e80..1314213f81c 100644 --- a/docs/integrations/data-ingestion/clickpipes/postgres/parallel_initial_load.md +++ b/docs/integrations/data-ingestion/clickpipes/postgres/parallel_initial_load.md @@ -27,15 +27,15 @@ Let's talk about the below settings: #### Snapshot number of rows per partition {#numrows-pg-snapshot} -This setting controls how many rows constitute a partition. The ClickPipe will read the source table in chunks of this size, and chunks will be processed in parallel based on the initial load parallelism set. The default value is 100,000 rows per partition. +This setting controls how many rows constitute a partition. The ClickPipe will read the source table in chunks of this size, and chunks are processed in parallel based on the initial load parallelism set. The default value is 100,000 rows per partition. #### Initial load parallelism {#parallelism-pg-snapshot} -This setting controls how many partitions will be processed in parallel. The default value is 4, which means that the ClickPipe will read 4 partitions of the source table in parallel. This can be increased to speed up the initial load, but it is recommended to keep it to a reasonable value depending on your source instance specs to avoid overwhelming the source database. The ClickPipe will automatically adjust the number of partitions based on the size of the source table and the number of rows per partition. +This setting controls how many partitions are processed in parallel. The default value is 4, which means that the ClickPipe will read 4 partitions of the source table in parallel. This can be increased to speed up the initial load, but it is recommended to keep it to a reasonable value depending on your source instance specs to avoid overwhelming the source database. The ClickPipe will automatically adjust the number of partitions based on the size of the source table and the number of rows per partition. #### Snapshot number of tables in parallel {#tables-parallel-pg-snapshot} -Not really related to parallel snapshot, but this setting controls how many tables will be processed in parallel during the initial load. The default value is 1. Note that is on top of the parallelism of the partitions, so if you have 4 partitions and 2 tables, the ClickPipe will read 8 partitions in parallel. +Not really related to parallel snapshot, but this setting controls how many tables are processed in parallel during the initial load. The default value is 1. Note that is on top of the parallelism of the partitions, so if you have 4 partitions and 2 tables, the ClickPipe will read 8 partitions in parallel. ### Monitoring parallel snapshot in Postgres {#monitoring-parallel-pg-snapshot} @@ -45,4 +45,4 @@ You can analyze **pg_stat_activity** to see the parallel snapshot in action. The - The snapshot parameters cannot be edited after pipe creation. If you want to change them, you will have to create a new ClickPipe. - When adding tables to an existing ClickPipe, you cannot change the snapshot parameters. The ClickPipe will use the existing parameters for the new tables. -- The partition key column should not contain `NULL`s, as they will be skipped by the partitioning logic. +- The partition key column should not contain `NULL`s, as they are skipped by the partitioning logic.