Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clickhouse copier row count mismatch #60540

Closed
pranavmehta94 opened this issue Feb 29, 2024 · 0 comments · Fixed by #61058
Closed

Clickhouse copier row count mismatch #60540

pranavmehta94 opened this issue Feb 29, 2024 · 0 comments · Fixed by #61058
Labels
potential bug To be reviewed by developers and confirmed/rejected.

Comments

@pranavmehta94
Copy link

pranavmehta94 commented Feb 29, 2024

We are trying to re-shard and re-balance a existing 2 shard/2 replica clickhouse cluster by moving data using clickhouse copier to a 3 shard/2 replica cluster
We used following links as reference for using clickhouse copier
https://github.com/ClickHouse/ClickHouse/blob/master/docs/en/operations/utilities/clickhouse-copier.md
https://altinity.com/blog/2018-8-22-clickhouse-copier-in-practice
In our tests, we are observing that after successful copier run, row count on the destination cluster does not match with the source cluster.

The source cluster table details are as follows:
DDL queries

  1. Create database: create database on cluster 'src-cluster'
  2. Create local replicated on src cluster: CREATE TABLE database-name.local-table-name ON CLUSTER 'src-cluster' (id UInt32, flightName String, socurce String, destination String) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}//', '{replica}') ORDER BY id;
  3. Create Distributed table: CREATE TABLE database-name.dist-table-name ON CLUSTER 'src-cluster' AS database-name.local-table-name ENGINE = Distributed('src-cluster','database-name,'local-table-name, id);

Data generation Insert query:
INSERT INTO database-name.dist-table-name SELECT number, randomPrintableASCII(randUniform(5, 25)), randomPrintableASCII(randUniform(5, 25)), randomPrintableASCII(randUniform(5, 25))FROM numbers(2000000000)

row count on src cluster: 2005242245 (query: select count(*) from database-name.dist-table-name)
(Note: We ran the insert command once with 2B rows and second run with ~5M rows)

The destination cluster table details are as follows:
DDL queries

  1. Create database: create database database-name on cluster 'dest-cluster'
  2. Create local repliacted table on dest cluster: These were created by copier
  3. Create Distributed table: CREATE TABLE database-name.dist-table-name ON CLUSTER 'dest-cluster' AS database-name.local-table-name ENGINE = Distributed('dest-cluster','database-name,'local-table-name, id);

row count on dest cluster: 2319920618 (query: select count(*) from database-name.dist-table-name)

A clear and concise description of what works not as it is supposed to.

After running copier to move and re-balance the data on destination cluster, we see that the row count on destination cluster distributed table was ~300M more than the table in source cluster.
We are not sure why is there a row count mismatch between source and destination cluster. Can somebody please explain the cause of this mismatch?

How to reproduce

  • Clickhouse image version: clickhouse/clickhouse-server:23.10
  • copier task.xml (attached in zip along with copier logs)
    CH copier.zip

Expected behavior

A clear and concise description of what you expected to happen.

The row count after querying distributed table on destination cluster should match with the row count query on distributed table in source cluster.

@pranavmehta94 pranavmehta94 added the potential bug To be reviewed by developers and confirmed/rejected. label Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
potential bug To be reviewed by developers and confirmed/rejected.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant