Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClickHouse Cloud Uploads: Appending duplicate CSV file doesn't append new values #237

Closed
calherries opened this issue May 20, 2024 · 4 comments · Fixed by #238
Closed

ClickHouse Cloud Uploads: Appending duplicate CSV file doesn't append new values #237

calherries opened this issue May 20, 2024 · 4 comments · Fixed by #238
Labels
bug Something isn't working

Comments

@calherries
Copy link
Contributor

calherries commented May 20, 2024

Bug with #236

If you append the same CSV file to a table that you have uploaded to the table previously, new rows will not be inserted into the table.

Steps to reproduce:

  1. Enable CSV uploads on a ClickHouse Cloud DB.
  2. Refresh the page.
  3. Upload a CSV file to a collection.
  4. Append the same CSV file to the existing model created in the previous step. Note the rows in the table is the same from step 3 even though the UI reports the append action as succeeding.
  5. Modify the data in the CSV file, and repeat step 4, appending the CSV file to the model. Note the data is now actually appended to the data in the table.
@calherries calherries added the bug Something isn't working label May 20, 2024
@slvrtrn
Copy link
Collaborator

slvrtrn commented May 20, 2024

I think that's how deduplication works.

If you create a table in CH Cloud like this:

CREATE TABLE foo
(
    `i` UInt32,
    `s` String
)
ENGINE = MergeTree
ORDER BY i;

And then insert the same thing two times:

INSERT INTO foo VALUES (1, 'q');
INSERT INTO foo VALUES (1, 'q');

The result will be:

SELECT * FROM foo FORMAT JSONEachRow;
{"i":1,"s":"q"}

I think this is related: https://clickhouse.com/blog/common-getting-started-issues-with-clickhouse#5-deduplication-at-insert-time

@calherries calherries changed the title ClickHouse Cloud Uploads: Appending the same CSV file doesn't append new values ClickHouse Cloud Uploads: Appending duplicate CSV file doesn't append new values May 20, 2024
@calherries
Copy link
Contributor Author

Got it. So to allow users to upload duplicate data, we'd have to create a unique column. We can do that with an auto-incrementing integer column or UUID column. I'm now thinking this is a good reason to use UUIDs for the auto-generated primary key column _mb_row_id for ClickHouse uploads. Previously I avoided that because I wanted to wait for auto-incrementing types to be supported by ClickHouse.

@slvrtrn
Copy link
Collaborator

slvrtrn commented May 20, 2024

Is uploading duplicates over a particular (small) time window a popular use case?

Additionally, you can, of course, try to disable https://clickhouse.com/docs/en/operations/settings/merge-tree-settings#replicated-deduplication-window.

@slvrtrn
Copy link
Collaborator

slvrtrn commented May 20, 2024

To enable duplicate inserts, if that is a popular use-case and is necessary, you could do the following during the table creation:

CREATE TABLE foo
(
    `i` UInt32,
    `s` String
)
ENGINE = MergeTree
ORDER BY i
SETTINGS replicated_deduplication_window=0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants