Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to guarantee data deduplication by clickhouse? #1178

lttPo opened this issue Aug 31, 2017 · 3 comments


Copy link

commented Aug 31, 2017

At first,I thought ReplacingMergeTree can do this, after i tried serveral times (insert a set of data by file with version 1, than insert the same data set with version 2), i find this method can't realize data deduplication, even if i create a materialized view by select with final keyword, or group by max(ver).

I also read it from the documents "" that guarantee of data deduplication can't be done by ReplacingMergeTree .

So how do i guarantee data deduplication when some cases i have to insert a set of data many times(load data from file to table is terminated by unexpected exceptions)?


This comment has been minimized.

Copy link

commented Aug 31, 2017

If you use Replicated tables, they will deduplicate inserted blocks of data:
-- if exactly identical block of data (same rows in same order) was inserted twice, it will be effectively inserted once.


(after "Blocks of data are deduplicated.")

Non replicated tables doesn't have this feature.


This comment has been minimized.

Copy link

commented Sep 5, 2017

I can't remember which version introduced the DEDUPLICATE to the OPTIMIZE query but it was effective for me in removing dupes. It does appear to "rewrite" the whole table. Also, I can't remember if I needed to use FINAL.



This comment has been minimized.

Copy link

commented Jul 27, 2018

@lttPo do you have any further questions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
4 participants
You can’t perform that action at this time.