Skip to content

Dedupe data using ReplacingMergeTree() #12525

@gane5hvarma

Description

@gane5hvarma

Hi, I wanted to dedupe data using replacingMergeTree. One issue i'm facing is that i have two columns event_received_at(datetime) and event_id(string). I want to dedupe data by event_id, but sorting should be by event_received_at.

According to the docs replacingMergeTree dedupes based on sorting key for me it is event_received_at, but there can be two different events which has the same event_received_at.
https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/replacingmergetree/

So is it possible to dedupe by event_id and sort by event_received_at using replacingMergeTree
or there is another mergeTree that could achieve this functionality?

tried below example:

CREATE TABLE IF NOT EXISTS test.app
(

    event_id     String,
    event_received_at   DateTime
) ENGINE = ReplacingMergeTree() order by (event_id, event_received_at) primary key event_id;


INSERT INTO test.app FORMAT CSV
abc, 2019-12-23 14:36:31
abc, 2020-12-23 14:36:31
def, 2019-12-23 14:36:31


select * from test.app final;
output:
abc, 2019-12-23 14:36:31
abc, 2020-12-23 14:36:31
def, 2019-12-23 14:36:31

expecting output to be like this

abc, 2020-12-23 14:36:31
def, 2019-12-23 14:36:31

if its not possible by replacingMergeTree is there another mergeTree which could achieve the above functionality

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions