[RFC] Table Engine with Unique Key support #41817

ucasfl · 2022-09-27T08:04:45Z

Motivation

Real-time updates have a large number of application scenarios: like order scene, row level deduplication etc.

Currently, in ClickHouse, we can use ReplacingMergeTree to meet this request somehow. However, the depuplication of ReplacingMergeTree engine when reading is rely on Merge-on-Read(select with final), which leads to worse read performance.

In order to get real-time updates with better read performance, we would like to implement a new MergeTree based Table Engine. In this new table engine, we use mark-delete + insert method to implement the updates and deletes, and it has much better read performance while write performance will be slow down.

Usage

Create table.

vccliu1645086827254-0 :) create table t(n1 UInt32, n2 UInt32, v UInt32) engine=UniqueMergeTree(v) order by n1;

CREATE TABLE t
(
    `n1` UInt32,
    `n2` UInt32,
    `v` UInt32
)
ENGINE = UniqueMergeTree(v)
ORDER BY n1

Query id: 040482de-fe32-4617-9c1a-9bea4e0f83fc

Ok.

0 rows in set. Elapsed: 0.009 sec.

The table creation is similar to other MergeTree tables. The table can deduplication data by OREDER BY expression. We can specify version column as the parameter of engine, if version column specified, the table will keep the data with max version, otherwise it will keep the newest data.

Btw, when speicfy partititon expression: if the partition expression is part of the ORDER BY expression, then the UNIQUENESS is table-level, otherwise the UNIQUENESS is partition-level.

Insert data

The write of the table engine use upsert semantics.

vccliu1645086827254-0 :) insert into t values(1, 2, 3) (2, 3, 4) (3, 4, 5)

INSERT INTO t FORMAT Values

Query id: a0489158-da13-4fa2-8f1b-e70bc57de0a1

Ok.

3 rows in set. Elapsed: 0.003 sec. 

vccliu1645086827254-0 :) select * from t

SELECT *
FROM t

Query id: 992cddb8-4b7f-4b59-9526-37ae2f13d7b0

┌─n1─┬─n2─┬─v─┐
│  1 │  2 │ 3 │
│  2 │  3 │ 4 │
│  3 │  4 │ 5 │
└────┴────┴───┘

3 rows in set. Elapsed: 0.002 sec.

Insert new data, since the key does not exist, so the data will be insert directly.

vccliu1645086827254-0 :) insert into t values(1, 4, 5) (4, 3, 4) (5, 4, 5)

INSERT INTO t FORMAT Values

Query id: f2ab0773-9706-496d-b5f6-9b2536632132

Ok.

3 rows in set. Elapsed: 0.002 sec. 

vccliu1645086827254-0 :) select * from t

SELECT *
FROM t

Query id: 363a1088-9e56-4590-94b4-92d913978ce7

┌─n1─┬─n2─┬─v─┐
│  2 │  3 │ 4 │
│  3 │  4 │ 5 │
└────┴────┴───┘
┌─n1─┬─n2─┬─v─┐
│  1 │  4 │ 5 │
│  4 │  3 │ 4 │
│  5 │  4 │ 5 │
└────┴────┴───┘

5 rows in set. Elapsed: 0.002 sec.

In this time, since key 1 already exists, so the old record will be marked deleted, then insert new records.

We also can delete data by specify __delete_op column in insert:

vccliu1645086827254-0 :) insert into t(n1, __delete_op) values (4, 1)

INSERT INTO t (n1, __delete_op) FORMAT Values

Query id: accd807c-7dda-442e-a957-0a89080a79f5

Ok.

1 rows in set. Elapsed: 0.001 sec. 

vccliu1645086827254-0 :) select * from t

SELECT *
FROM t

Query id: 91c49061-3053-463d-b359-5497aad55544

┌─n1─┬─n2─┬─v─┐
│  2 │  3 │ 4 │
│  3 │  4 │ 5 │
└────┴────┴───┘
┌─n1─┬─n2─┬─v─┐
│  1 │  4 │ 5 │
│  5 │  4 │ 5 │
└────┴────┴───┘

4 rows in set. Elapsed: 0.002 sec.

__delete_op = 1 means deletes, 0 means upsert, so insert and delete can be mixed.

Design

Introduce delete bitmap to record the deleted data in each part. The bitmap records the row number of the deleted record in this part. In this engine, since write may deleted old part, so the bitmap is multi-version, we kept multi bitmap for each part, the read always use the current valid bitmap with max version.
Write: when writing data, it will look up the key in table, if the new inserted key does not exist in table, just insert data; if exist, it will mark the old key as deleted(by updating related bitmap) and insert new data. In order to speed up the key lookup, we introduce key index: record primary key to its location. When data is upsert, the key index should be updated as well.
Read: Do not need to merge data again when reading, just need to use the delete bitmap to filter out deleted data. In order to speed up read, the newest version delete bitmap can be cached in memory.
Write-write conflict: in this engine, write-write can have conflict(update same key), we introduce a table-level lock to serialize multiple writes.
Write-merge conflict: write and merge can also have conflict(some data is deleted when merging). For this conflict, we introduce a delete buffer to record the deleted record when merging. When writing found the deleted record related part is merging, it will put the deleted key into the related delete buffer, then when merge confirm new part, it will check the delete buffer and generate delete bitmap for new merged part with deleted keys. Also, we add a setting to control the size of the delete buffer, if too many data deleted during merge, the merge should be abort and try again, such that we can save more space.

Limit

The UNIQUENESS is shard-level.

Currently, I have implement a demo of the table engine. How do you think about the feature, if acceptable, I would like to submit it to the community in future.

The text was updated successfully, but these errors were encountered:

canhld94 · 2022-09-27T15:32:56Z

Some of my thoughts (learned from a reference design)

Unique key is not necessary the order key, we can have a separate index (e.g. B+ tree, radix tree) for unique key.
To modify data (from user side), can provide explicit queries like DELETE and UPSERT.
Need an efficient concurrency control mechanism.

Algunenano · 2022-09-27T16:00:19Z

This seems to have a lot of similarities with lightweight deletes (#37893). I'm not saying it should be implemented on top of that feature, but __delete_op looks an awful lot like _row_exists and the mechanism added there to ignore "deleted" data that hasn't been removed from parts yet.

azat · 2022-09-30T09:48:12Z

write and merge can also have conflict(some data is deleted when merging).

@ucasfl how is this possible? If the data will be deleted on write under lock.

ucasfl · 2022-09-30T09:59:21Z

write and merge can also have conflict(some data is deleted when merging).

@ucasfl how is this possible? If the data will be deleted on write under lock.

The merge thread does not know some data is deleted before confirm merged part(it does not always hold the write lock), so the deleted data can appear again in merged part.

ucasfl · 2022-10-04T17:24:59Z

@alexey-milovidov Hi, how do you think about it?

albertcht · 2022-10-31T17:11:16Z

@ucasfl This idea is great! Personally I think it's really useful in many use cases. And it sounds similar to UniqueMergeTree in ByteHouse. Hope this feature can be can be implemented in ReplacingMergeTree engine or as an another independent engine.

ucasfl · 2023-02-22T04:54:34Z

RFC about replicated engine: #46650

ucasfl added the feature label Sep 27, 2022

ucasfl mentioned this issue Dec 23, 2022

Add new table engine UniqueMergeTree #44534

Closed

filimonov mentioned this issue Jan 23, 2023

[RFC] Replacing merge tree new engine #41005

Merged

ucasfl self-assigned this Feb 22, 2023

nikitamikhaylov mentioned this issue May 7, 2024

Add support for auto-increment and skip_rows #63473

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Table Engine with Unique Key support #41817

[RFC] Table Engine with Unique Key support #41817

ucasfl commented Sep 27, 2022 •

edited

canhld94 commented Sep 27, 2022

Algunenano commented Sep 27, 2022

azat commented Sep 30, 2022

ucasfl commented Sep 30, 2022

ucasfl commented Oct 4, 2022

albertcht commented Oct 31, 2022 •

edited

ucasfl commented Feb 22, 2023

[RFC] Table Engine with Unique Key support #41817

[RFC] Table Engine with Unique Key support #41817

Comments

ucasfl commented Sep 27, 2022 • edited

Motivation

Usage

Design

Limit

canhld94 commented Sep 27, 2022

Algunenano commented Sep 27, 2022

azat commented Sep 30, 2022

ucasfl commented Sep 30, 2022

ucasfl commented Oct 4, 2022

albertcht commented Oct 31, 2022 • edited

ucasfl commented Feb 22, 2023

ucasfl commented Sep 27, 2022 •

edited

albertcht commented Oct 31, 2022 •

edited