Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Table Engine with Unique Key support #41817

Open
ucasfl opened this issue Sep 27, 2022 · 7 comments
Open

[RFC] Table Engine with Unique Key support #41817

ucasfl opened this issue Sep 27, 2022 · 7 comments
Assignees
Labels

Comments

@ucasfl
Copy link
Collaborator

ucasfl commented Sep 27, 2022

Motivation

Real-time updates have a large number of application scenarios: like order scene, row level deduplication etc.

Currently, in ClickHouse, we can use ReplacingMergeTree to meet this request somehow. However, the depuplication of ReplacingMergeTree engine when reading is rely on Merge-on-Read(select with final), which leads to worse read performance.

In order to get real-time updates with better read performance, we would like to implement a new MergeTree based Table Engine. In this new table engine, we use mark-delete + insert method to implement the updates and deletes, and it has much better read performance while write performance will be slow down.

Usage

  1. Create table.
vccliu1645086827254-0 :) create table t(n1 UInt32, n2 UInt32, v UInt32) engine=UniqueMergeTree(v) order by n1;

CREATE TABLE t
(
    `n1` UInt32,
    `n2` UInt32,
    `v` UInt32
)
ENGINE = UniqueMergeTree(v)
ORDER BY n1

Query id: 040482de-fe32-4617-9c1a-9bea4e0f83fc

Ok.

0 rows in set. Elapsed: 0.009 sec. 

The table creation is similar to other MergeTree tables. The table can deduplication data by OREDER BY expression. We can specify version column as the parameter of engine, if version column specified, the table will keep the data with max version, otherwise it will keep the newest data.

Btw, when speicfy partititon expression: if the partition expression is part of the ORDER BY expression, then the UNIQUENESS is table-level, otherwise the UNIQUENESS is partition-level.

  1. Insert data

The write of the table engine use upsert semantics.

vccliu1645086827254-0 :) insert into t values(1, 2, 3) (2, 3, 4) (3, 4, 5)

INSERT INTO t FORMAT Values

Query id: a0489158-da13-4fa2-8f1b-e70bc57de0a1

Ok.

3 rows in set. Elapsed: 0.003 sec. 

vccliu1645086827254-0 :) select * from t

SELECT *
FROM t

Query id: 992cddb8-4b7f-4b59-9526-37ae2f13d7b0

┌─n1─┬─n2─┬─v─┐
│  123 │
│  234 │
│  345 │
└────┴────┴───┘

3 rows in set. Elapsed: 0.002 sec. 

Insert new data, since the key does not exist, so the data will be insert directly.

vccliu1645086827254-0 :) insert into t values(1, 4, 5) (4, 3, 4) (5, 4, 5)

INSERT INTO t FORMAT Values

Query id: f2ab0773-9706-496d-b5f6-9b2536632132

Ok.

3 rows in set. Elapsed: 0.002 sec. 

vccliu1645086827254-0 :) select * from t

SELECT *
FROM t

Query id: 363a1088-9e56-4590-94b4-92d913978ce7

┌─n1─┬─n2─┬─v─┐
│  234 │
│  345 │
└────┴────┴───┘
┌─n1─┬─n2─┬─v─┐
│  145 │
│  434 │
│  545 │
└────┴────┴───┘

5 rows in set. Elapsed: 0.002 sec. 

In this time, since key 1 already exists, so the old record will be marked deleted, then insert new records.

We also can delete data by specify __delete_op column in insert:

vccliu1645086827254-0 :) insert into t(n1, __delete_op) values (4, 1)

INSERT INTO t (n1, __delete_op) FORMAT Values

Query id: accd807c-7dda-442e-a957-0a89080a79f5

Ok.

1 rows in set. Elapsed: 0.001 sec. 

vccliu1645086827254-0 :) select * from t

SELECT *
FROM t

Query id: 91c49061-3053-463d-b359-5497aad55544

┌─n1─┬─n2─┬─v─┐
│  234 │
│  345 │
└────┴────┴───┘
┌─n1─┬─n2─┬─v─┐
│  145 │
│  545 │
└────┴────┴───┘

4 rows in set. Elapsed: 0.002 sec. 

__delete_op = 1 means deletes, 0 means upsert, so insert and delete can be mixed.

Design

  1. Introduce delete bitmap to record the deleted data in each part. The bitmap records the row number of the deleted record in this part. In this engine, since write may deleted old part, so the bitmap is multi-version, we kept multi bitmap for each part, the read always use the current valid bitmap with max version.
  2. Write: when writing data, it will look up the key in table, if the new inserted key does not exist in table, just insert data; if exist, it will mark the old key as deleted(by updating related bitmap) and insert new data. In order to speed up the key lookup, we introduce key index: record primary key to its location. When data is upsert, the key index should be updated as well.
  3. Read: Do not need to merge data again when reading, just need to use the delete bitmap to filter out deleted data. In order to speed up read, the newest version delete bitmap can be cached in memory.
  4. Write-write conflict: in this engine, write-write can have conflict(update same key), we introduce a table-level lock to serialize multiple writes.
  5. Write-merge conflict: write and merge can also have conflict(some data is deleted when merging). For this conflict, we introduce a delete buffer to record the deleted record when merging. When writing found the deleted record related part is merging, it will put the deleted key into the related delete buffer, then when merge confirm new part, it will check the delete buffer and generate delete bitmap for new merged part with deleted keys. Also, we add a setting to control the size of the delete buffer, if too many data deleted during merge, the merge should be abort and try again, such that we can save more space.

Limit

The UNIQUENESS is shard-level.

Currently, I have implement a demo of the table engine. How do you think about the feature, if acceptable, I would like to submit it to the community in future.

@ucasfl ucasfl added the feature label Sep 27, 2022
@canhld94
Copy link
Contributor

Some of my thoughts (learned from a reference design)

  • Unique key is not necessary the order key, we can have a separate index (e.g. B+ tree, radix tree) for unique key.
  • To modify data (from user side), can provide explicit queries like DELETE and UPSERT.
  • Need an efficient concurrency control mechanism.

@Algunenano
Copy link
Member

This seems to have a lot of similarities with lightweight deletes (#37893). I'm not saying it should be implemented on top of that feature, but __delete_op looks an awful lot like _row_exists and the mechanism added there to ignore "deleted" data that hasn't been removed from parts yet.

@azat
Copy link
Collaborator

azat commented Sep 30, 2022

write and merge can also have conflict(some data is deleted when merging).

@ucasfl how is this possible? If the data will be deleted on write under lock.

@ucasfl
Copy link
Collaborator Author

ucasfl commented Sep 30, 2022

write and merge can also have conflict(some data is deleted when merging).

@ucasfl how is this possible? If the data will be deleted on write under lock.

The merge thread does not know some data is deleted before confirm merged part(it does not always hold the write lock), so the deleted data can appear again in merged part.

@ucasfl
Copy link
Collaborator Author

ucasfl commented Oct 4, 2022

@alexey-milovidov Hi, how do you think about it?

@albertcht
Copy link

albertcht commented Oct 31, 2022

@ucasfl This idea is great! Personally I think it's really useful in many use cases. And it sounds similar to UniqueMergeTree in ByteHouse. Hope this feature can be can be implemented in ReplacingMergeTree engine or as an another independent engine.

@ucasfl
Copy link
Collaborator Author

ucasfl commented Feb 22, 2023

RFC about replicated engine: #46650

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants