New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Table Engine with Unique Key support #41817
Comments
Some of my thoughts (learned from a reference design)
|
This seems to have a lot of similarities with lightweight deletes (#37893). I'm not saying it should be implemented on top of that feature, but |
@ucasfl how is this possible? If the data will be deleted on write under lock. |
The merge thread does not know some data is deleted before confirm merged part(it does not always hold the write lock), so the deleted data can appear again in merged part. |
@alexey-milovidov Hi, how do you think about it? |
@ucasfl This idea is great! Personally I think it's really useful in many use cases. And it sounds similar to UniqueMergeTree in ByteHouse. Hope this feature can be can be implemented in ReplacingMergeTree engine or as an another independent engine. |
RFC about replicated engine: #46650 |
Motivation
Real-time updates have a large number of application scenarios: like order scene, row level deduplication etc.
Currently, in ClickHouse, we can use
ReplacingMergeTree
to meet this request somehow. However, the depuplication ofReplacingMergeTree
engine when reading is rely onMerge-on-Read
(select
withfinal
), which leads to worse read performance.In order to get real-time updates with better read performance, we would like to implement a new MergeTree based Table Engine. In this new table engine, we use
mark-delete + insert
method to implement the updates and deletes, and it has much better read performance while write performance will be slow down.Usage
The table creation is similar to other MergeTree tables. The table can deduplication data by
OREDER BY
expression. We can specifyversion
column as the parameter of engine, ifversion
column specified, the table will keep the data with max version, otherwise it will keep the newest data.Btw, when speicfy partititon expression: if the partition expression is part of the
ORDER BY
expression, then the UNIQUENESS is table-level, otherwise the UNIQUENESS is partition-level.The write of the table engine use
upsert
semantics.Insert new data, since the key does not exist, so the data will be insert directly.
In this time, since key
1
already exists, so the old record will be markeddeleted
, then insert new records.We also can delete data by specify
__delete_op
column in insert:__delete_op = 1
means deletes,0
means upsert, so insert and delete can be mixed.Design
multi-version
, we kept multi bitmap for each part, the read always use the current valid bitmap with max version.Write
: when writing data, it will look up the key in table, if the new inserted key does not exist in table, just insert data; if exist, it will mark the old key as deleted(by updating related bitmap) and insert new data. In order to speed up the key lookup, we introduce key index: record primary key to its location. When data is upsert, the key index should be updated as well.Read
: Do not need to merge data again when reading, just need to use the delete bitmap to filter out deleted data. In order to speed up read, the newest version delete bitmap can be cached in memory.Write-write
conflict: in this engine,write-write
can have conflict(update same key), we introduce a table-level lock to serialize multiple writes.Write-merge
conflict: write and merge can also have conflict(some data is deleted when merging). For this conflict, we introduce a delete buffer to record the deleted record when merging. When writing found the deleted record related part is merging, it will put the deleted key into the related delete buffer, then when merge confirm new part, it will check the delete buffer and generate delete bitmap for new merged part with deleted keys. Also, we add a setting to control the size of the delete buffer, if too many data deleted during merge, the merge should be abort and try again, such that we can save more space.Limit
The UNIQUENESS is
shard-level
.Currently, I have implement a demo of the table engine. How do you think about the feature, if acceptable, I would like to submit it to the community in future.
The text was updated successfully, but these errors were encountered: