Setting do_not_merge_across_partitions_select_final = 1 does not filter deleted rows in ReplacingMergeTree #49685

geopet85 · 2023-05-09T06:10:41Z

We are using Clickhouse 23.3.2 version and we have a few ReplacingMergeTree tables defined like below:

CREATE TABLE subscriptions_replacing
(
    `subscription_id` UInt64,
    `account_id` UInt64,
    ...
    `_is_deleted` UInt8,
    `_version` UInt64
)
ENGINE = ReplicatedReplacingMergeTree('/clickhouse/{cluster}/tables/{database}/{table}', '{replica}', _version, _is_deleted)
PRIMARY KEY (account_id, subscription_id)
ORDER BY (account_id, subscription_id)
SETTINGS min_age_to_force_merge_seconds = 300, min_age_to_force_merge_on_partition_only = 1, index_granularity = 8192

When we are querying with FINAL, we get different results if we add the setting do_not_merge_across_partitions_select_final = 1. It seems like it skips filtering the deleted rows:

$ select count(), _is_deleted from subscriptions_replacing group by _is_deleted

┌─count()─┬─_is_deleted─┐
│ 1096219 │           0 │
│  264359 │           1 │
└─────────┴─────────────┘

2 rows in set. Elapsed: 0.010 sec. Processed 1.36 million rows, 1.36 MB (139.59 million rows/s., 139.59 MB/s.)

$ select count() from subscriptions_replacing final 

┌─count()─┐
│ 1096219 │
└─────────┘

1 row in set. Elapsed: 0.066 sec. Processed 1.37 million rows, 34.22 MB (20.85 million rows/s., 521.13 MB/s.)

$ select count() from subscriptions_replacing final SETTINGS do_not_merge_across_partitions_select_final = 1

┌─count()─┐
│ 1360578 │
└─────────┘

1 row in set. Elapsed: 0.019 sec. Processed 1.36 million rows, 34.01 MB (72.04 million rows/s., 1.80 GB/s.)

$ select count() from subscriptions_replacing final where _is_deleted=0 SETTINGS do_not_merge_across_partitions_select_final = 1 

┌─count()─┐
│ 1096219 │
└─────────┘

1 row in set. Elapsed: 0.021 sec. Processed 1.36 million rows, 34.01 MB (65.00 million rows/s., 1.62 GB/s.)

Unfortunately I couldn't reproduce in a local setup.

Is it expected that this setting affects the latest behaviour of ReplacingMergeTrees which filters the deleted rows, or is this a bug?

The text was updated successfully, but these errors were encountered:

den-crane · 2023-05-09T15:56:19Z

repro https://fiddle.clickhouse.com/23eb97fc-b305-4ffc-a299-fad738bda8ad

CREATE TABLE t
(
    `account_id` UInt64,
    `_is_deleted` UInt8,
    `_version` UInt64
)
ENGINE = ReplacingMergeTree(_version, _is_deleted)
ORDER BY (account_id);

insert into t select number, 0, 1 from numbers(1e3);

insert into t select number, 1, 1 from numbers(1e2);

optimize table t final;

select count() from t final;
┌─count()─┐
│     900 │
└─────────┘

select count() from t final SETTINGS do_not_merge_across_partitions_select_final = 1;
┌─count()─┐
│    1000 │
└─────────┘

select count() from t;
┌─count()─┐
│    1000 │
└─────────┘

den-crane · 2023-05-09T15:59:13Z

@youennL-cs

youennL-cs · 2023-05-10T11:53:35Z

For me, this is the expected behaviour.
The do_not_merge_across_partitions_select_final setting avoid merges, where the filter is applied. So, it's normal that rows flagged as _is_deleted are still here with this setting.

You can add CLEANUP when optimize to remove these rows but they'll be removed once for all.

 CREATE TABLE t
(
    `account_id` UInt64,
    `_is_deleted` UInt8,
    `_version` UInt64
)
ENGINE = ReplacingMergeTree(_version, _is_deleted)
ORDER BY (account_id);

insert into t select number, 0, 1 from numbers(1e3);

insert into t select number, 1, 1 from numbers(1e2);

optimize table t final;

select count() from t final;
┌─count()─┐
│     900 │
└─────────┘

select count() from t final SETTINGS do_not_merge_across_partitions_select_final = 1;
┌─count()─┐
│    1000 │
└─────────┘

select count() from t;
┌─count()─┐
│    1000 │
└─────────┘

optimize table t final cleanup;

select count() from t final SETTINGS do_not_merge_across_partitions_select_final = 1;
┌─count()─┐
│    900 │
└─────────┘

select count() from t;
┌─count()─┐
│    900 │
└─────────┘

Maybe this case should be added in the documentation.

den-crane · 2023-05-11T14:12:40Z

Hm,

So table's / merge_tree settings solves the issue (settings clean_deleted_rows = 'Always').

https://fiddle.clickhouse.com/de6ccea6-63bb-4bcc-ab9d-fc19f4d4502c

youennL-cs · 2023-05-11T15:30:21Z

If you add this settings, clean_deleted_rows= 'Always', every time you run an OPTIMIZE ... FINAL, it will have the same behaviour than if you wrote OPTIMIZE ... FINAL CLEANUP. And data flagged as is_deleted = 1 will be definitively removed.

den-crane · 2023-05-11T15:46:44Z

@youennL-cs
'Never' by default because 'Always' leads to unexpected behaviour? Right? Delete during merge can happen for the recent parts and data will stay undeleted in the earlier parts.

geopet85 · 2023-05-11T16:11:49Z

I understand the reasoning and I can always filter using the is_deleted filter (we have some ReplacingMergeTrees where we don't want to introduce clean_deleted_rows= 'Always' because we want to use them to backfill downstream models again with the to-be-deleted rows).

However, I feel like having a setting (do_not_merge_across_partitions_select_final) which introduces different filtering behaviour based on another setting (clean_deleted_rows) could create some confusion for ReplacingMegreTrees. Especially since do_not_merge_across_partitions_select_final is suggested for better performance with FINAL.

Could we at least document this somehow?

youennL-cs · 2023-05-12T07:51:24Z

'Never' by default because 'Always' leads to unexpected behaviour? Right? Delete during merge can happen for the recent parts and data will stay undeleted in the earlier parts.

@den-crane ,
No, both have only expected behaviour.
'Never' by default because that what we agree on (#41005 (comment)) and because we think that people prefer choosing when delete data instead of deleted at every merge. Adn that what @geopet85 confirmed in his last message.
But, we some people want to clean up every thing at every merge, they can with the 'Always' choice.

And yes this specific behaviour with this setting should be documented, to avoid futur confusions.

den-crane · 2023-05-12T19:17:32Z

@youennL-cs I mean that ReplacingMT can merge new parts

 CREATE TABLE t
(
    `account_id` UInt64,
    `_is_deleted` UInt8,
    `_version` UInt64
)
ENGINE = ReplacingMergeTree(_version, _is_deleted)
ORDER BY (account_id)
settings clean_deleted_rows= 'Always'

insert into t select 1, 0, 1;  -- first part
insert into t select 1, 0, 2; -- second part
....
insert into t select 1, 0, 3; -- third part
insert into t select 1, 1, 4; -- fourth part (delete)

I guess Clickhouse is able to merge parts 3 + 4 , in the result will be no rows with key 1.
Parts 1 & 2 will stay undeleted.

den-crane · 2023-05-12T20:34:31Z

the row magically appears after optimize

drop table t;
CREATE TABLE t
(
    `account_id` UInt64,
    `_is_deleted` UInt8,
    `_version` UInt64
)
ENGINE = ReplacingMergeTree(_version, _is_deleted)
ORDER BY (account_id)
settings clean_deleted_rows= 'Always' as select number, 0, 1 from numbers (1e7);

optimize table t final;

insert into t select number, 0, 2 from numbers(600);
insert into t select number, 1, 3 from numbers(300);
insert into t select number, 0, 4 from numbers(10);

select * from t final where account_id = 11;    --------- no row
Ok.
0 rows in set. Elapsed: 0.005 sec. Processed 9.09 thousand rows, 154.56 KB (2.00 million rows/s., 34.02 MB/s.)

optimize table t;                          --------- optimize

select * from t final where account_id = 11;
┌─account_id─┬─_is_deleted─┬─_version─┐
│         11 │           0 │        1 │       ---- row has resurrected 
└────────────┴─────────────┴──────────┘
1 row in set. Elapsed: 0.013 sec. Processed 8.50 thousand rows, 144.53 KB (643.01 thousand rows/s., 10.93 MB/s.)

den-crane · 2023-05-12T20:38:36Z

I think one more mode is needed clean_deleted_rows= 'Final' or 'FirstFinal' then a merge knows that it merges all (first) blocks into the single block, then it's safe to delete rows.

den-crane · 2023-05-12T21:13:13Z

Ah, I see. A lot of unexpected / undocumented behaviors.

https://fiddle.clickhouse.com/296e776d-5118-4c8e-a914-200976cc7782
https://fiddle.clickhouse.com/6e4478c2-4e24-490f-9532-bd9e0c94a355
https://fiddle.clickhouse.com/82f00c01-9b85-435a-b224-689e82788c05
https://fiddle.clickhouse.com/9557e21b-1963-46b0-a5e1-860b9e464cfe

filimonov · 2023-05-15T11:31:08Z

Yep, that was discussed in PR #41005 (comment)

And that's why it's not enabled by default - it can give unreliable results.

Maybe we can try to implement logic in the merge, like following: "if the merge includes the minimum block of the partition - it's safe to remove rows marked as deleted ". Right now optimize deletes the last state of the row.

den-crane · 2023-06-22T22:02:33Z

For me, this is the expected behaviour.
The do_not_merge_across_partitions_select_final setting avoid merges, where the filter is applied. So, it's normal that rows flagged as _is_deleted are still here with this setting.

idea1.

Even with do_not_merge_across_partitions_select_final=1 a single part is processed by FINAL.
And _deleted rows can be filtered by the FINAL processing.

CREATE TABLE t
(
    `account_id` UInt64,
    `_is_deleted` UInt8,
    `_version` UInt64
)
ENGINE = ReplacingMergeTree(_version, _is_deleted)
ORDER BY (account_id);

insert into t select number, 0, 1 from numbers(1e8);
insert into t select number, 1, 1 from numbers(1);
optimize table t final;

select count() from t final;
--
99999999

Elapsed: 1.629 sec.


select count() from t final SETTINGS do_not_merge_across_partitions_select_final = 1;
--
100000000

Elapsed: 0.313 sec


select count() from t final where _is_deleted=0  SETTINGS do_not_merge_across_partitions_select_final = 1;
--
99999999

Elapsed: 0.332 sec

select count() from (select _is_deleted from t final where _is_deleted=0)  SETTINGS do_not_merge_across_partitions_select_final = 1;
--
99999999

Elapsed: 0.293 sec

The result is correct and no performance degradation.
Maybe I missing some scenario. But right now I cannot reproduce any issues with filtering by _is_deleted=0.

idea2.

Another solution is to store meta-information in a part that it has deleted rows and disable do_not_merge_across_partitions_select_final automatically.

something like has_lightweight_delete: 1
or store number of rows with _is_deleted.

aadant · 2023-06-23T09:36:49Z

@den-crane @youennL-cs : the setting should not affect the result even if you document this odd behavior. So let us apply the filtering when final is set

filimonov · 2023-06-30T06:43:25Z

For me, this is the expected behaviour.
The do_not_merge_across_partitions_select_final setting avoid merges, where the filter is applied. So, it's normal that rows flagged as _is_deleted are still here with this setting.

That is wrong. That setting does not avoid merges, it limits the scope FINAL to the parts of a single partition (so it makes FINAL work just as a usual merge, while by default it also tries to find duplicates across partitions).

Expectation that it should work sounds reasonable and valid. Probably it is not working due to some silly missing line somewhere.

@youennL-cs

filimonov · 2023-06-30T07:04:11Z

It's this logic working there:

ClickHouse/src/Processors/QueryPlan/ReadFromMergeTree.cpp

Lines 988 to 999 in eddd932

    
           /// If do_not_merge_across_partitions_select_final is true and there is only one part in partition 
        
           /// with level > 0 then we won't postprocess this part and if num_streams > 1 we 
        
           /// can use parallel select on such parts. We save such parts in one vector and then use 
        
           /// MergeTreeReadPool and MergeTreeThreadSelectProcessor for parallel select. 
        
           if (num_streams > 1 && settings.do_not_merge_across_partitions_select_final && 
        
               std::distance(parts_to_merge_ranges[range_index], parts_to_merge_ranges[range_index + 1]) == 1 && 
        
               parts_to_merge_ranges[range_index]->data_part->info.level > 0) 
        
           { 
        
               sum_marks_in_lonely_parts += parts_to_merge_ranges[range_index]->getMarksCount(); 
        
               lonely_parts.push_back(std::move(*parts_to_merge_ranges[range_index])); 
        
               continue; 
        
           }

So if there are more that a single part in partition the is_deleted logic works as expected.

We can just try to disable that part of logic there by adding condition like

    &&     data.merging_params.is_deleted_column.empty())

But that means that we will still do FINAL on partitions contatining single part (but in the scope on one partition only).

aadant · 2023-07-03T12:58:40Z

Also I think you should run the test suite with do_not_merge_across_partitions_select_final = 1 once the bug is fixed

geopet85 added the unexpected behaviour label May 9, 2023

nickitat added bug Confirmed user-visible misbehaviour in official release and removed unexpected behaviour labels May 9, 2023

den-crane mentioned this issue May 30, 2023

ReplacingMergeTree with is_deleted column and clean_deleted_rows=Always works incorrectly #50346

Closed

Enmk mentioned this issue Aug 17, 2023

Fixed SELECTing from ReplacingMergeTree with do_not_merge_across_partitions_select_final #53511

Merged

alexey-milovidov closed this as completed in #53511 Aug 28, 2023

den-crane mentioned this issue Oct 13, 2023

Update Russian version replacingmergetree.md. Added is_deleted parameter description #55584

Merged

1 task

filimonov mentioned this issue Dec 18, 2023

Lonely part FINAL optimization for ReplacingMergeTree with is_deleted #54988

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting do_not_merge_across_partitions_select_final = 1 does not filter deleted rows in ReplacingMergeTree #49685

Setting do_not_merge_across_partitions_select_final = 1 does not filter deleted rows in ReplacingMergeTree #49685

geopet85 commented May 9, 2023 •

edited

den-crane commented May 9, 2023

den-crane commented May 9, 2023

youennL-cs commented May 10, 2023

den-crane commented May 11, 2023 •

edited

youennL-cs commented May 11, 2023

den-crane commented May 11, 2023

geopet85 commented May 11, 2023

youennL-cs commented May 12, 2023

den-crane commented May 12, 2023

den-crane commented May 12, 2023 •

edited

den-crane commented May 12, 2023 •

edited

den-crane commented May 12, 2023

filimonov commented May 15, 2023 •

edited

den-crane commented Jun 22, 2023 •

edited

aadant commented Jun 23, 2023

filimonov commented Jun 30, 2023

filimonov commented Jun 30, 2023 •

edited

aadant commented Jul 3, 2023 •

edited

Setting do_not_merge_across_partitions_select_final = 1 does not filter deleted rows in ReplacingMergeTree #49685

Setting do_not_merge_across_partitions_select_final = 1 does not filter deleted rows in ReplacingMergeTree #49685

Comments

geopet85 commented May 9, 2023 • edited

den-crane commented May 9, 2023

den-crane commented May 9, 2023

youennL-cs commented May 10, 2023

den-crane commented May 11, 2023 • edited

youennL-cs commented May 11, 2023

den-crane commented May 11, 2023

geopet85 commented May 11, 2023

youennL-cs commented May 12, 2023

den-crane commented May 12, 2023

den-crane commented May 12, 2023 • edited

den-crane commented May 12, 2023 • edited

den-crane commented May 12, 2023

filimonov commented May 15, 2023 • edited

den-crane commented Jun 22, 2023 • edited

idea1.

idea2.

aadant commented Jun 23, 2023

filimonov commented Jun 30, 2023

filimonov commented Jun 30, 2023 • edited

aadant commented Jul 3, 2023 • edited

geopet85 commented May 9, 2023 •

edited

den-crane commented May 11, 2023 •

edited

den-crane commented May 12, 2023 •

edited

den-crane commented May 12, 2023 •

edited

filimonov commented May 15, 2023 •

edited

den-crane commented Jun 22, 2023 •

edited

filimonov commented Jun 30, 2023 •

edited

aadant commented Jul 3, 2023 •

edited