Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting do_not_merge_across_partitions_select_final = 1 does not filter deleted rows in ReplacingMergeTree #49685

Closed
geopet85 opened this issue May 9, 2023 · 18 comments · Fixed by #53511
Labels
bug Confirmed user-visible misbehaviour in official release

Comments

@geopet85
Copy link

geopet85 commented May 9, 2023

We are using Clickhouse 23.3.2 version and we have a few ReplacingMergeTree tables defined like below:

CREATE TABLE subscriptions_replacing
(
    `subscription_id` UInt64,
    `account_id` UInt64,
    ...
    `_is_deleted` UInt8,
    `_version` UInt64
)
ENGINE = ReplicatedReplacingMergeTree('/clickhouse/{cluster}/tables/{database}/{table}', '{replica}', _version, _is_deleted)
PRIMARY KEY (account_id, subscription_id)
ORDER BY (account_id, subscription_id)
SETTINGS min_age_to_force_merge_seconds = 300, min_age_to_force_merge_on_partition_only = 1, index_granularity = 8192

When we are querying with FINAL, we get different results if we add the setting do_not_merge_across_partitions_select_final = 1. It seems like it skips filtering the deleted rows:

$ select count(), _is_deleted from subscriptions_replacing group by _is_deleted

┌─count()─┬─_is_deleted─┐
│ 1096219 │           0 │
│  264359 │           1 │
└─────────┴─────────────┘

2 rows in set. Elapsed: 0.010 sec. Processed 1.36 million rows, 1.36 MB (139.59 million rows/s., 139.59 MB/s.)

$ select count() from subscriptions_replacing final 

┌─count()─┐
│ 1096219 │
└─────────┘

1 row in set. Elapsed: 0.066 sec. Processed 1.37 million rows, 34.22 MB (20.85 million rows/s., 521.13 MB/s.)

$ select count() from subscriptions_replacing final SETTINGS do_not_merge_across_partitions_select_final = 1

┌─count()─┐
│ 1360578 │
└─────────┘

1 row in set. Elapsed: 0.019 sec. Processed 1.36 million rows, 34.01 MB (72.04 million rows/s., 1.80 GB/s.)

$ select count() from subscriptions_replacing final where _is_deleted=0 SETTINGS do_not_merge_across_partitions_select_final = 1 

┌─count()─┐
│ 1096219 │
└─────────┘

1 row in set. Elapsed: 0.021 sec. Processed 1.36 million rows, 34.01 MB (65.00 million rows/s., 1.62 GB/s.)

Unfortunately I couldn't reproduce in a local setup.

Is it expected that this setting affects the latest behaviour of ReplacingMergeTrees which filters the deleted rows, or is this a bug?

@den-crane
Copy link
Contributor

repro https://fiddle.clickhouse.com/23eb97fc-b305-4ffc-a299-fad738bda8ad

CREATE TABLE t
(
    `account_id` UInt64,
    `_is_deleted` UInt8,
    `_version` UInt64
)
ENGINE = ReplacingMergeTree(_version, _is_deleted)
ORDER BY (account_id);

insert into t select number, 0, 1 from numbers(1e3);

insert into t select number, 1, 1 from numbers(1e2);

optimize table t final;

select count() from t final;
┌─count()─┐
│     900 │
└─────────┘

select count() from t final SETTINGS do_not_merge_across_partitions_select_final = 1;
┌─count()─┐
│    1000 │
└─────────┘

select count() from t;
┌─count()─┐
│    1000 │
└─────────┘

@den-crane
Copy link
Contributor

@youennL-cs

@nickitat nickitat added bug Confirmed user-visible misbehaviour in official release and removed unexpected behaviour labels May 9, 2023
@youennL-cs
Copy link
Contributor

For me, this is the expected behaviour.
The do_not_merge_across_partitions_select_final setting avoid merges, where the filter is applied. So, it's normal that rows flagged as _is_deleted are still here with this setting.

You can add CLEANUP when optimize to remove these rows but they'll be removed once for all.

 CREATE TABLE t
(
    `account_id` UInt64,
    `_is_deleted` UInt8,
    `_version` UInt64
)
ENGINE = ReplacingMergeTree(_version, _is_deleted)
ORDER BY (account_id);

insert into t select number, 0, 1 from numbers(1e3);

insert into t select number, 1, 1 from numbers(1e2);

optimize table t final;

select count() from t final;
┌─count()─┐
│     900 │
└─────────┘

select count() from t final SETTINGS do_not_merge_across_partitions_select_final = 1;
┌─count()─┐
│    1000 │
└─────────┘

select count() from t;
┌─count()─┐
│    1000 │
└─────────┘

optimize table t final cleanup;

select count() from t final SETTINGS do_not_merge_across_partitions_select_final = 1;
┌─count()─┐
│    900 │
└─────────┘

select count() from t;
┌─count()─┐
│    900 │
└─────────┘

Maybe this case should be added in the documentation.

@den-crane
Copy link
Contributor

den-crane commented May 11, 2023

Hm,

So table's / merge_tree settings solves the issue (settings clean_deleted_rows = 'Always').

https://fiddle.clickhouse.com/de6ccea6-63bb-4bcc-ab9d-fc19f4d4502c

@youennL-cs
Copy link
Contributor

If you add this settings, clean_deleted_rows= 'Always', every time you run an OPTIMIZE ... FINAL, it will have the same behaviour than if you wrote OPTIMIZE ... FINAL CLEANUP. And data flagged as is_deleted = 1 will be definitively removed.

@den-crane
Copy link
Contributor

@youennL-cs
'Never' by default because 'Always' leads to unexpected behaviour? Right? Delete during merge can happen for the recent parts and data will stay undeleted in the earlier parts.

@geopet85
Copy link
Author

I understand the reasoning and I can always filter using the is_deleted filter (we have some ReplacingMergeTrees where we don't want to introduce clean_deleted_rows= 'Always' because we want to use them to backfill downstream models again with the to-be-deleted rows).

However, I feel like having a setting (do_not_merge_across_partitions_select_final) which introduces different filtering behaviour based on another setting (clean_deleted_rows) could create some confusion for ReplacingMegreTrees. Especially since do_not_merge_across_partitions_select_final is suggested for better performance with FINAL.

Could we at least document this somehow?

@youennL-cs
Copy link
Contributor

'Never' by default because 'Always' leads to unexpected behaviour? Right? Delete during merge can happen for the recent parts and data will stay undeleted in the earlier parts.

@den-crane ,
No, both have only expected behaviour.
'Never' by default because that what we agree on (#41005 (comment)) and because we think that people prefer choosing when delete data instead of deleted at every merge. Adn that what @geopet85 confirmed in his last message.
But, we some people want to clean up every thing at every merge, they can with the 'Always' choice.

And yes this specific behaviour with this setting should be documented, to avoid futur confusions.

@den-crane
Copy link
Contributor

@youennL-cs I mean that ReplacingMT can merge new parts

 CREATE TABLE t
(
    `account_id` UInt64,
    `_is_deleted` UInt8,
    `_version` UInt64
)
ENGINE = ReplacingMergeTree(_version, _is_deleted)
ORDER BY (account_id)
settings clean_deleted_rows= 'Always'

insert into t select 1, 0, 1;  -- first part
insert into t select 1, 0, 2; -- second part
....
insert into t select 1, 0, 3; -- third part
insert into t select 1, 1, 4; -- fourth part (delete)

I guess Clickhouse is able to merge parts 3 + 4 , in the result will be no rows with key 1.
Parts 1 & 2 will stay undeleted.

@den-crane
Copy link
Contributor

den-crane commented May 12, 2023

the row magically appears after optimize

drop table t;
CREATE TABLE t
(
    `account_id` UInt64,
    `_is_deleted` UInt8,
    `_version` UInt64
)
ENGINE = ReplacingMergeTree(_version, _is_deleted)
ORDER BY (account_id)
settings clean_deleted_rows= 'Always' as select number, 0, 1 from numbers (1e7);

optimize table t final;

insert into t select number, 0, 2 from numbers(600);
insert into t select number, 1, 3 from numbers(300);
insert into t select number, 0, 4 from numbers(10);

select * from t final where account_id = 11;    --------- no row
Ok.
0 rows in set. Elapsed: 0.005 sec. Processed 9.09 thousand rows, 154.56 KB (2.00 million rows/s., 34.02 MB/s.)

optimize table t;                          --------- optimize

select * from t final where account_id = 11;
┌─account_id─┬─_is_deleted─┬─_version─┐
│         1101---- row has resurrected 
└────────────┴─────────────┴──────────┘
1 row in set. Elapsed: 0.013 sec. Processed 8.50 thousand rows, 144.53 KB (643.01 thousand rows/s., 10.93 MB/s.)

@den-crane
Copy link
Contributor

den-crane commented May 12, 2023

I think one more mode is needed clean_deleted_rows= 'Final' or 'FirstFinal' then a merge knows that it merges all (first) blocks into the single block, then it's safe to delete rows.

@filimonov
Copy link
Contributor

filimonov commented May 15, 2023

Yep, that was discussed in PR #41005 (comment)

And that's why it's not enabled by default - it can give unreliable results.

Maybe we can try to implement logic in the merge, like following: "if the merge includes the minimum block of the partition - it's safe to remove rows marked as deleted ". Right now optimize deletes the last state of the row.

@den-crane
Copy link
Contributor

den-crane commented Jun 22, 2023

For me, this is the expected behaviour.
The do_not_merge_across_partitions_select_final setting avoid merges, where the filter is applied. So, it's normal that rows flagged as _is_deleted are still here with this setting.

idea1.

Even with do_not_merge_across_partitions_select_final=1 a single part is processed by FINAL.
And _deleted rows can be filtered by the FINAL processing.

CREATE TABLE t
(
    `account_id` UInt64,
    `_is_deleted` UInt8,
    `_version` UInt64
)
ENGINE = ReplacingMergeTree(_version, _is_deleted)
ORDER BY (account_id);

insert into t select number, 0, 1 from numbers(1e8);
insert into t select number, 1, 1 from numbers(1);
optimize table t final;

select count() from t final;
--
99999999

Elapsed: 1.629 sec.


select count() from t final SETTINGS do_not_merge_across_partitions_select_final = 1;
--
100000000

Elapsed: 0.313 sec


select count() from t final where _is_deleted=0  SETTINGS do_not_merge_across_partitions_select_final = 1;
--
99999999

Elapsed: 0.332 sec

select count() from (select _is_deleted from t final where _is_deleted=0)  SETTINGS do_not_merge_across_partitions_select_final = 1;
--
99999999

Elapsed: 0.293 sec

The result is correct and no performance degradation.
Maybe I missing some scenario. But right now I cannot reproduce any issues with filtering by _is_deleted=0.

idea2.

Another solution is to store meta-information in a part that it has deleted rows and disable do_not_merge_across_partitions_select_final automatically.

something like has_lightweight_delete: 1
or store number of rows with _is_deleted.

@aadant
Copy link

aadant commented Jun 23, 2023

@den-crane @youennL-cs : the setting should not affect the result even if you document this odd behavior. So let us apply the filtering when final is set

@filimonov
Copy link
Contributor

For me, this is the expected behaviour.
The do_not_merge_across_partitions_select_final setting avoid merges, where the filter is applied. So, it's normal that rows flagged as _is_deleted are still here with this setting.

That is wrong. That setting does not avoid merges, it limits the scope FINAL to the parts of a single partition (so it makes FINAL work just as a usual merge, while by default it also tries to find duplicates across partitions).

Expectation that it should work sounds reasonable and valid. Probably it is not working due to some silly missing line somewhere.

@youennL-cs

@filimonov
Copy link
Contributor

filimonov commented Jun 30, 2023

It's this logic working there:

/// If do_not_merge_across_partitions_select_final is true and there is only one part in partition
/// with level > 0 then we won't postprocess this part and if num_streams > 1 we
/// can use parallel select on such parts. We save such parts in one vector and then use
/// MergeTreeReadPool and MergeTreeThreadSelectProcessor for parallel select.
if (num_streams > 1 && settings.do_not_merge_across_partitions_select_final &&
std::distance(parts_to_merge_ranges[range_index], parts_to_merge_ranges[range_index + 1]) == 1 &&
parts_to_merge_ranges[range_index]->data_part->info.level > 0)
{
sum_marks_in_lonely_parts += parts_to_merge_ranges[range_index]->getMarksCount();
lonely_parts.push_back(std::move(*parts_to_merge_ranges[range_index]));
continue;
}

So if there are more that a single part in partition the is_deleted logic works as expected.

We can just try to disable that part of logic there by adding condition like

    &&     data.merging_params.is_deleted_column.empty())

But that means that we will still do FINAL on partitions contatining single part (but in the scope on one partition only).

@aadant
Copy link

aadant commented Jul 3, 2023

Also I think you should run the test suite with do_not_merge_across_partitions_select_final = 1 once the bug is fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed user-visible misbehaviour in official release
Projects
None yet
6 participants