-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Engine Merge obeys row policy #50209
Conversation
This is an automated comment for commit c2816ec with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page Successful checks
|
I believe that test failures in 00002_log_and_exception_messages_formatting are irrelevant, although AST fuzzer (ubsan) issue IS relevant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address the comments
src/Storages/StorageMerge.cpp
Outdated
required_columns.begin(), required_columns.end(), | ||
std::inserter(filter_columns, filter_columns.begin())); | ||
|
||
source_step_with_filter->addFilter(filter_dag_ptr, filter_columns.front()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only with columns from RBAC that are not required columns? Would filter work properly if there are some required columns in expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only 1 filter, what if there are many columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'extra' column here is a complex condition that may contain logical operations, something like
'field1 > x and field2 > y'.
That's why only one filter is needed.
I'll try to simulate collision between the column and what is listed in required column, this is an interesting point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be beneficial to put that info into comment for this piece of code
src/Storages/StorageMerge.cpp
Outdated
Names filter_columns; | ||
|
||
std::set_difference(fa_actions_columns_sorted.begin(), fa_actions_columns_sorted.end(), | ||
required_columns.begin(), required_columns.end(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
required_columns.begin(), required_columns.end(), | |
required_columns_sorted.begin(), required_columns_sorted.end(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, required_columns_sorted must be used, thanks.
src/Storages/StorageMerge.cpp
Outdated
|
||
|
||
std::set_difference(fa_actions_columns_sorted.begin(), fa_actions_columns_sorted.end(), | ||
required_columns.begin(), required_columns.end(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, required_columns_sorted
?
src/Storages/StorageMerge.cpp
Outdated
if (row_policy_filter) | ||
{ | ||
ASTPtr expr = row_policy_filter->expression; | ||
|
||
auto syntax_result = TreeRewriter(local_context).analyze(expr, pipe_columns); | ||
auto expression_analyzer = ExpressionAnalyzer{row_policy_filter->expression, syntax_result, local_context}; | ||
|
||
auto actions_dag = expression_analyzer.getActionsDAG(true, false); | ||
auto filter_actions = std::make_shared<ExpressionActions>(actions_dag, ExpressionActionsSettings::fromContext(local_context, CompileExpressions::yes)); | ||
auto required_columns = filter_actions->getRequiredColumns(); | ||
|
||
LOG_TRACE(&Poco::Logger::get("ReadFromMerge::convertinfSourceStream"), "filter_actions_dag: {},<> {}, <> {}", | ||
filter_actions->getActionsDAG().dumpNames(), filter_actions->getActionsDAG().dumpDAG(), filter_actions->getSampleBlock().dumpStructure()); | ||
|
||
|
||
auto fa_actions_columns_sorted = filter_actions->getSampleBlock().getNames(); | ||
std::sort(fa_actions_columns_sorted.begin(), fa_actions_columns_sorted.end()); | ||
|
||
Names required_columns_sorted = required_columns; | ||
std::sort(required_columns_sorted.begin(), required_columns_sorted.end()); | ||
|
||
Names filter_columns; | ||
|
||
|
||
std::set_difference(fa_actions_columns_sorted.begin(), fa_actions_columns_sorted.end(), | ||
required_columns.begin(), required_columns.end(), | ||
std::inserter(filter_columns, filter_columns.begin())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks almost identical to the code block you've added above, maybe factor it out into a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add RBAC based on 2 or more columns?
src/Storages/StorageMerge.cpp
Outdated
} | ||
|
||
return builder; | ||
} | ||
|
||
void ReadFromMerge::RowPolicyData::init(RowPolicyFilterPtr row_policy_filter_ptr_, | ||
const std::shared_ptr<DB::IStorage> storage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove const
here, it breaks the build.
Jul 14 14:16:37 /build/src/Storages/StorageMerge.cpp:391:9: error: parameter 2 is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls,-warnings-as-errors]
src/Storages/StorageMerge.cpp
Outdated
@@ -961,8 +1129,14 @@ void ReadFromMerge::convertingSourceStream( | |||
const Aliases & aliases, | |||
ContextPtr local_context, | |||
QueryPipelineBuilder & builder, | |||
const QueryProcessingStage::Enum & processed_stage) | |||
const QueryProcessingStage::Enum & processed_stage, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove const
here too
Integration test failures are not related. Please, review proposed change. |
Hello @vitlibar , |
src/Storages/StorageMerge.cpp
Outdated
|
||
if (row_policy_data.needCare()) | ||
{ | ||
if (auto * source_step_with_filter = dynamic_cast<SourceStepWithFilter*>((plan.getRootNode()->step.get()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if that step
is not derived from SourceStepWithFilter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here that it's an optimization if the storage can use filtering while reading (e.g. MergeTree).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conditions can be combined to be shorter:
if (plan.isInitialized() and row_policy_data)
{
if (auto * source_step_with_filter = dynamic_cast<SourceStepWithFilter*>((plan.getRootNode()->step.get())))
row_policy_data->addStorageFilter(source_step_with_filter);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not obvious that "The conditions can be combined to be shorter".
Could you explain why you are sure that going further with non initialized plan is safe and makes sense?
src/Storages/StorageMerge.cpp
Outdated
{ | ||
if (row_policy_data.needCare()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needCare()
is not a very clear name. row_policy_data.hasRowPolicy()
seems better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed, although it is usually considered bad practice if a method name contain class name (something like Vector::vector_size).
Nevertheless, let it be 'hasRowPolicy'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better remove the method at all, see #50209 (comment)
DROP ROW POLICY IF EXISTS 02763_filter_4 ON 02763_merge_merge_1; | ||
DROP ROW POLICY IF EXISTS 02763_filter_5 ON 02763_merge_fancycols; | ||
DROP ROW POLICY IF EXISTS 02763_filter_6 ON 02763_merge_fancycols; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please write in the description to this PR a very simple example how it worked before your PR and how it's going to work now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made it by myself:
CREATE TABLE tbl1 (x UInt32) ENGINE=MergeTree ORDER BY tuple();
INSERT INTO tbl1 SELECT number AS x FROM numbers(10);
CREATE ROW POLICY r1 ON tbl1 USING x%2==0 TO ALL;
SELECT * FROM tbl1; -- applies the row policy
SELECT * FROM merge('default', 'tbl1'); -- doesn't apply the row policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original behavior is described in the issue
#50163
which is referred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My example is shorter :)
79c7c4f
to
0554018
Compare
Hello @vitlibar , I think I've addressed your comments, except binary search. |
Hello @vitlibar , do you have ideas/suggestions what should be improved further? |
auto storage_columns = storage_metadata_snapshot->getColumns(); | ||
auto needed_columns = storage_columns.getAllPhysical(); | ||
|
||
auto syntax_result = TreeRewriter(local_context).analyze(expr, needed_columns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to use here storage_columns.getAllPhysical()
as needed_columns
? I mean what if a storage has an alias column and a row policy filter involves that alias? Can you please check that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to use here
storage_columns.getAllPhysical()
asneeded_columns
? I mean what if a storage has an alias column and a row policy filter involves that alias? Can you please check that case?
Very good point, quite possible that it is a mistake. Will check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed we have an issue here.
But the problem is deeper, Engine Merge is not very good at aliases in general, they seem do not work via function https://fiddle.clickhouse.com/4648fd3d-3c1a-462e-919b-90dafcafc131
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is even deeper ;) , looks like Engine Merge does not work with tables with different structures. And 'does not work' means gives wrong result.
https://fiddle.clickhouse.com/27a3d1d1-51a4-491c-b6ad-5421c11d1f0b
Sorry, my bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summing up issue with alias based row policy.
I've fixed it for cases where Engine Merge is used directly, not via table function merge(). Works ok. Not pushed yet.
I haven't fixed it for merge() yet.
How it looks https://pastila.nl/?00d38392/d68be2ab03fb64572f5a3f1b3bd383b2#2RTThmq3iM6cf1C1nLQVUg==
Alternatively we can
- expose alias via merge table function - trivial fix. Can be treated as improvement, but breaks current behavior.
- go with this PR further as is and fix alias based row policy for merge() in a separated PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expose alias via merge table function - trivial fix. Can be treated as improvement, but breaks current behavior.
I suppose ideally the merge()
function should return aliases only if that alias is requested explicitly; or if asterisk is used and asterisk_include_alias_columns
is set. Because it's better to be consistent with how we treat aliases for normal tables. But of course if an alias is used in a row policy then it should work no matter if we're going to show that alias to a user or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that it is easier to have both
- merge() properly returning aliases and respecting asterisk_include_alias_columns
- alias based row policies working
implemented,
than (2) but not (1).
Working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course a better solution is always better, I just suggested to split work between two PRs. But ok, no problem, do it the way you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last push brought many changes in terms of lines of code, but actually changes are not significant.
I've introduced processAliases method because I have to add new names (extendNames) before taking care of aliases and it is too much clutter for initializePipeline().
Besides this, I renamed convertingSourceStream to convertAndFilterSourceStream , because filtering is needed in-between.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asterisk_include_alias_columns is left out of the scope despite my initial intention (and claim).
src/Storages/StorageMerge.h
Outdated
const StorageWithLockAndName & storage_with_lock, | ||
Aliases & aliases, | ||
const Block & sample_block, | ||
ContextMutablePtr modified_context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly does the function processAliases()
do? I know it's named processAliases
but how it processes them? A comment here would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved this part of code as is, but Ok, I'll try to add some comments.
src/Storages/StorageMerge.cpp
Outdated
} | ||
|
||
Aliases aliases; | ||
processAliases(real_column_names, storage_with_lock, aliases, sample_block, modified_context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's better to rename processAliases
to extractAliases()
and make it return Aliases
?
src/Storages/StorageMerge.cpp
Outdated
} | ||
if (!column_names_as_aliases.empty()) | ||
{ | ||
real_column_names = column_names_as_aliases; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it replaces real_column_names
only if !column_names_as_aliases.empty()
Such a complicated logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the function processAliases()
return column_names_as_aliases
as is, and do replace read_column_names
not here but in the function which calls processAliases()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
processAliases should return both Aliases and column_names, that's why non constant references are used.
I think that a more complex approach with std::pair or something does not make real sense here.
If you don't mind I would like to keep processAliases in its current form and to avoid further interference in its logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You've made an interference already, it doesn't add too much interference if we pass arguments this way or another. I'm ok with passing results via reference parameters. For me it's just better to avoid parameters which are input and output at the same time - just to make the life easier. So why can't we make column_names_as_aliases
a pure output parameter and do this logic in createSource()
:
QueryPipelineBuilderPtr ReadFromMerge::createSources(...)
{
...
Aliases aliases;
Names column_names_as_aliases;
processAliases(context, storage_with_lock, sample_block, modified_context, aliases, column_names_as_aliases);
if (!column_names_as_aliases.empty())
real_column_names = column_names_as_aliases;
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be because this logic is a part of processing aliases ? ;)
Please resolve the conflict |
The changes are rather significant. |
merge_row_policy: changes because of refactoring in master
@@ -42,6 +42,7 @@ | |||
02003_WithMergeableStateAfterAggregationAndLimit_LIMIT_BY_LIMIT_OFFSET | |||
02404_memory_bound_merging | |||
02725_agg_projection_resprect_PK | |||
02763_row_policy_storage_merge_alias |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically because of
$ grep merge_alias analyzer_tech_debt.txt
01214_test_storage_merge_aliases_with_where
01925_test_storage_merge_aliases
Aliases in Merge do not work if analyzer enabled.
The build is broken (see). Can you please fix it? |
Yes, sure. |
@vitlibar |
Test failures are not caused by the proposed changes. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Engine
Merge
filters the records according to the row policies of the underlying tables.