New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Materializing an ordinary column with default expression should not override past values #58023
Materializing an ordinary column with default expression should not override past values #58023
Conversation
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
…_override_past_values
|
||
SELECT '-- Compact parts'; | ||
|
||
CREATE TABLE tab (id Int64, dflt Int64 DEFAULT 54321) ENGINE MergeTree ORDER BY id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to test also MATERIALIZED expressions (e.g. as columns dflt_default
and dflt_materialized
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a test for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tests/queries/0_stateless/02946_materialize_column_must_not_override_past_values.sql
Outdated
Show resolved
Hide resolved
{ | ||
/// For ordinary column with default or materialized expression, MATERIALIZE COLUMN should not override past values | ||
/// So we only mutate column if `command.column_name` is a default/materialized column or if the part does not have physical column file | ||
auto column_ordinary = table_columns.getOrdinary().tryGetByName(command.column_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more precise here, we should not test for the absence of an ordinary column (l. 81), we should test for the presence of a materialized or default column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or if the part does not have physical column file
When is that the case? Does it not suffice to check for the column type (ordinary, default, materialized, etc.)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more precise here, we should not test for the absence of an ordinary column (l. 81), we should test for the presence of a materialized or default column.
I think we should only consider ordinary column. A materialize column, once we change the materialize expression, it should be rewritten for all parts, so the values of materilized column are consistent with materialize expression. @rschu1ze wdyt about this behaviour?
And that's why I only consider ordinary column here. As far as I concern ALTER MODIFY COLUMN
already checks that column must be materialize column or ordinary column with default expression, so at this step, this condition:
if(!column_ordinary || !part->tryGetColumn(command.column_name) || !part->hasColumnFiles(* column_ordinary))
means to check: if the column is default and it's absent in the part. Though I'm not 100% sure that we need to check both !part->tryGetColumn(command.column_name)
and !part->hasColumnFiles(* column_ordinary)
. I put both here just for safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After more checking, here is my understanding. (I hope it is correct)
For columns of type DEFAULT, one can either
- specify no value, in which case the DEFAULT value is assumed and no column is created for the part (1)
- specify a (arbitrary, including the DEFAULT) value, in which case a column is created in the part (which takes precedence over the DEFAULT value expression during search) (2)
Running ALTER COLUMN MATERIALIZE on a DEFAULT column currently goes over all parts.
- If no part exists, write one based on the current DEFAULT expression. That's okay.
- If a part exists, drop it and write a new part based on the current DEFAULT expression.
The problem is in the latter case that we don't know if the existing part was created
- because (2) happened before or
- because (1) happened before, followed by ALTER COLUMN MATERIALIZE.
We could guess (i.e. do all column rows in the part contain the same value and was this value derived from the previous default expression) but that seems fragile, so it is better to not touch existing parts at all.
For columns of type MATERIALIZE, one cannot specify a non-default value. ClickHouse will always create a part for the column. As a result, running ALTER COLUMN on a MATERIALIZE column is possible but kind of pointless as there are no missing parts to materialize (unlike for DEFAULT columns). As of today, doing so rewrites all existing parts with the latest MATERIALIZE expression:
CREATE TABLE tab (id Int64, dflt String MATERIALIZED 'dflt') ENGINE MergeTree ORDER BY id;
INSERT INTO tab (id) VALUES (1);
ALTER TABLE tab MATERIALIZE COLUMN dflt;
INSERT INTO tab (id) VALUES (2);
ALTER TABLE tab MODIFY COLUMN dflt String DEFAULT 'dflt_new';
INSERT INTO tab (id) VALUES (3);
ALTER TABLE tab MATERIALIZE COLUMN dflt;
INSERT INTO tab (id) VALUES (4);
SELECT * FROM tab ORDER BY id;
produces
1 dflt_new
2 dflt_new
3 dflt_new
4 dflt_new
Asking myself which behavior makes the most sense. I guess there are arguments for both behaviors. The current behavior is inline with the promise that MATERIALIZED columns are "always calculated" (docs). The new behavior is more consistent with this PR.
I think we should only consider ordinary column. A materialize column, once we change the materialize expression, it should be rewritten for all parts, so the values of materilized column are consistent with materialize expression. @rschu1ze wdyt about this behaviour?
I see you are in favor of the current behavior. Okay, let's go for that. Note that with this PR, the result is
1 dflt
2 dflt
3 dflt_new
4 dflt_new
which is different from the current behavior.
And that's why I only consider ordinary column here. As far as I concern ALTER MODIFY COLUMN already checks that column must be materialize column or ordinary column with default expression, so at this step, this condition:
if(!column_ordinary || !part->tryGetColumn(command.column_name) || !part->hasColumnFiles(* column_ordinary))
means to check: if the column is default and it's absent in the part.
What is confusing in the code is that in l. 80, one would expect that getOrdinary()
gets columns without default/materialized/ephemeral/alias expression but it gets columns with default expression instead. So what the condition says is check: if the column is a MATERIALIZED column (as earlier checks ensure it is either a DEFAULT or MATERIALIZED column) OR the part is missing. I guess we can remove !column_ordinary
and we are good.
Though I'm not 100% sure that we need to check both !part->tryGetColumn(command.column_name) and !part->hasColumnFiles(* column_ordinary). I put both here just for safety.
!part->hasColumnFiles(* column_ordinary)
looks like the right method to call.
Final note: It would be great if you could change the example in https://clickhouse.com/docs/en/sql-reference/statements/alter/column#materialize-column to use DEFAULT columns instead and add a note that MATERIALIZED columns are completely rewrittten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rschu1ze Thanks for detail response. Yes, everything you described about DEFAULT column is correct, but about MATERIALIZED column:
Note that with this PR, the result is
1 dflt
2 dflt
3 dflt_new
4 dflt_newwhich is different from the current behavior.
This PR keeps the current behaviour for MATERIALIZED columns, (it doesn't touch MATERIALIZED column at all at the very beginning).
So what the condition says is check: if the column is a MATERIALIZED column (as earlier checks ensure it is either a DEFAULT or MATERIALIZED column) OR the part is missing
Yes, it's exactly what I want here: if the column is a MATERIALIZED column, or the part doesn't have the column file, then we run the mutation.
I guess we can remove
!column_ordinary
and we are good.
So I don't think we could do this ^
!part->hasColumnFiles(* column_ordinary) looks like the right method to call.
Thanks, I will fix the code.
Final note: It would be great if you could change the example in https://clickhouse.com/docs/en/sql-reference/statements/alter/column#materialize-column to use DEFAULT columns instead and add a note that MATERIALIZED columns are completely rewrittten.
Ok, I will do this.
This is an automated comment for commit eeaa9fb with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page Successful checks
|
@canhld94 Hi, just wanted to ping you and ask if you like to finish this one up? I'd be happy to merge. |
@rschu1ze yes, I have nothing more to add |
…_override_past_values
…_override_past_values
a52bf77
to
eeaa9fb
Compare
Fixes #56119
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Running
ALTER COLUMN MATERIALIZE
on a column withDEFAULT
orMATERIALIZED
expression now writes the correct values: The default value for existing parts with default value or the non-default value for existing parts with non-default value. Previously, the default value was written for all existing parts.