Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Materializing an ordinary column with default expression should not override past values #58023

Merged

Conversation

canhld94
Copy link
Contributor

@canhld94 canhld94 commented Dec 19, 2023

Fixes #56119

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Running ALTER COLUMN MATERIALIZE on a column with DEFAULT or MATERIALIZED expression now writes the correct values: The default value for existing parts with default value or the non-default value for existing parts with non-default value. Previously, the default value was written for all existing parts.

Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
@UnamedRus

This comment was marked as outdated.

@canhld94

This comment was marked as outdated.

@UnamedRus

This comment was marked as outdated.

@canhld94

This comment was marked as outdated.

Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
Signed-off-by: Duc Canh Le <duccanh.le@ahrefs.com>
@rschu1ze rschu1ze mentioned this pull request Jan 12, 2024
@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Jan 12, 2024

SELECT '-- Compact parts';

CREATE TABLE tab (id Int64, dflt Int64 DEFAULT 54321) ENGINE MergeTree ORDER BY id;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to test also MATERIALIZED expressions (e.g. as columns dflt_default and dflt_materialized).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a test for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

{
/// For ordinary column with default or materialized expression, MATERIALIZE COLUMN should not override past values
/// So we only mutate column if `command.column_name` is a default/materialized column or if the part does not have physical column file
auto column_ordinary = table_columns.getOrdinary().tryGetByName(command.column_name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise here, we should not test for the absence of an ordinary column (l. 81), we should test for the presence of a materialized or default column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or if the part does not have physical column file

When is that the case? Does it not suffice to check for the column type (ordinary, default, materialized, etc.)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise here, we should not test for the absence of an ordinary column (l. 81), we should test for the presence of a materialized or default column.

I think we should only consider ordinary column. A materialize column, once we change the materialize expression, it should be rewritten for all parts, so the values of materilized column are consistent with materialize expression. @rschu1ze wdyt about this behaviour?

And that's why I only consider ordinary column here. As far as I concern ALTER MODIFY COLUMN already checks that column must be materialize column or ordinary column with default expression, so at this step, this condition:

if(!column_ordinary || !part->tryGetColumn(command.column_name) || !part->hasColumnFiles(* column_ordinary))

means to check: if the column is default and it's absent in the part. Though I'm not 100% sure that we need to check both !part->tryGetColumn(command.column_name) and !part->hasColumnFiles(* column_ordinary). I put both here just for safety.

Copy link
Member

@rschu1ze rschu1ze Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After more checking, here is my understanding. (I hope it is correct)

For columns of type DEFAULT, one can either

  • specify no value, in which case the DEFAULT value is assumed and no column is created for the part (1)
  • specify a (arbitrary, including the DEFAULT) value, in which case a column is created in the part (which takes precedence over the DEFAULT value expression during search) (2)

Running ALTER COLUMN MATERIALIZE on a DEFAULT column currently goes over all parts.

  • If no part exists, write one based on the current DEFAULT expression. That's okay.
  • If a part exists, drop it and write a new part based on the current DEFAULT expression.

The problem is in the latter case that we don't know if the existing part was created

  • because (2) happened before or
  • because (1) happened before, followed by ALTER COLUMN MATERIALIZE.

We could guess (i.e. do all column rows in the part contain the same value and was this value derived from the previous default expression) but that seems fragile, so it is better to not touch existing parts at all.


For columns of type MATERIALIZE, one cannot specify a non-default value. ClickHouse will always create a part for the column. As a result, running ALTER COLUMN on a MATERIALIZE column is possible but kind of pointless as there are no missing parts to materialize (unlike for DEFAULT columns). As of today, doing so rewrites all existing parts with the latest MATERIALIZE expression:

CREATE TABLE tab (id Int64, dflt String MATERIALIZED 'dflt') ENGINE MergeTree ORDER BY id;

INSERT INTO tab (id) VALUES (1);
ALTER TABLE tab MATERIALIZE COLUMN dflt;

INSERT INTO tab (id) VALUES (2);

ALTER TABLE tab MODIFY COLUMN dflt String DEFAULT 'dflt_new';
INSERT INTO tab (id) VALUES (3);

ALTER TABLE tab MATERIALIZE COLUMN dflt;

INSERT INTO tab (id) VALUES (4);

SELECT * FROM tab ORDER BY id;

produces

1	dflt_new
2	dflt_new
3	dflt_new
4	dflt_new	

Asking myself which behavior makes the most sense. I guess there are arguments for both behaviors. The current behavior is inline with the promise that MATERIALIZED columns are "always calculated" (docs). The new behavior is more consistent with this PR.

I think we should only consider ordinary column. A materialize column, once we change the materialize expression, it should be rewritten for all parts, so the values of materilized column are consistent with materialize expression. @rschu1ze wdyt about this behaviour?

I see you are in favor of the current behavior. Okay, let's go for that. Note that with this PR, the result is

1	dflt
2	dflt
3	dflt_new
4	dflt_new	

which is different from the current behavior.

And that's why I only consider ordinary column here. As far as I concern ALTER MODIFY COLUMN already checks that column must be materialize column or ordinary column with default expression, so at this step, this condition:

if(!column_ordinary || !part->tryGetColumn(command.column_name) || !part->hasColumnFiles(* column_ordinary))

means to check: if the column is default and it's absent in the part.

What is confusing in the code is that in l. 80, one would expect that getOrdinary() gets columns without default/materialized/ephemeral/alias expression but it gets columns with default expression instead. So what the condition says is check: if the column is a MATERIALIZED column (as earlier checks ensure it is either a DEFAULT or MATERIALIZED column) OR the part is missing. I guess we can remove !column_ordinary and we are good.

Though I'm not 100% sure that we need to check both !part->tryGetColumn(command.column_name) and !part->hasColumnFiles(* column_ordinary). I put both here just for safety.

!part->hasColumnFiles(* column_ordinary) looks like the right method to call.

Final note: It would be great if you could change the example in https://clickhouse.com/docs/en/sql-reference/statements/alter/column#materialize-column to use DEFAULT columns instead and add a note that MATERIALIZED columns are completely rewrittten.

Copy link
Contributor Author

@canhld94 canhld94 Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rschu1ze Thanks for detail response. Yes, everything you described about DEFAULT column is correct, but about MATERIALIZED column:

Note that with this PR, the result is

1 dflt
2 dflt
3 dflt_new
4 dflt_new

which is different from the current behavior.

This PR keeps the current behaviour for MATERIALIZED columns, (it doesn't touch MATERIALIZED column at all at the very beginning).

So what the condition says is check: if the column is a MATERIALIZED column (as earlier checks ensure it is either a DEFAULT or MATERIALIZED column) OR the part is missing

Yes, it's exactly what I want here: if the column is a MATERIALIZED column, or the part doesn't have the column file, then we run the mutation.

I guess we can remove !column_ordinary and we are good.

So I don't think we could do this ^

!part->hasColumnFiles(* column_ordinary) looks like the right method to call.

Thanks, I will fix the code.

Final note: It would be great if you could change the example in https://clickhouse.com/docs/en/sql-reference/statements/alter/column#materialize-column to use DEFAULT columns instead and add a note that MATERIALIZED columns are completely rewrittten.

Ok, I will do this.

@rschu1ze rschu1ze self-assigned this Jan 12, 2024
@rschu1ze rschu1ze added the can be tested Allows running workflows for external contributors label Jan 12, 2024
@robot-ch-test-poll4
Copy link
Contributor

robot-ch-test-poll4 commented Jan 12, 2024

This is an automated comment for commit eeaa9fb with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Mergeable CheckChecks if all other necessary checks are successful✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
SQLTestThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
SQLancerFuzzing tests that detect logical bugs with SQLancer tool✅ success
SqllogicRun clickhouse on the sqllogic test set against sqlite and checks that all statements are passed✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success
Check nameDescriptionStatus
A SyncThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS❌ failure
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR⏳ pending
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ failure
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors❌ failure

@rschu1ze
Copy link
Member

@canhld94 Hi, just wanted to ping you and ask if you like to finish this one up? I'd be happy to merge.

@canhld94
Copy link
Contributor Author

@rschu1ze yes, I have nothing more to add

@rschu1ze rschu1ze force-pushed the materialize_non_override_past_values branch from a52bf77 to eeaa9fb Compare February 15, 2024 11:47
@rschu1ze rschu1ze merged commit 7e11fc7 into ClickHouse:master Feb 20, 2024
277 of 285 checks passed
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 21, 2024
rschu1ze added a commit to rschu1ze/ClickHouse that referenced this pull request Mar 1, 2024
rschu1ze added a commit that referenced this pull request Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Materialize column overwrites all past value. It should be documented.
6 participants