Skip to content

Try to squash matview concurrent inserts#87280

Merged
antonio2368 merged 6 commits intomasterfrom
matview-squashing-in-parallel-processing
Oct 2, 2025
Merged

Try to squash matview concurrent inserts#87280
antonio2368 merged 6 commits intomasterfrom
matview-squashing-in-parallel-processing

Conversation

@antonio2368
Copy link
Copy Markdown
Member

@antonio2368 antonio2368 commented Sep 18, 2025

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Squash data from all threads before inserting to materialized views depending on the settings min_insert_block_size_rows_for_materialized_views and min_insert_block_size_bytes_for_materialized_views.
Previously, if parallel_view_processing was enabled, each thread inserting to a specific materailized view would squash insert independently which could lead to higher number of generated parts.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Sep 18, 2025

Workflow [PR], commit [7deeb3a]

Summary:

job_name test_name status info comment
Upgrade check (amd_msan) failure
Killed by signal (in clickhouse-server.log) FAIL
Fatal message in clickhouse-server.log (see fatal_messages.txt) FAIL
Killed by signal (output files) FAIL
Found signal in gdb.log FAIL

@clickhouse-gh clickhouse-gh bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Sep 18, 2025
@antonio2368 antonio2368 force-pushed the matview-squashing-in-parallel-processing branch from b943071 to 32bad6b Compare September 18, 2025 12:21
@CheSema CheSema self-assigned this Sep 18, 2025
@antonio2368 antonio2368 force-pushed the matview-squashing-in-parallel-processing branch 2 times, most recently from a7f4efa to f30c774 Compare September 18, 2025 13:25
@antonio2368 antonio2368 force-pushed the matview-squashing-in-parallel-processing branch from f30c774 to 031a9d5 Compare September 18, 2025 13:30

insert_chains.reserve(sink_stream_size);

/// Squashing from multiple streams breaks deduplication for now so the optimization will be disabled
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How it will work if there is multiple inserts and some of the inserts have transactions and able to initiate rollback after insertion?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not being clear, I'll write proper changelog once the PR is ready.
This squashes insert data from same inserts but different threads.
Currently, we create squashing transform per thread so each of it will squash the data independently.
I'm trying to see if I can squash everything in one place to reduce number of generated parts, especially if MV generates much smaller parts than the source insert.

@antonio2368 antonio2368 marked this pull request as ready for review September 22, 2025 10:14
@antonio2368
Copy link
Copy Markdown
Member Author

@CheSema do you think we need a setting to have old behavior?
Even though this is better when it comes to number of parts it could increase insert latency depending on the SELECT queries done by the MVs.

@CheSema
Copy link
Copy Markdown
Member

CheSema commented Sep 23, 2025

Sema Checherinda do you think we need a setting to have old behavior? Even though this is better when it comes to number of parts it could increase insert latency depending on the SELECT queries done by the MVs.

I'm really worried here about deduplication. I know that the order of the resulting chunks in inner query has to be preserved and thy can not be mixed. So that feature could be incompatible with the way how we deduplicate inserted data.

query_sample_block,
async_insert,
/*skip_destination_table*/ no_destination,
/*max_insert_threads*/ 1,
Copy link
Copy Markdown
Member

@CheSema CheSema Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we need more inserting threads here?

@antonio2368
Copy link
Copy Markdown
Member Author

I'm really worried here about deduplication. I know that the order of the resulting chunks in inner query has to be preserved and thy can not be mixed. So that feature could be incompatible with the way how we deduplicate inserted data.

It will be disabled if we deduplication is enabled. I'm thinking about adding setting that could disable squashing in such way even with disabled deduplication.

@clickhouse-gh clickhouse-gh bot added pr-improvement Pull request with some product improvements and removed pr-not-for-changelog This PR should not be mentioned in the changelog labels Sep 23, 2025
}
}

if (deduplicate_blocks_in_dependent_materialized_views || !has_squashing_transforms)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you do it with respect to deduplication.

else if (stage == Stage::Finish)
{
if (auto exception = runStep([this] { onFinish(); }, thread_group))
GenerateResult res;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not clear why do we need this changes?
why do we have a result on Finish stage?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PlanSquashingTransform was IInflatingTransform because we don't create chunk until we have enough data. To properly handle errors from MV it needed to become ExceptionKeepingTransform.
As the logic of IInflatingTransform is much simple, the easiest thing for me was to add the logic from it to ExceptionKeepingTransform.

@CheSema
Copy link
Copy Markdown
Member

CheSema commented Sep 23, 2025

As I understand the code the setting insert thread count was introduced not for speeding up inner queries, it does it as side effect, but mainly as making writing to destination table concurrently.

Many be we need here more detailed settings like:
inner_mv_queries_concurency
insertion_concurency

@antonio2368
Copy link
Copy Markdown
Member Author

As I understand the code the setting insert thread count was introduced not for speeding up inner queries, it does it as side effect, but mainly as making writing to destination table concurrently.

Yes, but if you have chained MVs, with this PR it will squash data from different threads which means instead of running 4 smaller SELECTs (and creating more parts) you run 1 heavy SELECT (and creating only 1 part).
So you end up doing more during the insert instead of during merges and SELECTs on the destination table.
It general this should be fine, creating fewer parts is always a win but it could confuse some users.

And I agree for the settings, it's a bit confusing how max_insert_threads interacts with MVs. Let's do it in different PR not to clutter this one.

@antonio2368 antonio2368 force-pushed the matview-squashing-in-parallel-processing branch from ad1fc8c to 7deeb3a Compare September 29, 2025 11:45
@antonio2368 antonio2368 added this pull request to the merge queue Oct 2, 2025
Merged via the queue into master with commit b6d1cd8 Oct 2, 2025
121 of 123 checks passed
@antonio2368 antonio2368 deleted the matview-squashing-in-parallel-processing branch October 2, 2025 15:23
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants