Try to squash matview concurrent inserts#87280
Conversation
|
Workflow [PR], commit [7deeb3a] Summary: ❌
|
b943071 to
32bad6b
Compare
a7f4efa to
f30c774
Compare
f30c774 to
031a9d5
Compare
|
|
||
| insert_chains.reserve(sink_stream_size); | ||
|
|
||
| /// Squashing from multiple streams breaks deduplication for now so the optimization will be disabled |
There was a problem hiding this comment.
How it will work if there is multiple inserts and some of the inserts have transactions and able to initiate rollback after insertion?
There was a problem hiding this comment.
Sorry for not being clear, I'll write proper changelog once the PR is ready.
This squashes insert data from same inserts but different threads.
Currently, we create squashing transform per thread so each of it will squash the data independently.
I'm trying to see if I can squash everything in one place to reduce number of generated parts, especially if MV generates much smaller parts than the source insert.
|
@CheSema do you think we need a setting to have old behavior? |
I'm really worried here about deduplication. I know that the order of the resulting chunks in inner query has to be preserved and thy can not be mixed. So that feature could be incompatible with the way how we deduplicate inserted data. |
| query_sample_block, | ||
| async_insert, | ||
| /*skip_destination_table*/ no_destination, | ||
| /*max_insert_threads*/ 1, |
There was a problem hiding this comment.
what if we need more inserting threads here?
It will be disabled if we deduplication is enabled. I'm thinking about adding setting that could disable squashing in such way even with disabled deduplication. |
| } | ||
| } | ||
|
|
||
| if (deduplicate_blocks_in_dependent_materialized_views || !has_squashing_transforms) |
There was a problem hiding this comment.
I see, you do it with respect to deduplication.
| else if (stage == Stage::Finish) | ||
| { | ||
| if (auto exception = runStep([this] { onFinish(); }, thread_group)) | ||
| GenerateResult res; |
There was a problem hiding this comment.
this is not clear why do we need this changes?
why do we have a result on Finish stage?
There was a problem hiding this comment.
PlanSquashingTransform was IInflatingTransform because we don't create chunk until we have enough data. To properly handle errors from MV it needed to become ExceptionKeepingTransform.
As the logic of IInflatingTransform is much simple, the easiest thing for me was to add the logic from it to ExceptionKeepingTransform.
|
As I understand the code the setting insert thread count was introduced not for speeding up inner queries, it does it as side effect, but mainly as making writing to destination table concurrently. Many be we need here more detailed settings like: |
Yes, but if you have chained MVs, with this PR it will squash data from different threads which means instead of running 4 smaller SELECTs (and creating more parts) you run 1 heavy SELECT (and creating only 1 part). And I agree for the settings, it's a bit confusing how max_insert_threads interacts with MVs. Let's do it in different PR not to clutter this one. |
ad1fc8c to
7deeb3a
Compare
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Squash data from all threads before inserting to materialized views depending on the settings
min_insert_block_size_rows_for_materialized_viewsandmin_insert_block_size_bytes_for_materialized_views.Previously, if
parallel_view_processingwas enabled, each thread inserting to a specific materailized view would squash insert independently which could lead to higher number of generated parts.Documentation entry for user-facing changes