Skip to content

Implement lazy columns replication in JOIN and ARRAY JOIN#88752

Merged
Avogar merged 44 commits intoClickHouse:masterfrom
Avogar:lazy-column-replication
Oct 28, 2025
Merged

Implement lazy columns replication in JOIN and ARRAY JOIN#88752
Avogar merged 44 commits intoClickHouse:masterfrom
Avogar:lazy-column-replication

Conversation

@Avogar
Copy link
Copy Markdown
Member

@Avogar Avogar commented Oct 17, 2025

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Implement lazy columns replication in JOIN and ARRAY JOIN. Avoid converting special columns representation like Sparse and Replicated to full columns in some output formats. This avoids unnecessary data copy in memory.

Closes #82669.

To use lazy replication enable settings enable_lazy_columns_replication and allow_special_serialization_kinds_in_output_formats (they are disabled by default for now).

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Oct 17, 2025

Workflow [PR], commit [57d94ce]

Summary:

job_name test_name status info comment
Stateless tests (amd_binary, ParallelReplicas, s3 storage, parallel) failure
03657_gby_overflow_any_sparse FAIL cidb
Stress test (amd_msan) failure
Server died FAIL cidb
Hung check failed, possible deadlock found (see hung_check.log) FAIL cidb
Killed by signal (in clickhouse-server.log) FAIL cidb
Fatal message in clickhouse-server.log (see fatal_messages.txt) FAIL cidb
Killed by signal (output files) FAIL cidb
Found signal in gdb.log FAIL cidb

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Oct 17, 2025
@KochetovNicolai KochetovNicolai self-assigned this Oct 21, 2025
namespace DB
{

/// Wrapper around ColumnVector to store indexes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part was ejected from LowCardinality with no change, so I can skip reviewing it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Howewer, I added a few new simple methods that I needed in ColumnReplicated

Copy link
Copy Markdown
Member Author

@Avogar Avogar Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a bug in ColumnVariant::index implementation during testing. I will also create a separate bug-fix PR with this change to backport it

ColumnReplicated::ColumnReplicated(MutableColumnPtr && nested_column_)
: nested_column(std::move(nested_column_))
{
indexes.insertIndexesRange(0, nested_column->size());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we deprecate this ctor? If ColumnReplicated can replace any column, then we can always use the initial one without indexes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe we can check that no sparse/LC is possible inside. (not sure, but at least it does not make a lot of sense to me).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we deprecate this ctor? If ColumnReplicated can replace any column, then we can always use the initial one without indexes.

We cannot use initial one if we will need to insert into it from ColumnReplicated. It's needed in MergingSortedTransform. Otherwise we will need to convert ColumnReplicated to full there and loose it's benefits.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe we can check that no sparse/LC is possible inside. (not sure, but at least it does not make a lot of sense to me).

Sparse inside Replicated 100% makes sense, Sparse column can contain big values inside that we want to avoid replicating.

LC inside Replicated doesn't make sense, let's avoid it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sparse inside Replicated 100% makes sense

Sparse is kind of included in replicated. At least you can reuse the same internal column when converting sparse->replicated, and only rebuild indexes

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's ok, I will add proper Sparse -> Replicated conversion later, maybe even in a separate PR.

@Avogar Avogar added this pull request to the merge queue Oct 28, 2025
Merged via the queue into ClickHouse:master with commit 8cb04b3 Oct 28, 2025
121 of 124 checks passed
@Avogar Avogar deleted the lazy-column-replication branch October 28, 2025 21:10
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Oct 28, 2025
@robot-ch-test-poll4 robot-ch-test-poll4 added pr-backports-created-cloud deprecated label, NOOP pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR labels Oct 28, 2025
robot-ch-test-poll4 added a commit that referenced this pull request Oct 29, 2025
Cherry pick #88752 to 25.10: Implement lazy columns replication in JOIN and ARRAY JOIN
robot-clickhouse added a commit that referenced this pull request Oct 29, 2025
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Oct 29, 2025
Avogar added a commit that referenced this pull request Oct 29, 2025
Backport #88752 to 25.10: Implement lazy columns replication in JOIN and ARRAY JOIN
azat added a commit to azat/ClickHouse that referenced this pull request Nov 3, 2025
Tuple itself cannot be Sparse, but, some previous ClickHouse version may
write this into the serialization.json, and such table will not be able
to loaded.

Follow-up for: ClickHouse#88752
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-backports-created-cloud deprecated label, NOOP pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo v25.10-must-backport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement ColumnReplicated for the result of IColumn->replicate (similar to ColumnSparse)

6 participants