Fix cluster functions with hive partitioning by ianton-ru · Pull Request #1863 · Altinity/ClickHouse

ianton-ru · 2026-06-02T13:59:22Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix cluster functions with hive partitioning

Documentation entry for user-facing changes

Solved #1855
List of virtual columns can include hive columns.
In non-cluster case iterator is created with filtered list of hive columns, where virtual columns columns are removed (see https://github.com/Altinity/ClickHouse/blob/antalya-26.3/src/Storages/prepareReadingFromFormat.cpp#L49).
In cluster case iterator was created with full lists. As result, hive columns added twice, later first column was filled with data, second was empty, and attempt to filter failed with error SIZES_OF_COLUMNS_DOESNT_MATCH,

CI/CD Options

Exclude tests:

Regression jobs to run:

github-actions · 2026-06-02T14:00:37Z

Workflow [PR], commit [9d5e1d1]

ianton-ru · 2026-06-02T14:01:03Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f8cc3a0eb6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ianton-ru · 2026-06-02T17:46:14Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9850be48c3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ianton-ru · 2026-06-02T18:50:41Z

@codex review

chatgpt-codex-connector · 2026-06-02T18:58:20Z

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

arthurpassos · 2026-06-03T18:39:23Z


+NamesAndTypesList IStorageCluster::getHivePartitionColumnsWithoutVirtuals() const
+{
+    // Virtual columns can contain hive columns, so we remove these hive coulmns to avoid duplicates.


iirc, that wasn't supposed to be the case

Hive columns added to virtuals in getVirtualsForFileLikeStorage method when partition_strategy == PartitionStrategyFactory::StrategyType::NONE.

well, I have just checked the source code and the behavior has indeed been modified. Upstream is now maintaining this. When I implemented it, it wasn't supposed to be like this.

Regardless, then I think this change is fine. Will approve it.

Selfeer · 2026-06-08T13:24:45Z

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1863 (Fix cluster functions with hive partitioning):

Confirmed defects:

No confirmed defects in reviewed scope.

Coverage summary:

Scope reviewed: PR diff in src/Storages/IStorageCluster.cpp, src/Storages/IStorageCluster.h, src/Storages/ObjectStorage/StorageObjectStorageCluster.cpp, src/Storages/ObjectStorage/StorageObjectStorageCluster.h, src/Storages/StorageFileCluster.cpp, src/Storages/StorageFileCluster.h, src/Storages/StorageURLCluster.cpp, src/Storages/StorageURLCluster.h, and added integration tests in tests/integration/test_file_cluster/test.py, tests/integration/test_s3_cluster/test.py, tests/integration/test_storage_url/test.py.
Categories failed: None confirmed.
Categories passed: Call-graph and transition checks for task iterator construction/filtering; fail-open/fail-closed checks for missing/filtered paths; shared-state/interleaving checks for iterator/filter state; exception/partial-update checks (no new mutation-heavy state transitions); required C++ bug classes reviewed (lifetime, iterator invalidation, races/deadlocks, exception safety, integer/signedness, ownership/RAII, UB) with no confirmed defect in changed paths.
Assumptions/limits: Static audit only (no runtime execution in this pass); conclusions are limited to changed code paths and their direct interactions.

Selfeer · 2026-06-08T13:33:33Z

PR #1863 CI Triage: "Fix cluster functions with hive partitioning"

PR: #1863
Title: Fix cluster functions with hive partitioning
Branch: bugfix/antalya-26.3/1855_s3cluster_hive
Head SHA: 9d5e1d15b7e88e2017a778370f5c8239f843f0e1
Author: Anton Ivashkin (@ianton-ru)

Files Changed

src/Storages/IStorageCluster.cpp / .h — new getHivePartitionColumnsWithoutVirtuals() helper; moves hive_partition_columns_to_read_from_file_path to base class
src/Storages/ObjectStorage/StorageObjectStorageCluster.cpp / .h — use getHivePartitionColumnsWithoutVirtuals()
src/Storages/StorageFileCluster.cpp / .h — use getHivePartitionColumnsWithoutVirtuals(); fix setVirtuals() to pass partition strategy for hive column dedup
src/Storages/StorageURLCluster.cpp / .h — use getHivePartitionColumnsWithoutVirtuals()
tests/integration/test_file_cluster/test.py — new hive partitioning + WHERE filter test
tests/integration/test_s3_cluster/test.py — new hive partitioning + WHERE filter test
tests/integration/test_storage_url/test.py — new hive partitioning + WHERE filter test

Scope of change: Purely hive partition column deduplication in cluster storage task iterators. No fault tolerance, retry logic, or node failure handling is modified.

Summary

Category	Count	Tests
PR-caused regression	0	—
Pre-existing flaky	4	`00157_cache_dictionary`, `03443_shared_storage_snapshots`, stress test state pollution, swarms node failure timing
Infrastructure issue (known broken)	1	`test_different_versions` (NETLINK_ERROR)

Conclusion: None of the 5 CI failures are caused by this PR. Safe to merge.

Failure-by-Failure Analysis

1. `Stateless tests (arm_binary, sequential)` — `00157_cache_dictionary` FAIL

Report: https://altinity-build-artifacts.s3.amazonaws.com/json.html?PR=1863&sha=9d5e1d15b7e88e2017a778370f5c8239f843f0e1&name_0=PR&name_1=Stateless%20tests%20%28arm_binary%2C%20sequential%29
Test timed out after ~318–322 seconds (sequential test timeout threshold)
Dictionary cache test — no relation to cluster functions or hive partitioning
Same failure observed in PR Profile events for Iceberg catalogs #1868 and other unrelated PRs
Verdict: Pre-existing flaky test — cache dictionary tests are well-known timing-sensitive

2. `Stateless tests (arm_asan, azure, sequential, 2/2)` — `03443_shared_storage_snapshots` FAIL

Report: https://altinity-build-artifacts.s3.amazonaws.com/json.html?PR=1863&sha=9d5e1d15b7e88e2017a778370f5c8239f843f0e1&name_0=PR&name_1=Stateless%20tests%20%28arm_asan%2C%20azure%2C%20sequential%2C%202%2F2%29
Test timed out after ~217–226 seconds (two attempts, both timed out)
Shared storage snapshots test — no relation to cluster functions or hive partitioning
Verdict: Pre-existing flaky test — timing-sensitive test unrelated to PR changes

3. `Stress test (amd_debug)` — `Cannot start clickhouse-server` FAIL

Report: https://altinity-build-artifacts.s3.amazonaws.com/json.html?PR=1863&sha=9d5e1d15b7e88e2017a778370f5c8239f843f0e1&name_0=PR&name_1=Stress%20test%20%28amd_debug%29
Identical failure pattern to PR Profile events for Iceberg catalogs #1868: leftover Hybrid engine table (test_1.t3) from a previous stress run blocks server startup

Error:

Code: 60. DB::Exception: Table test_1.local_hot does not exist:
Cannot attach table `test_1`.`t3` ...
ENGINE = Hybrid(remote('localhost:9000', 'test_1', 'local_hot'), ...)

PR does not touch Hybrid engine code
Verdict: Pre-existing flaky — stress test state pollution from Altinity-specific Hybrid engine table not cleaned up between runs

4. `Integration tests (arm_binary, distributed plan, 4/4)` — Two failures

4a. `test_storage_s3_queue/test_0.py::test_move_after_processing[another_bucket-AzureQueue]` (counted as Failure: 1/1039)

Report: https://altinity-build-artifacts.s3.amazonaws.com/json.html?PR=1863&sha=9d5e1d15b7e88e2017a778370f5c8239f843f0e1&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%204%2F4%29

Error:

assert 0 == (5 * 10)
  where 0 = int('0\n')
  where '0\n' = query("SELECT sum(rows_processed) FROM system.azure_queue_log
                       WHERE table = 'move_after_processing_unordered_AzureQueue_ztwdcp'")

AzureQueue log shows 0 rows processed — queue message delivery/timing issue
No relation to cluster functions or hive partitioning
Verdict: Pre-existing flaky test — AzureQueue processing timing

4b. `test_backup_restore_on_cluster/test_different_versions.py::test_different_versions` (filtered out by broken tests handler)

Broken tests handler log:

test_backup_restore_on_cluster/test_different_versions.py::test_different_versions
Matched rule: {'reason': 'INVESTIGATE - NETLINK_ERROR', 'message': 'NETLINK_ERROR'}

Correctly filtered as a known infrastructure networking issue (Docker NETLINK_ERROR)
Not counted in the CI failure summary
Verdict: Known infrastructure issue — CI properly recognized and filtered this

5. `Regression release swarms` — `/swarms/feature/node failure/check restart swarm node` FAIL

Report: https://altinity-build-artifacts.s3.amazonaws.com/REFs/1863/merge/9d5e1d15b7e88e2017a778370f5c8239f843f0e1/regression/x86_64/with_analyzer/zookeeper/without_thread_fuzzer/swarms/report.html

Test purpose: Verify that a query running on a swarm cluster fails when a swarm node is killed during execution.

What the test does:

Loads 100 rows into an Iceberg table
Runs SELECT count(), hostName() FROM iceberg_table WHERE NOT ignore(sleepEachRow(1)) GROUP BY hostName() SETTINGS object_storage_cluster='static_swarm_cluster' — expected to take ~100 seconds
In parallel: stops clickhouse3 and immediately restarts it
Expects the query to fail with exit code 138 (DB::Exception: Query was cancelled.)

What actually happened:

[clickhouse1] 100    clickhouse2
[clickhouse1] bash# echo $?
[clickhouse1] 0

The query succeeded (exit code 0, expected 138)
All 100 rows returned from clickhouse2 (original distribution was clickhouse2: 47, clickhouse3: 53)
Query took 1m 47s ≈ 100 rows × ~1 second/row (all rows processed by clickhouse2 after clickhouse3 recovery)
clickhouse3 was stopped for only ~1 second (docker-compose stop + start took ~3s total)

Why this is NOT caused by the PR:

The PR changes are limited to:

Adding getHivePartitionColumnsWithoutVirtuals() — filters duplicate hive columns in task iterators
Moving hive_partition_columns_to_read_from_file_path to the base class

Neither change touches fault tolerance, retry logic, or node failure handling in the cluster query execution path. The getTaskIteratorExtension change only affects which columns are passed to the file iterator at query setup time.

Selfeer · 2026-06-08T13:33:57Z

Need to take one more closer look at the swarm testes failures before marking as verifeid

Fix cluster functions with hive partitioning

f8cc3a0

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread tests/integration/test_s3_cluster/test.py Outdated

ianton-ru added antalya bugfix antalya-26.3 labels Jun 2, 2026

ianton-ru added 3 commits June 2, 2026 18:29

Fix for url function

776d0ae

Fix for file function

7de10eb

Remove code copy-paste

9850be4

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread tests/integration/test_file_cluster/test.py Outdated

Remove confused AI comments

9d5e1d1

arthurpassos reviewed Jun 3, 2026

View reviewed changes

arthurpassos approved these changes Jun 4, 2026

View reviewed changes

Conversation

ianton-ru commented Jun 2, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ianton-ru commented Jun 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ianton-ru commented Jun 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ianton-ru commented Jun 2, 2026

Uh oh!

chatgpt-codex-connector Bot commented Jun 2, 2026

Uh oh!

arthurpassos Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

ianton-ru Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

arthurpassos Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Selfeer commented Jun 8, 2026

Uh oh!

Selfeer commented Jun 8, 2026

PR #1863 CI Triage: "Fix cluster functions with hive partitioning"

Files Changed

Summary

Failure-by-Failure Analysis

1. Stateless tests (arm_binary, sequential) — 00157_cache_dictionary FAIL

2. Stateless tests (arm_asan, azure, sequential, 2/2) — 03443_shared_storage_snapshots FAIL

3. Stress test (amd_debug) — Cannot start clickhouse-server FAIL

4. Integration tests (arm_binary, distributed plan, 4/4) — Two failures

4a. test_storage_s3_queue/test_0.py::test_move_after_processing[another_bucket-AzureQueue] (counted as Failure: 1/1039)

4b. test_backup_restore_on_cluster/test_different_versions.py::test_different_versions (filtered out by broken tests handler)

5. Regression release swarms — /swarms/feature/node failure/check restart swarm node FAIL

Uh oh!

Selfeer commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented Jun 2, 2026 •

edited

Loading

1. `Stateless tests (arm_binary, sequential)` — `00157_cache_dictionary` FAIL

2. `Stateless tests (arm_asan, azure, sequential, 2/2)` — `03443_shared_storage_snapshots` FAIL

3. `Stress test (amd_debug)` — `Cannot start clickhouse-server` FAIL

4. `Integration tests (arm_binary, distributed plan, 4/4)` — Two failures

4a. `test_storage_s3_queue/test_0.py::test_move_after_processing[another_bucket-AzureQueue]` (counted as Failure: 1/1039)

4b. `test_backup_restore_on_cluster/test_different_versions.py::test_different_versions` (filtered out by broken tests handler)

5. `Regression release swarms` — `/swarms/feature/node failure/check restart swarm node` FAIL